(This page is a child of the TaskList page)
Solr would benefit from a flexible document processing framework meeting the requirements of enterprise grade content integration. Most search projects have some need for processing the incoming content prior to indexing, for example:
- Language identification
- Text extraction (Tika)
- Entity extraction and classification
- Data normalization and cleansing
- 3rd party systems integration (e.g. enrich document from external source)
The built-in UpdateRequestProcessorChain is capable of doing simple simple processing jobs, but it is only built for local execution on the indexer node in the same thread. This means that any performance heavy processing chains will slow down the indexers without any way to scale out processing independently. We have seen FAST systems with far more servers doing document processing than indexing.
There are many processing pipeline frameworks from which to get inspiration, such as the one in FAST ESP, OpenPipeline, OpenPipe (now on GitHub), Pypes, UIMA, Eclipse SMILA, Apache commons pipeline, Piped, Behemoth, Findwise's yet-to-be-announced pipeline and others. Indeed, some of these are already being used with Solr as a pre-processing server.
A choice of technologies is good, but it can be a bit too much and fragmented as well...
There have recently been interest within the search community for a true open source pipeline with a healthy community behind it and a rich pool of processors. See this presentation from Lucene Eurocon 2010 as well as this blog post for thoughts from FindWise, as well as the recent solr-user thread Pipeline for Solr and Cominvent's talk at Lucene Eurocon 2011 Improving Solr's Update Chain. In addition to developing a true open source preferred solution, it should also be possible to improve interoperability and compatibility.
Here are a few things that we could consider in order to ease this situation:
- Start talking together and try find common ground, places to cooperate, consolidate etc
- Develop a common Java interface which models a document processor, enabling cross-pipeline use of the same processor
- Develop a Java wrapper for executing Python processors (reuse of ESP processors, Pypes processors and Piped processors) in a Java pipeline
- Specify a common "Document" model which may be serialized between various components (Avro based?)
- Establish a source repository (outside of the ASF) of reusable processors, maintained by a large community
Wishes for a Lucene targeted pipeline
Here are some thoughts and wishes for a new pipeline project mainly target at Lucene based search enginens (including Solr, ElasticSearch and Lucene itself). It should probably build upon/fork one of the existing projects and best practices.
- Apache licensed
- Java based
- Lightweight (not over-engineered)
- Support for multiple named pipelines, addressable at document ingestion
- Support for a rich document format, including token streams (pre-analyzed content)
- Support for metadata on document and field level (e.g. tokenized=true, language=en)
- Well defined dead-simple API and SDK for the processing stages
- Easy configuration of pipelines through separate config and GUI
- Run standalone as well as embedded in another framework (such as Solr's UpdateChain)
- Do not directly depend on Solr, but allow easy, tight integration with either Lucene or Solr
- SDK for stage developers - to encourage stage development
- Easily debuggable and testable
- Separate stages repository (e.g. a gitHub space, outside of ASF svn) to encourage sharing
- Integration points for UIMA, LingPipe, OpenNLP etc
- Be able to run Lucene's Tokenizers and Token Filters directly and ship this to Lucene as the new "pre-analyzed" field (see SOLR-1535)
- Support for writing stages in JVM scripting languages such as Jython
- GUI for configuring pipelines
- Hot pluggable pipelines
- Wrappers for custom FAST ESP stages to work with minor modification
- Wrappers for custom UpdateProcessor stages to work with minor modification
- Robust - if a batch fails, it should re-schedule to another processor
- Optimize for performance through e.g. batch support
- Allow scaling out processing to multiple dedicated servers for heavy tasks. Cloud-friendly
- Support status callbacks to the client
- Do not over-architecture like Eclipse SMILA and others have done going crazy with ESB etc
- Do not try to be a connector framework as well. Let ManifoldCF do that job. Focuson on the pipeline!
- Do not keep the source private (although Apache licensed) as DieselPoint did with OpenPipeline - create a community!
Jan Høydahl: I think OpenPipe is a hot candidate to fork as a new open source framework. It already supports most of the above, is Apache licensed, and is abandoned by its original developers.
Your question here
- Q: Is there a JIRA issue that tracks the Solr-side development of this?
- A: Not yet
- Q: How is this related to https://issues.apache.org/jira/browse/SOLR-2129?
- A: SOLR-2129 is an UpdateProcessor for UIMA (see SolrUIMA). Here we're talking about a new standalone framework and a way to integrate this and other existing pipelines cleanly with Solr/Lucene.
- Q: Will the pipelines have to be linear. For instance, could we implement a first stage in the pipeline that would be a splitter. The splitter could, for example, break up a large XML document into chapters, then push each chapter to the next stage where other processing will take place. In the end, the Lucene index would have one document per chapter.
- A: The new framework can be however we want it. If you talk about the Solr UpdateChain, we suggest in SOLR-2841 a way to support non linear chains. For splitting in chapters however, I think that a UpdateRequestHandler may be a better choice, see http://wiki.apache.org/solr/XsltUpdateRequestHandler
- Q: How will the pipelines support compound files, e.g. archives, e-mail messages with attachments (which could be archives), etc.? This could be a problem if pipelines are linear.
- A: This is an open question. For the new pipeline framework, there are many possibilities, which must be discussed. If you're thinking about the Solr UpdateChain, you have a choice whether your UpdateRequestHandler should understand the input format and do the splitting for you. But it should also be possible to write an UpdateProcessor which splits the incoming SolrInputDocument into multiple sub documents - generating unique IDs for each. You would somehow need to inject these sub documents again, either by using SolrJ from your UpdateProcessor or by instantiating a "sub chain" in another thread to push the sub docs into the index. This is however, left as an exercise for the user