This page continues a discussion about how to integrate Lucene with Derby. Lucene is an Apache text search engine. The discussion began on the Derby user mail list with the Full Text Indexing thread. JIRA enhancement request DERBY-472 tracks this discussion.
This page briefly describes Lucene's capabilities and then explores text-searching features and use cases which Derby might support. Please feel free to expand this list of features and use cases.
Lucene's Capabilities
Lucene provides a java library for indexing and searching documents. Lucene ships with English, German, and Russian support and you can find plugins for other languages, including Chinese, Japanese, and Korean. Plugins exist for the following document formats:
plain text |
html |
xml |
Open Office |
Word |
Excel |
Powerpoint |
IMAP mail |
RTF |
The following high level concepts drive Lucene's design:
- Crawling - A component crawls through some repository (say a web, filesystem, or database), looking for documents to index.
- Analyzing - The resulting documents are analyzed into useful terms:
- Lexing - The text is broken up into language-specific words.
- Stemming - Inflectional markers are stripped and words are reduced to standard forms. For instance, English possessives, plurals, and tenses disappear and the words bat, bats, bat's, batted, and batting all become the word bat.
- Stopping - Noise words (like "the" and "an") are thrown away.
- Indexing - An index is built keyed by useful terms. For each useful term, the index tracks various statistics including the term's word offsets into documents.
- Querying - Complex queries can be built out of words and phrases, arbitrarily connected by ANDs, ORs, and NOTs. Queries allow exact matches and various kinds of fuzzy matches. Queries may be expressed in a text-based query language or as graphs of Lucene search objects.
- Filtering - Query results may be run through noise filters to sift out irrelevant documents.
- Hits - Filtered query results, sorted by relevance, appear as lists of document hits.
Features We Want
Integrating Lucene with Derby may involve some or all of the following features. Probably we would phase in features over a number of releases.
- Complex Searches - Text-search documents. Restrict the search by metadata that is stored in Derby. Join search results with supplementary information stored in Derby.
- Administration - Be able to use off-the-shelf tools to maintain and optimize Lucene indexes.
- Import/Export - Rapid import/export of text-searchable documents.
- Security - Restrict text-searching to authorized documents.
- Recovery - Recover text-search indexes after a crash.
- Parallelism - Be able to throw many processors at a text-search.
- Plugins - Lucene support should not bloat up the core Derby release.
- Customizing - Customers should be able to supply their own analyzers and filters and store these in the database.
- Query API - Customers should be able to express queries with Lucene's query language or with graphs of Lucene search objects.
- Convenience - Make it easy to declare which document fields appear in Lucene indexes and which are stored in columns.
Use Cases to Support
Use Case |
Description |
Example |
Loose Coupling |
Store documents outside Derby in a filesystem or web. |
Web-advertising: Maintain a searchable web of content. When the user searches for content, return web pages as well as advertising jsps bound to certain keywords. |
Moderate Coupling |
Store documents inside Derby but maintain text-search indexes outside Derby in a filesystem. Provides transactional versioning and audit trail for documents which can be text-searched. |
Law office: Be able to transactionally store legal documents and search for them later. |
Tight Coupling |
Transactionally store documents and text-search indexes inside Derby. |
Online market: Be able to search for an item immediately after its description is posted. |
Issues
- Index Latency - Probably the first phase of Lucene support will not store the Lucene indexes in the database. There will be some sort of lag between storing a document and seeing it appear in searches. How long can this lag be? A minute? An hour? A day? Similarly, after a crash, we may need to rebuild the Lucene indexes. How long can this rebuilding take?