Please refer to Geode 1.2.0 documentation with final implementation is here.
Requirements
Allow user to create Lucene Indexes on data stored in Geode
- Update the indexes asynchronously to avoid impacting write latency
Allow user to perform text (Lucene) search on Geode data using the Lucene index. Results from the text searches may be stale due to asynchronous index updates.
Provide highly available of indexes using Geode's HA capabilities
- Scalability
- Performance comparable to RAMFSDirectory
Building next/better Solr/Elasticsearch.
Enhancing the current Geode OQL to use Lucene index.
Related Documents
A previous integration of Lucene and GemFire:
Similar efforts done by other data products
Hibernate Search: Hibernate search
Solandra: Solandra embeds Solr in Cassandra.
Terminology
- Documents: In Lucene, a Document is the unit of search and index. An index consists of one or more Documents.
- Fields: A Document consists of one or more Fields. A Field is simply a name-value pair.
- Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.
API
User Input
- A region and list of to-be-indexed fields
- [ Optional ] Specified Analyzer for fields or Standard Analyzer if not specified with fields
Key points
A single index will not support multiple regions. Join queries between regions are not supported
- Heterogeneous objects in single region will be supported
- Only top level fields of nested objects can be indexed, not nested collections
- The index needs to be created before the region is created (for phase1)
- Pagination of results will be supported
Java API
Now that this feature has been implemented, please refer to the javadocs for details on the Java API.
Examples
// Get LuceneService LuceneService luceneService = LuceneServiceProvider.get(cache); // Create Index on fields with default analyzer: luceneService.createIndex(indexName, regionName, "field1", "field2", "field3"); // create index on fields with specified analyzer: Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>(); analyzerPerfield.put("field1", new StandardAnalyzer()); analyzerPerfield.put("field2", new KeywardAnalyzer()); luceneService.createIndex(indexName, regionName, analyzerPerField); Region region = cache.createRegionFactory(RegionShutcut.PARTITION).create(regionName); // Create Query LuceneQuery query = luceneService.createLuceneQueryFactory().setLimit(200).setPageSize(20) .create(indexName, regionName, querystring, "field1" /* default field */); // Search using Query PageableLuceneQueryResults<K,Object> results = query.findPages(); // Pagination while (results.hasNext()) { results.next().stream().forEach(struct -> { Object value = struct.getValue(); System.out.println("Key is "+struct.getKey()+", value is "+value); }); }
Gfsh API
// List Index gfsh> list lucene indexes [with-stats] // Create Index gfsh> create lucene index --name=indexName --region=/orders --field=customer,tags // Create Index gfsh> create lucene index --name=indexName --region=/orders --field=customer,tags --analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer,org.apache.lucene.analysis.bg.BulgarianAnalyzer Execute Lucene query gfsh> search lucene --regionName=/orders -queryStrings="John*" --defaultField=field1 --limit=100
XML Configuration
<cache xmlns="http://geode.apache.org/schema/cache" xmlns:lucene="http://geode.apache.org/schema/lucene" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd http://geode.apache.org/schema/lucene http://geode.apache.org/schema/lucene/lucene-1.0.xsd" version="1.0"> <region name="region" refid="PARTITION"> <lucene:index name="index"> <lucene:field name="a" analyzer="org.apache.lucene.analysis.core.KeywordAnalyzer"/> <lucene:field name="b" analyzer="org.apache.lucene.analysis.core.SimpleAnalyzer"/> <lucene:field name="c" analyzer="org.apache.lucene.analysis.standard.ClassicAnalyzer"/> </lucene:index> </region> </cache>
REST API
Spring Data GemFire Support
Implementation Flowchart
Inside LuceneIndex
A closer look at Partitioned region data flow
Processing Queries
Implementation Details
Index Storage
- FileRegion : holds the meta data about indexing files
- ChunkRegion : Holds the actual data chunks for a given index file.
- create document for indexed fields. Indexed field values are obtained from AsyncEvent through reflection (in case of domain object) or by PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region.
- determine the bucket id of the entry.
- Get the RegionDirectory for that bucket, save the document into RegionDirectory.
Storage with different region types
PersistentRegions
Walkthrough creating index in Geode region
LuceneIndex can be created and destroy. We don't support creating index on a region with data for now.
Handling failures, restarts, and rebalance
The index region and async event queue will be restored with its colocated data region's buckets. So during failover the new primary should be able to read/write index as usual.
Aggregation
In the case of partitioned regions, the query must be sent out to all the primaries. The results will then need to be aggregated back together. Lucene search will use FunctionService to distribute query to primaries.
Input to primaries
- Serialized Query
- CollectorManager to be used for local aggregation
- Result limit
Output from primaries
- Merged collector created from results of search on local bucket indexes.
We are still investigating options for how to aggregate the data, see Text Search Aggregation Options.
In case of replicated regions, query will be sent to one of the members and get the results there. Aggregation will be handled in that member before returned to the caller.
Result collection and paging
JMX MBean
A Lucene Service MBean is available and accessed through an ObjectName like:
GemFire:service=CacheService,name=LuceneService,type=Member,member=192.168.2.13(59583)<ec><v5>-1026
This MBean provides operations these operations:
/** * Returns an array of {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.LuceneIndex} * instances defined in this member * * @return an array of LuceneIndexMetrics for the LuceneIndexes defined in this member */ public LuceneIndexMetrics[] listIndexMetrics(); /** * Returns an array of {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.LuceneIndex} * instances defined on the input region in this member * * @param regionPath The full path of the region to retrieve * * @return an array of LuceneIndexMetrics for the LuceneIndex instances defined on the input region * in this member */ public LuceneIndexMetrics[] listIndexMetrics(String regionPath); /** * Returns a {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.LuceneIndex} * with the input index name defined on the input region in this member. * * @param regionPath The full path of the region to retrieve * @param indexName The name of the index to retrieve * * @return a LuceneIndexMetrics for the LuceneIndex with the input index name defined on the input region * in this member. */ public LuceneIndexMetrics listIndexMetrics(String regionPath, String indexName);
A LuceneIndexMetrics data bean includes raw stat values like:
Region=/data2; index=full_index commitTime->107608255573 commits->5999 commitsInProgress->0 documents->498 queryExecutionTime->0 queryExecutionTotalHits->0 queryExecutions->0 queryExecutionsInProgress->0 updateTime->7313618780 updates->6419 updatesInProgress->0
Limitations include:
- no rates or average latencies are available
- no aggregation (which means no rollups across members in the GemFire -> Distributed MBean)