IndexWriter.close()
) will result in creation of a new set of segment files. This could trigger a segment merge operation which could be resource intensive (think compaction in LSM).writer.setMaxBufferedDocs and writer.setRAMBufferSizeMB
). More RAM size means larger segments means less merging later.writer.addIndexesNoOptimize
() PUTs -> [Cache] [Cache] .down.> [Async Queue] [Async Queue] -right-> [Lucene Indexer] [Lucene Indexer] -up-> [GeodeFSDirectory] [GeodeFSDirectory] -left-> [Cache] [Cache] -up-> () Search |
() User -down-> [Cache] : PUTs node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] -down-> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [Cache] ..> [PR 2] [PR 2] -down-> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
A search request will be intercepted by a custom ParserAggregator. This component will distribute the search query to all PRs. Each PR will route the request to local Lucene. The result will be routed to ParserAggregator. ParserAggregator will reorder and trim the aggregated result set and return the updated result set to user.
() User -> [Cache] : Search node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] --> [ParserAggregator] [ParserAggregator] --> [LucenePR1] [LucenePR1] --> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [ParserAggregator] --> [LucenePR2] [LucenePR2] --> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
Here search request is handled by Lucene and Lucene's Parser and aggregator is utilized. DistributedFSDirectory will provide a unified view to Lucene. Lucene will request DistributedFSDirectory to fetch index chunks. DistributedFSDirectory will aggregate the index chunks from the PR which hosts the data. This is similar to a Cache Client in behavior. Cache Client reaches different PRs and provides a unified data view to the user.
() User -> [Cache] : Search node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] --> [LucenePR1] [LucenePR1] --> [DistributedFSDirectory] [DistributedFSDirectory] -down-> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [DistributedFSDirectory] -down-> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
Advantages
Here search request is handled by Solr. Solr distributes queries to Solr agents and its aggregator is utilized. SolrCloud solves some issues related to index distribution. These issues are not relevant If the index is managed in Cache. So the Solr *Distributed Search* seems like a promising solution.
Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud:
- Splitting of the core into shards was somewhat manual.
- There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't figure out on its own what shards to send documents to.
- There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone.
() User -> [Cache] : Search node cluster { database { () indexPR1 } [Cache] ..> [PR 1] [PR 1] --> [SolrServer] [SolrServer] --> [SolrPR1] [SolrPR1] -down-> [FSDirectoryPR1] [FSDirectoryPR1] -> indexPR1 database { () indexPR2 } [SolrServer] --> [SolrPR2] [SolrPR2] -down-> [FSDirectoryPR2] [FSDirectoryPR2] -> indexPR2 } |
A custom implementation of IndexWriter and IndexReader could be provided as an alternative to FSDirectory implementation. FSDirectory is file-like interface. Lucene constructs a file and hands it over to FSDirectory for writes and reads. Lucene manages file merges. The directory implementation does not have visibility into the contents of the file. The IndexWriter approach is one layer above FSDirectory. Lucene interacts at a document and term level granularity with IndeReader/IndexWriter layer. The following are the important classes and methods to look at:
org.apache.lucene.index.MultiReader: An IndexReader which reads multiple indexes, appending their content.
termDocs(Term term): Returns an enumeration of all the documents which contain term.
termPositions: Returns an enumeration of all the documents which contain term. For each document, in addition to the document number and frequency of the term in that document, a list of all of the ordinal positions of the term in the document is available.
org.apache.lucene.index.IndexWriter
updateDocument, addDocument
IndexWriter can control how the terms are distributed and persisted. In case of a distributed search, MultiReader can distribute the query to shard based sub-readers and each sub-reader streams filtered results from the shard to the query coordinator.
A map with this form <term, map <docId, list <position>>> is needed for supporting various lucene functions.
Lucene / Solr support flat, Json and API based interfaces for faceting
// Create Readers
DirectoryReader indexReader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(indexReader);
TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoDir);
// Create counters along dimensions
FacetSearchParams fsp = new FacetSearchParams(new CountFacetRequest(new CategoryPath("Author"), 10));
// Aggregates the facet counts
FacetsCollector fc = FacetsCollector.create(fsp, searcher.getIndexReader(), taxoReader);
// Search
searcher.search(...);
// Retrieve results
List<FacetResult> facetResults = fc.getFacetResults();
{
high_popularity : {
type : query,
q : "popularity:[8 TO 10]",
facet : { average_price : "avg(price)" }
}
}
Example response "high_popularity": {
"count": 147,
"average_price": 74.25
}
{
prices : {
type : range,
field : price,
start : 0,
end : 40,
gap : 20
}
}
"prices"
:{
"buckets"
:[
{
"val"
:0.0,
// the bucket value represents the start of each range. This bucket covers 0-20
"count"
:5},
{
"val"
:20.0,
"count"
:1}
]
}