Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.6 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Covers Apache Solr 6.6
As of 5 May 2017, committer editing of this Ref Guide is LOCKED for migration to the new Ref Guide. See SOLR-10290 for details.

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The facet function provides aggregations that are rolled up over buckets. Under the covers the facet function pushes down the aggregation into the search engine using Solr's JSON Facet API. This provides sub-second performance for many use cases. The facet function is appropriate for use with a low to moderate number of distinct values in the bucket fields. To support high cardinality aggregations see the rollup function.

Parameters

  • collection: (Mandatory) Collection the facets will be aggregated from.
  • q: (Mandatory) The query to build the aggregations from.
  • buckets: (Mandatory) Comma separated list of fields to rollup over. The comma separated list represents the dimensions in a multi-dimensional rollup.
  • bucketSorts: Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
  • bucketSizeLimit: The number of buckets to include. This value is applied to each dimension.
  • metrics: List of metrics to compute for the buckets. Currently supported metrics are sum(col), avg(col), min(col), max(col), count(*).

Syntax

Example 1:

Code Block
borderColor#666666
borderStylesolid
facet(collection1, 
      q="*:*", 
      buckets="a_s",
      bucketSorts="sum(a_i) desc",
      bucketSizeLimit=100,
      sum(a_i), 
      sum(a_f), 
      min(a_i), 
      min(a_f), 
      max(a_i), 
      max(a_f),
      avg(a_i), 
      avg(a_f), 
      count(*))

The example above shows a facet function with rollups over a single bucket, where the buckets are returned in descending order by the calculated value of the sum(a_i) metric.

Example 2:

Code Block
borderColor#666666
borderStylesolid
facet(collection1, 
      q="*:*", 
      buckets="year_i, month_i, day_i",
      bucketSorts="year_i desc, month_i desc, day_i desc",
      bucketSizeLimit=100,
      sum(a_i), 
      sum(a_f), 
      min(a_i), 
      min(a_f), 
      max(a_i), 
      max(a_f),
      avg(a_i), 
      avg(a_f), 
      count(*))

The example above shows a facet function with rollups over three buckets, where the buckets are returned in descending order by bucket value.

features

The features function extracts the key terms from a text field in a classification training set stored in a SolrCloud collection. It uses an algorithm known as Information Gain, to select the important terms from the training set. The features function was designed to work specifically with the train function, which uses the extracted features to train a text classifier.

The features function is designed to work with a training set that provides both positive and negative examples of a class. It emits a tuple for each feature term that is extracted along with the inverse document frequency (IDF) for the term in the training set. 

The features function uses a query to select the training set from a collection. The IDF for each selected feature is calculated relative to the training set matching the query. This allows multiple training sets to be stored in the same SolrCloud collection without polluting the IDF across training sets.

Parameters

  • collection: (Mandatory) The collection that holds the training set
  • q: (Mandatory) The query that defines the training set. The IDF for the features will be generated specific to the result set matching the query.
  • featureSet: (Mandatory) The name of the feature set. This can be used to retrieve the features if they are stored in a SolrCloud collection.
  • field: (Mandatory) The text field to extract the features from.
  • outcome: (Mandatory) The field that defines the class, positive or negative
  • numTerms: (Mandatory) How many feature terms to extract.
  • positiveLabel: (defaults to 1) The value in the outcome field that defines a postive outcome.

Syntax

Code Block
borderColor#666666
borderStylesolid
features(collection1, 
         q="*:*", 
         featureSet="features1", 
         field="body", 
         outcome="out_i", 
         numTerms=250)

gatherNodes

The gatherNodes function provides breadth-first graph traversal. For details, see the section Graph Traversal.

model

The model function retrieves and caches logistic regression text classification models that are stored in a SolrCloud collection. The model function is designed to work with models that are created by the train function, but can also be used to retrieve text classification models trained outside of Solr, as long as they conform to the specified format. After the model is retrieved it can be used by the classify function to classify documents.

A single model tuple is fetched and returned based on the id parameter. The model is retrieved by matching the id parameter with a model name in the index. If more then one iteration of the named model is stored in the index, the highest iteration is selected.

Caching

The model function has an internal LRU (least-recently-used) cache so models do not have to be retrieved with each invocation of the model function. The time to cache for each model ID can be passed as a parameter to the function call. Retrieving a cached model does not reset the time for expiring the model ID in the cache.

Model Storage

The storage format of the models in Solr is below. The train function outputs the format below so you only need to know schema details if you plan to use the model function with logistic regression models trained outside of Solr.

  • name_s (Single value, String, Stored): The name of the model.
  • iteration_i (Single value, Integer, Stored): The iteration number of the model. Solr can store all iterations of the models generated by the train function. 
  • terms_ss (Multi value, String, Stored: The array of terms/features of the model.
  • weights_ds (Multi value, double, Stored): The array of term weights. Each weight corresponds by array index to a term.
  • idfs_ds (Multi value, double, Stored): The array of term IDFs (Inverse document frequency). Each IDF corresponds by array index to a term.

Parameters

  • collection: (Mandatory) The collection where the model is stored.
  • id: (Mandatory) The id/name of the model. The model function always returns one model. If there are multiple iterations of the name, the highest iteration is returned.
  • cacheMillis: (Optional) The amount of time to cache the model in the LRU cache.

Syntax

Code Block
borderColor#666666
borderStylesolid
model(modelCollection, 
      id="myModel"
      cacheMillis="200000") 

random

The random function searches a SolrCloud collection and emits a pseudo-random set of results that match the query. Each invocation of random will return a different pseudo-random result set.

Parameters

  • collection: (Mandatory) The collection the stats will be aggregated from.
  • q: (Mandatory) The query to build the aggregations from.
  • rows: (Mandatory) The number of pseudo-random results to return.
  • fl: (Mandatory) The field list to return.
  • fq: (Optional) Filter query

Syntax

Code Block
borderColor#666666
borderStylesolid
random(baskets, 
       q="productID:productX", 
       rows="100", 
       fl="basketID") 

In the example above the random function is searching the baskets collections for all rows where "productID:productX". It will return 100 pseudo-random results. The field list returned is the basketID.

significantTerms

The significantTerms function queries a SolrCloud collection, but instead of returning documents, it returns significant terms found in documents in the result set. The significantTerms function scores terms based on how frequently they appear in the result set and how rarely they appear in the entire corpus. The significantTerms function emits a tuple for each term which contains the term, the score, the foreground count and the background count. The foreground count is how many documents the term appears in in the result set. The background count is how many documents the term appears in in the entire corpus. The foreground and background counts are global for the collection.

Parameters

  • collection: (Mandatory) The collection that the function is run on.

  • q: (Mandatory) The query that describes the foreground document set.
  • limit: (Optional, Default 20) The max number of terms to return.
  • minDocFreq: (Optional, Defaults to 5 documents) The minimum number of documents the term must appear in on a shard. This is a float value. If greater then 1.0 then it's considered the absolute number of documents. If less then 1.0 it's treated as a percentage of documents. 
  • maxDocFreq: (Optional, Defaults  to 30% of documents) The maximum number of documents the term can appear in on a shard. This is a float value. If greater then 1.0 then it's considered the absolute number of documents. If less then 1.0 it's treated as a percentage of documents. 
  • minTermLength: (Optional, Default 4) The minimum length of the term to be considered significant.

Syntax

Code Block
borderColor#666666
borderStylesolid
significantTerms(collection1, 
                 q="body:Solr", 
                 minDocFreq="10",
                 maxDocFreq=".20",
                 minTermLength="5")

In the example above the significantTerms function is querying collection1 and returning at most 50 significant terms that appear in 10 or more documents but not more then 20% of the corpus.

shortestPath

The shortestPath function is an implementation of a shortest path graph traversal. The shortestPath function performs an iterative breadth-first search through an unweighted graph to find the shortest paths between two nodes in a graph. The shortestPath function emits a tuple for each path found. Each tuple emitted will contain a path key which points to a List of nodeIDs comprising the path.

Parameters

  • collection: (Mandatory) The collection that the topic query will be run on.

  • from: (Mandatory) The nodeID to start the search from
  • to: (Mandatory) The nodeID to end the search at
  • edge: (Mandatory) Syntax: from_field=to_field. The from_field defines which field to search from. The to_field defines which field to search to. See example below for a detailed explanation.
  • threads: (Optional : Default 6) The number of threads used to perform the partitioned join in the traversal.
  • partitionSize: (Optional : Default 250) The number of nodes in each partition of the join.
  • fq: (Optional) Filter query
  • maxDepth: (Mandatory) Limits to the search to a maximum depth in the graph. 

Syntax

Code Block
borderColor#666666
borderStylesolid
shortestPath(collection, 
             from="john@company.com", 
             to="jane@company.com",
             edge="from_address=to_address",
             threads="6",
             partitionSize="300", 
             fq="limiting query", 
             maxDepth="4")

The expression above performs a breadth-first search to find the shortest paths in an unweighted, directed graph.

The search starts from the nodeID "john@company.com" in the from_address field and searches for the nodeID "jane@company.com" in the to_address field. This search is performed iteratively until the maxDepth has been reached. Each level in the traversal is implemented as a parallel partitioned nested loop join across the entire collection. The threads parameter controls the number of threads performing the join at each level, while the partitionSize parameter controls the of number of nodes in each join partition. The maxDepth parameter controls the number of levels to traverse. fq is a limiting query applied to each level in the traversal.

stats

The stats function gathers simple aggregations for a search result set. The stats function does not support rollups over buckets, so the stats stream always returns a single tuple with the rolled up stats. Under the covers the stats function pushes down the generation of the stats into the search engine using the StatsComponent. The stats function currently supports the following metrics: count(*), sum(), avg(), min(), and max().

Parameters

  • collection: (Mandatory) Collection the stats will be aggregated from.
  • q: (Mandatory) The query to build the aggregations from.
  • metrics: (Mandatory) The metrics to include in the result tuple. Current supported metrics are sum(col), avg(col), min(col), max(col) and count(*)

Syntax

Code Block
borderColor#666666
borderStylesolid
stats(collection1, 
      q=*:*, 
      sum(a_i), 
      sum(a_f), 
      min(a_i), 
      min(a_f), 
      max(a_i), 
      max(a_f), 
      avg(a_i), 
      avg(a_f), 
      count(*))

timeseries

train

The train function trains a Logistic Regression text classifier on a training set stored in a SolrCloud collection. It uses a parallel iterative, batch Gradient Descent approach to train the model. The training algorithm is embedded inside Solr so with each iteration only the model is streamed across the network.

The train function wraps a features function which provides the terms and inverse document frequency (IDF) used to train the model. The train function operates over the same training set as the features function, which includes both positive and negative examples of the class.

With each iteration the train function emits a tuple with the model. The model contains the feature terms, weights, and the confusion matrix for the model. The optimized model can then be used to classify documents based on their feature terms.

Parameters

  • collection: (Mandatory) Collection that holds the training set
  • q: (Mandatory) The query that defines the training set. The IDF for the features will be generated on the 
  • name: (Mandatory) The name of model. This can be used to retrieve the model if they stored in a Solr Cloud collection.
  • field: (Mandatory) The text field to extract the features from.
  • outcome: (Mandatory) The field that defines the class, positive or negative
  • maxIterations: (Mandatory) How many training iterations to perform.
  • positiveLabel: (defaults to 1) The value in the outcome field that defines a positive outcome.

Syntax

Code Block
borderColor#666666
borderStylesolid
train(collection1,
      features(collection1, q="*:*", featureSet="first", field="body", outcome="out_i", numTerms=250),
      q="*:*",
      name="model1",
      field="body",
      outcome="out_i",
      maxIterations=100)

topic

...