This page has been replaced with a newer version for Solr 6, found at Streaming Expressions. The content on this page is valid for Solr 5 ONLY.
Streaming Expressions provide a simple query language for SolrCloud that merges search with parallel computing. Under the covers Streaming Expressions are backed by a java Streaming API that provides a fast map/reduce implementation for SolrCloud.
Streaming Expressions provide a higher level query language so you don't have to be a java programmer to access the parallel computing capabilities in the Streaming API.
The Language Basics
- Streaming Expressions are composed of functions.
- All functions behave like Streams, which means that they don't hold all the data in memory at once.
- Functions are designed to work with entire result sets rather then the top N results like normal search. The /export handler was developed to support this.
- Functions emit a stream of Tuples (key/value Maps)
- Functions can be compounded, or wrapped, to form complex operations
- Functions can be parallelized across a worker collection.
- Under the covers SolrCloud supports map/reduce "Shuffling" to partition result sets across worker nodes.
- Under the covers all functions map and are compiled to Streaming API java objects.
Searches a SolrCloud collection and emits a stream of Tuples that match the query.
- Collection: (Mandatory) the collection being search
- zkHost : (Optional) if collection being searched is found in a different zkHost then the local stream handler.
- qt : (Optional) specifies the query type. Set this to "/export" to work with large result sets. Defaults to "/select".
- rows: (Mandatory with /select handler) The rows parameter specifies how many rows to return. This param is only needed with the /select handler as the /export handler always returns all rows.
- q : (Mandatory) The query
- fl : (Mandatory) The field list
- sort : (Mandatory) The sort criteria
- partitionKeys: A list of fields (comma separated), to partition search results across worker nodes. This parameter is only required with the parallel() expression (see below)
Merges two Streaming Expressions and maintains the ordering of the underlying streams. Note that the merge function maintains the ordering of the underlying streams, so the sorts of the underlying streams must line up with the sort parameter provided to the merge function.
- StreamExpression A
- StreamExpression B
- on : Sort criteria for performing the merge
Wraps a Streaming Expression and groups tuples by common fields. The group function emits one group head Tuple per group. The group head Tuple contains a list of all the Tuples within the group. The group by parameter must match-up with the sort order of the underlying stream. The group function implements a non-co-located grouping algorithm. This means that records from the same group do not need to be co-located on the same shard. When executed in parallel the partitionKeys parameter must be the same as the group by field so that records from the same group will be shuffled to the same worker.
- by : grouping criteria
Wraps a Streaming Expression and emits a unique stream of Tuples based on the over parameter. The unique function relies on the sort order of the underlying stream. The over parameter must match up with the sort order of the underlying stream. The unique function implements a non-co-located unique algorithm. This means that records with the same unique over field do not need to be co-located on the same shard. When executed in the parallel, the partitionKeys parameter must be the same as the unique over field so that records with the same keys will be shuffled to the same worker.
- over: unique criteria
Wraps a Streaming Expression and re-orders the Tuples. The top function emits only the top N tuples in the new sort order. The top function re-orders the underlying stream so the sort criteria does not have to match up with the underlying stream.
- n : Number of top tuples to return
- sort: Sort criteria for selecting the top N tuples.
The expression below finds the top 3 results of the underlying search. Notice that it reverses the sort order. The top function re-orders the results of the underlying stream.
Wraps a Streaming Expression and sends it to N worker nodes to be processed in parallel. The parallel function requires that the partitionKeys parameter be provided to the underlying searches. The partitionKeys parameter will partition the search results (Tuples) across the worker nodes. Tuples with the same values in the partitionKeys field will be shuffled to the same worker nodes. The parallel function maintains the sort order of the Tupes returned by the worker nodes. So the sort criteria of the parallel function must match up with the sort order of the Tuples returned by the workers.
- Collection: Name of the worker collection to send the StreamExpression to.
- StreamExpression: Expression to send to the worker collection.
- workers: Number of workers in the worker collection to send the expression to.
- zkHost: zkHost where the worker collection resides.
- sort: sort criteria for ordering Tuples returned by the worker nodes.
HTTP Interface: Stream Handler
Solr has a new /stream handler that takes Streaming Expression requests and returns the Tuples as a JSON stream. The stream http parameter is used to specify the Streaming Expression. For example, this curl command (encodes and) POSTS to the
/stream handler a simple "search()" expression:
For the above example the
/stream handler responded with the following JSON response (without the formatting):
This response needs to be handled in a different manner then a normal search response. The first thing to notice is that numFound and start are both set to -1. In the first release these values are unsupported and in future releases may be removed. The reason is that numFound is often impossible to know because the streams are being transformed after they leave Solr. So only after the final EOF Tuple is read can you be sure the stream is finished. The start header from Solr is also not supported as in the future paging will likely be implement above Solr in a Streaming Expression.
Also notice the final doc which only contains "EOF": true. This is the EOF Tuple which marks the end of the stream. In your code you'll need to use a streaming JSON implementation because Streaming Expressions return the entire result set which may be millions of results. In your JSON client you'll need to iterate each doc (Tuple) and check for the EOF Tuple to determine the end of stream.
In the future the EOF Tuple will also hold aggregations that are gathered by Streaming Expressions.
The org.apache.solr.client.solrj.io package provides Java classes that compile Streaming Expressions into live Streaming API objects. These classes can be used to execute Streaming Expressions from inside a Java application. For example:
The search expression allows you to specify a request hander using the qt parameter. By default the /select handler is used. This /select handler can be used for simple rapid prototyping of expressions. For production you will most likely want to use the /export handler which is designed to sort and export entire result sets. The /export handler is not used by default because it has much stricter requirements then the /select handler so it's not as easy to get started working with. To read more about the export handlers requirements review the documentation.
The parallel expression sends a Streaming Expression to a worker collection to be executed in parallel. A worker collection can be any SolrCloud collection that has the /stream handler configured. Unlike normal SolrCloud collections, worker collections don't have to hold any data. Worker collections can be empty collections that exists only to execute Streaming Expressions.
The initial release of Streaming Expressions and the Streaming API require that all sort fields contain non-null data. There are also a number of data requirements for the /export handler concerning sort and fl fields, please see the /export handler documentation for details.