Status
Current state: Under Discussion
Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/lucene-dev/)
JIRA: https://issues.apache.org/jira/browse/SOLR-14597
Release: 8.7 (proposed)
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.
Motivation
Solr has a wide array of query parsers, but lacks a parser targeted at an advanced set of non-technical users that prioritize selecting an accurate set of documents over the relevancy based ordering of documents. Such users are often non-technical, but use the search interface in a professional capacity daily and will be able to invest time in learning query syntax. The standard solr query parser along with it's facility for local parameters is the total swiss army knife, but has a syntax that only a search engineer could love. The XML query parser is even more capable and even less user friendly. Edismax is great for use cases where relevancy is a primary concern and users are not expected to have the time or patience to learn an explicit syntax. There are also a number of very useful features such as span queries and complex phrase queries that presently require an understanding of how to use local parameters to employ. As such, if they are to be provided to the end users, this requires query string pre-processing or allowance for users to invoke any query parser they like via local params.
Additionally, the current set of parsers are very aggressive in removal of punctuation before passing tokens on to analysis meaning that it's very difficult to craft systems that work with punctuation in the query string, and a parser that minimizes it's impact on punctuation opens up new possibilities such as synonyms based on patterns that include punctuation such as equating 401(k) with 401k.
Attribution
This Parser was developed by the Library of Congress for use on Congress.gov and is being donated to Lucene/Solr. Contact person for Legal/Procedural/Proccess matters is Mike Nibeck and the primary technical contact is Gus Heck (who was contracted to lead the development effort for this feature). Additional contributors include Rohit Gupta, Jay Muntz, Peter Fries and Steve Ge. Library of Congress CLA is already on file, Gus and Mike are listed.
Public Interfaces
This SIP will add a new Query Parser and Several new Analyzers that can be used to leverage it's ability to pass punctuation down into analysis. Specifically, it will add:
- org.apache.solr.aqp.QueryParser generated from org/apache/solr/aqp/QueryParser.jj and supporting classes.
- org.apache.solr.analysis.TokenAnalyzerFilterFactory which is able to apply the analysis from any type in a solr schema to an individual token.
- org.apache.lucene.analysis.miscellaneous.DropIfFlaggedFilterFactory which allows for the dropping of tokens that bear a given set of flags.
- org.apache.lucene.analysis.miscellaneous.PatternTypingFilterFactory which recognizes tokens based on regex patterns and then applies types and flags to those tokens
Proposed Changes
Parser
The advanced query syntax is distinct from the existing syntaxes in several ways:
- There are no infix operators, all operators are prefix operators
- Most operators are distributive across parenthesis
- Some operators have been assigned to different punctuators
It has the following constructs:
Name | Symbol | Example | Explanation |
SHOULD | ~ | ~foo | explicitly override the default operator to enforce should logic |
MUST_NOT | ! | !foo | Chosen over '-' to reduce conflicts with hyphenated words, not an operator at the end of a token so no conflict with exclamations (Spanish uses an upside down ! in front) |
MUST | + | +foo | Similar to standard query parser |
ANALYZED_PHRASE | "" | "foo" | phrase search including synonyms/and full analysis |
LITERAL_PHRASE | '' | 'foo' | phrase search with reduced analysis (see below for details) |
GROUP | () | (foo bar) | applies the default operator (or other specified operator to the terms within the parenthesis, and causes them to be considered as a unit. |
DISTANCE | n/#() | n/3(foo bar) | Specifies a span query where foo and bar occur (in either order) within 3 tokens of each other |
ORDERED_DISTANCE | w/#() | w/4(foo bar) | Specifies a span query where foo and bar occur within 4 tokens of each other with foo occurring before bar. |
PREFIX | * | foo* | Specifies a prefix search matching any tokens starting with 'foo' default settings require at least 3 prefix characters. |
FIELD | : | title:foo | searches the title field for foo |
RANGE | :[ TO ] | votes:[0 TO 10} | Typical lucene range searches on text, date or numeric data, inclusive and exclusive bounds supported as in standard parser |
Several Elements of other syntaxes are intentionally omitted:
- ~ is not available for fuzzy search (this operator was put to better use)
- ^ is not available for boosting (these users are not focused on tuning relevancy for others, but rather on obtaining results for themselves).
- {!foo} is not available for local params. (this syntax is relatively safe to give to solr without pre-processing, since there are no "escape routes" into arbitrary syntaxes and parsers.)
"Literal" searches are performed by appending _lit to the field for the literal search. This is treated as a fielded phrase search on an alternate field (i.e. _text__lit or title_lit) so the following two searches are equivalent:
title:'foo and bar' title_lit:"foo and bar"
This does impose requirements on the indexing strategy, but this is an "Advanced" feature (hence then name!) so that's ok. The result is that "literal" search can be as literal or analyzed as desired depending on the configuration of the corresponding _lit field.
Treatment of parenthesis is conservative, so parentheses occurring within a token do not break the token, and the parser is careful to only eliminate syntax parentheses and leave parentheses not required to complete the syntax, so the query +(401(k)) will require the token '401(k)' and rather than '401' and 'k' or '401(k'.
Analysis Components
One of the major goals of this parser is to enable a configuration that can apply synonyms to punctuated constructs that have significance to the user but are typically destroyed by the existing parsers. An example configuration of a field type to achieve this (anticipating the use of this parser) looks like this:
<fieldType name="text_aqp" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.PatternTypingFilterFactory" patternFile="patterns.txt"/> <filter class="solr.TokenAnalyzerFilterFactory" asType="text_general" preserveType="true"/> <filter class="solr.TypeAsSynonymFilterFactory" prefix="__TAS__" synFlagsMask="0" ignore="word"/> <filter class="solr.DropIfFlaggedFilterFactory" dropFlags="2"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <!-- query parser already handles splitting --> <filter class="solr.PatternTypingFilterFactory" patternFile="patterns.txt"/> <filter class="solr.TokenAnalyzerFilterFactory" asType="text_en_aqp" preserveType="true" /> <filter class="solr.TypeAsSynonymFilterFactory" prefix="__TAS__" synFlagsMask="0"ignore="word"/> <filter class="solr.DropIfFlaggedFilterFactory" dropFlags="2"/> </analyzer> </fieldType> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_general_lit" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> ---- patterns.txt ---- 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3 2 C\+\+ ::: c_plus_plus
There's a lot to unpack there, so starting from the top:
- The advanced query parser always splits on whitespace, so the whitespace tokenizer is used at index time to ensure corresponding tokens.
- PatternTypingFilterFactory matches incoming tokens to the 2nd (white space delimited) column of patterns.txt in sequence so the token C++ would match the last line and 401(k) would match the first line.
- Upon matching 401(k), PatternTyping filter adds the type attribute __TAS__legal2_401_k and sets the second bit of the flags attribute (determined by the first column of patterns.txt). The purpose of the __TAS__ prefix is to avoid any cases in which a token from the text might coincide with the tokens from the synonyms when this type is converted into a token later on. the ::: in patterns.txt is just a separator to make it easier to see where patterns end and replacements begin.
- TokenAnalyzerFilterFactory conducts the text_general analysis on the tokens provided and is instructed to add the existing token type to any tokens produced, at this point 401(k) is broken into 401 and k, each with type __TAS__legal2_401_k and flag = 2
- TypeAsSynonymFilterFactory converts the type into a flag, but a new ignore attribute allows it to not convert the standard "word" type that every token gets by default. Note that this new token will NOT bear the flag set by PatternTypingFilterFactory.
- DropIfFlaggedFilterFactory drops all tokens that have all flags specified set. So if a token arrives with a flags value of 5, it will not be dropped, but 2,6,10 etc would be dropped. If dropFlags were set to 3, then any flags attribute with a value of 1, 2, 3, 5,6,7,9,10,11 etc would be dropped.
- In addition to text_general there is a text_general_lit type that can be used for a text_aqp_literal type which would be identical except for the configured field type in TokenAnalyzerFilterFactory. (omitted for brevity)
Thus you would configure fields like this:
<field name="bill_text" type="text_aqp" indexed="true" stored="false" multiValued="true"/> <field name="bill_text_lit" type="text_aqp_literal" indexed="true" stored="false" multiValued="true"/>
The net result is that both 401k and 401(k) produce __TAS__legal2_401_k and match the same documents but the analysis does not produce tokens for '401' or 'k' so Rhode Island phone numbers and Documents pertaining to Vitamin K do not match.
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
None.
- If we are changing behavior how will we phase out the older behavior?
No behavior change, entirely additive.
- If we need special migration tools, describe them here.
None
- When will we remove the existing behavior?
Not Applicable
Security considerations
This feature improves security for users that choose to use it by eliminating the need to consider what parsers might be invoked via local parameters. There are no (yet known) ways in which this would provide new attack vectors. By default, it also prevents the most expensive varieties of prefix queries. Some of the analysis techniques may be resource-intensive, and the users will need to consider carefully the performance impacts of the regular expressions used in PatternTypingFilterFactory.
Test Plan
This contribution comes with an extensive unit test suite
Rejected Alternatives
Many parts of this are possible via other means but there is no solution that provides all of the features described.
1 Comment
Alexandre Rafalovitch
PatternTypingFilterFactory and DropIfFlaggedFilterFactory seem to be quite similar to KeywordMarkerFilterFactory and TypeTokenFilterFactory to the degree that perhaps the existing classes should be enhanced instead to support additional functionality. Especially since keyword marking is integrated into other parts of Solr (e.g. not dropping it as stopword, I think). Also TypeTokenFilter can work as both a blacklist and a whitelist. Both types of filtering are useful. E.g. I used it in the book to allow to search for emails only extracted from some text: https://github.com/arafalov/solr-indexing-book/blob/master/published/text2/conf/schema.xml