This page exists for the Solr Community to share Tips, Tricks, and Advice about Analyzers, Tokenizers and Filters.
Reference material previously located on this page has been migrated to the Official Solr Reference Guide. If you need help, please consult the Reference Guide for the version of Solr you are using. The sections below will point to corresponding sections of the Reference Guide for each specific feature.
If you'd like to share information about how you use this feature, please add it to this page.
/* cwikimigrated */
Analyzers, Tokenizers, and Token Filters
For a complete list of what Tokenizers and TokenFilters come out of the box, please consult the Lucene javadocs, Solr javadocs, and Automatically generated list at solr-start.com. Please look at analyzer-*. There are quite a few. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.
Note: For a good background on Lucene Analysis, it's recommended that you read the following sections in Lucene In Action:
- 1.5.3 : Analyzer
- Chapter 4.0 through 4.7 at least
High Level Concepts
Individual Solr stemmers are documented in the Solr Reference Guide section Filter Descriptions.
Analyzers are documented in the Solr Reference Guide section Analyzers.
CharFilters are documented in the Solr Reference Guide section CharFilterFactories.
Tokenizers are documented in the Solr Reference Guide section Tokenizers.
Token Filters are documented in the Solr Reference Guide section Filter Descriptions.
Specifying an Analyzer in the schema
If you want to use custom CharFilters, Tokenizers or TokenFilters, you'll need to write a very simple factory that subclasses BaseTokenizerFactory or BaseTokenFilterFactory, something like this...
When To use a CharFilter vs a TokenFilter
There are several pairs of CharFilters and TokenFilters that have related (ie: MappingCharFilter and !ASCIIFoldingFilter) or nearly identical functionality (ie: PatternReplaceCharFilterFactory and PatternReplaceFilterFactory) and it may not always be obvious which is the best choice.
The ultimate decision depends largely on what Tokenizer you are using, and whether you need to "out smart" it by preprocessing the stream of characters.
For example, maybe you have a tokenizer such as StandardTokenizer and you are pretty happy with how it works overall, but you want to customize how some specific characters behave.
In such a situation you could modify the rules and re-build your own tokenizer with javacc, but perhaps its easier to simply map some of the characters before tokenization with a CharFilter.
Notes On Specific Factories
Documentation at MappingCharFilterFactory.
Documentation at PatternReplaceCharFilterFactory.
Documentation at HTMLStripCharFilterFactory.
Solr provides the following TokenizerFactories (Tokenizers and TokenFilters):
Documentation at Keyword Tokenizer.
Documentation at Letter Tokenizer.
Documentation at White Space Tokenizer.
Documentation at Lower Case Tokenizer.
Documentation at Standard Tokenizer.
Some token types are number, alphanumeric, email, acronym, URL, etc. —
"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't" |
Word boundary rules from http://unicode.org/reports/tr29/#Word_Boundaries]
"I.B.M. 8.5 can't!!!" ==> ALPHANUM: "I.B.M.", NUM:"8.5", ALPHANUM:"can't" |
Documentation at Classic Tokenizer.
Documentation at UAX29 URL Email Tokenizer.
Documentation at Regular Expression Pattern Tokenizer.
Documentation at ICU Tokenizer.
Overall documented at Filter Descriptions.
Documentation at Classic Filter.
Strips all characters after an apostrophe (including the apostrophe itself).
"Türkiye'de", "2003'te" ==> "Türkiye", "2003".
Documentation at Lower Case Filter.
Documented at Type Token Filter.
Documented at Trim Filter.
A token filter for truncating the terms into a specific length.
"abcdefg", "1234567" ==> "abcde", "12345".
Emits tokens for each capture group in a regular expression
For example, the following definition will tokenize the input text of "http://www.foo.com/index" into "http://www.foo.com" and "www.foo.com".
If none of the patterns match, or if preserve_original is true, the original token will also be emitted.
Documentation at Pattern Replace Filter.
Documentation at Stop Filter.
Documentation at Common Grams Filter.
Documentation at Edge N-Gram Filter.
This FilterFactory is very useful in matching prefix substrings (or suffix substrings if side="back") of particular terms in the index during query time. Edge n-gram analysis can be performed at either index or query time (or both), but typically it is more useful, as shown in this example, to generate the n-grams at index time with all of the n-grams indexed at the same position. At query time the query term can be matched directly without any n-gram analysis. Unlike wildcards, n-gram query terms can be used within quoted phrases.
Documentation at Keep Word Filter.
Documentation at Word Delimiter Filter.
One use for WordDelimiterFilter is to help match words with different delimiters. One way of doing so is to specify
generateWordParts="1" catenateWords="1" in the analyzer used for indexing, and
generateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer).
In some cases you might want to adjust how WordDelimiterFilter splits on a per-character basis. To do this, you can supply a configuration file with the "types" attribute that specifies custom character categories. An example file is in subversion here. This is especially useful to add "hashtag or currency" searches.
Documentation at Synonym Filter.
Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "
sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:
- The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words
sea biscitthe analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
- Phrase searching (ie:
"sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be
"(sea | sea | seabiscuit) (biscuit | biscit)"which would not match the simple case of "seabiscuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
- An index with a "text" field, which at query time uses the SynonymFilter with the synonym
- Many thousands of documents containing the term "text:TV"
- A few hundred documents containing the term "text:Television"
A query for
text:TV will expand into
(text:TV text:Television) and the lower docFreq for
text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably – which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
Documentation at Remove Duplicates Token Filter.
Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. In Solr 3.x, this filter is deprecated. This filter does not exist at all in 4.x versions. Use ASCIIFoldingFilterFactory instead.
Documentation at ASCII Folding Filter.
Documentation at Phonetic Filter.
Documentation at Double Metaphone Filter.
Documentation at Beider-Morse Filter.
This is especially useful for Central European and Eastern European surnames. For example, one can use this filter factory to find documents that contain the surname "Kracovsky" when the original search term was "Crakowski", or vice versa. For more information, check out the paper about Beider-Morse Phonetic Matching (BMPM) at http://stevemorse.org/phonetics/bmpm.htm.
Documentation at Shingle Filter.
This filter was deprecated and removed from Lucene in 5.0
A PositionFilter manipulates the position of tokens in the stream.
Set the positionIncrement of all tokens to the "positionIncrement", except the first return token which retains its original positionIncrement value.
PositionFilter can be used with a query Analyzer to prevent expensive Phrase and MultiPhraseQueries. When QueryParser parses a query, it first divides text on whitespace, and then Analyzes each whitespace token. Some TokenStreams such as StandardTokenizer or WordDelimiterFilter may divide one of these whitespace-separate tokens into multiple tokens.
The QueryParser will turn "multiple tokens" into a Phrase or MultiPhraseQuery, but "multiple tokens at the same position with only a position count of 1" is treated as a special case. You can use PositionFilter at the end of your QueryAnalyzer to force any subsequent tokens after the first one to have a position increment of zero, to trigger this case.
For example, by default a query of "Wi-Fi" with StandardTokenizer will create a PhraseQuery:
If you instead wrap the StandardTokenizer with PositionFilter, the "Fi" will have a position increment of zero, creating a BooleanQuery:
Another example is when exact matching hits are wanted for any shingle within the query. (This was done at http://sesam.no to replace three proprietary 'FAST Query-Matching servers' with two open sourced Solr indexes, background reading in sesat and on the mailing list). It was needed that in the query all words and shingles to be placed at the same position, so that all shingles to be treated as synonyms of each other.
With only the ShingleFilter the shingles generated are synonyms only to the first term in each shingle group. For example the query "abcd efgh ijkl" results in a query like:
- ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
ShingleFilter does not offer a way to alter this behaviour.
Using the PositionFilter in combination makes it possible to make all shingles synonyms of each other. Such a configuration could look like:
Documentation at Reversed Wildcard Filter.
Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text.
See Unicode Collation.
This filter works like CollationKeyFilterFactory, except it uses ICU for collation. This makes smaller and faster sort keys, and it supports more locales. See UnicodeCollation for some more information, the same concepts apply.
The only configuration difference is that locales should be specified to this filter with RFC 3066 locale IDs.
Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
Documentation at ICU Normalizer 2 Filter.
Documentation at ICU Folding Filter.
Documentation at ICU Transform Filter.