Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata

This section contains information about tokenizers and filters related to character set conversion or for use with specific languages. For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters. In other languages the tokenization rules are often not so simple. Some European languages may require special tokenization rules as well, such as rules for decompounding German words.

For information about language detection at index time, see Detecting Languages During Indexing.

Topics discussed in this section:

KeywordMarkerFilterFactory

Protects words from being modified by stemmers. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.

A sample Solr protwords.txt with comments can be found in the sample_techproducts_configs config set directory:

KeywordRepeatFilterFactory

Emits each token twice, one with the KEYWORD attribute and once without. If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.

To configure, add the KeywordRepeatFilterFactory early in the analysis chain. It is recommended to also include RemoveDuplicatesTokenFilterFactory to avoid duplicates when tokens are not stemmed.

A sample fieldType configuration could look like this:

When adding the same token twice, it will also score twice (double), so you may have to re-tune your ranking rules.

StemmerOverrideFilterFactory

Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers.

A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary" attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer.

A sample stemdict.txt with comments can be found in the Source Repository.

Dictionary Compound Word Token Filter

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

Compound words are most commonly found in Germanic languages.

Factory class: solr.DictionaryCompoundWordTokenFilterFactory

Arguments:

dictionary: (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with "#" are ignored. This path may be an absolute path, or path relative to the Solr config directory.

minWordSize: (integer, default 5) Any token shorter than this is not decompounded.

minSubwordSize: (integer, default 2) Subwords shorter than this are not emitted as tokens.

maxSubwordSize: (integer, default 15) Subwords longer than this are not emitted as tokens.

onlyLongestMatch: (true/false) If true (the default), only the longest matching subwords will generate new tokens.

Example:

Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

Unicode Collation

Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes.

Unicode Collation in Solr is fast, because all the work is done at index time.

Rather than specifying an analyzer within <fieldtype ... class="solr.TextField">, the solr.CollationField and solr.ICUCollationField field type classes provide this functionality. solr.ICUCollationField, which is backed by the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs solr.CollationField.

solr.ICUCollationField is included in the Solr analysis-extras contrib - see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib in order to use it.

solr.ICUCollationField and solr.CollationField fields can be created in two ways:

  • Based upon a system collator associated with a Locale.
  • Based upon a tailored RuleBasedCollator ruleset.

Arguments for solr.ICUCollationField, specified as attributes within the <fieldtype> element:

Using a System collator:

locale: (required) RFC 3066 locale ID. See the ICU locale explorer for a list of supported locales.

strength: Valid values are primary, secondary, tertiary, quaternary, or identical. See Comparison Levels in ICU Collation Concepts for more information.

decomposition: Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information.

Using a Tailored ruleset:

custom: (required) Path to a UTF-8 text file containing rules supported by the ICU  RuleBasedCollator

strength: Valid values are primary, secondary, tertiary, quaternary, or identical. See Comparison Levels in ICU Collation Concepts for more information.

decomposition: Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information.

Expert options:

alternate: Valid values are shifted or non-ignorable. Can be used to ignore punctuation/whitespace.

caseLevel: (true/false) If true, in combination with strength="primary", accents are ignored but case is taken into account. The default is false. See CaseLevel in ICU Collation Concepts for more information.

caseFirst: Valid values are lower or upper. Useful to control which is sorted first when case is not ignored.

numeric: (true/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false.

variableTop: Single character or contraction. Controls what is variable for alternate

Sorting Text for a Specific Language

In this example, text is sorted according to the default German rules provided by ICU4J.

Locales are typically defined as a combination of language and country, but you can specify just the language if you want. For example, if you specify "de" as the language, you will get sorting that works well for the German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland.

In the example above, we defined the strength as "primary". The strength of the collation determines how strict the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores differences in case and accents.

Another example:

The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the same base letter without diacritics.

An example using the "city_sort" field to sort:

Sorting Text for Multiple Languages

There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support, consider defining collated fields for each language and using copyField. However, adding a large number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode default collator.

The Unicode default or ROOT locale has rules that are designed to work well for most languages. To use the default locale, simply define the locale as the empty string. This Unicode default sort is still significantly more advanced than the standard Solr sort.

Sorting Text with Custom Rules

You can define your own set of sorting rules. It's easiest to take existing rules that are close to what you want and customize them.

In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts in German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For more information, see the ICU RuleBasedCollator javadocs.

This example shows how to create a custom rule set for solr.ICUCollationField and dump it to a file:

This rule set can now be used for custom collation in Solr:

JDK Collation

As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use ICU4J for some reason, you can use solr.CollationField.

The principles of JDK Collation are the same as those of ICU Collation; you just specify language, country and variant arguments instead of the combined locale argument.

Arguments for solr.CollationField, specified as attributes within the <fieldtype> element:

Using a System collator (see Oracle's list of locales supported in Java 8):

language: (required) ISO-639 language code

country: ISO-3166 country code

variant: Vendor or browser-specific code

strength: Valid values are primary, secondary, tertiary or identical. See Oracle Java 8 Collator javadocs for more information.

decomposition: Valid values are no, canonical, or full. See Oracle Java 8 Collator javadocs for more information.

Using a Tailored ruleset:

custom: (required) Path to a UTF-8 text file containing rules supported by the JDK RuleBasedCollator

strength: Valid values are primary, secondary, tertiary or identical. See Oracle Java 8 Collator javadocs for more information.

decomposition: Valid values are no, canonical, or full. See Oracle Java 8 Collator javadocs for more information.

A solr.CollationField example:

ASCII & Decimal Folding Filters

Ascii Folding

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters with reasonable ASCII alternatives are converted.

This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.

Factory class: solr.ASCIIFoldingFilterFactory

Arguments: None

Example:

In: "Björn Ångström"

Tokenizer to Filter: "Björn", "Ångström"

Out: "Bjorn", "Angstrom"

Decimal Digit Folding

This filter converts any character in the Unicode "Decimal Number" general category ("Nd") into their equivalent Basic Latin digits (0-9).

This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.

Factory class: solr.DecimalDigitFilterFactory

Arguments: None

Example:

 

Language-Specific Factories

These factories are each designed to work with specific languages. The languages covered here are:

Arabic

Solr provides support for the Light-10 (PDF) stemming algorithm, and Lucene includes an example stopword list.

This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility.

Factory classes: solr.ArabicStemFilterFactory, solr.ArabicNormalizationFilterFactory

Arguments: None

Example:

Brazilian Portuguese

This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class org.apache.lucene.analysis.br.BrazilianStemmer. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.

Factory class: solr.BrazilianStemFilterFactory

Arguments: None

Example:

In: "praia praias"

Tokenizer to Filter: "praia", "praias"

Out: "pra", "pra"

Bulgarian

Solr includes a light stemmer for Bulgarian, following this algorithm (PDF), and Lucene includes an example stopword list.

Factory class: solr.BulgarianStemFilterFactory

Arguments: None

Example:

Catalan

Solr can stem Catalan using the Snowball Porter Stemmer with an argument of language="Catalan". Solr includes a set of contractions for Catalan, which can be stripped using solr.ElisionFilterFactory.

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Catalan" in this case

Example:

In: "llengües llengua"

Tokenizer to Filter: "llengües"(1) "llengua"(2),

Out: "llengu"(1), "llengu"(2)

Chinese

Chinese Tokenizer

The Chinese Tokenizer is deprecated as of Solr 3.4. Use the solr.StandardTokenizerFactory instead.

Factory class: solr.ChineseTokenizerFactory

Arguments: None

Example:

Chinese Filter Factory

The Chinese Filter Factory is deprecated as of Solr 3.4. Use the solr.StopFilterFactory instead.

Factory class: solr.ChineseFilterFactory

Arguments: None

Example:

Simplified Chinese

For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the solr.HMMChineseTokenizerFactory in the analysis-extras contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib.

Factory class: solr.HMMChineseTokenizerFactory

Arguments: None

Examples:

To use the default setup with fallback to English Porter stemmer for English words, use:

<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>

Or to configure your own analysis setup, use the solr.HMMChineseTokenizerFactory along with your custom filter setup.

CJK

This tokenizer breaks Chinese, Japanese and Korean language text into tokens. These are not whitespace delimited languages. The tokens generated by this tokenizer are "doubles", overlapping pairs of CJK characters found in the field text.

Factory class: solr.CJKTokenizerFactory

Arguments: None

Example:

Czech

Solr includes a light stemmer for Czech, following this algorithm, and Lucene includes an example stopword list.

Factory class: solr.CzechStemFilterFactory

Arguments: None

Example:

In: "prezidenští, prezidenta, prezidentského"

Tokenizer to Filter: "prezidenští", "prezidenta", "prezidentského"

Out: "preziden", "preziden", "preziden"

Danish

Solr can stem Danish using the Snowball Porter Stemmer with an argument of language="Danish".

Also relevant are the Scandinavian normalization filters.

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Danish" in this case

Example:

In: "undersøg undersøgelse"

Tokenizer to Filter: "undersøg"(1) "undersøgelse"(2),

Out: "undersøg"(1), "undersøg"(2)

Dutch

Solr can stem Dutch using the Snowball Porter Stemmer with an argument of language="Dutch".

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Dutch" in this case

Example:

In: "kanaal kanalen"

Tokenizer to Filter: "kanaal", "kanalen"

Out: "kanal", "kanal"

Finnish

Solr includes support for stemming Finnish, and Lucene includes an example stopword list.

Factory class: solr.FinnishLightStemFilterFactory

Arguments: None

Example:

In: "kala kalat"

Tokenizer to Filter: "kala", "kalat"

Out: "kala", "kala"

French

Elision Filter

Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan, Italian, and Irish.

Factory class: solr.ElisionFilterFactory

Arguments:

articles: The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words such as "le", which are commonly abbreviated, such as in l'avion (the plane). This file should include the abbreviated form, which precedes the apostrophe. In this case, simply "l". If no articles attribute is specified, a default set of French articles is used.

ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word file. Defaults to false

Example:

In: "L'histoire d'art"

Tokenizer to Filter: "L'histoire", "d'art"

Out: "histoire", "art"

French Light Stem Filter

Solr includes three stemmers for French: one in the solr.SnowballPorterFilterFactory, a lighter stemmer called solr.FrenchLightStemFilterFactory, and an even less aggressive stemmer called solr.FrenchMinimalStemFilterFactory. Lucene includes an example stopword list.

Factory classes: solr.FrenchLightStemFilterFactory, solr.FrenchMinimalStemFilterFactory

Arguments: None

Examples:

In: "le chat, les chats"

Tokenizer to Filter: "le", "chat", "les", "chats"

Out: "le", "chat", "le", "chat"

Galician

Solr includes a stemmer for Galician following this algorithm, and Lucene includes an example stopword list.

Factory class: solr.GalicianStemFilterFactory

Arguments: None

Example:

In: "felizmente Luzes"

Tokenizer to Filter: "felizmente", "luzes"

Out: "feliz", "luz"

German

Solr includes four stemmers for German: one in the solr.SnowballPorterFilterFactory language="German", a stemmer called solr.GermanStemFilterFactory, a lighter stemmer called solr.GermanLightStemFilterFactory, and an even less aggressive stemmer called solr.GermanMinimalStemFilterFactory. Lucene includes an example stopword list.

Factory classes: solr.GermanStemFilterFactory, solr.LightGermanStemFilterFactory, solr.MinimalGermanStemFilterFactory

Arguments: None

Examples:

In: "haus häuser"

Tokenizer to Filter: "haus", "häuser"

Out: "haus", "haus"

Greek

This filter converts uppercase letters in the Greek character set to the equivalent lowercase character.

Factory class: solr.GreekLowerCaseFilterFactory

Arguments: None

Use of custom charsets is not longer supported as of Solr 3.1. If you need to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader, and so on.) during I/O, so that Lucene can analyze this text as Unicode instead.

Example:

Hindi

Solr includes support for stemming Hindi following this algorithm (PDF), support for common spelling differences through the solr.HindiNormalizationFilterFactory, support for encoding differences through the solr.IndicNormalizationFilterFactory following this algorithm, and Lucene includes an example stopword list.

Factory classes: solr.IndicNormalizationFilterFactory, solr.HindiNormalizationFilterFactory, solr.HindiStemFilterFactory

Arguments: None

Example:

Indonesian

Solr includes support for stemming Indonesian (Bahasa Indonesia) following this algorithm (PDF), and Lucene includes an example stopword list.

Factory class: solr.IndonesianStemFilterFactory

Arguments: None

Example:

In: "sebagai sebagainya"

Tokenizer to Filter: "sebagai", "sebagainya"

Out: "bagai", "bagai"

Italian

Solr includes two stemmers for Italian: one in the solr.SnowballPorterFilterFactory language="Italian", and a lighter stemmer called solr.ItalianLightStemFilterFactory. Lucene includes an example stopword list.

Factory class: solr.ItalianStemFilterFactory

Arguments: None

Example:

In: "propaga propagare propagamento"

Tokenizer to Filter: "propaga", "propagare", "propagamento"

Out: "propag", "propag", "propag"

Irish

Solr can stem Irish using the Snowball Porter Stemmer with an argument of language="Irish". Solr includes solr.IrishLowerCaseFilterFactory, which can handle Irish-specific constructs. Solr also includes a set of contractions for Irish which can be stripped using solr.ElisionFilterFactory.

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Irish" in this case

Example:

In: "siopadóireacht síceapatacha b'fhearr m'athair"

Tokenizer to Filter: "siopadóireacht", "síceapatacha", "b'fhearr", "m'athair"

Out: "siopadóir", "síceapaite", "fearr", "athair"

Japanese

Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which includes several analysis components - more details on each below:

  • JapaneseIterationMarkCharFilter normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
  • JapaneseTokenizer tokenizes Japanese using morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation. 
  • JapaneseBaseFormFilter replaces original terms with their base forms (a.k.a. lemmas).
  • JapanesePartOfSpeechStopFilter removes terms that have one of the configured parts-of-speech.
  • JapaneseKatakanaStemFilter normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.

Also useful for Japanese analysis, from lucene-analyzers-common:

  • CJKWidthFilter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.

Japanese Iteration Mark CharFilter

Normalizes horizontal Japanese iteration marks (odoriji) to their expanded form.  Vertical iteration marks are not supported.

Factory class: JapaneseIterationMarkCharFilterFactory

Arguments:

normalizeKanji: set to false to not normalize kanji iteration marks (default is true)

normalizeKana: set to false to not normalize kana iteration marks (default is true)

Japanese Tokenizer

Tokenizer for Japanese that uses morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation. 

JapaneseTokenizer has a search mode (the default) that does segmentation useful for search: a heuristic is used to segment compound terms into their constituent parts while also keeping the original compound terms as synonyms.

Factory class: solr.JapaneseTokenizerFactory

Arguments:

mode: Use search mode to get a noun-decompounding effect useful for search. search mode improves segmentation for search at the expense of part-of-speech accuracy. Valid values for mode are:

  • normal: default segmentation
  • search: segmentation useful for search (extra compound splitting)
  • extended: search mode plus unigramming of unknown words (experimental)

For some applications it might be good to use search mode for indexing and normal mode for queries to increase precision and prevent parts of compounds from being matched and highlighted.

userDictionary: filename for a user dictionary, which allows overriding the statistical model with your own entries for segmentation, part-of-speech tags and readings without a need to specify weights. See lang/userdict_ja.txt for a sample user dictionary file.

userDictionaryEncoding: user dictionary encoding (default is UTF-8)

discardPunctuation: set to false to keep punctuation, true to discard (the default)

Japanese Base Form Filter

Replaces original terms' text with the corresponding base form (lemma).  (JapaneseTokenizer annotates each term with its base form.)

Factory class: JapaneseBaseFormFilterFactory

(no arguments)

Japanese Part Of Speech Stop Filter

Removes terms with one of the configured parts-of-speech.  JapaneseTokenizer annotates terms with parts-of-speech.

Factory class : JapanesePartOfSpeechStopFilterFactory

Arguments:

tags: filename for a list of parts-of-speech for which to remove terms; see conf/lang/stoptags_ja.txt in the sample_techproducts_config config set for an example.

enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false",  no position holes will be left by this filter when it removes tokens.  This argument is invalid if luceneMatchVersion is 5.0 or later.

Japanese Katakana Stem Filter

Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.

CJKWidthFilterFactory should be specified prior to this filter to normalize half-width katakana to full-width.

Factory class: JapaneseKatakanaStemFilterFactory

Arguments:

minimumLength: terms below this length will not be stemmed. Default is 4, value must be 2 or more. 

CJK Width Filter

Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.

Factory class: CJKWidthFilterFactory

(no arguments)

 

Example:

Hebrew, Lao, Myanmar, Khmer

Lucene provides support, in addition to UAX#29 word break rules, for Hebrew's use of the double and single quote characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the solr.ICUTokenizerFactory in the analysis-extras contrib module. To use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib.

See the ICUTokenizer for more information.

Latvian

Solr includes support for stemming Latvian, and Lucene includes an example stopword list.

Factory class: solr.LatvianStemFilterFactory

Arguments: None

Example:

In: "tirgiem tirgus"

Tokenizer to Filter: "tirgiem", "tirgus"

Out: "tirg", "tirg"

Norwegian

Solr includes two classes for stemming Norwegian, NorwegianLightStemFilterFactory and NorwegianMinimalStemFilterFactory. Lucene includes an example stopword list.

Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian".

Also relevant are the Scandinavian normalization filters.

Norwegian Light Stemmer

The NorwegianLightStemFilterFactory requires a "two-pass" sort for the -dom and -het endings. This means that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules apply so it will be further stemmed to "krist". The effect of this is that "kristen," "kristendom," "kristendommen," and "kristendommens" will all be stemmed to "krist."

The second pass is to pick up -dom and -het endings. Consider this example:

One pass

 

Two passes

 

Before

After

Before

After

forlegen

forleg

forlegen

forleg

forlegenhet

forlegen

forlegenhet

forleg

forlegenheten

forlegen

forlegenheten

forleg

forlegenhetens

forlegen

forlegenhetens

forleg

firkantet

firkant

firkantet

firkant

firkantethet

firkantet

firkantethet

firkant

firkantetheten

firkantet

firkantetheten

firkant

Factory class: solr.NorwegianLightStemFilterFactory

Arguments: variant: Choose the Norwegian language variant to use. Valid values are:

  • nb: Bokmål (default)
  • nn: Nynorsk
  • no: both

Example:

In: "Forelskelsen"

Tokenizer to Filter: "forelskelsen"

Out: "forelske"

Norwegian Minimal Stemmer

The NorwegianMinimalStemFilterFactory stems plural forms of Norwegian nouns only.

Factory class: solr.NorwegianMinimalStemFilterFactory

Arguments: variant: Choose the Norwegian language variant to use. Valid values are:

  • nb: Bokmål (default)
  • nn: Nynorsk
  • no: both

Example:

In: "Bilens"

Tokenizer to Filter: "bilens"

Out: "bil"

Persian

Persian Filter Factories

Solr includes support for normalizing Persian, and Lucene includes an example stopword list.

Factory class: solr.PersianNormalizationFilterFactory

Arguments: None

Example:

Polish

Solr provides support for Polish stemming with the solr.StempelPolishStemFilterFactory, and solr.MorphologikFilterFactory for lemmatization, in the contrib/analysis-extras module. The solr.StempelPolishStemFilterFactory component includes an algorithmic stemmer with tables for Polish. To use either of these filters, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib.

Factory class: solr.StempelPolishStemFilterFactory and solr.MorfologikFilterFactory

Arguments: None

Example:

In: ""studenta studenci"

Tokenizer to Filter: "studenta", "studenci"

Out: "student", "student"

More information about the Stempel stemmer is available in the Lucene javadocs.

Note the lower case filter is applied after the Morfologik stemmer; this is because the Polish dictionary contains proper names and then proper term case may be important to resolve disambiguities (or even lookup the correct lemma at all).

The Morfologik dictionary param value is a constant specifying which dictionary to choose. The dictionary resource must be named path/to/language.dict and have an associated .info metadata file. See the Morfologik project for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.

 

Portuguese

Solr includes four stemmers for Portuguese: one in the solr.SnowballPorterFilterFactory, an alternative stemmer called solr.PortugueseStemFilterFactory, a lighter stemmer called solr.PortugueseLightStemFilterFactory, and an even less aggressive stemmer called solr.PortugueseMinimalStemFilterFactory. Lucene includes an example stopword list.

Factory classes: solr.PortugueseStemFilterFactory, solr.PortugueseLightStemFilterFactory, solr.PortugueseMinimalStemFilterFactory

Arguments: None

Example:

In: "praia praias"

Tokenizer to Filter: "praia", "praias"

Out: "pra", "pra"

Romanian

Solr can stem Romanian using the Snowball Porter Stemmer with an argument of language="Romanian".

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Romanian" in this case

Example:

Russian

Russian Stem Filter

Solr includes two stemmers for Russian: one in the solr.SnowballPorterFilterFactory language="Russian", and a lighter stemmer called solr.RussianLightStemFilterFactory. Lucene includes an example stopword list.

Factory class: solr.RussianLightStemFilterFactory

Arguments: None

Example:

Scandinavian

Scandinavian is a language group spanning three languages Norwegian, Swedish and Danish  which are very similar.

Swedish å,ä,ö are in fact the same letters as Norwegian and Danish å,æ,ø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters.

In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above.

There are two filters for helping with normalization between Scandinavian languages: one is solr.ScandinavianNormalizationFilterFactory trying to preserve the special characters (æäöå) and another solr.ScandinavianFoldingFilterFactory which folds these to the more broad ø/ö->o etc.

See also each language section for other relevant filters.

Scandinavian Normalization Filter

This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.

It's a semantically less destructive solution than ScandinavianFoldingFilter, most useful when a person with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does not perform the common Swedish folds of å and ä to a nor ö to o.

Factory class: solr.ScandinavianNormalizationFilterFactory

Arguments: None

Example:

In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"

Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"

Out: "blåbærsyltetøj", "blåbærsyltetøj", "blåbærsyltetøj", "blabarsyltetoj"

Scandinavian Folding Filter

This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.

It's a semantically more destructive solution than ScandinavianNormalizationFilter, but can in addition help with matching raksmorgas as räksmörgås.

Factory class: solr.ScandinavianFoldingFilterFactory

Arguments: None

Example:

In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"

Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"

Out: "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj"

Serbian

Serbian Normalization Filter

Solr includes a filter that normalizes Serbian Cyrillic and Latin characters. Note that this filter only works with lowercased input.

See the Solr wiki for tips & advice on using this filter: https://wiki.apache.org/solr/SerbianLanguageSupport

Factory class: solr.SerbianNormalizationFilterFactory

Arguments: haircut : Select the extend of normalization. Valid values are:

  • bald: (Default behavior) Cyrillic characters are first converted to Latin; then, Latin characters have their diacritics removed, with the exception of "LATIN SMALL LETTER D WITH STROKE" (U+0111) which is converted to "dj"
  • regular: Only Cyrillic to Latin normalization will be applied, preserving the Latin diatrics

Example:


Spanish

Solr includes two stemmers for Spanish: one in the solr.SnowballPorterFilterFactory language="Spanish", and a lighter stemmer called solr.SpanishLightStemFilterFactory. Lucene includes an example stopword list.

Factory class: solr.SpanishStemFilterFactory

Arguments: None

Example:

In: "torear toreara torearlo"

Tokenizer to Filter: "torear", "toreara", "torearlo"

Out: "tor", "tor", "tor"

Swedish

Swedish Stem Filter

Solr includes two stemmers for Swedish: one in the solr.SnowballPorterFilterFactory language="Swedish", and a lighter stemmer called solr.SwedishLightStemFilterFactory. Lucene includes an example stopword list.

Also relevant are the Scandinavian normalization filters.

Factory class: solr.SwedishStemFilterFactory

Arguments: None

Example:

In: "kloke klokhet klokheten"

Tokenizer to Filter: "kloke", "klokhet", "klokheten"

Out: "klok", "klok", "klok"

Thai

This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai does not use whitespace to delimit words.

Factory class: solr.ThaiTokenizerFactory

Arguments: None

Example:

Turkish

Solr includes support for stemming Turkish with the solr.SnowballPorterFilterFactory; support for case-insensitive search with the solr.TurkishLowerCaseFilterFactory; support for  stripping apostrophes and following suffixes with solr.ApostropheFilterFactory (see Role of Apostrophes in Turkish Information Retrieval); support for a form of stemming that truncating tokens at a configurable maximum length through the solr.TruncateTokenFilterFactory (see Information Retrieval on Turkish Texts); and Lucene includes an example stopword list.

Factory class: solr.TurkishLowerCaseFilterFactory

Arguments: None

Example:

Another example, illustrating diacritics-insensitive search:

Ukrainian

Solr provides support for Ukrainian lemmatization with the solr.MorphologikFilterFactory, in the contrib/analysis-extras module. To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib.  

Lucene also includes an example Ukrainian stopword list, in the lucene-analyzers-morfologik jar.

Factory class: solr.MorfologikFilterFactory

Arguments: 

dictionary: (required) lemmatizer dictionary - the lucene-analyzers-morfologik jar contains a Ukrainian dictionary at org/apache/lucene/analysis/uk/ukrainian.dict.

Example:

Note the lower case filter is applied after the Morfologik stemmer; this is because the Ukrainian dictionary contains proper names and then proper term case may be important to resolve disambiguities (or even lookup the correct lemma at all).

The Morfologik dictionary param value is a constant specifying which dictionary to choose. The dictionary resource must be named path/to/language.dict and have an associated .info metadata file. See the Morfologik project for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.

 

  • No labels

27 Comments

  1. I think "JDK RuleBasedCollator|http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html"  meant as : 'JDK RuleBasedCollator' as anchor text and the URL as link.

  2. Why "Kuromoji" is the entry in the language list where every entry except this is a name of a language? Kuromoji is a name of a software component, not a language name.  If there is a need to mention the software component for some reason, this entry should be "Japanese (by Kuromoji)", not the other way around.

     

    1. from what i can tell, once upon a time the Analysis classes actually had Kuromoji in their names, which is probably what influenced the section names in the ref guide.

      But besides just fixing the section name, the section contents seems like it probably needs gutted & updated?  The config snippet example appears at first glance to be valid, but the paragraph leading up to it refers to classes that don't even exist (anymore).

      1. Thank you, Hoss. May I further suggest to change "Solr includes support for stemming Kuromoji (Japanese)...", to "Solr Japanese tokenization and stemming by the Kuromoji morphological analyzer..." ?


        1. I'm working on a rewrite of the Japanese analysis section - I'll use your wording, thanks.

          1. Hello Steve,

            I'm happy to contribute to a rewrite of the Japanese section. Shall sync up at some point?

            1. Hi Christian, please check out the rewritten section I just submitted, and feel free to make any changes you like.  Thanks!

  3. a small note regarding german stemming

    • "Hunden" is not a vaild german word. the plural form of "Hund" is "Hunde"
    • There might ba a better example "Haus" (engl. house) and "Häuser" (engl. houses) are stemmed to "haus" ... this example shows in addition that the stemmer can handle germam umlauts as well
    1. changed - thanks Markus!

  4. It seems that solr.DutchStemFilterFactory (Dutch language-specific factory) is no longer available. The option now is to use Snowball Porter, isn't it?

    1. Thanks Jorge, I've removed DutchStemFilterFactory - as you say, Snowball is the way to go.

  5. The section for the ChineseTokenizerFactory says it was deprecated in Solr 3.4. I couldn't find it in the 5.3 javadocs, so I'm thinking it was removed (probably in 4?). Is that right? If so, I can remove that section to avoid confusion.

  6. In Solr 5.3, CJKTokenizerFactory and ChineseTokenizerFactory are removed. 

    JapaneseTokenizerFactory can be used on Chinese, but can't get all the words out. 

    HMMChineseTokenizerFactory is better choice for Chinese only text. 

    ICUTokenizerFactory seems even better than HMMChineseTokenizerFactory and also works on mixed languages. 

    ClassicTokenizerFactory and StandardTokenizerFactory break CJK text into separated characters (fit certain conditions). 

    StandardTokenizerFactory can be used with CJKBigramFilterFactory and takes 2 characters tokens (like old CJKTokenizer). 

    1. Thanks Liang Shen, I've finally incorporated your suggestions in  SOLR-10758

  7. There is a mistake in KeywordRepeatFilterFactory section
    <
    filter class="solr.RemoveDuplicatesTokenFilter" /> should be
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 

  8. I want both collation support and wildcard search ability simultaneously on a field type in apache solr version 6.

    It seems that it doesn't support (ICU)CollationKeyFilterFactory anymore. I used ICUCollationField instead, but the problem is that wildcard search query doesn't work on this field type.

    Please help me.

  9. It looks like MorfologikFilterFactory's dictionary-resource attribute is replaced with dictionary.

    1. I mean dictionary-resource="pl" should be removed from example of MorfologikFilterFactory.

      1. Thanks Arslan - looking at the code, I can see that LUCENE-6833 replaced the dictionary-resource attribute with dictionary.  Also, if this attribute isn't specified, the PolishStemmer's dictionary is used - I think that's what you mean by 'dictionary-resrouce="pl" should be removed'?   I think it would be better to leave it, to show that it can be used with other dictionaries/languages, but make it so it actually works: dictionary="pl.dict".  What do you think?

        1. Hi Steve, I tried  

          • <filter class="solr.MorfologikFilterFactory" dictionary="morfologik/stemming/polish/pl.dict"/>
          • <filter class="solr.MorfologikFilterFactory" dictionary="pl.dict"/>

            However, both throw SolrResourceNotFoundException Can't find resource 'pl.dict' in classpath or '/Users/iorixxx/Documents/solr-6.1.0/server/solr/cw12'
            I wonder what is the correct way to load a *.dict file inside a jar file.
          1. This worked for me: 

            <filter class="solr.MorfologikFilterFactory" dictionary="morfologik/stemming/polish/polish.dict" />

            I also tried dictionary="polish.dict", i.e. without the path, and that didn't work, likely because the resource is not in the same package as the filter.

            I'll make the change.

            1. aha, so it is renamed to polish.dict, thanks.

  10. Please add a section for Ukrainian:

       <!-- Ukrainian -->
        <fieldType name="text_uk" class="solr.TextField" positionIncrementGap="100">
          <analyzer> 
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/uk/stopwords.txt" />
            <filter class="solr.MorfologikFilterFactory" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict" />
          </analyzer>
        </fieldType>
    1. Thanks Andriy, I've added a Ukrainian section - please let me know if it can be improved.

      1. Can we please move the LowerCaseFilterFactory after  MorfologikFilterFactory for Ukrainian like it was done recently for Polish by Dawid Weiss?

        Ukrainian analayzer uses the same approach and Ukrainian dictionary also has lots of proper names where case would need to be preserved to be tagged correctly. Thanks!