Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata

You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer>:

The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the org.apache.solr.analysis.TokenizerFactory. A TokenizerFactory's create() method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.

Tokenizers discussed in this section:

Arguments may be passed to tokenizer factories by setting attributes on the <tokenizer> element.

The following sections describe the tokenizer factory classes included in this release of Solr.

For user tips about Solr's tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
  • The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

Note that words are split at hyphens.

The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

Factory class: solr.StandardTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength.

Example:

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

Classic Tokenizer

The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the Unicode standard annex UAX#29 word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token.
  • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
  • Recognizes Internet domain names and email addresses and preserves them as a single token.

Factory class: solr.ClassicTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength.

Example:

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

Keyword Tokenizer

This tokenizer treats the entire text field as a single token.

Factory class: solr.KeywordTokenizerFactory

Arguments: None

Example:

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

Factory class: solr.LetterTokenizerFactory

Arguments: None

Example:

In: "I can't."

Out: "I", "can", "t"

Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.

Factory class: solr.LowerCaseTokenizerFactory

Arguments: None

Example:

In: "I just LOVE my iPhone!"

Out: "i", "just", "love", "my", "iphone"

N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.

Factory class: solr.NGramTokenizerFactory

Arguments:

minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.

maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize.

Example:

Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.

In: "hey man"

Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

Example:

With an n-gram size range of 4 to 5:

In: "bicycle"

Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.

Factory class: solr.EdgeNGramTokenizerFactory

Arguments:

minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0.

maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= minGramSize.

side: ("front" or "back", default is "front") Whether to compute the n-grams from the beginning (front) of the text or from the end (back).

Example:

Default behavior (min and max default to 1):

In: "babaloo"

Out: "b"

Example:

Edge n-gram range of 2 to 5

In: "babaloo"

Out:"ba", "bab", "baba", "babal"

Example:

Edge n-gram range of 2 to 5, from the back side:

In: "babaloo"

Out: "oo", "loo", "aloo", "baloo"

ICU Tokenizer

This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

You can customize this tokenizer's behavior by specifying per-script rule files. To add per-script rules, add a rulefiles argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi.

The default solr.ICUTokenizerFactory provides UAX#29 word break rules tokenization (like solr.StandardTokenizer), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), and for syllable tokenization for Khmer, Lao, and Myanmar.

Factory class: solr.ICUTokenizerFactory

Arguments:

rulefile: a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.

Example:

To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section Lib Directives in SolrConfig). See the solr/contrib/analysis-extras/README.txt for information on which jars you need to add to your SOLR_HOME/lib.

 

Path Hierarchy Tokenizer

This tokenizer creates synonyms from file path hierarchies.

Factory class: solr.PathHierarchyTokenizerFactory

Arguments:

delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.

replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.

Example:

In: "c:\usr\local\apache"

Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Regular Expression Pattern Tokenizer

This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.

Factory class: solr.PatternTokenizerFactory

Arguments:

pattern: (Required) The regular expression, as defined by in java.util.regex.Pattern.

group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

Example:

A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.

In: "fee,fie, foe , fum, foo"

Out: "fee", "fie", "foe", "fum", "foo"

Example:

Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.

In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

Example:

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.

In: "SKU: 1234, Part Number 5678, Part: 126-987"

Out: "1234", "5678", "126-987"

Simplified Regular Expression Pattern Tokenizer

This tokenizer is similar to the PatternTokenizerFactory described above, but uses Lucene RegExp pattern matching to construct distinct tokens for the input stream.  The syntax is more limited than PatternTokenizerFactory, but the tokenization is quite a bit faster.

Factory class: solr.SimplePatternTokenizerFactory

Arguments:

pattern: (Required) The regular expression, as defined by in the RegExp javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

maxDeterminizedStates: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

Example:

To match tokens delimited by simple whitespace characters:

Simplified Regular Expression Pattern Splitting Tokenizer

This tokenizer is similar to the SimplePatternTokenizerFactory described above, but uses Lucene RegExp pattern matching to identify sequences of characters that should be used to split tokens.  The syntax is more limited than PatternTokenizerFactory, but the tokenization is quite a bit faster.

Factory class: solr.SimplePatternSplitTokenizerFactory

Arguments:

pattern: (Required) The regular expression, as defined by in the RegExp javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

maxDeterminizedStates: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

Example:

To match tokens delimited by simple whitespace characters:

 

UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token.
  • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
  • Recognizes and preserves as single tokens the following: 
    • Internet domain names containing top-level domains validated against the white list in the IANA Root Zone Database when the tokenizer was generated
    • email addresses
    • file://, http(s)://, and ftp:// URLs
    • IPv4 and IPv6 addresses

The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <URL>, <EMAIL>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>.

Factory class: solr.UAX29URLEmailTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength.

Example:

In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"

Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"

White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokens.

Factory class: solr.WhitespaceTokenizerFactory

Arguments: rule : Specifies how to define whitespace for the purpose of tokenization. Valid values:

Example:

In: "To be, or what?"

Out: "To", "be,", "or", "what?"

 

  • No labels

21 Comments

  1. Regarding the classic Tokenizer Factory description and the standard Tokenizer Factory description :
    the class expressed in the example should be :

    solr.StandardTokenizerFactory

    solr.ClassicTokenizerFactory

    Now they are the same ...

    1. Thanks for reporting that Alessandro!

      should be all fixed now.

  2. On StandardTokenizer, it says

    "email address will NOT be preserved."

    But in the example, the email address john.doe@foo.com IS preserved.
    From explanation, I expected john.doe@foo.com to be splitted into two tokens, "john.doe" and "foo.com"

    Is this a typo?

    1. Thanks Seungtack Baek, I've fixed the example output, which had several mistakes. The Standard Tokenizer does use "@" as a token separator, so email addresses will NOT be preserved.

  3. In both Standard and Classic tokenizers description  : 

    • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

    But they don't produce the same out for "m37-xq" ...

    1. Thanks, Laurent, I've fixed the Standard Tokenizer description. The Standard Tokenizer, which implements the default UAX#29 word boundary rules, always splits at hyphens.

  4. I think the URL is missing from the Input text example of the URL Email Tokenizer.

     

    1. Thanks Kuro, I've put back the URL in the example input for the UAX29URLEmailTokenizer.  It was present in the 4.6 reference guide - I'm guessing it got stripped out somehow during the Confluence upgrade from v3.X to v5.0.

  5. <analyzer>
      <tokenizer class="solr.PatternTokenizerFactory" pattern="\[A-Z\]\[A-Za-z\]" group="0"/>
    </analyzer>

     

    in the above example I think pattern should be pattern="\[A-Z\]\[A-Za-z\]*"  

    1. Also, square brackets should not need escaping. I verified this works:

      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
      </analyzer>


      1. giving escape sequence is not an error so not a big deal.

        1. Thanks chaitanya and Kuro, I've fixed the pattern.

  6. Regarding UAX29 URL Email Tokenizer example, why it didn't split e-mail into "e", "mail" although there is a hyphen?

     

     

    1. Thanks kholoud, I've fixed it.

  7. I want custom tokenizer with my_tokenizer, I did try to follow the page:
    http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/

    I think tokenizer and filter have the same way to custom,
    this is my config:


    - in file solrconfig.xml add:

      <lib dir="${solr.install.dir:../../../..}/lib" regex=".*\.jar" /> <!-- I Added foler lib to solr install to save my_tokenizer_lib.jar–>

    - in file schema.xml add:

    <fieldType name="text_test" class="solr.TrieIntField">

      <analyzer>

        <tokenizer class="com.btl.tktdtt.solr.tokenizer.VietnameseTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

      </analyzer>

    </fieldType>


    but I received err: ('t3' is my solrcore name )

    t3: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core t3: Can't load schema /mnt/01CDF1ECE3AB4280/DH/NAM_5/Ki_1/TimkiemTrinhDien/BTL/solr-6.2.1/server/solr/t3/conf/managed-schema: Plugin init failure for [schema.xml] fieldType "text_test": Plugin init failure for [schema.xml] analyzer/tokenizer: class com.btl.tktdtt.solr.tokenizer.VietnameseTokenizerFactory

    full - log err here: http://codepad.org/BirlJwrV

    Please help me!! ~~..

    1. I would recommend that you post this question to the Solr User mailing list: http://lucene.apache.org/solr/resources.html#community.

  8. For StandardTokenizer, it says:

    • Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

    However, at least with Solr 6.5, it seems that periods are only kept with the token if the characters on either side of the token are both letters, or both numbers, but not a mix.  For example the following string:

    object.1.1.t.t.t.2.3.g.Z._.3

    Gets split into these tokens:

    object

    1.1

    t.t.t

    2.3

    g.Z

    3

    1. You're right, that could be phrased better (and may be a holdover from a previous version of this tokenizer).  FWIW, the canonical description of what it does is present at the link a little further down in the text: Unicode standard annex UAX#29

      1. Thanks, that is very helpful!

    2. And I'm seeing slightly different behavior for the ':' character, where it remains with the token only if both adjacent characters are letters:

      1:2:a:b:3

      becomes:

      1

      2

      a:b

      3