Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata
Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters and placed in front of a Tokenizer. Char Filters can add, change, or remove characters while preserving the original character offsets to support features like highlighting.

Topics discussed in this section:

solr.MappingCharFilterFactory

This filter creates org.apache.lucene.analysis.MappingCharFilter, which can be used for changing one string to another (for example, for normalizing é to e.).

This filter requires specifying a mapping argument, which is the path and name of a file containing the mappings to perform.

Example:


Mapping file syntax:

  • Comment lines beginning with a hash mark (#), as well as blank lines, are ignored.

  • Each non-comment, non-blank line consists of a mapping of the form: "source" => "target"
    • Double-quoted source string, optional whitespace, an arrow (=>), optional whitespace, double-quoted target string.
  • Trailing comments on mapping lines are not allowed.
  • The source string must contain at least one character, but the target string may be empty.

  • The following character escape sequences are recognized within source and target strings:

    Escape
    sequence
    Resulting character (ECMA-48 alias)Unicode characterExample mapping line
    \\\U+005C"\\" => "/"
    \""U+0022"\"and\"" => "'and'"
    \bbackspace (BS)U+0008"\b" => " "
    \ttab (HT)U+0009"\t" => ","
    \nnewline (LF)U+000A"\n" => "<br>"
    \fform feed (FF)U+000C"\f" => "\n"
    \rcarriage return (CR)U+000D"\r" => "/carriage-return/"
    \uXXXXUnicode char referenced by the 4 hex digitsU+XXXX"\uFEFF" => ""
    • A backslash followed by any other character is interpreted as if the character were present without the backslash.

solr.HTMLStripCharFilterFactory

This filter creates org.apache.solr.analysis.HTMLStripCharFilter. This Char Filter strips HTML from the input stream and passes the result to another Char Filter or a Tokenizer.

This filter:

  • Removes HTML/XML tags while preserving other content.
  • Removes attributes within tags and supports optional attribute quoting.
  • Removes XML processing instructions, such as: <?foo bar?>
  • Removes XML comments.
  • Removes XML elements starting with <!>.
  • Removes contents of <script> and <style> elements.
  • Handles XML comments inside these elements (normal comment processing will not always work).
  • Replaces numeric character entities references like &#65; or &#x7f; with the corresponding character.
  • The terminating ';' is optional if the entity reference at the end of the input; otherwise the terminating ';' is mandatory, to avoid false matches on something like "Alpha&Omega Corp".
  • Replaces all named character entity references with the corresponding character.
  • &nbsp; is replaced with a space instead of the 0xa0 character.
  • Newlines are substituted for block-level elements.
  • <CDATA> sections are recognized.
  • Inline tags, such as <b>, <i>, or <span> will be removed.
  • Uppercase character entities like quot, gt, lt and amp are recognized and handled as lowercase.

The input need not be an HTML document. The filter removes only constructs that look like HTML. If the input doesn't include anything that looks like HTML, the filter won't remove any input.

The table below presents examples of HTML stripping.

Input

Output

my <a href="www.foo.bar">link</a>

my link

<br>hello<!--comment-->

hello

hello<script><!-- f('<!--internal--></script>'); --></script>

hello

if a<b then print a;

if a<b then print a;

hello <td height=22 nowrap align="left">

hello

a<b &#65 Alpha&Omega Ω

a<b A Alpha&Omega Ω

Example:

 

solr.ICUNormalizer2CharFilterFactory

This filter performs pre-tokenization Unicode normalization using ICU4J.

Arguments:

name: A Unicode Normalization Form, one of nfc, nfkc, nfkc_cf. Default is nfkc_cf.

mode: Either compose or decompose. Default is compose. Use decompose with name="nfc" or name="nfkc" to get NFD or NFKD, respectively.

filter: A UnicodeSet pattern. Codepoints outside the set are always left unchanged. Default is [] (the null set, no filtering - all codepoints are subject to normalization).

Example:

solr.PatternReplaceCharFilterFactory

This filter uses regular expressions to replace or change character patterns.

Arguments:

pattern: the regular expression pattern to apply to the incoming text.

replacement: the text to use to replace matching patterns.

You can configure this filter in schema.xml like this:

The table below presents examples of regex-based pattern replacement:

Input

pattern

replacement

Output

Description

see-ing looking

(\w+)(ing)

$1

see-ing look

Removes "ing" from the end of word.

see-ing looking

(\w+)ing

$1

see-ing look

Same as above. 2nd parentheses can be omitted.

No.1 NO. no. 543

[nN][oO]\.\s*(\d+)

#$1

#1 NO. #543

Replace some string literals

abc=1234=5678

(\w+)=(\d+)=(\d+)

$3=$1=$2

5678=abc=1234

Change the order of the groups.

Related Topics

  • No labels

11 Comments

  1. Just to clarify,

    in the PatternReplaceCharFilterFactory's example,

    "see-ing looking" is filtered to "see-ing look" because "see-" is not recognized as word, right?

    1. Right. More specifically, "see-" doesn't match \w+ because the hyphen (-) isn't included in the \w character set [a-zA-Z_0-9] - see http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

  2. Hi,

    I think the parameter for solr.PatternReplaceCharFilterFactory is not 'replaceWith'. The good one is 'replacement'.

    <charFilter class="solr.PatternReplaceCharFilterFactory"
                 pattern="([nN][oO]\.)\s*(\d+)" replacement="$1$2"/>
    1. Thanks Jorge, I've replaced replaceWith with replacement (smile)

  3. The MappingCharFilterFactory section needs to describe what the mapping file should look like. A file example will help.

     

    1. Good point, I've added a description of the mapping file syntax.

  4. Hi,

    How can a result of a filter could be copied into new field ? I've tried copyField and didn't work , still getting source data as is . 

    Thanks,

    Yousef.

    1. yousef: analyzers (including the CharFilterFactories that might start an analyzer) only affect the indexed terms in a field - so the results can't be copied to another field.

       

      I've added a note about this to the Analyzers page to make it more clear.

  5. Hi, 

    Struck with this problem, need help

    Problem Statement : Remove "." from AlphaNumeric String Only

    i.e.

    Want : Java1.7 -> Java17

    *Don't Want* : ASP.NET -> ASPNET

    i.e. : dot must be removed from All strings only with Numbers

    I hope that Question is Explanatory (smile)

    Thanks

    Vivek

     

     

  6. Can we please add a related topic of Filter Descriptions? I didn't realize these were two different things. We should also add a reference to the "Analysis" section of the admin tool, because that's a super powerful way to debug these patterns (we should add that too the Filter Descriptions page too!)