Apache Solr Documentation

6.3 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.4 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 6.4

Skip to end of metadata
Go to start of metadata

Phonetic matching algorithms may be used to encode tokens so that two different spellings that are pronounced similarly will match.

For overviews of and comparisons between algorithms, see http://en.wikipedia.org/wiki/Phonetic_algorithm and http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html

Algorithms discussed in this section:

Beider-Morse Phonetic Matching (BMPM)

To use this encoding in your analyzer, see Beider Morse Filter in the Filter Descriptions section.

Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic matching system. BMPM helps you search for personal names (or just surnames) in a Solr/Lucene index, and is far superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone, etc.

In general, phonetic matching lets you search a name list for names that are phonetically equivalent to the desired name. BMPM is similar to a soundex search in that an exact spelling is not required. Unlike soundex, it does not generate a large quantity of false hits.

From the spelling of the name, BMPM attempts to determine the language. It then applies phonetic rules for that particular language to transliterate the name into a phonetic alphabet. If it is not possible to determine the language with a fair degree of certainty, it uses generic phonetic instead. Finally, it applies language-independent rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches.

For example, assume that the matches found when searching for Stephen in a database are "Stefan", "Steph", "Stephen", "Steve", "Steven", "Stove", and "Stuffin". "Stefan", "Stephen", and "Steven" are probably relevant, and are names that you want to see. "Stuffin", however, is probably not relevant. Also rejected were "Steph", "Steve", and "Stove". Of those, "Stove" is probably not one that we would have wanted. But "Steph" and "Steve" are possibly ones that you might be interested in.

For Solr, BMPM searching is available for the following languages:

  • English
  • French
  • German
  • Greek
  • Hebrew written in Hebrew letters
  • Hungarian
  • Italian
  • Polish
  • Romanian
  • Russian written in Cyrillic letters
  • Russian transliterated into English letters
  • Spanish
  • Turkish

The name matching is also applicable to non-Jewish surnames from the countries in which those languages are spoken.

For more information, see here: http://stevemorse.org/phoneticinfo.htm and http://stevemorse.org/phonetics/bmpm.htm.

Daitch-Mokotoff Soundex

To use this encoding in your analyzer, see Daitch-Mokotoff Soundex Filter in the Filter Descriptions section.

The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavic and Yiddish surnames with similar pronunciation but differences in spelling.

The main differences compared to the other soundex variants are:

  • coded names are 6 digits long
  • initial character of the name is coded
  • rules to encoded multi-character n-grams
  • multiple possible encodings for the same name (branching)

Note: the implementation used by Solr (commons-codec's DaitchMokotoffSoundex ) has additional branching rules compared to the original description of the algorithm.

For more information, see http://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex and http://www.avotaynu.com/soundex.htm

Double Metaphone

To use this encoding in your analyzer, see Double Metaphone Filter in the Filter Descriptions section.  Alternatively, you may specify encoding="DoubleMetaphone" with the Phonetic Filter, but note that the Phonetic Filter version will not provide the second ("alternate") encoding that is generated by the Double Metaphone Filter for some tokens.  

Encodes tokens using the double metaphone algorithm by Lawrence Philips.  See the original article at http://www.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2

Metaphone

To use this encoding in your analyzer, specify encoding="Metaphone" with the Phonetic Filter.

Encodes tokens using the Metaphone algorithm by Lawrence Philips, described in "Hanging on the Metaphone" in Computer Language, Dec. 1990.  

See http://en.wikipedia.org/wiki/Metaphone

Soundex

To use this encoding in your analyzer, specify encoding="Soundex" with the Phonetic Filter.

Encodes tokens using the Soundex algorithm, which is used to relate similar names, but can also be used as a general purpose scheme to find words with similar phonemes.

See  http://en.wikipedia.org/wiki/Soundex

Refined Soundex

To use this encoding in your analyzer, specify encoding="RefinedSoundex" with the Phonetic Filter.

Encodes tokens using an improved version of the Soundex algorithm.   

See http://en.wikipedia.org/wiki/Soundex

Caverphone

To use this encoding in your analyzer, specify encoding="Caverphone" with the Phonetic Filter.

Caverphone is an algorithm created by the Caversham Project at the University of Otago.  The algorithm is optimised for accents present in the southern part of the city of Dunedin, New Zealand.

See http://en.wikipedia.org/wiki/Caverphone and the Caverphone 2.0 specification at http://caversham.otago.ac.nz/files/working/ctp150804.pdf

Kölner Phonetik a.k.a. Cologne Phonetic 

To use this encoding in your analyzer, specify encoding="ColognePhonetic" with the Phonetic Filter.

The Kölner Phonetik, an algorithm published by Hans Joachim Postel in 1969, is optimized for the German language.

See  http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik

NYSIIS

To use this encoding in your analyzer, specify encoding="Nysiis" with the Phonetic Filter.

NYSIIS is an encoding used to relate similar names, but can also be used as a general purpose scheme to find words with similar phonemes.

See http://en.wikipedia.org/wiki/NYSIIS and http://www.dropby.com/NYSIIS.html

  • No labels

6 Comments

  1. Hi, I just downloaded Beider-Morse Phonetic Matching (BMPM) sources files from http://stevemorse.org/phoneticinfo.htm and it seems to me that Portuguese is already one of the languages supported by it.

    I took a deeper look and at the end, I noted code related to:

    • Arabic
    • Czech
    • Dutch
    • Portuguese.

    Given that, I believe this page/documentation should be updated to avoid further misunderstanding.

    1. Solr's BMPM capabilities are provided by commons-codec v1.10, which contains a Java reimplementation of the v3.04 sources on stevemorse.org.  Browsing the commons-codec source, I can see resources for the four languages you mention, but it's not clear to what extent they are used.  I'll try to dig deeper later this week.

  2. I am running into an issue that I cannot understand why it is happening, so if someone has any idea it will be great. I know this is not a support line, but seems to me like it is a very straightforward issue that someone might run into. The scenario is simple, I have configured spellcheck and once I add phonetic search, no results are returned for any query at all. Is this a known issue? If it is not, I can then create a question in StackOverflow but wanted to ask at if at a high level there is something that prevents for me to have phonetic search and spellcheck in the same request handler.   

    1. Indeed, this is not a support forum.  We have a mailing list and an IRC channel.  The mailing list has the larger audience.

      Filters for phonetic matching change the spelling of the words in your index – that's how they work.  If you're building suggestions from your index, the tokens generated by a phonetic filter will be completely useless for spell-checking.

      1. Shawn Heisey indeed I understand that this is not a support channel, however that was not my purpose. I have implemented both phonetic search and spellcheck in different fields and they work fine separately but they do not work together in the same request handler. My intention was to know if this is a known issue (or even an issue) that when adding them to the same request handler then both stop to work (and no results are returned at all) so that I can work on isolating the issue and submitting a JIRA issue.

        1. what is/isn't expected/supported and what may/may-not be a bug is exactly the type of question to raise on the solr-user@lucene mailing list as Shawn advised.

          documentation comments are for precisely that: commenting on the documentation itself.