SpellChecker
A Spell Checker allows to suggest a list of words similar to a misspelled word. This implementation is based on David Spencer's code using the n-gram method and the Levenshtein distance.
Structure of a dictionary index
An index (the dictionary) with all the possible words (a lucene index) must be created. The structure of this index is (for a 3-4 gram) this:
Index Structure |
Example |
word |
kings |
gram3 |
kin, ing, ngs |
gram4 |
king, ings |
start3 |
kin |
start4 |
king |
end3 |
ngs |
end4 |
ings |
Import: Adding Words to the Dictionary
We can add the words coming from a Lucene Index (more precisely from a set of Lucene fields), and from a text file with a list of words.
- Example: we can add all the keywords of a given Lucene field of my index.
SpellChecker spell= new SpellChecker(dictionaryDirectory); spell.indexDictionary(new LuceneDictionary(my_luceneReader,my_fieldname));
Getting a List of Suggested Words
The suggestSimilar method returns a list of suggested words sorted by:
- the Levenshtein distance (the most similar word to the misspelled word is the first in the list). 2. (optionally) the popularity of the word in a given Lucene Field.
Furthermore, that list can be restricted only to the words present in a given Lucene Field.
- First example: the suggestSimilar(misspelled_word, num_list) method.
The num_list is the maximum number of words returned.
In this example the list is just sorted with the Levenshtein distance.String[] l=spellChecker.suggestSimilar("sevanty", 2); //l[0] = "seventy"
- Second example: the suggestSimilar(misspelled_word, num_list, myIndexReader,myField, morePopular) Note: if myIndexReader and myField are null this method is the same as the first method
- The returned words are restricted only to the words presents in the field myField of the Lucene Index "myIndexReader" 2. The list is also sorted with a second criterium: the popularity (the frequency) of the word in the user field 3. If morePopular is true and the mispelled word exists in the user field, return only the words more frequent than this.
See the test case code for an example.
- The returned words are restricted only to the words presents in the field myField of the Lucene Index "myIndexReader" 2. The list is also sorted with a second criterium: the popularity (the frequency) of the word in the user field 3. If morePopular is true and the mispelled word exists in the user field, return only the words more frequent than this.
Changes
Version 1.1 :
- sort fixed (the sort was inversed!)
- set gram dynamically (depending of the length of the word)
- use the FuzzyQuery score: ((edit distance)/(length of word))
- new Dictionary interface + LuceneDictionary and PlaintextDictionary implementation
- replace addWords method by indexDictionary(Dictionnary dic)
- add a new public method: boolean exist(word)
- add a build.xml
Credits
- Maisonneuve Nicolas
- Spencer David