Child pages
  • FSA Dictionary with morfologik-addon
Skip to end of metadata
Go to start of metadata

Note: the morfologik-addon still under development.

Morfologik provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries.

The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use the use of FSA Morfologik dictionary tools:

  • opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
    • Extends: opennlp.tools.postag.POSTaggerFactory
    • Helps creating a POSTagger model with an embedded TagDictionary based on FSA
  • opennlp.morfologik.tagdict.MorfologikTagDictionary
    • Implements: opennlp.tools.postag.TagDictionary
    • A TagDictionary based on FSA is much smaller than the defaul XML based, and consumes less memory.
  • opennlp.morfologik.lemmatizer.MorfologikLemmatizer
    • Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
    • A dictionary based lemmatizer that uses FSA dictionary.

The addon also provides a command line interface that allows:

  • MorfologikDictionaryBuilder    
    • builds a binary POS Dictionary using Morfologik
  • XMLDictionaryToTable           
    • reads an OpenNLP XML tag dictionary and outputs it in a tab separated file that can be built into a FSA dictionary

Addon Installation

Note: today the addon is not available as a distributable and is not in any public Maven repository.

The addon should be compiled and the result should be copied on top of an OpenNLP binary distribution.

To create the binary distribution execute:

The distribution will be target/apache-opennlp-morfologik-addon-1.0-SNAPSHOT-bin.zip

Example of usage

Embed a FSA based dictionary in a POSModel

In this example we will use the free CONLL X Portuguese data to train a POS Tag dictionary and embed a FSA dictionary.

Download the Corpus

Download the Portuguese data data from http://ilk.uvt.nl/conll/free_data.html

Portuguese train: portuguese_bosque_train.conll
Portuguese test: portuguese_bosque_test.conll

Train and evaluate a baseline model without dictionary

We start by training and evaluating without using Tag Dictionary

Train and evaluate with a dictionary extracted from training data

Now we create a model with a embed XML dictionary created from training data. All entries that appear more than twice will be included (tagDictCutoff parameter).

Note how the accuracy improved from 96.097% to 96.489% after including the dictionary, proving its importance.

Create a FSA Dictionary from the XML Tag Dictionary

Now we extract the XML Tag Dictionary and convert to a FSA Dictionary

Comparing sizes

We can now compare the size of the XML dictionary and the FSA dictionary:

The FSA dictionary is much smaller than the original XML version. Also, it performs much better during runtime, because the XML is loaded into a hash, while the FSA to a finite state automata.

Train a POS model with the FSA dictionary

We can use MorfologikPOSTaggerFactory to create a POS model with the an embedded FSA dictionary:

Evaluate

We can evaluate again and verify that the accuracy did not change.

 

 

  • No labels