Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Overview of POS Tagger

This project provides a UIMA wrapper around the popular OpenNLP part-of-speech tagger. The UIMA examples project provides a default wrapper from which we have borrowed liberally. We have created our own wrapper so that it will work better with our type system and to add features and supporting components. Additionally, both the OpenNLP package and the UIMA examples OpenNLP wrappers lack documentation for how to generate training data, build a part-of-speech tagging model, and build a tag dictionary. The latter in particular can be confusing if you are new to OpenNLP.

...

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

This is a corpus owned and maintained by Mayo Clinic. Unfortunately, because of legal and privacy issues it is not currently available for distribution. However, a part-of-speech model based on this data is released.

...

...

If you'd like to use your own algorithms on the Mayo Clinic corpus, please contact clinicalnlp@mayo.edu for its availability.

...

GENIA

GENIA is a literature mining project in molecular biology from University of Tokyo. Its corpus, a collection of biomedical literature, has been annotated with POS tags. You can download a copy of its POS corpus version 3.02p that we used to build our model from Topics.

...

Another strategy is to take the output of the chunker training data as detailed in the section called
"Prepare Penn Treebank training data" from the Chunker component.
and convert it to the correct format.

...

...

No problem. OpenNLP splits the word from the tag using the last underscore. However, there will be difficulties if your data uses an underscore as a part-of-speech tag.

...


...

This is a problem. OpenNLP will not be able to handle a token that contains a space in it. GENIA, for example, contains 108 occurrences of spaces inside tokens. The white space must be removed from these tokens or ignored (see above).

...

Creating a model

java -cp <classpath> opennlp.tools.postag.POSTaggerME <training-data> <model-name> iterations cutoff
Where

...

OpenNLP provides a default tag dictionary for the English part-of-speech model called tag.bin.gz which can be downloaded from
}+http://opennlp.sourceforge.net/models/english/parser/tagdict+
. You should use this tag dictionary only if you are using the model from
+http://opennlp.sourceforge.net/models/english/parser/tag.bin.gz+.

...


...

If you want to use the tag dictionary in a case insensitive way, then entries in the tag dictionary which are not all lowercased will be ignored because the tag dictionary fails to lowercase entries read in from the file. It only lowercases the words that are compared against the dictionary when "CaseSensitive" is set to false. Therefore, if you want the tag dictionary to be used in a case insensitive way, be sure to build the tag dictionary using 'false' as the third argument.

...

Analysis engines (annotators)

...

  • false negative (FN)
    a token in the gold standard data that was not correctly generated by the tokenizer/POS tagger

An example is given in
Background Colorcolor

deeppink

Evaluate a POS tagger using generated tokens

...