Overview of Chunker

In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a component that tags noun phrases, verb phrases, etc.

This project supports three tasks:

  • Building a model from training data;
  • Tagging text, using a trained model;
  • Adjusting the end offset of certain chunks so they envelop other chunks, for certain patterns of chunks.

This project provides a UIMA wrapper around the popular OpenNLP chunker. The UIMA examples project provides default wrappers for several of the components in OpenNLP, but not for the chunker. We have borrowed from the UIMA examples project liberally. Our wrapper works with our type system. Additionally, we added features and supporting components.

A chunker model is included with this project.

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

Building a model - Prepare GENIA training data

You need to download a copy of GENIA's Treebank corpus from tokyo.ac.jp/~genia/topics/Corpus/GTB.html. The version we used is called "beta". This version is distributed in a set of two files, one dated Sept. 22, 2004, with 200 "abstracts", and the other July 11, 2005, with 300 "abstracts". Please download both. After extraction, place all the .tree files from the two download into one directory, which we'll refer to <genia-trees>.

Please also download chunklink from ilk.uvt.nl. The version we used is chunklink_2-2-2000_for_conll.pl. This tool, from the Induction of Linguistic Knowledge (ILK) group of Tilburg University, The Netherlands, converts Penn Treebank II files into a one-word-per-line format.

Next, we'll use data.chunk.genia.Genia2PTB to convert Genia Treebank corpus to Penn Treebank II format, then use chunklink to convert to chunk data, and finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format.

This Java class a) renames the .tree files to files that look like wsj_0001.mrg and puts them in a directory structure expected by chunklink and creates a mapping of the original new names to the old names; b) reformats the way pos tags are formatted; c) adds an extra set of parentheses to each line of the data.

  • Run data.chunk.genia.Genia2PTB:

java -cp <classpath>; data.chunk.genia.Genia2PTB <genia-trees> <ptb-trees> <genia-ptb-name-mapping>
<genia-trees> is the directory which holds the GENIA corpus files;
<ptb-trees> is the the directory where the converted PTB trees will be written to;
<genia-ptb-name-mapping> is a file that will created by Genia2PTB to save file name mappings.

There are a number of problematic sentences in the second set of 300 treebanked abstracts (in <ptb-trees> after processing by data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We removed them when building our model. The original GENIA file names are listed below for your reference. You need to remove the lines from the output of Genia2PTB. To find out the converted file names, please look at <genia-ptb-name-mapping>.

Line numbers are separated by commas.

  • 93123257.tree - 6
  • 93172387.tree - 3
  • 93186809.tree - 5
  • 93280865.tree - 7
  • 94085904.tree - 6
  • 94193110.tree - 2
  • 96247631.tree - 3, 5
  • 96353916.tree - 10
  • 96357043.tree - 4
  • 97031819.tree - 3, 4
  • 97054651.tree - 7
  • 97074532.tree - 6, 7
  • Run chunklink:

perl chunklink_2-2-2000_for_conll.pl -NHhftc <ptb-trees> /wsj????.mrg> <chunklink-chunks>
<chunklink-chunks> is the redirected standard output from chunklink.

The chunklink script doesn't seem to work on Windows. But we did manage to run it in a Cygwin session.

  • Run data.chunk.Chunklink2OpenNLP

java -cp <classpath> data.chunk.Chunklink2OpenNLP <chunklink-chunks> <training-data>
<chunklink-chunks> is the output of chunklink from the previous step.
<training-data> is the resulting training data file.

  • Prepare Penn Treebank training data

Preparing Penn Treebank data is similar to preparing GENIA data, as described in the section called "Prepare GENIA training data" above, except that the first step is not necessary.

  • Run chunklink:

perl chunklink_2-2-2000_for_conll.pl -NHhftc<ptb-corpus>/wsj_????.mrg ><chunklink-chunks>


<ptb-corpus> is your Penn Treebank corpus directory.
<chunklink-chunks> the redirected standard output.

  • Run Chunklink2OpenNLP

java -cp <classpath> data.chunk.Chunklink2OpenNLP <chunklink-chunks> <training-data>

<chunklink-chunks> is the output of chunklink from the previous step.
<training-data> is the resulting training data file.

Build a model from your training data
Building a chunker model is much easier than preparing the training data. After you have obtained training data, run the OpenNLP tool:

java -cp<classpath>opennlp.tools.chunker.ChunkerME<training-data><model-name><iterations><cutoff>
<training-data> is an OpenNLP training data file.
<model-name> is the file name of the resulting model. The name should end with
either .txt (for a plain text model) or .bin.gz (for a compressed binary
<iterations> determines how many training iterations will be performed. The default is 100.
<cutoff> determines the minimum number of times a feature has to be seen to be considered for inclusion in the model.The default cutoff is 5.
The iterations and cutoff arguments are, taken together, optional, that is, you should provide both or provide neither.

Analysis engines (annotators)


The file cTAKESdesc/chunkerdesc/analysis_engine/Chunker.xml provides a descriptor for the Chunker analysis engine which is the UIMA component we have written that wraps the OpenNLP chunker. It calls edu.mayo.bmi.uima.chunker.Chunker, whose Javadoc provides information on how to customize this descriptor.

ModelFile - the file that contains the chunker tagging model
ChunkerCreatorClass - the full class name of an implementation of the interface edu.mayo.bmi.uima.chunker.ChunkerCreator


The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides a descriptor that defines a pipeline for shallow parsing so that all the necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the CAS. It inherits two parameters from Chunker.xml and three from POSTagger.xml.

  • Start UIMA CPE GUI.

java -cp <classpath> org.apache.uima.tools.cpm.CpmFrame

  • Open this file.
  • Set the parameters for the collection reader to point to a local collection of files that you want shallow parsed.
  • Set the parameters for the Chunker as appropriate for your environment.
  • Set the output directory of the XCAS Writer CAS Consumer.

The results of running the pipeline are written to the output directory as XCAS files. These files can be viewed in the CAS Visual Debugger.

  • No labels