The OpenNLP Annotations Project proposal describes a community based annotation project on free data.
The OpenNLP project has for a long time be limited in openness because the training
data necessary to train statistical models for the various components is often guarded
by strict copyright rules, and cannot be distributed under an Open Source license, like
the AL 2.0.
The non-open training data also handicaps the development of OpenNLP itself,
it is difficult for the developers to train and test on the same data
and potential contributors face a high barrier because acquiring training data is either
expensive or takes time. Furthermore distributing models which have been trained on
copyright protected training data is eventually not possible under an open source license.
Also user or potential users face a high entry barrier because the available models do not perform
good enough for their needs and extending the models with a small amount of custom training
data is difficult, because of the barriers to acquire the original training data.
It is proposed that the OpenNLP community shares a corpus of free text and joins
forces to add annotations to this data.
There is no existing annotation project infrastructure which can be reused by this project.
The project needs to create fundamental tooling itself, which can be used for OpenNLP Annotations
and private annotation projects.
The Corpus Server is responsible to host collections of texts with annotations. The text
with annotations is represented by UIMA CASes.
All CASes in the server are indexed to make them retrievable via search queries.
These queries are the basis for a couple of features the server supports.
The Corpus Server is a separate server and defines a remote interface which
is used by all other dependent systems to access with the hosted data.
In an annotation project a team of annotators needs to find and distribute
a set of CAS which need to be labeled. After a CAS is identified it must be
passed to one annotator until it is returned or released by a timeout, in the
mean time it should not be distributed to a second annotator in the team.
To define such a queue a user specifies a search query which puts all matching
CASes in the queue. To retrieve such a CAS the annotators can call an iterator
Search for a CAS
A user want to search for certain CAS and retrieves it.
UIMA AAE process support
To programmatically work with the CASes a user needs to create an UIMA Analysis Engine.
Such an engine can be used to train OpenNLP or to automatically label CASes in the
A user defines a Task Queue which contains all the articles which should be feed into
the UIMA AAE. At the end of the AAE pipeline the user can insert a special AE to write
the results back to the Corpus Server if required.
The service will be implemented as a restful service based on JAX-RS and Jersey.
The interface needs to support these methods in the initial version:
- add - inserts a CAS to the server
- update - replaces an existing CAS with the provided one
- get - retrieves a CAS from the server
- query - retrieves a list of CASes which are matched by the provided search query
The articles will be represented by a UIMA CAS in the XMI format and each articles
is stored in one CAS.
In the initial version it should use Apache Derby as DB and Lucene to index the
CASes with annotations.
- The server should support multiple corpora in a later version.
- The query method might not be well suited to iterate over all documents
and needs to be replaced.
The WikiNews importer is a data conversion tool which takes the WikiNews dump from Wikipedia for a defined language as
input, parses it and uploads the produced CASes into the Corpus Server.
Since WikiNews is updated daily it should be able to also load the new articles into the Corpus Server, e.g. based on
a time stamp.
The initial version can hopefully be started with code Olivier Grisel developed for pignlproc.
The Tagging Server is responsible to annotate a (piece of) text on user request,
the request can contain already confirmed annotations which need to be taken
into account during the tagging. The server is intended to be used by annotation
tools which need to update the proposed annotations based on the current user input.
An annotation tasks can be done best if the user has access to a UI which is
optimized for the task. This library should help to build optimized
and task specify annotation tools.
OpenNLP Online Demo
The online demo can be used to test the various OpenNLP components easily
in a browser. A user should be able to paste a piece of text into an input
form (similar to Google Translates UI) and then see the processed text with
annotations next it.
It should use the Tagging Server to annotate the text and the
WikiNews Annotation Model
The headline annotation marks the headline of the WikiNews article and
is extracted from the wiki markup.
The byline annotation marks the byline of the WikiNews article and
is also extracted from the wiki markup.
The paragraph annotation marks the paragraphs of the WikiNews articles and
can be extracted from the wiki markup.
A sentence annotation marks one individual sentence in the text which is terminated
by an end-of-sentence character.
A token annotation marks one individual token in the text.
An named entity annotation type will be created for every entity type, that will
be labelled in the corpus.
To get the project started it should be divided into two phases, in the first the
focus is to provide basic infrastructure and to import free data. And the second
phase the project should focus to efficiently crowdsource annotations.
The first phase should develop the fundamental infrastructure of the project.
That is, a corpus server which can host the text with annotations, an integration
for the Apache UIMA Cas Editor to label data, tooling to import the free wikinews
corpus, guidelines on how to label the data and an OpenNLP integration to pre-label texts.
The second phase should develop new web based annotation tooling that enables
the community to easily add, correct and acknowledge annotations from a big group
After the first automatic labeling of the data, we will be able to train models
on it and to distribute those, the quality of the models will evolve over time when
the training data gets corrected and extended by the community.
– NOTES –
1. How do we make sure copyrighted material doesn't make it into the corpus? If ti does it could cause issues down the line... (jk)