On this page we would like to suggest and discuss components and tooling for the UIMA sandbox.
The sandbox was designed to host UIMA analysis components like annotators, parser or consumers, as well as UIMA tooling. The provided components are free to use and everyone is invited to suggest new components or work on some of them.
Suggested Analysis Components
Parser
- document text parser
- provide a parser component that extracts the plain text from a PDF or HTML document using some open source libraries like PDF box or NekoHTML.
Annotators
- simple whitespace tokenizer
- writing a simple whitespace tokenizer that extracts tokens from a plain text document for whitespace separated languages.
- language detection annotator
- writing an annotator that detects the language of a document using for examples simple language specific word lists.
- word list annotator
- writing an annotator that use a word list to create annotations of a specified type. The word list can either be provided as XML input or in a compiled format.
Consumer
- casToXML
- provide a UIMA CAS consumer that writes the analysed documents in a configurable XML representation to the filesystem. The types that should be serialized can be specified in the settings of the CAS consumer.
Tooling
Document Model
Create a standard model to represent document structure and properties in the CAS, to replace the current default of plain text.
Structure
This should be functionally equivalent to XML. For example, this Type System might do it.
ElementAnnotations can form a hierarchy of annotations representing structure:
Feature Name |
Super Type |
Element Type |
---|---|---|
ElementAnnotation |
Annotation |
|
attributes |
FSArray |
AttributeFS |
children |
FSArray |
ElementAnnotation |
name |
String |
|
parent |
ElementAnnotation |
|
qualifiedName |
String |
|
uri |
String |
An ElementAnnotation can have many attributes, which are just name-value pairs:
Feature Name |
Super Type |
Element Type |
---|---|---|
AttributeFS |
TOP |
|
localName |
String |
|
qualifiedName |
String |
|
type |
String |
CDATA, DI, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, ENTITIES, NOTATION |
uri |
String |
|
value |
String |
Parsers (filters) for various file types (Word, PDF, plain text, HTML, XML) would extract plain text and set it into the CAS, and also convert (HTML, XML) or discover structure and create ElementAnnotations.
If this became standard and annotators could depend on it, then better extraction quality would result. For example, HTML is usually converted to plain text in which boundaries between table cells are lost. If instead the table structure was represented using ElementAnnotations, then an annotator might decide that the boundaries of ElementAnnotations named "TD" (i.e. HTML cells) are actually paragraph terminators.
For example, in this table:
Maker |
Model |
Year |
---|---|---|
Honda |
Accord |
2007 |
Toyota |
Camry |
2006 |
If converted to plain text, you might get the line:
Honda Accord 2007
Entity extraction might parse this as just a VEHICLE. But with cells as boundaries, it would be seen as three separate entities, COMPANY, VEHICLE, and YEAR, which is more correct, since they weren't originally part of a single sentence.
Properties
There should be standard representation of document properties. By properties, I'm talking about the things you get in MS-Word when you select the File->Properties dialog, Summary and Custom tabs.
The representation should be based on the Dublin Core Metadata Initiative. For example:
Feature Name |
Super Type |
Element Type |
---|---|---|
PropertyFS |
TOP |
|
name |
String |
|
scheme |
String |
Boolean, Box, DCMIType, Double, IMT, Integer, ISO3166, ISO639-2, Long, Period, RFC2278, RFC3066, String, URI, W3CDTF |
value |
String |
The scheme tells you how to interpret the value.
Names would be DCMI names, such as charset, creator, date modified, format (IMT MIME type), identifier, language, title, and so on. A minimal set would be required.
The document parser would create these FS's from the document content and/or the document container (file system, content management system, HTTP headers, HTML META elements, etc.).
This would replace the current class SourceDocumentInformation.
4 Comments
Andreas Baumann
http://textcat.sourceforge.net/ for language detection annotator (as a first step)?
Thilo Goetz
Good idea, we could provide a simple wrapper. Note though that we can't distribute it because it's LGPL licensed: see here for the ASF's position
Have you tried it, how's the quality?
Andreas Baumann
Tried the C version some time ago, quite good results. I just opted for ngramming instead of
short words, the language profile should be more accurate. Didn't know about the license, so yes,
jtextcat is not an option.
Adam Holmberg
For a white-space tokenizer, we might try using the BreakIterator class in ICU. It can find words, lines, and sentences, according to Unicode rules.
ICU also has a charset detector, and encoding converter. This would help in parsing and converting plain text files into a Java String (UTF-16).
ICU also has a regular expression class that fully conforms to Unicode.
http://icu-project.org