A place to collect ideas for the next version of UiMA Java core.

Framework interoperability

There are many big-data frameworks now. UIMA has a particular slant on things to encourage component development and reuse (I'm thinking of externalization of the Type System, merging of type systems). UIMA also has its scaleout approach, and the RUTA workbench facility. This topic is where we can think about UIMA components in other frameworks (e.g. Apache Spark), or vice-versa.

Interoperability could be facilitated by more standards around REST service packaging.

Complete JSON deserialization with an eye toward being "permissive" to receive data models from other frameworks?

Big changes

More use of Java compiler (ecj) and decompiling

A portable Java compiler from Eclipse (ecj) and decompiling capabilities (e.g. Procyon) are appropriately licensed and could be part of the startup.

JCasGen could be "automatic" for merged type systems, and merged instances of JCasGen'd user classes?
- Users still would need a generated version for their code to compile against.
Pear definitions for JCas cover classes could be merged?
Could generate one kind of Java cover class for all types. (lazy, load on demand
- eliminate / reduce use of TypeImpl in runtime.
- generate for all merged types (except custom built ins)
  - (as opposed to current impl, where no JCas cover class is generated if it doesn't exist - the "standard" one is used instead)
use class loader technology to support multiple type systems
- Having same-named types, sharing the JCas cover types for those, but (after merging) having different sets of features.
- This would only be used for UIMA (merged) Types that have same name but have different feature sets.
- Current design uses the same JCas cover class for differing type systems (e.g., ones that have a different # of features for a type). In this case, the JCas cover type only is being used to set/read slots it knows about; other facilities might be used to read/set additional slots.

Feature Structure == an instance of its Java Cover class

One representation only of a FS; the static fields of the class have the typeImpl info..

Features represented directly as fields.

To get around "reflection" slowness:
- Support set/get by int <- class <- feature-name-string
- Support set/get (bulk) ? <ordering among fields significant?>
- possibly use something like ReflectASM which is like Java reflection but has a byte-code generator and is much faster (but probably not as fast as custom support code compiled into the Java Cover class).

User customization of Java cover classes, and PEAR classpath isolation issues

Currently users may customize their JCas cover classes. PEAR classpath isolation allows the use case where different customizations are present in one pipeline. The current implementation supports this, and switches the set of JCas cover classes as Pear boundaries are crossed. The idea of a Feature Structure being an instance of its cover class breaks down when multiple definitions of this exist. Some ideas for fixing this.

Consider ideas from other popular big-data frameworks: Hadoop, Spark

These typically have approaches to type systems that use user-defined Java types, and allow any kind of Java objects in the fields. There are new kinds of Serialization / Deserialization that work for all kinds of Java objects, but are more efficient than Java reflection-based approaches (e.g. Kryo used by Spark).

Add support for Collections and Maps

Users have wanted these kinds of objects; some implementations I've seen have tried to implement Sets using a combination of HashSet and UIMA FSLists, duplicating the data and keeping things in sync, which was very inefficient. More on this topic here.

More concurrency

Support parallel running of pipeline components.

Careful trade-off vs slower due to synchronization, cache-line interference. Key is to separate things being updated.

Consider special index support for this

Supporting Java 8 streams

Iterating over FSs: alternative: have generator of FSs, process with stream APIs

Possibly having a new kind of managed component? being either
- The "functions" the standard operations on streams use
- new standard operations on streams (unlikely I think)
- I think this might be deferred until we have some more experience

(Unlikely) Making the element of the "stream" be a new CAS - replacement for CAS Multipliers. Seems like the wrong granularity... Maybe best to let Java evolve this for a few more releases.

Other changes

Integrate key ideas from uimaFIT

These include:

Alternative, Java-centric way of specifying a type system - user write a Java class with annotations.
Alternative, Java-centric way of specifying configuration information
Convenience methods (e.g. selecting groups of feature structures using SQL-like specifications)
Others?

Better support for "run-time" dynamic typing

Moving towards "dynamic" typing - see paper: http://aclweb.org/anthology/W14-5209

Supporting "combining specifications" that map type systems

Different components should be easily combinable even if they have different type systems, if a mapping can be found and specified. For more complex mappings, custom adapters could be supported?

Using the Web to facilitate component combinations

User wanting to combine X with Y should be able to lookup on the web and download the adapter or 90% of the work predone. It should be easy for users to share this information on the Web.

Judicious substitution of other packages for hand-built code

There's a plus and a minus for this - plus: we get better tested, better function (perhaps), better performance for some typical capabilities (e.g., parsing XML to/from Java Objects). Minus - it make the code depend on these other packages. Also, if it's working fine now, there's little motivation to invest in changing it.

Some areas to consider:

XML parsing and writing for descriptors - use JAXB or Jackson (already used for JSON support)

Child pages

Ideas for UIMAJ v3