JSON Serialization for UIMA

This page collects thoughts toward adding JSON serialization to UIMA.  It covers both the CAS, and UIMA Meta-data (particularly Type Systems).

Why?

Recent trends show people wanting to move things into the "cloud", and hook things up using various mashup techniques.  Lots of new standards are emerging to support this, like JAX-RS for REST services enablement.  A popular data format used for communication between services in the cloud is JSON; it's popularity is rising while XML is falling (see http://www.google.com/trends/explore#q=json,xml ).

JSON serialization would ideally support the community of developers wanting to incorporate UIMA pipelines into other cloud-based applications.  

  • should be simple for simple use cases, but gracefully extend to support the full capabilities of CAS data
  • should align with popular or emerging trends in JSON use and library support, especially extensions to support references (think of Feature Structure (FS) references), and name spaces (where needed to avoid name collisions)

Extensions above basic JSON

Full CAS support requires 3 extensions above basic JSON

  1. support for references (to enable building linked data structures with share structure, back pointers, etc.)
  2. support for namespaces (along the lines we use in XMI/XML serialization)
  3. support for type hierarchies - to enable common operations like "iterate over all FSs of type Foo (including all subtypes of Foo)"

Looking around, it seems JSON-LD (a w3c standard) is a good candidate to use as a guide in adding these kinds of things.

Prototype

The design space for this is large, and there are many options which may correspond to useful use-cases.  Here are some thoughts

Keeping the NameSpace stuff simple when possible

I may be the case that for many use cases, the type names are unique without resorting to namespaces.  In this case, it would be good to have any complexity related to namespaces be omitted.  This can be dynamically determined at serialization  time, and could also be a configurable behavior.

Wide variety of options

Configurable

Here are some options, many taken from the current XMI serialization:

    • Delta Cas vs whole CAS
    • Type (and feature) filtering
    • Pretty Printing
    • whether or not to serialized null values, empty arrays, empty Strings
    • Type Augmentation Style
    • FeatureStructure Style
    • including the View information (which FSs are considered "indexed"), or not.
    • including either a Reference (URI) to Type System information, or embedding essential type system information for used types

The Type Augmentation style: this is how the FS type is added to the serialization.  In "normal" JSON serialization of objects coming from a typed object system (like Java or UIMA), an object's type is not normally serialized.  There are several conventions for adding this; popular serialization mechanisms such as Jackson support multiple styles.  

The FeatureStructure Style allows for different structuring of the collection of FSs.  Some possibilities:

    • as a set of properties, keyed by the "id" of the FS: example (8 is the "id"): {"8" : {"DocumentAnnotation" : {"sofa": 1, "begin": 0, "end": 5669, "language": "en"}}
    • as a set of feature structure types, keyed by the type, with a value being an array (collection) of all Feature Structures for that Type.

Assuming simple deserialization, the first style supports following FS References as an O(1) operation, while the 2nd supports an O(1) operation to get all FSs of a particular type, and then easy iteration through the array of all FSs of that type; but note that this differs from UIMA iteration over a Type in that UIMA picks up all subtypes of that type as well.

Both of these advantages can be obtained using a custom deserializer, which could create additional indexes to enable O(1) operations.  In an advanced scenario, we could imagine having the deserializer take as input a UIMA index specification, which it could use to build and populate equivalent indices to replicate what UIMA does.  (But this would be a lot of work, repeated for multiple languages (JavaScript, Python, Ruby, etc.)

TypeSystem info as part of a serialization

To implement many kinds of processing, information from the type system would be needed by the deserializer of the JSON data.  For instance, a feature with a value of 123 might be a number, or it could be an FS reference to another FS.  Likewise, if one was to iterate over all "Annotations", you would need to know which types are subtypes of Annotation.  

In one future scenario, we could imagine having type systems live in the cloud.  If they did, then one could serialize a URI reference to the type system as part of the serialization.

In another scenario, these might not be available.  In that case, we would need to serialize some parts of the type system, such as: which features were FS References, and the type/subtype hierarchy.  This would only need be done for the types that were included in the serialization.

Non-Configurable

Some design choices are probably better put into the design, and not made configurable (following the Maven convention over configuration) to keep things simpler.  An example: for the "id" used in the reference, we could do what the XMI serialization does (it uses the "address" of the item in the CAS Heap); or we could use a "dense" number, starting at 1 and incrementing by 1.

Other choices are the names of the parts of the serailized CAS, for instance a part describing the metadata ("@context" - following the naming conventions in JSON-LD), the part holding the feature structures ("_featureStructures" or something like that), etc.  Putting a "_" in front of the name makes it less likely to collide with other names, going forward, I think.

API

The most popular JSON Java utilities seem to be the Jackson ones.  For serializing, that API (for the streaming case) adopts an API style in 2 steps:

    • Use a static method to get an instance of a serializer
    • Call configuration methods on the serializer instance
    • Call the serializer instance to do the serialization

Since the Jackson streaming API has years of investment and is designed for high efficiency and reuse by other middlewares, I suggest we incorporate it.

It would be good to expose the same generator configuration capabilities, I think, for simplicity and familiarity of others who already know this interface.

To serialize a CAS, one would create and configure a Jackson generator, and then pass it to an instance of the (existing, shared) XmiCasSerializer, via a new serialize method call..  

  • A set of static methods on XmiCasSerializer could hide this detail.

Configuring a CAS Serializer

UIMA CAS serialization has several configurable choices:

    • Where the output goes
    • Whether or not to prettyprint 
    • Delta or not (Delta is default if mark is set in the CAS)
    • Filter what's serialized by whether or not the types / features are contained in a typeSystem
    • Where to report errors (for example, while serializing a List of things, encountering a mal-formed list structure in the CAS)

There are two kinds of objects to serialize:  CASs and UIMA Component Descriptions.

  • No labels