Approaches to generating and loading JCas classes for V3

There are two parts to JCas generation.  

  • Generation of getters/setters and constructors and superclass specifications
  • Generation of the storage for the data included in the type.

The JCas generation for the first kind of thing is particular to a merged type system; and potentially for the 2nd as well.  

A single JVM may have multiple merged type systems:

  • running simultaneously, or sequentially, one after another in a single pipeline (with varying type systems, e.g. processing a sequence of deserialized CASs)
  • these can be handled using separate class loaders per different merged type system, or via forming a Union (if possible) among the type systems and using that (but note it may not be possible, due to incompatible range types, and may be inefficient - storage is needed for the merged feature set which may be large).  
  • The class loader used to load the generate JCas classes must also be used for any Application, Annotator, or External resource code that references by name the JCas generated classes, versus referencing "generically" via Type and Feature objects, in order to be able to "see" the instances.

Alternatives for storage generation of a type

One alternative is to generate particular fields for each feature.  The getters and setters then are simple references to those features.

An other alternative is to have an indirection via feature offsets into two different kinds of data: data the garbage collector follows (i.e., referents) and data which is primitive other than references.

  • This alternative allows a close emulation of the capabilities and limitations of v2 JCas
    • it is possible to not have all features described in the JCas, but they can be present (for serialization/deserialization and for access via type system indirection).

When to Generate

  • Dynamic - at type system "commit"
    • Pros:
      • guaranteed exact match, no need for code to check this
      • can be done lazily - a large type system with most types unused need not generate most types
        • requires a custom class loader (I think)
    • Cons:
      • user code must commit the type system before any reference to a named JCas type.  This "error" can be checked for, if a special class loader is used.
      • Affects class loading design, requires for some use cases change to user code
      • Supporting Customization via merging is more complex.
  • preexisting - by the user externally, ahead of time, running JCasGen on a fully merged type system
    • Pros:
      • Maybe less complex
      • Classes can be customized - custom fields and methods can be added (possible with the V2 design)
    • Cons:
      • may be out of sync, need to have more runtime checks
  • preexisting at JVM startup time (via -javaagent).
    • Cons:
      • -javaagent is a dependency outside of main Java
      • the merged type system computed might be different - users can write UIMA app code to do custom merging.
      • uses one classpath; in embedded apps there may be multiple classpaths. 
      • is not "lazy" - does whole type system generation
      • user requires extra attention when starting a UIMA-based application. Analytics application vendors cannot easily create standalone executable JARs or e.g. Groovy/Jython scripts

Some Q&A

  • What happens when a type system changes?  Some ways this might occur:
    • CAS reinitialized due to restoring a persisted (binary/serialized) CAS.  
      • Two sub-cases:
        • The app/externalresource/annotator code has references by name to the generated JCas classes (e.g. they code "myFooInstance.getMyFeature()".
          • If this code was executed, the lazy loading and linking of Java would have linked the (e.g.) annotator reference to the class.  Java doesn't support "unlinking" in general, so the only way I see to handle this case is to create a new UIMATypeSystemClassLoader for the new type system (assuming it was different of course), and reload/relink new versions of the classes to connect with the new definitions of the JCas classes.
        • If there are no ref-by-name, then maybe (to be determined) we can redefine the class.  The APIs are there, it just depends if this is allowed.  This would be a case where the type system was thought to be "variable", so the code could actually not really have by-name references (other than, say, to built-in types, whose definition is imagined to be shared in any case). 
    • New types defined at runtime
      • either externally by an application (already possible), or internally by an AE while a pipeline is running (a potential new feature being discussed)
      • These would, I think, always be referenced via some kind of indirection, not by a Java-linked name.  For the moment, I was thinking that since the new JCas would support Java HashMaps, that this requirement could be met by implementing a map from a user-specified key (e.g., a string) to a value; these could be constrained using Java generics.  This would be an initial way to support this requirement; we could later investigate more native ways.
  • is JCas the only option or is a basic CAS API still available?
    • The intent is to allow existing users of UIMA to run their pipelines in V3, as much as may be feasible.  So, yes, I think we should continue to support the following CAS and JCas api varieties:
      • JCas - creation via new operator, getters/setters for each field, array getters/setters for array-valued fields
      • So called low level JCas (makes use of the _Type class)
      • CAS Apis - use Feature objects to specify the feature, and have variety for the various range types.  (Probably will add "array" styles, which are currently missing I think).  This API doesn't use by-name java-linked references.  As of Java 8, there is a significant improvement to the compiler + JVM that makes this style just as efficient as the by-name linked style.
      • Low-level CAS Apis
  • To what extend can users in the new design customize JCas classes?
  • One's expectations towards a more native JCas is the ability to store arbitrary objects in the CAS, potentially in customized JCas classes. Is this part of the new design? If yes, how to preserve such objects when a classloader changes and JCas is regenerated?
    • The main idea is that on regeneration, there's a "merge" step from any customizations that might be present in the classpath.
    • Another goal is that because the new V3 design has a new kind of range type for Features, called "JavaObject", which I hope will turn out to be able to hold arbitrary Java Objects (although we might limit this initially), this will allow storing arbitrary Java objects in the CAS as the values of particular features, without any customization.  

Setup for Type System class loader

We can imagine a new version of UIMAClassLoader, called UIMATypeSystemLoader, which has no URL and only serves to lazily generate and load JCas types.  This loader would have a reference to a committed type system, and would need to be in the parent chain of anything that referenced the generated types by name.  This class loader would be associated with a TypeSystem instance.

This might be setup using UIMAFramework.withUIMATypeSystemContext("a top level class to load and run"); this would load that named class under an instance of UIMATypeSystemLoader.

It would be nice to also run if no such context was set up - in this case, only one type system might be supported (and exceptions thrown if it was redefined).  If no UIMATypeSystemLoader was in the parent path, then type system commit would need to batch generate and load (using that arbitrary class loader) all the types (lazy not supported).  

If there was an instance UIMATypeSystemLoader in the parent chain, then this would support lazy loading.

The UIMATypeSystemLoader has a ref to the committed type system, which is null before a commit.  Type system commit walks up the class loader chain looking for an instance of UIMATypeSystemLoader, and the first one found has its ref changed from null to the committed type system. If it was not null, compare the commited type system with the previous value - if equal, OK, and leave things as is.  If not, throw exception: can't change merged type system using same UIMATypeSystemLoader.

Finding the right class loader at type system commit time

An external App can create a "stand-alone" type system.

  • this same type system instance could be used for multiple UIMA pipelines (shared instance)
  • It could be created before any CASs and be associated with many CASs eventually
  • It could be created after a CAS was created, and replace that CAS's type system (e.g. deserialization)

Type System creation is via new TypeSystemImpl() or one of its callers.  This is a public API, so would like to keep this (for the default case).

  • In this case, we have no access to a UIMA Pipeline or CAS in general, so can only look up the classloader parent chain of this internal uima core class instance.
  • An alternative is to allow the user to set up a type system loader and pass that to a new version of the type system impl, or to a new version of commit.
    • commit is currently called by CAS.commitTypeSystem(), which also does some work re: the old FsGenerator setups.
    • The call from CAS.commitTypeSystem has (obviously) a ref to a CAS, and therefore, can find the class loader associated with that CAS - the sharedViewData has a JCasClassLoader, set by the createCas calls -> doCreateCas -> setJCasClassLoader.
      • So, from the caller to commit which is via a CAS, we can use that class loader.

Approach - outside of UIMA framework

The UIMA framework could take an approach which says a particular UIMA application (imagine it running as a servlet) has its own classpath, set up and managed outside of the UIMA framework (e.g., by the J2EE servlet APIs).  Using the preexisting alternative, the user could generate the JCas classes, and include them in the servlet's classpath.  

  • drawback - we would need to include some code to verify the loaded class had the right methods and feature range mappings.
  • supports lazy loading trivially, but not lazy generation
  • restricts particular UIMA app to one type system.

Flow of JCas class generation 

There are two parts:

  • generation and loading (done lazily unless no suitable UIMATypeSystemClassLoader in the parent chain, in which case done in batch mode via injection into current classloader)
    • current classloader either JCasClassLoader if one is defined, the thread's context class loader, or the UIMA framework class loader if all else fails
  • creation of indirection function interface method handles to use for creation and accessing of features
    • done on first-need, not in batch
    • stored in either the TypeImpl (creation) or the FeatureImpl (set/get of features)
      • The feature ones not needed (and not done) for JCas style access to features by named getters/setters

generating JCas sequence

 

Content of generated class

The content has information enabling backwards compatibility (mainly, the _Type class and its support of the Low-Level APIs).  The main class also has generated content for backwards compatibility.

Main class:  There's one instance of this 

  • extends = the super class from the UIMA type system.
  • backwards compatibility:
    • a global index into the JCasRegistry for this class (need to watch out for memory leaks)
    • an int that represents the "id".  This is an incrementing by one value per CAS (not by cas view)
    • more - tbd
  • for each defined feature:
    • a private field, with a name starting with _ and corresponding to the feature name.  (Note that feature names cannot start with _)
    • getters and setters for the field, with specific typing info.  
      • Also, a version of these for arrays, to allow element access
      • These directly access the fields.
  • constructors (used by new operator), taking a JCas as an argument, for backwards compatibility
    • For subtypes of Annotation, a 3 arg constructor specifying begin and end (backwards compat).
  • (Cannot have a static ref to the TypeImpl because for built-ins, there's one class shared by multiple type systems)
  • A reference to the shared (per view) instance of the corresponding _Type class
    • From here, the TypeImpl can be located, as well as the CAS view
    • This instance is created on demand, if needed- from a map (identityHash) <class of instance -> instance of _Type) in the CAS view.

_Type class:  There's one instance of this per CAS view (per JCas class).

  • extends = the super _Type class corresponding to the super type in the UIMA type system
  • backwards compatibility:
    • generated low-level APIs for setting and getting values as is now done
    • The same global index into JCasRegistry from the main class, via static reference to corresponding field
    • has the same control flags as v2
  • ref to CAS/JCas view
  • ref to TypeImpl

Accessing features via Feature instances

For both backwards compatibility and to support more generic processing, UIMA provides access to features of a FeatureStructures via api calls where the feature is identified indirectly, as an argument, which is a reference to an instance of FeatureImpl.

In Java 8, this can be supported via lazy on demand generation of getters and setters, which have the same performance as native access.  This can be achieved using the new MethodHandle and LambdaMetaFactory capabilities, available in Java 8.

The APIs will call a getter/setter method specialized to the classes supported in UIMA V2: boolean, byte, short, int, long, float, double, String, FeatureStructure, and JavaObject (new in v3). (Arrays will be either FeatureStructure or JavaObject).

These methods in FeatureImpl on first call, will use appropriate reflection to set up the proper method-handle/functional-interfaces to support subsequent high-speed access.

Generation notes

Need to package ASM embedded within new prefix in project (ASM requirement?)

Need to insure package-name is defined before doing class definition

Need to avoid circular refs or see if they can be handled by delaying resolution in the class loader

  • No labels