Explore the consequence of allowing the various implementations of Java List, Set, and Map to be 1st class UIMA types.

  • Could even have the same package name, so their JCas cover classes follow naming conventions
  • Would need serialization / deserialization code for our various forms
  • Would (optionally) need C++ impl support.  Alternative - the new types would be excluded from being passed. 

Issues with "equals" and "hashcode" and "compare"

The Java implementations of these are called.  We don't (currently) have a method of defining these for UIMA Types.  

  • An extension to the type system specification could define these in terms of UIMA Types used as keys.
    • How to handle type merging - might add new features that could be spec'd to be part of these functions. 
      • For hashcode / equals - is sufficient to say feature is included or not
      • for compare - there's also an "ordering" consideration - what feature would be compared first.  This seems to be unsolvable given the constraints of independently developed components / type system defs. Perhaps for this we might "require" a hand-merged type system spec. and throw an error otherwise. 
  • No labels

4 Comments

  1. How about defining equals and hashcode basically as "feature structure identify", so for the current implementation based on the CAS address. For more sophisticated comparisons, UIMA provides the indexes. Regarding FS identity, I think "compare" doesn't generally have a meaning. But when we haven an index, it might be possible to obtain a comparator from the index that is based on the indexed features.

  2. I think that having feature-structure identity is one possible choice, but it seems to me that the equals, hashcode, and compare need to be user-specified (analogous to how this is done in plain Java programming).  

    Here' a made-up use case I'm thinking about.  Suppose you have some text, and annotators are examining it and adding annotations to it.  Now, imagine that these annotations might form a graph, and you might have many (100's, 1000's) of these graphs, each one "attached" to some phrase.  The elements of the graph might also have some confidence or other kinds of measures (as fields), and these might change as more analysis is done.  A subsequent annotator might want to iterate for each annotated phrase over these annotations in some kind of sorting order.

    Although this is similar to what UIMA provides with its indexes, the different idea is that (for whatever reasons) the pipeline designer had 100's or 1000's of these graphs, all attached to specific spots in the text, or linked via some graph structure from those spots.  Hence the need for some kind of "collection" that can be the value of some field of some annotation, and the need to have a user-defined way of sorting, etc. those things in the collection.

    My first naive guess at specifying this would be to essentially copy the existing "spec" we have for set and sorted indexes - a set of keys (fields in the Feature Structure) and a flag saying ascending / descending.  This is weaker than what Java provides (arbitrary user-defined functions...), but I can't see a "portable" way to specify that much flexibility.  Of course, if we restrict this to Java, we could have code snippets, perhaps (smile) )

     

  3. Would you imagine that the actual UIMA type system specification be extended or that this would be a convenience extension e.g. for the Java-based UIMA implementation?

  4. A first try at implementing this might be just an extension for the Java-based UIMA.   As long as the pipeline components were in one JVM, then each Annotator would have access to these Java collection objects.  If the pipeline had remote components, even if they also were Java components, the CAS would be "serialized/deserialized" and the remote component (assuming it was interested in these objects too) would need to deserialize them, and then would need the type info about this in the UIMA Type system, including a definition of what the "keys" sort/set info was, I think.   

    In terms of supporting this for other programming paradigms, perhaps those could wait until someone got the that itch to scratch; I suspect C/C++ has libraries that support the same functionality as Java collections, and the serialization/deserialization for C/C++ could be extended to support these kinds of objects.