A place to record analysis and observations about the design represented by the Cas-obj prototype.

aspectsub
aspect 
picturespace/time tradeoffs, locality-of-reference (L1/L2/L3 memory caching)backwards
compatibility 
alternativesnotes   
Overview 

UIMA Entities (internal view)

       
Data Storage

Where: Each FS's storage is represented by values as part of 1 Java object

  • can be GC'd
  • No central CAS "Heap"

UIMA Feature Structure diagram


More space:

  • always have Java cover object (vs possibility of no Java objects)
  • Java cover object: 3 object overheads / FS (vs 1)
  • Java cover object has denormalized shared additional fields

Faster: locality of reference high.

Faster: operations (except for some needing FSids) won't need to use JCasHashMap to convert from int offsets in the heap to JCas cover objects.

 

for reduced space / FS could:

  • avoid java object overheads for Obj & int arrays (but gives up GC by individual object)
  • share cas ref, type ref, typesystem ref.
denormalized: each has cas ref, type ref, typesystem ref   
Data Storage

"fs-id" - an int (dense) representing the unique ID of a FS.

  • assigned lazily, not all FSs might have these
  • not reused in case FS is garbage collected
        
Data Storage

Feature Structure representation: as 3 Java objects:

  • array of Ints
  • array of Objects
  • container of above, with additional refs to
    • cas
    • typesystem
    • type
    • a ref to a shared int array representing offsets in the top two arrays, indexed by known JCas features
 

The offset array is an object that roughly corresponds to the _Type object in the JCas, in that provides a way to get from a designated field to the offset. The JCas provides this as special named fields, part of the _Type object. The CasObj provides this as an int array object.

The cas ref is used for "addToIndexes" to locate the view containing the indexes to be added to.

The offset array is shared among all FS associated with a particular type system, with some exceptions (e.g. SourceDocumentInformation) - but I think this is just a quick-fix anomaly

  

cas ref is to one view; used for add/remove-indexes, getView, get the "fs-id"

   
Data Storage

"get" and "set" operations for features

  • some builtin hard-coded offsets
  • there's a shared int[] that maps JCas features to offsets in the int[] and obj[] values
        
Data StorageJCAS _Type classes These are not used, but are "supported" for backwards compatibility. Support includes their low-level APIs (question)      
Data Storagelow level API support, including C++, binary (de)serialization partially started, remainder TBD      
Views  FS obj has link to CAS -view it was originally created in; this is used for obj.addToIndexes style for add/remove      
IndexesBag - structure

UIMA-CO Bag Index

  • 1 collection per (instantiated) type (lazy construction)
  • Collection structures (especially concurrent ones) have significant space overheads
    • but probably has low memory-cache-dumping for add/remove and simple iteration ops (linked lists) (a good thing)
  • size() operation may be slow especially for concurrent

 

  There is effectively one bag index per view. The index is kept by type, with the type parts lazily created.   

 

  • No labels

2 Comments

  1. Some explanations/clarifications around the FS/cover-class object design (have tried to include responses to the (question)'s above):

    Motivated primarily by simplification, the initial approach involved a new much simpler format for JCas cover classes, with a single class per type (i.e. no _Type classes). With this approach, as mentioned in the table, the featOffsets array effectively replaces the _Type object. I have some modifications to JCasGen to generate this style of cover classes, but it became apparent that a requirement to regen cover classes would be pretty inconvenient in many cases from a compatibility pov, and so in a later iteration I adapted it to also work with existing cover classes.

    An example of the simpler format is the current SourceDocumentInformation class - in general the offsets might vary at runtime depending on which fields the cover class has versus the effective fields of it's corresponding type, and so these need to be looked up. The static FEATURES array field declares which features the cover class knows about, and based on this the non-static featOffsets array of offsets is populated. It's contents could differ per typesystem, but for a given typesystem/coverclass pair there will only be one of these array objects which all FS's instances of that type will reference.

    The comment in the table about SourceDocumentInformation - having extra int[] offsets I think is maybe confusing the featOffsets array with the intValues[] array. However, it's true that for built-in types like this an additional simplification can be made - the offsets are actually invariant even between typesystems, so special-case static values can be used. I made this change to other built-in types (Annotation etc) but did miss SourceDocumentInformation.

    Something to point out here is that the _Type approach is better for saving space since there's a single ref per FS, whereas there could be multiple featOffset array refs per FS (one per coverclass in the type hierarchy).

    This is an example of one of the design tradeoffs that I'm not sure about - the space-saving of using _Type classes versus the added complexity. Adding in backwards-compatibility, I now think keeping the _Type objects might make more sense.

     

    Re additional object overhead per FS (the value arrays and the refs to them) - it's true that this is a disadvantage, but the only viable approach I could think of which would be fully general and provide full compatibility. If the FS doesn't have features of certain types then the overhead could be less than 3 (the corresponding array is null).

    The decision for having separate cas, type, typesystem refs was both for locality of reference and simplicity, but it would be trivial to obtain the typesystem indirectly, and also not hard to change the usage of the _Type classes so that the cas and type refs are also eliminated. I don't have a good feel for the best tradeoff here! However a change I do think is worthwhile which I have made (may not be in the latest published version), is to "denormalize" (to borrow Marshall's terminology) the typesystemimpl ref which was in the cas metadata, adding a ref per CASImpl (view). This would have insignificant space impact.

    There are other tradeoffs which I made a rough guess about - for example having 2 arrays per FS (obj and int). This seemed like the right balance - other options could include having a single obj array or greater than two (a long[] array to avoid wrapper class overhead).

     

    Re the FS ID questions - the design really tried to avoid use of these at all (since the FS obj refs themselves are the logical equivalent), but there are some reasons why they're still needed for a couple of specific things (LL api compatibility, consistency across existing serialization formats, consistent index ordering for otherwise equal FSs). Currently there are maps in each direction for this and the ID is lazily assigned. I couldn't see any reason it shouldn't be dense - apart from the case where existing binary serialization is use and the heap structure needs to be simulated. The id values themselves are not reused if the FS is GC'd - I could see no reason for doing this (and there would be a cost in terms of processing and complexity).

    An idea I was playing with though is to have an int id field in the feature structure as an alternative to one of the maps. There are obvious pros/cons to this and again the best tradeoff wasn't obvious to me.

  2. Re: offset for built-in types and making them static.  Before doing this it would be good to see if the type is "featureFinal" - otherwise, the features might extended via the type merge mechanism.  Even if this happened, it probably would be possible to guarantee that the "built-in" features come first. 

    Re: the _Type being better for space vs having a ref in every FS instance.  Although true, it's better for locality of reference (LOR) to have the extra reference right next to the other references. I think the general guiding principle ought to be if the improvement in LOR occurs in a high-frequency use case path, then it might be a good thing to do.