A place to record analysis and observations about the design represented by the Cas-obj prototype.
aspect | sub aspect | picture | space/time tradeoffs, locality-of-reference (L1/L2/L3 memory caching) | backwards compatibility | alternatives | notes | |||
---|---|---|---|---|---|---|---|---|---|
Overview | |||||||||
Data Storage | Where: Each FS's storage is represented by values as part of 1 Java object
| UIMA Feature Structure diagram | More space:
Faster: locality of reference high. Faster: operations (except for some needing FSids) won't need to use JCasHashMap to convert from int offsets in the heap to JCas cover objects. | for reduced space / FS could:
| denormalized: each has cas ref, type ref, typesystem ref | ||||
Data Storage | "fs-id" - an int (dense) representing the unique ID of a FS.
| ||||||||
Data Storage | Feature Structure representation: as 3 Java objects:
| The offset array is an object that roughly corresponds to the _Type object in the JCas, in that provides a way to get from a designated field to the offset. The JCas provides this as special named fields, part of the _Type object. The CasObj provides this as an int array object. The cas ref is used for "addToIndexes" to locate the view containing the indexes to be added to. The offset array is shared among all FS associated with a particular type system | cas ref is to one view; used for add/remove-indexes, getView, get the "fs-id" | ||||||
Data Storage | "get" and "set" operations for features
| ||||||||
Data Storage | JCAS _Type classes | These are not used, but are "supported" for backwards compatibility. Support includes their low-level APIs | |||||||
Data Storage | low level API support, including C++, binary (de)serialization | partially started, remainder TBD | |||||||
Views | FS obj has link to CAS -view it was originally created in; this is used for obj.addToIndexes style for add/remove | ||||||||
Indexes | Bag - structure |
| There is effectively one bag index per view. The index is kept by type, with the type parts lazily created. |
2 Comments
Nick Hill
Some explanations/clarifications around the FS/cover-class object design (have tried to include responses to the 's above):
Motivated primarily by simplification, the initial approach involved a new much simpler format for JCas cover classes, with a single class per type (i.e. no _Type classes). With this approach, as mentioned in the table, the featOffsets array effectively replaces the _Type object. I have some modifications to JCasGen to generate this style of cover classes, but it became apparent that a requirement to regen cover classes would be pretty inconvenient in many cases from a compatibility pov, and so in a later iteration I adapted it to also work with existing cover classes.
An example of the simpler format is the current SourceDocumentInformation class - in general the offsets might vary at runtime depending on which fields the cover class has versus the effective fields of it's corresponding type, and so these need to be looked up. The static FEATURES array field declares which features the cover class knows about, and based on this the non-static featOffsets array of offsets is populated. It's contents could differ per typesystem, but for a given typesystem/coverclass pair there will only be one of these array objects which all FS's instances of that type will reference.
The comment in the table about SourceDocumentInformation - having extra int[] offsets I think is maybe confusing the featOffsets array with the intValues[] array. However, it's true that for built-in types like this an additional simplification can be made - the offsets are actually invariant even between typesystems, so special-case static values can be used. I made this change to other built-in types (Annotation etc) but did miss SourceDocumentInformation.
Something to point out here is that the _Type approach is better for saving space since there's a single ref per FS, whereas there could be multiple featOffset array refs per FS (one per coverclass in the type hierarchy).
This is an example of one of the design tradeoffs that I'm not sure about - the space-saving of using _Type classes versus the added complexity. Adding in backwards-compatibility, I now think keeping the _Type objects might make more sense.
Re additional object overhead per FS (the value arrays and the refs to them) - it's true that this is a disadvantage, but the only viable approach I could think of which would be fully general and provide full compatibility. If the FS doesn't have features of certain types then the overhead could be less than 3 (the corresponding array is null).
The decision for having separate cas, type, typesystem refs was both for locality of reference and simplicity, but it would be trivial to obtain the typesystem indirectly, and also not hard to change the usage of the _Type classes so that the cas and type refs are also eliminated. I don't have a good feel for the best tradeoff here! However a change I do think is worthwhile which I have made (may not be in the latest published version), is to "denormalize" (to borrow Marshall's terminology) the typesystemimpl ref which was in the cas metadata, adding a ref per CASImpl (view). This would have insignificant space impact.
There are other tradeoffs which I made a rough guess about - for example having 2 arrays per FS (obj and int). This seemed like the right balance - other options could include having a single obj array or greater than two (a long[] array to avoid wrapper class overhead).
Re the FS ID questions - the design really tried to avoid use of these at all (since the FS obj refs themselves are the logical equivalent), but there are some reasons why they're still needed for a couple of specific things (LL api compatibility, consistency across existing serialization formats, consistent index ordering for otherwise equal FSs). Currently there are maps in each direction for this and the ID is lazily assigned. I couldn't see any reason it shouldn't be dense - apart from the case where existing binary serialization is use and the heap structure needs to be simulated. The id values themselves are not reused if the FS is GC'd - I could see no reason for doing this (and there would be a cost in terms of processing and complexity).
An idea I was playing with though is to have an int id field in the feature structure as an alternative to one of the maps. There are obvious pros/cons to this and again the best tradeoff wasn't obvious to me.
Marshall Schor
Re: offset for built-in types and making them static. Before doing this it would be good to see if the type is "featureFinal" - otherwise, the features might extended via the type merge mechanism. Even if this happened, it probably would be possible to guarantee that the "built-in" features come first.
Re: the _Type being better for space vs having a ref in every FS instance. Although true, it's better for locality of reference (LOR) to have the extra reference right next to the other references. I think the general guiding principle ought to be if the improvement in LOR occurs in a high-frequency use case path, then it might be a good thing to do.