Processor architecture in 2015 has a very large imbalance between the speed of a core and the speed of main memory access.  This is helped by having L1/ L2/ L3 memory hierarchies.  If the core UIMA framework uses many cache lines for high-frequency operations (such as getting a value from a Feature structure), this reduces the cache available to the application, and can, as a result, slow down (perhaps significantly) the speed of that app.

  • The amount of slow-down depends on how "likely" the app would have previously "fit" in the cache.
  • The "cache-line" - the amount of data moved - is typically 128 bytes.
  • The size of a cache-line + the lower cost of memory imply a trade-off for having "locality-of-reference" copies of some data near the place where it would be used

Short-term improvements to the core

Many of these involve copying data to be closer to things using it - closer means more likely to be in a cache-line, due to other data also being referenced.  This costs memory space (because data is duplicated) but ideally, reduces cache loading. 

Getting/Setting values in a Feature Structure

The main activities in using Feature Structures involve setting and getting values.  This has (currently) many levels of indirection, meaning, there's a lot of pointer-chasing.  

One common operation is, given an address of the FS, loading or storing to the particular offset for that feature slot.  In the current implementation, the feature is represented by a feature - code - an integer which is unique for each feature in the entire type system. So for 1000 types, each having on average 10 features (for example), you would have 10,000 features.  There's a int array of this size, indexed by the feature-code, that stores the offset for this.

Many of the offsets are actually constants, but the code doesn't take advantage of this (these are for the built-in features). So one change is using static final int values for these offsets, vs looking them up on every reference at run time.

Where they do need to be looked up, put the information nearer to other things already being accessed.

Improving UIMA Index corruption testing

UIMA 2.7.0 introduce testing for index corruption.  This, when done, currently has a big cache load, which could be greatly reduced by storing a bit per FeatureSturcture per View, in a spot local to the FS, which indicated if the the FS was indexed in one of the corruptable indexes. For this purpose, we could allocate 1 word in the heap per FS that had the possibility of corrupting indexes.  (e.g., we could skip it for arrays, which cannot be used as keys in indexes).  One word would allow for up to 32 "views", which is likely to cover all the interesting cases I think. For views beyond that, we could revert to the previous system, or just declare this to be more likely a user error than an intent, and signal an error. 

Creating Feature Structures

When FSs are created (admittedly a lower frequency than setting/getting values from feature structures) the number of slots needed on the heap is looked up in a big int array (indexed by type code).  The information for this could be moved closer to other information more likely already to be in the cache.

Careful attention to Concurrent structures

Most concurrent implementations come at a cost in memory and cache loading.  For instance, a concurrent hash table, with a concurrency level of 16, takes 17+ cache lines (out of typically 256 L1 cache lines).  This is because each reference randomly chooses one of 16 sub-hash tables.  Another way to look at this, is that each "get" might refer to a line for the table, a line for the subtable, a line for the hash entry in the table - or 3 lines per reference.  If the table was frequently used, the first line might be in the cache, and the 2nd might or might not, depending on the concurrency level. The 3rd reference would likely not be in the table (unless a recently accessed key was being reaccessed). 

So a lower concurrency level can result in somewhat lower cache loading. 



  • No labels