2 Related Documents
- – memory that is not on the standard Java heap and is not managed by Java and its garbage collector
4 Architectural Description
The goals and general approach to using off-heap is described in MemScale (high-level charter/goals).
Experimentation reveals that usage of off-heap memory provides a potentially significant performance boost by minimizing CPU usage of the Java garbage collector. The extra processing required to serialize/deserialize and store/fetch the data through sun.misc.Unsafe is offset by having less Java heap memory for the garbage collector to manage as well as not consuming the CPU cycles normally used up by the collector.
- – new GemFire property to specify size of memory in kilobytes, megabytes or gigabytes to allocate for off-heap storage
- – new GemFire property that allows both heap and off-heap memory to be locked in memory
- – new Region attribute to specify usage of off-heap memory for Region entries
Examples of valid values:
4.1.2 lock-memory GemFire property
4.1.3 off-heap Region attribute
Our implementation organizes the off-heap memory into slabs. Rather than expose another configuration detail to the user, we will calculate the best slab size possible.
We believe that Integer.MAX_VALUE (about 2 GB) should be an acceptable maximum slab size. This determines that 2 GB is the maximum sized object or data allowed.
For simplicity we are making the slabs all of equal size. Therefore the off-heap memory must be divisible by the calculated slab size with negligible waste.
If the total off-heap memory is 2 GB or less, we will have one slab. If the off-heap memory is greater than 2 GB then there will be 2 or more slabs.
4.3 Off-Heap Memory statistics
|defragmentations||The total number of times off-heap memory has been defragmented.|
|defragmentationTime||The total time spent on the defragmentation of off-heap memory.|
|fragmentation||The percentage of off-heap memory fragmentation. Updated every time a defragmentation is performed.|
|fragments||The number of fragments of free off-heap memory. Updated every time a defragmentation is done.|
|freeMemory||The amount of off-heap memory, in bytes, that is not being used.|
|largestFragment||The largest fragment of memory found by the last defragmentation of off heap memory. Updated every time a defragmentation is done.|
|maxMemory||The maximum amount of off-heap memory, in bytes. This is the amount of memory allocated at startup and does not change.|
|objects||The number of objects stored in off-heap memory.|
|reads||The total number of reads of off-heap memory.|
|usedMemory||The amount of off-heap memory, in bytes, that is being used to store data.|
4.4 M&M Support
4.4.1 JMX Changes
MemberMXBean has new attributes which expose several of the off-heap memory statistics for the member.
- getOffHeapDefragmentationTime - provides the value of the defragmentationTime statistic
- getOffHeapFragmentation - provides the value of the fragmentation statistic
- getOffHeapFreeMemory - provides the value of the freeMemory statistic
- getOffHeapObjects - provides the value of the objects statistic
- getOffHeapUsedMemory - provides the value of the usedMemory statistic
- getOffHeapMaxMemory - provides the value computed by freeMemory + usedMemory
If the defragmentationTime is of non-zero value then the member has had to compact the off-heap memory. This may indicate that the member is close to running out of off-heap memory. It also indicates how much time the GemFire XD process is having to spend compacting. Does this help with capacity planning or analyzing usage patterns? Lots of variability in row sizes might possibly result in more compacting than row sizes that are more uniform.
FreeMemory + UsedMemory is the same as MaxMemory so we did not expose MaxMemory as an attribute (it would never change value anyway). Note: the MemberMXBean operation listGemFireProperties would include off-heap-memory-size which is also the same as MaxMemory.
Objects show the number of GemFire objects stored in the Region.
In theory, Fragmentation would be a valuable attribute to monitor. An excessively high Fragmentation percentage would indicate that the off-heap memory is sufficiently fragmented to potentially make it difficult to fill up or utilize. This means the member might become critical (see docs about ResourceManager) and throw LowMemoryException. It's also possible that a member may throw OutOfOffHeapMemoryException, which will cause the cache to close in order to maintain consistency across the cluster. The latter is last resort and each member will attempt to use the off-heap eviction and critical thresholds on the ResourceManager to prevent data inconsistency or member loss. This may deserve further discussion (it's complicated).
RegionMXBean listRegionAttributes includes a new boolean attribute that indicates if the region is configured to use off-heap memory or regular Java heap memory.
- one of the attributes returned by the listRegionAttributes operation is offHeap which will have a value of true if that region is configured to use off-heap memory (GFXD: was enableOffHeapMemory)
4.4.2 Pulse Support
Changes to Pulse for GemFireXD Off-Heap will need to be ported with minor changes if any. If possible, we should expose all the off-heap attributes on MemberMXBean including the new getOffHeapMaxMemory.
Each region using off-heap should also be flagged as an off heap region. The attribute was "enable-off-heap-memory" in GFXD, but will be "off-heap" in GemFire.
alter disk-store: new option "--off-heap" for setting off-heap for each region in the disk-store
create region: new option "--off-heap" for setting off-heap
describe member: now displays the off-heap size
describe offline-disk-store: now shows if a region is off-heap
describe region: now displays the off-heap region attribute
show metrics: Now has an offheap category. The offheap metrics are: maxMemory, freeMemory, usedMemory, objects, fragmentation, and defragmentationTime
start server: added --lock-memory, --off-heap-memory-size, --critical-off-heap-percentage, and --eviction-off-heap-percentage
4.5 Design Limits and Constraints
Discuss minimum and maximum configuration and usage. Most of the limits are due to the max size of primitives being used, such as integer for number of slabs or long for memory address pointers.
The maximum off-heap data size is 2 GB, but some of this data includes implementation-specific meta-data which lowers the maximum serialized user data (bytes) to less than 2 GB.
4.5.2 All Members Must Configure Off-Heap Memory Enabled Uniformly
We will not be allowing two members to host data for the same Region with a different value for off-heap enabled. We stepped through some scenarios and the ResourceManager and LRU settings just didn't work properly unless the members were required to be configured the same way. TODO: this constraint may change for rolling a region to off-heap
Equality checks are done without deserializing the value stored on off-heap memory. If the value is serialized using pdx then PdxInstance equals is used for comparison. Otherwise the raw serialized bytes are compared. This means that custom equals() implementations will not be called when checking off-heap values for equality. Value equality checks are done by Region.replace(K,V,V) and Region.remove(Object, Object).
4.7 Off-Heap Object Details
Off-heap memory is going to perform best for uniform value sizes as this minimizes the impact of fragmentation. We are using a defragmentation strategy which does not move live data. So usage patterns involving mass creates followed by mass destroys (if destroying at all) will also minimize effects of fragmentation.
The worst usage pattern will be those that create and then perform updates with lots of different size values without every destroying or clearing. This may result in severe fragmentation that our defragmentation algorithm will struggle with.
Defragmentation involves moving address spaces to a free list from which future creates will pull from. There are 16384 free lists, one for each value size in powers of 8 up to the power of 16KB (8^16384) which allows for a max size up to 131,072. Beyond this the implementation uses a ConcurrentSkipList for the free list of larger chunks. Contiguous entries in a free list can be combined. The defragmentation of free spaces involves removing them from specific free lists so they can be used for any size.
The number of free lists for values smaller than what's ultimately stored in the ConcurrentSkipList can be changed by setting the System property gemfire.OFF_HEAP_FREE_LIST_COUNT. We do not intend to expose this to the customer and will only use this to enable testing and profiling for determining the best number of free lists.
The on-heap foot-print is a long pointing to an absolute memory address for the off-heap value. The first 8 bytes of the off-heap value contains meta-data about the value including size.
Some values stored in an off-heap region will not use any off-heap memory. If the serialized size of the value is smaller than 8 bytes then it will not consume any off-heap memory.
Also if the value is an instance of java.lang.Long whose most significant byte is just an extension of the sign bit (that is the most significant 9 bits are all 1 or 0) then it will not consume any off-heap memory. These values are stored in the primitive long field that is normally used to reference off-heap memory. Since this primitive field is part of the heap overhead of an off-heap region that value will not consume any off-heap memory. Keep this in mind when estimating how much off-heap memory is needed and when viewing off-heap statistics.
TODO: determine if we will eventually move to off-heap keys because this will impact how we compare two keys for off-heap values.
5 Feature Interactions
Callback Events (EntryEvents) are not fully usable beyond the scope of the callback on an off-heap region. Attempting to invoke methods which acquire values may throw an exception indicating that the value is no longer available if called after the callback returns. The exception is: IllegalStateException("Attempt to access off heap value after the EntryEvent was released.").
If the user needed to save a reference to the instance of EntryEvent into a variable for use after the CacheListener callback returns, then they will instead need to create an on-heap copy of the value to ensure that the value remains available to them.
For example, invoking com.gemstone.gemfire.cache.EntryEvent#getNewValue() is only safe while the thread is executing within the com.gemstone.gemfire.cache.CacheListener#afterUpdate(EntryEvent) method or other callback method. The EntryEvent methods getOldValue(), getSerializedOldValue(), and getSerializedNewValue() are similarly affected.
Existing ResourceManager thresholds would still apply because off-heap values have a heap foot-print which could trigger the existing thresholds.
We will also introduce new ResourceManager support for off-heap memory thresholds (eviction and critical) which will trigger thresholds specific to off-heap memory. The eviction threshold will support all eviction actions. The critical threshold will result in LowMemoryExceptions thrown when attempting write operations on an off-heap table/region.
We need two sets of thresholds because a GemFire node may have a mix of regions that use off-heap entries and regions that use heap entries.
The ResourceAdvisor, profiles and message will require modification to include off-heap memory in addition to Java heap memory. Low memory messages will be sent separately for heap and off-heap memory status changes. Regions using off-heap will then throw LowMemoryException when either heap or off-heap is critical. Regions using heap will throw LowMemoryException only when heap is critical.
Compression can be used with an off-heap region.
Every time a delta is applied it will need to: deserialize the off-heap value, apply the delta, and re-serialize the value to off-heap. So off-heap with delta has to do a lot more work than heap with delta (since heap keeps the object form in memory letting it apply the delta without doing a deserialize and serialize).
Use with querying/OQL will probably cause frequent bulk copying to the heap.
Off-heap regions with indexes are supported. In the first implementation the index data for an off-heap region will be on the heap so heap usage will be less but can still be high depending on the type and number of indexes.
- hash index - supported
- primary key index - supported
- functional index
- range index - problematic since it copies the value. Since it would copy all the off-heap memory to heap we decided to not support it. If a customer tries to create one on an off-heap region it will throw an UnsupportedOperationException. This is what the product already did for these indexes on an overflow region. At some point in the future we may add support for these indexes to off-heap.
- compact range index - supported
Note that not all operations done on an off-heap region will keep the values in the OperationContext off-heap. Some of them will copy the value onto the heap before passing it to the OperationContext.
5.8 Result of Region.put, Region.destroy, Region.remove, and Region.Entry.setValue
The result of all these operations are sometimes null even if an old value existed. The reason for this is to give better performance for the normal use case of doing this operation without wanting the old value. All of these operations will now also return null if the region is off-heap and the old value was stored in off-heap memory and not compressed.
5.9 PartitionAttributes localMaxMemory
The default value of localMaxMemory for an off-heap Partitioned Region will be 100% of off-heap-memory-size instead of 90% of the heap memory size.
General requirements and purpose for the off-heap memory feature:
- no worse than using Java heap for JVMs hosting certain volume of data
- allow certain features to have worse performance: deltas, querying/OQL
- primary purpose is to increase data density while minimizing GC overhead
- allow the customer to use more than 70% memory and sustained CPU maxing without risking GC pauses
- manage 50 GB of customer data in a single member - initial experiments show we can surpass this
- manage 10 TB of customer data in one GemFire cluster - initial experiments attained 1 TB with 8 members
- emphasis on primary usage pattern with cycles of many creates followed by clear (or many destroys)
- ensure that all GemFire features work even if performance is compromised for certain features such as deltas
- ensure capability of scaling to 50 GB of user data in a single member and 10 TB in one cluster
In general we recommend against combining the use of off-heap memory with the following due to performance and memory overhead implications:
- do not use with deltas
- do not use with Functional Range indexes
- do not use with querying/OQL
The following conditions result in significant fragmentation and would make use of off-heap memory questionable:
- value size is non-uniform
- value sizes vary dramatically
- usage pattern involves many updates instead of creates & destroys/clear
- usage pattern involves updates that result in significant changes to the value size
We do recommend using off-heap memory for all or any of the following:
- do use if value size is uniform
- do use if usage pattern involves cycles of many creates followed by clear (or many destroys)
- do use if customer needs to scale to very large memory
TODO: We need to determine the heap foot-print of off-heap values and then calculate some meaningful guidelines for sizing GemFire servers, a reasonable minimum overall off-heap-memory-size, minimum entry value size (for example: 3 byte value sizes would not be a good match for off-heap), etc.
|type||32-bit JVM||64-bit JVM (< 32 GB Java heap)||64-bit JVM (> 32 GB Java heap)|
|ordinary object pointer||4 bytes||4 bytes (compressed OOPS)||8 bytes|
|object header||8 bytes||12 bytes||16 bytes|
|primitive long||8 bytes||8 bytes||8 bytes|
|object Long||16 bytes||24 bytes||24 bytes|