To be Reviewed By: February 23th, 2022
Authors: Alberto Gomez
Status: Discussion
Superseded by: N/A
Related: N/A
Problem
Geode supports the storage of data in off-heap memory by using the Java Unsafe class.
At startup, each Geode server with off-heap memory configured creates several off-heap fragments (each with a maximum of 2GB of size) so that any request to allocate an object could get a piece of any of the available fragments to store the object.
A list of tiny and huge memory chunks is also maintained by Geode so that off-heap memory freed when deleting an off-heap object will be put in one of these two lists (depending on the amount of allocated memory) for further allocation requests of objects of the same size.
The possibility to store an object in off-heap memory not only depends on the amount of free off-heap memory available but also on how fragmented the off-heap memory is, i.e. what is the size available in each of the fragments, how many elements are there in the tiny and huge lists and what is the size of each.
Depending on how fragmented the off-heap memory is, it might be that objects of a certain size will not be able to be stored in it no matter what the total available off-heap memory is.
Even though Geode offers several stats related to the status of the off-heap memory area (usedMemory, freeMemory, fragmentation, largestFragment and fragments), the ones that provide information about the level of the fragmentation of the off-heap memory (fragmentation, largestFragment and fragments) are only updated when defragmentation is executed.
Given that the defragmentation of the off-heap memory area is only run -automatically- whenever a request to store an object cannot be satisfied, the fragmentation status (given by the above stats) is only known right after the server has been started (no fragmentation) or right after the memory has been defragmented.
Running defragmentation can be a very CPU intensive task (and Java GC demanding) and it blocks incoming off-heap memory allocation requests. Therefore, being timely aware of the fragmentation status of the off-heap memory can allow Geode users to plan for measures to avoid an automatic defragmentation to kick in at an inconvenient time. A possible measure could be to restart the servers to defragment the off-heap memory at a convenient time.
Anti-Goals
This RFC does not propose to add the possibility to run off-heap defragmentations on demand or to improve the defragmentation algorithm. Nevertheless, some ideas in that direction will be proposed in another RFC.
Solution
This RFC proposes to make timely visible the off-heap fragmentation status by:
- Updating the largestFragment stat periodically.
- Adding a new off-heap stat called "freedChunks" that will provide the number of elements in the tiny and huge lists. This stat will be periodically updated.
Ideally, the two stats above should be updated in real time. Nevertheless, given that the calculation could be very costly, it is proposed that it is done periodically, in a background thread, in order to limit the CPU consumption and to not affect the latency of allocations and deallocations.
These stats would be available via JMX and gfsh.
These stats could be described as follows in the Geode documentation:
- largestFragment: Largest off-heap memory fragment that can be used to allocate an object. When an object is allocated from a fragment, the size of the fragment is decreased with the size of the object plus some heading and possibly some alignment bytes.
- freedChunks: Number of off-heap memory chunks that have been freed since the last defragmentation and that are not currently being used to store an object in the off-heap memory space.
There are two types of chunks depending on the size of them: tiny if the size is lower than OFF_HEAP_ALIGNMENT* OFF_HEAP_FREE_LIST_COUNT (default = 512K) and huge if the size is greater.
These chunks can also be used to allocate objects in the off-heap but the size of the freedChunk must comply with some conditions with respect to the object to be allocated. In order to allocate an object of tiny size between X and X-OFF_HEAP_ALIGMENT-1, a tinty freedChunk of size X must be found. In order to allocate an object of huge size between X and X-256-1, a huge freedChunk of size X must be found.
Other information that could be added to the off-heap documentation could be:
- When an off-heap object is freed, a freedChunk entry is created in either the tiny or the huge list depending on the freed memory.
- When defragmentation is run, Geode takes all freedChunks and remains of fragments and creates a new set of fragments by concatenating adjacent freedChunks and fragments.
Changes and Additions to Public Interfaces
A new off-heap stat will be available: freedChunks
This stat and the current largestFragment one will be available via JMX and via gfsh.
Performance Impact
No significant performance impacts are foreseen.
Backwards Compatibility and Upgrade Path
No impacts
Prior Art
N/A
9 Comments
Darrel Schneider
Is "freedChunks" just the number of elements in the free lists?
When you update "largestFragment" periodically would it just report the largest item in a free list or would it actually do a "dry run" of defragmentation and report the largest fragment a defrag could produce right now?
I don't think the size of the free lists tells us anything about fragmentation. It might be that all the memory that was allocated has been freed. And if we defragment we will be back to one 2G chunk. In that case we have 0% fragmentation even though the free lists might have a large size. Fragmentation is introduced when some chunks of memory remain allocated and they are in between freed chunks.
Could we do a periodic dry run defragmentation in the background that just updates the existing fragmentation stats? It would only run if an actual defragmentation has not been done for the last X seconds. It would be nice if we could find a way to safely scan the free lists to do this without locking them but this might be hard to do.
Alberto Gomez
Yes, "freedChunks" would be the number of elements in the free lists.
I thought at first about doing the "dry run" of defragmentation to update the stats but discarded it for two reasons:
I do think the size of the free lists gives you useful information about the fragmentation level. It is true that after running defragmentation you will have a clearer picture but if we do not have this stat, we are blind before defragmentation is run, which is only run when you cannot allocate.
I do not see the value of the "dry run" approach because, as I said, above, you will not be in the state given by the "dry run" until you defrag.
Darrel Schneider
Would "largestFreeChunk" be a better name than "largestFragment"?
A possible performance impact is that computing these stats in the background could end up consuming an entire CPU (if the computation is done with a single thread). Currently we don't store the size of each freelist so to compute this in the background you will need to use something like OffHeapStoredObjectAddressStack.computeTotalSize but that code keeps detecting concurrent mods and restarting. So to do this while freelists are being actively used could actually get starved and consume 100% CPU without ever completing. So I think we would need to have a better way to estimate the size of the free lists.
I still want to understand how these two stats would be used to detect fragmentation? It is possible to have an app that allocates all off heap memory into a bunch of objects of size 32 and then frees them all (causing a very high freeChunkCount and a largestFragment of 32). If you looked at these new stats would that tell you something useful? That app might then turn around and allocate all these 32 byte chunks from the free list. So no allocations would fail.
So could you write up the use cases that you see these two stats being helpful with and how you would actually use them in a deployed system?
Alberto Gomez
I would prefer to keep "largestFragment" because "largestFreeChunk" could lead you to think that the elements in free lists could also be included to give you the largest fragment/free chunk. But we know that the elements in the free lists are not as useful as the the free memory of a fragment because elements in free lists can only be reused by objects of the same size, they cannot be partially used and the remaining part left to allocate another object.
In order decrease the consumption of CPU when calculating the size of the tiny free list I propose to add the size to it in a member variable that will be updated every time you add or remove an element so that it would be available for the stat. I have created a draft pull request with the implementation of the ideas in this RFC: https://github.com/apache/geode/pull/7384). Also, the idea would be to update the stat not very frequently. Maybe every hour by default.
It is true that there are cases in which these stats would not be useful to detect future problems in allocation depending on the pattern of modifications done and the future use of the memory.
Regarding the use cases in which these two stats would be useful here come some examples:
The operator could set a threshold on the value of "largestFragment" so that when it goes below a certain value he could plan for a controlled restart of the members at a convenient time in order to perfectly defragment the off-heap memory.
What would be the use of "freedChunks" then?
In a situation in which "freeMemory" and "largestFragment" are the same, a higher value of "freedChunks" would indicate that the available memory would be more difficult to use, i.e. we would be in a higher risk to get an out-of-memory error because these freed chunks are less flexible (can only be used to store objects of the size of the chunk and cannot be split).
As a result, an operator could have heuristics that according to first, the value of "largestFragment" and then the value of "freedChunks", decide to plan for a restart of the servers or even to scale memory of the system.
Darrel Schneider
I agree that largestFragment is a better name. It would help if you added to the RFC actual definitions of the new stats and that would allow you to define "fragment".
Some minor corrections to some of your statements (in case they make it into documentation).
You say: "If the "largestFragment" value is lower than the object size you will get an out of memory error."
You will only get the out of memory error if defragmentation (which will be done by the thread trying to allocate) does not produce a fragment big enough.
It seems like if you always restarted a member when largestFragment gets low then you never give online defragmentation a chance. But I can see that you may want to operate that way to prevent defragmentation pauses. You would need to have redundancy but the challenge would be that the largestFragment may get small at the same time on multiple members. I guess the operator would just need to be careful to only do one restart at a time.
You say: "If there is not a free list element that is the same size of the object"
The "same size" is not quite correct. We have two types of free lists: tiny and huge. For a tiny list its items will be used to allocate objects in the range FLS-(OFF_HEAP_ALIGNMENT-1)..FLS where FLS is the Free List Size. OFF_HEAP_ALIGNMENT defaults to 8 but can be configured with a system property. So basically if you have a free lists of chunk whose size is 16 then allocations of size 9..16 can use that list. The unused memory is still allocated but never used for that allocation. This is called internal fragmentation. So the bigger OFF_HEAP_ALIGNMENT is the more internal fragmentation is allowed but does make it more likely that variable size allocations will be able to reuse an item in the free list.
For huge free lists (chunks are all bigger than OFF_HEAP_ALIGNMENT*OFF_HEAP_FREE_LIST_COUNT which by default is 520k) the internal fragmentation allowed is 256 bytes (instead of OFF_HEAP_ALIGNMENT).
As far as the usefulness of the new "freedChunks" goes have you considered not adding it? It seems like have every allocation and free also update a size will be the biggest performance impact of this. And from you description it seems much less useful than the largestFragment stat (which is very easy to compute). I guess I mainly see it as a good indicator of how expensive the next defragmentation will be since more free chunk require more work during defrag.
Alberto Gomez
Thanks for the clarifications. I was not very strict in my explanations in order not to complicate them but I appreciate your attention to the detail and the rigorous descriptions. I have added some definitions to the RFC based on the information you have provided.
While it is true that the "largestFragment" stat would be the most important piece of information in order to make a decision on when to plan for a defragmention, I still think the "freedChunks" provides useful information as well. It is not the same having a "small" value of "largestFragment" if "freedChunks" is zero or close to zero than when it has a very high value. Also, following the evolution over time of "freedChunks" could help us see if our application is able to reuse the "freedChunks" or not.
Regarding the impact in performance, bear in mind that for every allocation and free, the extra overhead would only consist of incrementing or decrementing an AtomicInteger member variable in OffHeapStoredObjectAddressStack.
Alberto Gomez
Darrel Schneider Do you feel we need more discussion on the RFC or should I go ahead with the implementation?
Darrel Schneider
I looked over your draft PR and it looks like adding a length for each tiny free list will not impact performance or add that much extra memory overhead. All the places that needs to change the length are already synchronized on "this" and doing an extra inc or dec inside that sync shouldn't hurt performance too much.
I would say you should not use an AtomicInteger since that uses extra memory and all your mods are inside a sync anyway. So instead just add a "private volatile int length" field. This will allow you to read the value without syncing (just like the atomic did) and reduce how much memory is used by each tiny free list.
So I don't need anymore discussion. Thanks for all your quick responses to my questions!
Alberto Gomez
Thank you!