Geode is a distributed data management system that runs on multiple machines and networks, and uses CPU, Memory, and Disk and network bandwidth to provide real time data access to applications. In the remainder of this article, whenever we refer to resource management, we are referring to the effective use of CPU, Memory, Disk and network bandwidth in the Geode cluster. Geode is a very dynamic system with the ability to add capacity on the fly. It contains a group membership system that allows it to react to membership changes in the system. In order to realize end to end Quality of Service (QoS) guarantees, Geode needs to constantly monitor the use of available cluster resources and make load balancing and resource allocation decisions on the basis of resource usage. The same information is made available to administrators and end users through management and monitoring mechanisms.
A Geode distributed system is comprised of processes running on a number of machines, communicating with each other, with stateful message exchanges among the members. (For example, a system may define multiple server-groups with each group defining a set of data regions, such that updates to a region in this group are replicated to other members of the group. Another example of communication would be clients connecting up to servers and establishing subscriptions on the servers). This implies that actions of one member of the system will have an impact on other members of the system and it is important to ensure that resources are available for each member to perform their functions in order to provide end-to-end QoS to the application cluster accessing Geode. Resource management tries to prevent problems from developing in specific members by moving and reassigning data or workload from heavily loaded members, and also providing the basis for making decisions to isolate or remove slow members to ensure that overall system functioning stays unaffected.
As briefly discussed in earlier sections, CPU, Memory, Disk and Network bandwidth are the four crucial resources that are critical to providing QoS in Geode. Of these, Geode uses memory extensively to provide its data management capability. In addition to the data that is stored in Geode, indexes defined on regions to speed up queries, intermediate results generated by queries, asynchronous queues that hold results of application subscriptions, transient memory usage for doing Geode operations and book-keeping, and metadata to speed up access to the data and optimize data distribution all contribute to the use of memory in Geode. Memory management is the key to ensuring that Geode can provide continuous availability, high throughput and low latency across the cluster. The next section provides more information on the various knobs provided by Geode to manage resources in the system.
In this section, we provide specifics on how Geode implements resource management for the resources involved in a Geode Distributed system.
Geode manages data in memory (with the ability to overflow or persist to disk based on configuration) and Geode cache servers are built on Java which means that cache servers are subject to garbage collection. A detailed explanation of how Java manages objects in VM memory is beyond the scope of this article, but it is important to note that the memory space in a VM is divided into different pools and objects get promoted from one pool to the next (get copied from one area of memory to the next) depending on how long they live. The longer the object survives, the more expensive it becomes to reclaim the memory associated with it when it is no longer in use. Replacing older objects with newer objects can exacerbate memory problems, but at the same time, you do want to remove objects that have not been accessed in a long time to keep memory footprint under control. Geode uses object pools wherever possible to reduce the amount of garbage created and ensures that temporary memory used is released quickly allowing it to be garbage collected cheaply.
Geode allows each cache server to specify data policies to control the amount of data stored in the server. If eviction is used, then it automatically implies that Geode is being used as a cache and if an entry cannot be found in the cache, there is an alternate data store from which it can be loaded. Read more about eviction here.
Depending on the number of servers and the partitioning policies used to partition the data, it is possible for some servers to end up hosting more data than other server. A server may end up hosting more data to provide redundancy for another server that might have been taken down for maintenance perhaps. Rebalancing allows data to be migrated from one server to another, ensuring that each server is only handling as much data as it is capable of dealing with. This allows each VM to control its memory footprint (as well as data workload). For more on rebalancing, read Rebalancing Partitioned Region Data.
The Geode resource manager:
The Geode resource manager works closely with the server VM's tenured garbage collector to control heap usage and ensures that the server remains responsive under heavy load conditions and more importantly ensures that the server cluster does not suffer any QoS degradation due to the load on one of the servers. For more about the Geode resource manager, read How the Resource Manager Works.
Geode regions store entries which are comprised of keys and values. Values can be large nested complex objects which take up a lot of space on the server heap. Geode regions can be configured to overflow values to disk to control the memory footprint within the server. For more about overflowing entries to disk and configuring regions with the disk attributes for overflow, read Disk Storage.
Managing memory footprint of client subscriptions:
In the client-server topology, Geode client applications establish subscriptions with the server and qualifying updates are queued up for delivery to clients. These queues contain references to the entries that get dispatched to the client. Client queues can be configured to overflow to disk to reduce the memory footprint of the cache server. The implementation ensures that there is no unnecessary memory usage while fetching the entries from disk and delivering them to clients. Read more about client subscriptions here.
Geode cache servers can be configured to use TCP, reliable unicast or reliable multi-cast as the communication protocol for exchanging metadata, as well as distributing data across the server cluster. Client applications use TCP to communicate with servers. Clients generally connect to multiple servers depending on their data access needs and also to ensure redundancy for their subscription queues. Server clusters that use TCP to communicate with each other also pool their connections and use idle time algorithms to return connections to the pool or close connections when they are not needed. For more on connection management within the cluster, read Topologies and Communication.
The goal with connection management and connection timeouts is to scale and actively manage the amount of workload that needs to be handled by each cache server. Each server can be configured with a maximum client limit. In addition to this, Geode allows customers to configure a custom load probe, which essentially reports the overall load factor on a server VM. A custom load probe implements a pre-defined interface and the information it provides is used to redirect existing and new client connections to servers that are lightly loaded to ensure better QoS. Clients pool their connections to servers as well, leasing them out to application threads for accessing data on the server and returning the connections to the pool when the application thread has completed its data access. Pooled client connections can be configured to expire, which gives the server cluster the opportunity to rebalance connections and ensure that each server is able to service its clients optimally. Last but not the least, the ability to efficiently serialize objects over the network using a language neutral wire protocol and transmit object deltas to receivers limits the amount of bandwidth use and allows applications to make effective use of this resource in conjunction with the other resources that impact application QoS. The DataSerializable protocol in Geode provides an extremely byte-efficient representation of the object on the wire reducing bandwidth consumption when the object is transmitted across machine boundaries. Efficient wire representation also reduces CPU usage when the bytes are sent on the wire. In most real world deployments, when an object is frequently updated, there is a small part of the object that gets updated (like price on an object representing a stock symbol). The ability to identify what part of the object was changed, and then serialize just those changes and apply them elsewhere, without compromising on data consistency (aka Delta Propagation) further optimizes the wire protocol, reduces garbage generation, improves bandwidth utilization and reduces CPU usage.
The Geode cluster provides distributed data management to applications at very high levels of throughput and extremely low latencies. It receives updates to data from many different sources and needs to propagate them to the relevant cluster members to ensure data consistency to applications. All of this makes each server a very parallelized entity that makes heavy use of multi-threading to do its work. Geode makes efficient use of thread pools and queues to deliver some of the key benefits of Staged Event Driven Architectures (viz. High Concurrency, Load conditioning, scalable I/O and so on). In a later section, we will see how resource management for each managed resource in a cache works to ensure optimal performance of each cache server in the cluster.
Data stored in Geode can be overflowed to disk (to control the memory foot print of the VM holding the data) and persisted to disk as well (to handle high availability and recovery in the event of system failure). Writing to disk offers reliability at the cost of performance. Disk space is treated as a managed resource in Geode and we make efforts to maximize disk performance by providing various configuration options to limit disk usage. Geode regions that are configured to overflow or persist can configure disk artifacts to go to multiple disk directories that may reside on different spindles. For each disk directory, a maximum size can be specified. Writes to disk can be synchronous or asynchronous depending on the trade-offs the application wishes to make between performance and reliability. In addition to this, disk files (also known as Oplogs) can be periodically rolled to reduce the number of files used by Geode concurrently. For more on disk attributes, disk tuning and persistence, read Disk Storage.
In previous sections, we touched upon the various resources that need to be managed by Geode and provided details on how Geode accomplishes the same. As you may have recognized by now, imbalance in one resource can exacerbate issues with other resources and can cause performance degradation and an inability to provide QoS to end user applications accessing the cluster. A server that uses up too much memory will get hit by garbage collection pauses which will cause clients connected up to it to fail over to other servers, causing unnecessary network traffic and increase resource usage on the other servers, increasing CPU and potentially disk usage on them. In order to prevent problems of this nature, it is important to do proper capacity planning for your Geode installation. You can find more on capacity planning here. This developer note also provides some practical guidelines for capacity planning in Geode. Function execution and querying are two Geode capabilities that impact performance and need to be considered as part of capacity planning. Function execution and querying can use up CPU, cause message traffic on the network and generate transient data causing the garbage collector to get engaged. Unlike capacity planning for data volumes and number of clients that access the cluster, function execution and querying are harder to plan and predict because the functions execute in the server process and queries run on the server and both of these can do arbitrarily complex things, the process for adding queries and functions should be done just like queries are approved for a traditional relational database. When a new query needs to be added to a relational database, the query has to be approved by the database adminstrator, who considers the impact of the query on the overall system, adds new indexes as appropriate, creates views if needed and only then is the query added to the system. With function execution, if the function does a lot of updates, then it need to be called out as such so that it can be routed efficiently. Also, if a function needs to access related regions, then it might make sense to co-locate those regions to improve execution times on the function and also to reduce network round trips to fetch any required information.
'Stop the world' garbage collection pauses are one of the biggest culprits that cause QoS degradation in Java based systems (including Geode) and tuning the VM to eliminate or reduce such pauses can help alleviate the problem considerably. For more on garbage collection, please read this note. In this document, we have tried to highlight the need for resource management and also explained how Geode attempts to manage resources to ensure predictable QoS across the cluster. Through examples and scenarios, it should be clear that over and above the attempts made by Geode to do effective resource management, the application architect has a significant role to play in ensuring that availability and prudent use of system resources to allow Geode to provide the high throughput, low latency and high availability across the cluster. The application architect needs to consider data partitioning and colocation strategies along with capacity planning and provisioning at the design and deployment stage of a Geode installation, keeping in mind the QoS the system is expected to provide to applications.
Along with these, GC tuning and run time monitoring of the system are things that will ensure that the system is able to meet or exceed its QoS guarantees.