We must replace the old JGroups build in Geode with something new due to licensing restrictions. The old JGroups had a LGPL license, which is incompatible with Apache 2.0 licensing. This document explains the requirements Geode has for a membership system and examines several alternatives for moving forward, concluding that we should replace JGroups with a custom Geode-centric solution.
How Geode uses JGroups
Geode’s dynamic membership model is based on the JGroups dynamic model provided by pbcast.GMS, where new nodes can be added at will without taking the distributed system down.
Geode uses the membership view primarily for replication and event delivery. For replication we form a subset of the membership view that is known to have a cache Region and label this a DistributionAdvisor. Our current replication scheme requires a return-receipt from any specified recipient as long as the recipient is in the membership view.
The membership view is also used for selecting an Elder for the distributed lock service. The Elder keeps track of who is allowed to grant lock requests, and it is the oldest non-admin member in the view.
Network partition detection is also custom built into the JGroups GMS protocol and the failure-detection protocols have been altered to let Geode feed “suspicion” into JGroups to help it figure out that failures have happened.
Peer authentication has been built into the 2.2.9 JGroups stack by introducing a custom AUTH protocol that intercepts Join requests and requires authentication checks to pass before allowing the request to reach the GMS membership protocol.
In the future Geode servers were also going to rely on JGroups for reliable UDP transmission of messages that are broadcast to the whole membership set, such as StartupMessage, ShutdownMessage, CreateRegionMessage and PDX registrations. Sending these messages over TCP/IP stream connections is a barrier to increasing the size of the distributed system, especially at startup time when we must create 4M of these connections (M=member count) just to join the distributed system.
Reliable UDP communication is also needed for out-of-band low-priority communication, such as sending alerts to management nodes. Creating TCP/IP connections to send alerts can block operations during periods when there are already bad things happening. We recently saw this in a large production system, where an alert that members weren’t acknowledging a membership view change blocked operations because the management node that was to receive the alert was sick and not accepting connections.
Geode integrates a JGroups GossipServer into the Locator service. GossipServer is used to provide information on who is in the distributed system when a new member is joining the distributed system.
Finally, Geode clients use the membership system's classes to form IDs, and these contain a JGroups IpAddress.
In brief, the membership service must
deliver notification of membership changes to the DistributedSystem’s MembershipManager
allow new members to join without taking down the system
provide identity for each peer in the distributed system and allow clients to have a similar identity. The identity must be unique for the peer and old identities should not be reused (at least not very quickly)
transmit information about each member’s DistributedSystem characteristics (VM type, DirectChannel port, Groups, Name, etc) in the member’s ID
efficiently and quickly detect loss of a member (failure detection)
support the notion of an Elder member for Geode’s Distributed Lock Service
support Geode’s model of handling network partitions (winning/losing partitions)
allow Geode to give advice on which members might be sick or out of action
support rolling upgrade (old members can’t rejoin once upgrade has begun & the service itself must support backward compatibility)
integrate with Geode’s authentication service and require authentication before allowing a new member to join
A UDP messaging services must
Be compatible with the membership service’s IDs (an ID from membership identifies endpoings in the UDP messaging service)
Support rolling upgrade (on-wire compatibility across releases)
Options for replacing JGroups v2.2.9
There are a number of options for us to choose from. Here are a few:
Move to a newer version of JGroups
Create a custom solution
One of the nice things about Geode is that membership management is dynamic and has no single point of failure. Its weakest link is the Locator services, where if all Locators take a nose-dive clients cannot get info about servers and new servers can't be added to the cluster until one of the locators is available again. Even if the locators are down the server cluster remains viable and available to clients that are already connected to servers.
For this reason a solution like Zookeeper seems inadequate. Users would have to configure Zookeeper clusters and make sure the cluster is configured so that servers have minimal risk of losing contact with it. Losing contact with the cluster would require a server to shut down. Zookeeper clusters are typically pretty small, so a Geode user with 200 servers might feel at risk when using even a large 7-node Zookeeper cluster. Zookeeper also doesn't answer requirements 7 through 11 or offer UDP messaging.
JGroups has evolved and solved a lot of problems that had to be fixed in the 2.2.9 copy currently in the Geode repository. However, in order to use it we would have to fork it and modify certain parts in order to answer requirements 4, 7, 9 and 10. We could fork only parts, such as GMS and the failure detection protocols but the View class needs to carry Credentials for authentication in order to be useful to Geode. If not used for cluster membership, JGroups might still be useful for reliable UDP messaging.
Akka looks promising and a lot of people are using it.
A custom solution that does not leverage other projects for clustering could also be implemented, especially if JGroups is used for reliable UDP messaging.
Examination of the options
In this section we'll look at each of the options and see how it might address the requirements.
A newer version of JGroups could be modified to fit the requirements, just as is being done with the initial, incubating, version of Geode.
New versions of JGroups use the Apache license so they are compatible with Geode and could replace the old 2.2.9 version. A new version of JGroups would be a natural fit for replacing the old version. A number of the changes made to the older JGroups for GemFire, such as out-of-band messaging and out-of-band messages are now available in off-the-shelf JGroups. The GossipServer service used in the Locator is now, sadly, missing but there is a GossipRouter service that might plug into the Locator instead. Other problems, such as the membership view being transmitted in a message header instead of the body have also been corrected (allowing for large membership views through body fragmentation). Message-handling executors have been added and the old up/down thread pattern has been removed. UUIDs are used for member IDs in preference to InetAddress:port (though these are still used for identifying endpoints associated with UUID identifiers).
The downside of using JGroups is that we have to fork it in order to insert our failure-detection SLAs, network-partition detection and handling, ability to shut down in all circumstances, etc. We largely rewrote failure-detection, but the GMS may need a lot fewer changes thanks to improvements made in JGroups over the years like simultaneous handling of joins & leaves.
We would also need to change JGroups to support rolling upgrade. The old 2.2.9 JGroups code was modified to use versioned serialization streams and include version information so that recipients of messages would know how to deserialize the messages and react to them.
If we continue to use and modify JGroups one other concern to note is that the code we inserted into the old LGPL version of JGroups is probably still covered by LGPL and can’t just be copied into a newer version to cover our needs. It will need to be written from scratch but can use the same algorithms as the old implementation.
If JGroups is not used for membership it could still be used for UDP communication, but rolling upgrade support might be an issue if the version of JGroups being used by Geode is ever changed.
Many Apache projects use Apache Zookeeper for collaboration between processes. Zookeeper provides collaboration primitives that we can use to build membership services, distributed locking and more.
If you haven’t been exposed to how Zookeeper is used you might take a look at this page: http://nofluffjuststuff.com/blog/scott_leberknight/2013/07/distributed_coordination_with_zookeeper_part_3_group_membership_example
Zookeeper meets requirements 1 through 6 in the list you saw earlier in this document.
Zookeeper can be used to join a distributed system by creating an ephemeral, sequenced z-node in the distributed system’s “group” node. Each z-node can contain additional details about the member, fulfilling one of the important requirements Geode has for a membership system.
Zookeeper gives its clients notification when a node changes. This can be used to tell members when a new peer has joined or left the system. Notification only lets you see the current state of the group z-node, so we have to keep a snapshot of the last state and do a diff to see what’s changed. Apparently it is possible for changes to the group node to be missed so there probably needs to be some periodic polling of the group node to check for changes in addition to the notification mechanism.
Zookeeper’s getChildren() method returns a list of nodes that has no particular order but if we use sequential z-nodes each node is assigned a sequential number that can be used to impose ordering and select an Elder for the Distributed Lock Service.
To detect crashed clients Zookeeper uses UDP heartbeats. This is notoriously fragile under heavy CPU load and is one reason I crippled the similar mechanism in JGroups 2.2.9 by making it defer to the decisions of the stronger tcp/ip stream connection-based FD_SOCK. I have no proof but suspect this will result in members being kicked out of the system under heavy load, transferring the load to other already-burdened servers...
There doesn’t appear to be a way for clients to raise suspicion about another client, so Geode could not hint to Zookeeper to check up on a member that’s not responding to messages. It does seem possible to delete another client’s node though, so that could be used to remove a problematic member. Another mechanism for doing a health check on that member would be needed in Geode before taking the drastic step of unilaterally kicking it out. Adding an external health checker, such as a phi-based system or a membership-ring system, is a possibility.
Zookeeper doesn’t deal with network partitioning but this could be handled when membership changes. Each member would have to decide whether now-missing members have crashed or not and perform loss calculations. Geode sends Shutdown messages, so this could be done.
Zookeeper does not include any form of send/reply messaging so we would still need some other solution for out-of-band messaging and broadcast messaging.
Zookeeper would not support the notion of blocking a member from joining based Geode version information. We would have to build this on top of zookeeper as this can cause distributed hangs for algorithms that are changing behavior in the newer version of Geode.
Zookeeper has a pluggable authentication system and ACL-based access control. It’s not clear to me whether it is compatible with the pluggable authentication system in Geode. It certainly wouldn’t interact with Geode’s authentication system directly. This, too, would have to be built on top of zookeeper.
The down-side to Zookeeper is that it is not a shared-nothing membership system like we currently have. Geode would be dependent on Zookeeper servers which must be deployed and managed, and if Zookeeper goes down the entire distributed system will become unavailable. It’s not exactly a single point of failure since Zookeeper has redundant servers, but it forms a static-membership core to Geode that can put a large distributed system at risk. If the Zookeeper servers are not reachable a node in a Geode Distributed System will have to shut down immediately since it can draw no other conclusion but that it is isolated from the primary partition. This could be a significant risk in a large distributed system. It’s akin to requiring locators to always be available.
Zookeeper might give us trouble in large distributed systems because access to zookeeper would be required for servers to remain up. We should consider using the Apache Curator project or perhaps parts of Helix that handle some of the problems people usually run into with Zookeeper and for UDP messaging we could use a JGroups channel having no membership or failure detection protocols.
So, to sum up, zookeeper could be used as the basis for a membership management service, replacing some of the functionality we currently have built into JGroups. We would have to implement a fair portion of what we need outside of zookeeper, and using zookeeper comes with some risks. We'd still need a different solution for UDP messaging.
Akka is used by Google Compute Engine and other projects for clustering. Google has posted that it achieved 1500 nodes in a cluster with stable performance using a simple application and fairly loose timeouts.
Akka clustering documentation can be found here: http://doc.akka.io/docs/akka/snapshot/common/cluster.html
Akka uses configured seed-nodes to join the cluster, which is compatible with Geode’s Locator discovery pattern, but they have not solved the concurrent-startup problem for seed nodes that Geode has licked in its JGroups improvements. The first seed node needs to be brought up before any other seed nodes are started.
Akka delivers membership change notifications using an Actor model for individual MemberUp/MemberRemoved events and you can also get the cluster state. Cluster state is fairly complete and has both a lead-member, a sorted set of members and a set of unreachable members. One thing missing from cluster state is any form of unique identifier.
Akka allows new members to join without taking down the system. New members can’t join though if the coordinator (“leader”) is examining an unresponsive member.
Akka provides each member with a unique ID containing hostname:port:UID. The host name can be an address. IDs are not allowed to rejoin a system once they leave or are kicked out. This is compatible with what Geode needs.
Akka member IDs do not provide storage for extra information, so Geode member characteristics can’t be transmitted with the cluster state.
Akka can auto-detect that a member is “unreachable” and boot it out. The member is left to figure out what to do and by default will just form its own 1-member system.
Akka member IDs have an olderThan method that could be used to determine the Elder in the system, assuming the Member<->InternalDistributedMember mapping problem already mentioned is solved.
Akka uses a gossip-based algorithm based on Dynamo and Riak to distribute membership changes. The cluster state is sent to random members of the system with preference for those that have not seen the latest version. The cluster state is completely installed once all members have been included in the “seenBy” set that is transmitted with cluster state. Cluster state also includes a set of Unreachable nodes. Somehow this would have to map to a coherent network-partition-detection algorithm for Geode. The gossip-based diffusion of membership events doesn’t map at all to the current 2-phased installation algorithm.
Akka uses phi-accrual failure detection, which we have considered using for Geode. However, it is a heartbeat-based algorithm and so is prone to false-positives when CPU or network load is high. Akka docs say to tune the phi-threshold to suit the environment.
There is an API in the failure detector registry that will let Geode feed heartbeats to Akka but there is no way to feed suspicion to it based on non-responsiveness in our other communication channels.
Akka would not support the notion of blocking a member from joining based Geode version information. We would not be able to prevent a node using the old version of the product from joining during a rolling upgrade. This can cause distributed hangs for algorithms that are changing behavior in the newer version of Geode.
Akka would not integrate with Geode’s peer-to-peer authentication system as it does not allow you to send credentials when joining and does not include credentials of the sender in its cluster-state events.
Akka does provide messaging so it could handle the out-of-band messages we currently send over JGroups.
Akka lets you plug in your own serializer so we could insert DataSerializable and DataSerializableFixedID. Rolling upgrade could not be supported at this level, though, because no info about the destination address is given to the serializer.
There are a few problems that make the native clustering in Akka unusable for Geode.
First, we can’t support authentication since we can’t send credentials in join messages. As with zookeeper this would have to be built on top of Akka.
Second, Akka has no notion of time or identity in its cluster state, so if a message is sent by member R to member S we can’t check to see if S is at a compatible state with R before allowing the message to be processed. We need that for some distributed algorithms. We would need to build this on top of Akka.
Third, Akka does not have a network partition detection algorithm at all, much less one that maps to Geode’s. This also would have to be built on top of Akka.
So, Akka looks like a better fit than Zookeeper in some ways and it looks like less of a good fit in others. We would have to build a lot of stuff on top of it.
Each of the options has downsides and would require us to implement a lot of additional functionality.
We would like to avoid making heavy modifications to JGroups but we will continue to use JGroups for UDP communication since it fits well with the current Geode architecture.
We will create interfaces for the services we currently get from JGroups so that different implementations can be plugged in but will implement most of these services from the ground up.