This is a functional specification for the membership manager in Geode. This replaces the JGroupMembershipManager that is in the incubating version of Geode.
The primary functions of the membership manager are to implement membership for the distributed system and handle all message sending/receiving. It has a plug-in for the DistributionManager so it can receive messages and it must map between whatever internal identifiers are used for membership/messaging and Geode's InternalDistributedMember identifiers.
The membership manager can forcefully shut down a Geode cache if it detects it is no longer a member of the distributed system.
Interfaces
There are a number of existing interfaces in Geode that must be implemented by the membership manager:
MembershipManager - provides membership and messaging functionality to the DistributionManager
NetMember - represents a member ID in the membership manager, plugs into InternalDistributedMember
MemberServices - factory for creating a NetMember
NetView - this is actually a class that represents a membership set. It must be created by the membership manager for use by DistributionManager
QuorumChecker - used during AutoReconnect to poll to see if a quorum of a NetView is reachable
Internally there are interfaces in the membership manager that provide separation of concern for each of its components. This should allow us to plug in different implementations for each component such as ring-based or phi-accrual based health monitoring.
Service - interface implemented by all internal components
void init(Services s)
void start(CancelCriterion c) - called after all services have been initialized with init() and all services are available via Services
void stop()
void installView(NetView v)
void beSick(), playDead(), beHealthy() - used for membership testing
ServiceFactory - used by the membership manager to instantiate its services
ServiceConfig create(Manager m, ServiceConfig sc)
Authenticator - authenticates a member
String rejectionMessage authenticate(NetMember m)
Object getCredentials()
HealthMonitor - monitors members and instigates removal of those deemed dead
void contactedBy(NetMember m) - tells the monitor that we've had contact with another member
void suspect(NetMember m) - tells the monitor that the member is suspected of being ill or dead
void checkSuspect(NetMember m) - requests a health check on another member. This should initiate removal of the member if it does not pass the test
JoinLeave - manages joining, leaving and removing members.
boolean join() - joins the distributed system and returns true if successful, false if not. Throws SystemConnectException and GemFireConfigException
void leave() - leaves the distributed system. Should be invoked before stop()
void remove(NetMember m) - force another member out of the system
Locator - used by TcpServer to handle peer-location requests. Implements TcpHandler
Manager - internal interface for working with the membership manager. Extends MessageHandler
void send(DistributionMessage m)
InternalDistributedMember getMemberID(NetMember m)
void forceDisconnect(String reason)
MessageHandler - receives messages from a Messenger
void handle(DistributionMessage m)
Messenger - sends and receives messages. All messages are delivered to the Manager unless there is a handler installed for the message's class
void addHandler(Class c, MessageHandler h) - adds a handler for the given class/interface of messages
void send(DistributionMessage m) - sends an asynchronous message
NetMember getMemberID() - returns the endpoint ID for this member
Services - provides access to ServiceConfig and a directory of the membership manager's internal components
get/setAuthenticator
get/setConfig
get/setHealthMonitor
get/setJoinLeave
get/setLocator
get/setManager
ServiceConfig - provides configuration information for the manager and its components
DistributionConfig getDistributionConfig()
Properties getProperties()
Implementation Notes
In order to preserve as much of the current membership behavior as possible, fostering adoption of Geode by the GemFire user base the existing JGroupMembershipManager will be copied and most of its code will be preserved. It will continue to hold the DirectChannel but will now also hold a ServiceConfig that it will use in place of the JGroups channel.
The implementation of each of the other components will be in separate packages to keep the code clean and possibly allow for different implementations to be plugged in.
Authenticator will use Geode's authentication API to authenticate another member and to get credentials for JoinLeave to use in sending membership views and join requests.
HealthMonitor will initially use the NetView to form a look-to-the-right ring for one member to monitor another. HealthMonitor will keep a record of the last time a message was received from each member in the system. If the member it is watching has not made contact in the last member-timeout milliseconds it will request a heartbeat from the member and perform a timed attempt to connect to the members DirectChannel port (if available). If the member does not respond within member-timeout milliseconds HealthMonitor will remove it using the JoinLeave.removeMember() API. The implementation of removeMember will forward the request to the current membership coordinator who will perform its own health-check on the member before removing it (sending out a new NetView). When the ping request has been sent HealthMonitor will go on to examine the next member in the view.
JoinLeave will use Messenger, and possibly the membership manager, to communicate with other members. It will use TcpClient to contact Locators when joining in order to find the current membership coordinator. Once it knows the coordinator it will send it a Join message including authentication credentials. JoinLeave will also implement membership coordination functions (i.e., replace what we're doing with JGroups GMS). It will be responsible for detecting a network partition and invoking forceDisconnect() in the membership manager.
The Locator component will persist the current membership view and will respond to requests for the ID of the current membership coordinator. If there is no membership coordinator (meaning the Locator is booting up) then it will return its best guess of who the coordinator is based on who has contacted it.
The Manager API is what should be used by all components to interact with the membership manager.
Messenger will use a trimmed-down modern JGroups stack to perform UDP messaging. JGroups will no longer be forked for use in Geode but will be added as a dependency. Messenger will be responsible for installing the current NetView in its JGroups stack as a native JGroups View so that UDP broadcast works and multicast message garbage-collection can be properly performed. Note that this switch to using off-the-shelf JGroups means we will start seeing more log messages from JGroups than in the past.
Testing
Since this is implementing an existing interface in Geode there are already a lot of tests that exercise it. These tests will need some attention if they are referring to any JGroups code. The use of interfaces in this version of the MembershipManager should allow us to create real unit tests, as opposed to integration tests, for each component to achieve a higher level of code coverage.
Jepsen testing should be performed to ensure that the membership manager behaves as expected during network failures, GC pauses, etc. All releases of Geode should require Jepsen testing.