由于licensing的限制, 我们必须替换内置在 Geode 中老的JGroups 通信堆栈.  老的JGroups 有 LGPL 授权, 与 Apache 2.0 的授权是不兼容的.  此文档解释了对于成员关系系统上 Geode的一些需求和未来的一些替换, 最后围绕以 Geode 为核心完成对 JGroups 的替换.

 

Geode 怎样使用 JGroups

Geode 的动态成员关系模型基于通过pbcast.GMS 来提供的 JGroups 动态模型, 在新的节点添加进来的地方, 不会发生分布式系统瘫痪宕机的情况.  

Geode使用成员关系视图来进行复制和事件投递. 对于复制, 我们形成一个成员关系视图的子集, 让一个 Cache Region 知道, 同时标记它为 DistributionAdvisor. 我们当前的复制模式需要来自一个特定接收者的'返回接收', 只要这个接收者在成员关系视图中.

成员关系视图也可以被用来在分布式锁服务中选择一个老成员.老陈冠负责追踪谁允许被授权锁请求, 它在视图中是最老的'非管理员'成员.

网络分区检测也构建在JGroups GMS协议之中, 故障检协议已经被替换让Geode发送"怀疑对象"给JGroups, 帮助它描述出故障已经发生.
 

 

Peer 认证已经构建在2.2.9 JGroups 堆栈之上, 通过引入自定义的认证协议, 拦截'加入'请求', 同时要求认证在允许请求到达GMS 成员关系协议之前来检查发送.

未来 Geode 服务器也会响应JGroups, 对于可靠的 UDP 消息传输, 这个 UDP 消息用来广播整个成员关系集合, 例如StartupMessage, ShutdownMessage, CreateRegionMessage 和 PDX 注册.

跨 TCP/IP 流连接发送这些消息有一个障碍是增加分布式系统的成员数量.特别是在启动时, 我们必须创建4个成员的连接来加入到分布式系统中.

可靠的 UDP 通信在带外的, 低优先级通信的环境下也需要, 如管理节点的告警. 创建 TCP/IP  连接发送告警能够阻塞故障或错误的操作发生.我们最近也看到了一些大型的生产系统中, 当一个告警阻塞了操作 — 成员还没有确认一个成员关系视图的变化, 因为管理节点接收到的告警是有问题的, 不接受连接.

Geode 集成了一个 JGroups GossipServer 到  Locator 服务当中.  GossipServer 用于提供一些信息, 谁在分布式系统中, 当一个新成员正加入到分布式系统中时.

最后, Geode 客户端使用成员关系系统的类来形成 IDs, 这些包含了一个 JGroups IpAddress.

 

需求

简要来讲, 成员关系服务必须

  1. 投递成员变化的通告消息到 分布式系统的成员关系管理器

  2. 允许新的成员关系加入, 而无系统停机的情况

  3. 在分布式系统里为每个成员提供一个身份, 同时允许客户端有类似的身份.  对于 Peer 来说, 此 ID 必须是唯一的, 老的身份不应该被重用 (至少不是非常快)

  4. 在成员 ID中, 传输信息包括每个成员的 DistributedSystem 特征 (VM type, DirectChannel port, Groups, Name, etc) 

  5. 高效, 快速地检测到一个成员丢失情况 (故障检测)

  6. 对于分布式锁服务, 支持老成员的想法

  7. 支持 Geode 模型来处理网络分区 (获得/丢失分区)

  8. 允许 Geode 给出通知, 哪个成员可能存在问题

  9. 支持滚动升级 (一旦升级开始, 老的成员不能重新加入 & 服务自身支持向后兼容)

  10. Geode 认证服务集成, 在允许新的成员加入前进行认证

 UDP 消息服务必须

  1. 兼容成员关系服务 IDs  (, 成员关系的一个 ID 标识了 UDP 消息服务中的端点)

  2. 支持滚动升级 (跨发布版本的on-wire 兼容性)



替换 JGroups v2.2.9 的选项


对于我们来说有一些选项可以选择.  选项如下:

迁移到新的JGroups版本

使用 Zookeeper

使用 Akka

创建一个自定义解决方案


Geode做的最好事情之一是成员管理管理是动态的, 毫无单点故障.  稍微弱的地方是 Locator 服务, 如果所有的 Locators 都有一个急速下降的客户端数, 不能获得服务器的相关信息, 且新的服务器不能添加到集群中, 直到一个Locator再次恢复可用性.  即时 Locators 都宕掉了, 服务器集群仍然是可用的, 客户端仍然能够连接到服务器上.

对于这种情况, 如 Zookeepr 看起来是不满足要求的. 用户必须配置Zookeeper集群, 保证集群是配置好的, 把服务器丢失连接的风险降低到最小. 丢失一个集群的连接将需要一个服务器停机. Zookeeper集群通常情况下比较小, 因此一个 带有200台服务器的Geode 用户将会感觉存在风险, 当使用 一个大的, 7个节点的Zookeeper 集群时.Zookeeper并不响应7~11个节点的需求, 或者提供 UDP 消息通信.

 

JGroups 已经牵扯和解决了大量的必须在Geode 中 2.2.9 版本上修复的问题. 然而, 为了使用它, 我们 Fork 了一个版本, 修改特定的部分为了满足4, 7, 9 和 10版本.我们 Fork 出特定的部分, 如GMS 和故障检测协议 , 但是视图(View)类需要传递认证的 Credentials.  如果你没有在集群成员关系上使用它, JGroups 仍然在可靠的 UDP 消息传输上是有用的.

Akka 看起来是有前途的, 现在有大量的人使用它.

没有利用其他工程的自定义解决方案也能实现这个功能, 特别是如果 JGroups 被用于可靠地  UDP 消息传输时.

上述选项检查

本章节我们看一下每个选项, 看一下每个选项如何定位这些需求.

JGroups

A newer version of JGroups could be modified to fit the requirements, just as is being done with the initial, incubating, version of Geode.

New versions of JGroups use the Apache license so they are compatible with Geode and could replace the old 2.2.9 version.  A new version of JGroups would be a natural fit for replacing the old version.  A number of the changes made to the older JGroups for GemFire, such as out-of-band messaging and out-of-band messages are now available in off-the-shelf JGroups.  The GossipServer service used in the Locator is now, sadly, missing but there is a GossipRouter service that might plug into the Locator instead.  Other problems, such as the membership view being transmitted in a message header instead of the body have also been corrected (allowing for large membership views through body fragmentation).  Message-handling executors have been added and the old up/down thread pattern has been removed.  UUIDs are used for member IDs in preference to InetAddress:port (though these are still used for identifying endpoints associated with UUID identifiers).

The downside of using JGroups is that we have to fork it in order to insert our failure-detection SLAs, network-partition detection and handling, ability to shut down in all circumstances, etc.  We largely rewrote failure-detection, but the GMS may need a lot fewer changes thanks to improvements made in JGroups over the years like simultaneous handling of joins & leaves.

We would also need to change JGroups to support rolling upgrade.  The old 2.2.9 JGroups code was modified to use versioned serialization streams and include version information so that recipients of messages would know how to deserialize the messages and react to them.

If we continue to use and modify JGroups one other concern to note is that the code we inserted into the old LGPL version of JGroups is probably still covered by LGPL and can’t just be copied into a newer version to cover our needs.  It will need to be written from scratch but can use the same algorithms as the old implementation.

If JGroups is not used for membership it could still be used for UDP communication, but rolling upgrade support might be an issue if the version of JGroups being used by Geode is ever changed.


Zookeeper

Many Apache projects use Apache Zookeeper for collaboration between processes.  Zookeeper provides collaboration primitives that we can use to build membership services, distributed locking and more.

If you haven’t been exposed to how Zookeeper is used you might take a look at this page: http://nofluffjuststuff.com/blog/scott_leberknight/2013/07/distributed_coordination_with_zookeeper_part_3_group_membership_example

Zookeeper meets requirements 1 through 6 in the list you saw earlier in this document.

Zookeeper can be used to join a distributed system by creating an ephemeral, sequenced z-node in the distributed system’s “group” node.  Each z-node can contain additional details about the member, fulfilling one of the important requirements Geode has for a membership system.

Zookeeper gives its clients notification when a node changes.  This can be used to tell members when a new peer has joined or left the system.  Notification only lets you see the current state of the group z-node, so we have to keep a snapshot of the last state and do a diff to see what’s changed.  Apparently it is possible for changes to the group node to be missed so there probably needs to be some periodic polling of the group node to check for changes in addition to the notification mechanism.

Zookeeper’s getChildren() method returns a list of nodes that has no particular order but if we use sequential z-nodes each node is assigned a sequential number that can be used to impose ordering and select an Elder for the Distributed Lock Service.

To detect crashed clients Zookeeper uses UDP heartbeats.  This is notoriously fragile under heavy CPU load and is one reason I crippled the similar mechanism in JGroups 2.2.9 by making it defer to the decisions of the stronger tcp/ip stream connection-based FD_SOCK.  I have no proof but suspect this will result in members being kicked out of the system under heavy load, transferring the load to other already-burdened servers...

There doesn’t appear to be a way for clients to raise suspicion about another client, so Geode could not hint to Zookeeper to check up on a member that’s not responding to messages.  It does seem possible to delete another client’s node though, so that could be used to remove a problematic member.  Another mechanism for doing a health check on that member would be needed in Geode before taking the drastic step of unilaterally kicking it out.  Adding an external health checker, such as a phi-based system or a membership-ring system, is a possibility.

Zookeeper doesn’t deal with network partitioning but this could be handled when membership changes.  Each member would have to decide whether now-missing members have crashed or not and perform loss calculations.  Geode sends Shutdown messages, so this could be done.

Zookeeper does not include any form of send/reply messaging so we would still need some other solution for out-of-band messaging and broadcast messaging.

 

Zookeeper would not support the notion of blocking a member from joining based Geode version information.  We would have to build this on top of zookeeper as this can cause distributed hangs for algorithms that are changing behavior in the newer version of Geode.

 Zookeeper has a pluggable authentication system and ACL-based access control.  It’s not clear to me whether it is compatible with the pluggable authentication system in Geode.  It certainly wouldn’t interact with Geode’s authentication system directly.  This, too, would have to be built on top of zookeeper.

The down-side to Zookeeper is that it is not a shared-nothing membership system like we currently have.  Geode would be dependent on Zookeeper servers which must be deployed and managed, and if Zookeeper goes down the entire distributed system will become unavailable.  It’s not exactly a single point of failure since Zookeeper has redundant servers, but it forms a static-membership core to Geode that can put a large distributed system at risk.  If the Zookeeper servers are not reachable a node in a Geode Distributed System will have to shut down immediately since it can draw no other conclusion but that it is isolated from the primary partition.  This could be a significant risk in a large distributed system.  It’s akin to requiring locators to always be available.

Zookeeper might give us trouble in large distributed systems because access to zookeeper would be required for servers to remain up.  We should consider using the Apache Curator project or perhaps parts of Helix that handle some of the problems people usually run into with Zookeeper and for UDP messaging we could use a JGroups channel having no membership or failure detection protocols.

So, to sum up, zookeeper could be used as the basis for a membership management service, replacing some of the functionality we currently have built into JGroups.  We would have to implement a fair portion of what we need outside of zookeeper, and using zookeeper comes with some risks.  We'd still need a different solution for UDP messaging.

Akka 集群

Akka 被用于Google Compute Engine 和其他的集群工程中.  Google 发布博客宣称在一个集群中它能够达到1500个节点, 且性能稳定, 同时超时范围宽松.

Akka 集群的相关文档, 如下所示: http://doc.akka.io/docs/akka/snapshot/common/cluster.html

Akka 使用配置的种子节点加入到集群中, 它兼容 Geode的 Locator 发现模式, 但是对于种子节点, 它也不能够解决并发启动的问题,  Geode 尝试了JGroups 增强实现.  第一个种子节点需要提出, 在其他的种子节点启动之前.

Akka delivers membership change notifications using an Actor model for individual MemberUp/MemberRemoved events and you can also get the cluster state.  Cluster state is fairly complete and has both a lead-member, a sorted set of members and a set of unreachable members.  One thing missing from cluster state is any form of unique identifier.

Akka allows new members to join without taking down the system.  New members can’t join though if the coordinator (“leader”) is examining an unresponsive member.

Akka provides each member with a unique ID containing hostname:port:UID.  The host name can be an address.  IDs are not allowed to rejoin a system once they leave or are kicked out.  This is compatible with what Geode needs.

Akka member IDs do not provide storage for extra information, so Geode member characteristics can’t be transmitted with the cluster state.  

Akka can auto-detect that a member is “unreachable” and boot it out.  The member is left to figure out what to do and by default will just form its own 1-member system.  

Akka member IDs have an olderThan method that could be used to determine the Elder in the system, assuming the Member<->InternalDistributedMember mapping problem already mentioned is solved.

Akka uses a gossip-based algorithm based on Dynamo and Riak to distribute membership changes.  The cluster state is sent to random members of the system with preference for those that have not seen the latest version.  The cluster state is completely installed once all members have been included in the “seenBy” set that is transmitted with cluster state.  Cluster state also includes a set of Unreachable nodes.  Somehow this would have to map to a coherent network-partition-detection algorithm for Geode.  The gossip-based diffusion of membership events doesn’t map at all to the current 2-phased installation algorithm.

Akka uses phi-accrual failure detection, which we have considered using for Geode.  However, it is a heartbeat-based algorithm and so is prone to false-positives when CPU or network load is high.  Akka docs say to tune the phi-threshold to suit the environment.

There is an API in the failure detector registry that will let Geode feed heartbeats to Akka but there is no way to feed suspicion to it based on non-responsiveness in our other communication channels.

Akka would not support the notion of blocking a member from joining based Geode version information.  We would not be able to prevent a node using the old version of the product from joining during a rolling upgrade.  This can cause distributed hangs for algorithms that are changing behavior in the newer version of Geode.

Akka would not integrate with Geode’s peer-to-peer authentication system as it does not allow you to send credentials when joining and does not include credentials of the sender in its cluster-state events.

Akka does provide messaging so it could handle the out-of-band messages we currently send over JGroups.

Akka lets you plug in your own serializer so we could insert DataSerializable and DataSerializableFixedID.  Rolling upgrade could not be supported at this level, though, because no info about the destination address is given to the serializer.

There are a few problems that make the native clustering in Akka unusable for Geode.

First, we can’t support authentication since we can’t send credentials in join messages.  As with zookeeper this would have to be built on top of Akka.

Second, Akka has no notion of time or identity in its cluster state, so if a message is sent by member R to member S we can’t check to see if S is at a compatible state with R before allowing the message to be processed.  We need that for some distributed algorithms.  We would need to build this on top of Akka.

Third, Akka does not have a network partition detection algorithm at all, much less one that maps to Geode’s.  This also would have to be built on top of Akka.

So, Akka looks like a better fit than Zookeeper in some ways and it looks like less of a good fit in others.  We would have to build a lot of stuff on top of it. 

Conclusion

Each of the options has downsides and would require us to implement a lot of additional functionality.

We would like to avoid making heavy modifications to JGroups but we will continue to use JGroups for UDP communication since it fits well with the current Geode architecture.

We will create interfaces for the services we currently get from JGroups so that different implementations can be plugged in but will implement most of these services from the ground up.


  • No labels