Java Broker Design - HA using BDB

Overview

A number of designs for providing HA in the Java Broker have previously been considered and subsequently discarded for one reason or another.

Currently users of the Java broker are advised that for HA they need to synchronously replicate their store files to a secondary location, and then start a secondary broker on failure of their primary instance. Obviously this is a far from ideal solution, and the ability to synchronously replicate the store is, in general, limited to high end SAN solutions.

The addition of HA capabilities to the Berkeley DB Java Edition provides a potentially low-cost tactical solution to providing Active-Passive clustering for persistent messaging.

Requirements

There are (at least) two potential use cases for clustering with HA.

Firstly one can consider clustered replication as an "alternative" to persistence as a mechanism for providing reliability. In this use-case the user is looking to survive the failure of a single node in the cluster without losing transient messages. The user must cater for the case where the entire cluster goes down (in which case all transient message data is lost), but is willing to live with this for the performance advantages of using in-memory message storage rather than on-disk storage.

Secondly one can use clustered replication in "addition" to persistence. Here persistence on disk is still required, but the user is looking for either a) a further guarantee that if the primary broker goes down (and its storage devices are potentially lost), a secondary instance will exist with an up-to-date copy of the data; or b) reduced downtime in the case of a transient broker failure where the cost of restarting a broker is higher than the client failing over to a replacement instance.

For this proposal we will consider only the second case. In order to meet such requirements the HA solution must guarantee replication of both persistent message and durable configuration (queues, exchanges, bindings). We also require that any HA solution maintains atomicity with respect to transactions (at least with respect to persistent enqueues and dequeues). That is that if a broker failure happens while a transaction is being committed, either the entire commit is successfully replicated, or the transaction is rolled back. Since the scope of a transaction may include multiple queues on the same virtual host, replication must therefore take place at the virtual host or broker level.

Proposal

Berkeley DB Java Edition provides a mechanism for HA replication:

http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/index.html

We can use this mechanism to provide basic active-passive HA clustering at the virtual host level.

The user must configure the virtual host in Qpid to be using a BDB store in HA mode, identifying the "replication group" in which they are going to take part. Each virtual host in a broker may be independently replicated. When attaching to the group the virtual host will either become the "master" node or a "replica". Where the virtual host is a replica it must not allow any incoming connections. Upon becoming a master the virtual host MUST re-read all its configuration and message data from the store, and then allow clients to connect to it.

Electing a Master

The BDB HA implementation takes care of the election of the master node (See http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/introduction.html#masterselect).  Since election requires a simple majority, the smallest group size that can survive a node failure is three. Since many people will wish to have only a two node (active and passive) there exists a special process for selecting a master in a two node group where one node has failed. In a two node group one node may be defined the "primary". If the "primary" node loses contact with the master, it will elect itself the master and activate itself. If the secondary node loses touch with the "primary" then it will cease to be active even if it was previously the master (See http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/two-node.html).

Reliability

In the non-HA case the broker does not acknowledge transactions or synchronous operations until it has confirmed that the operation has been written to disk (not just to a memory buffer). HA introduces a new condition - how many replicas must have seen the update (and to what extent must they have persisted it). The strongest guarantee would be to not acknowledge until all replicas have persisted the state (along with the state reliably being persisted locally). Users may decide that the extra reliability inherent in HA clustering means that the on-disk reliability may be lessened by accepting the writing to the OS buffer as sufficient to claim successful transfer. In this case a message would only be "lost" if all machines running nodes in the cluster failed during the period between a transaction being "committed" and the OS buffer being flushed to disk. See http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/txn-management.html#durability.

Recovery After Failure

The BDB HA clustering features the ability for a node with an "out of date" store to re-join the cluster; however if the node is too far behind, the whole store must be retransmitted (although again BDB HA provides facilities for this: http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/logfile-restore.html )

Design Questions

How to manage two-node groups? Need an external process which can set the "primary" on the remaining node.
How to configure the durability guarantees - should we offer a relaxation to "WRITE_NO_SYNC" on the primary if it is also committing to the network? (Note WRITE_NO_SYNC will survive broker failure, but not machine failure).
Dynamic management of groups through management - the ability to add/remove node from groups?

Other Considerations

Interop with the C++ Broker

A new effort around HA clustering is currently being undertaken on the C++ Broker. This effort is concentrating on the ability to replicate queues through the use of special "browsers" an using QMF to inform replicas of configuration state changes. Ultimately this approach may lead to a mechanism the Java Broker also should support in order to be more interopable with the C++ codebase; however for our current requirements it is insufficient as it does not support transactional atomicity; and recovery after a node failure requires a full replay of the store.

HA Clustering for Transient Messages / non-BDB stores

While BDB offers an HA solution, this does not help those who are focused on transient messages or wish to use a non-BDB store. For these cases we need an alternative solutions. One proposal would be to write a "replicating facade" the would wrap any existing store and serialize and replicate transactions to other nodes in a cluster. (Note that in order to be able to perform "recovery" from a restarted node it would also require the persistence of "last processed transaction" and "outstanding uncommited transactions"; and the nodes would need to keep a replay buffer of in-doubt transactions which could be replayed to failed nodes. A process would also need to be put inplace for leadership election - though something like Apache Zookeeper might help us in all this.

Child pages