- Provide a new clustering paradigm in which there is no single Master node.
Background and strategic fit
In the 0.x baseline, Apache NiFi employs a "Master-Worker" paradigm when a cluster is created. The Master Node is referred to as the NiFi Cluster Manager, or NCM. If the NCM is down, there is no way for a user to see the current flow or change anything about the flow. The data does continue to flow through NiFi on all nodes. However, we are unable to see anything about which nodes are connected or how the nodes are performing. Additionally, we are unable to see what the flow actually looks like or make any sort of changes or respond to any REST API calls.
We would like to provide a new clustering paradigm for 1.0.0 of NiFi in which there is no single point of failure for the Control Plane. We refer to this as a Zero Master Clustering model.
User interaction and design
Under the new model, rather than connecting to the NCM in order to see and interact with the flow, the user should be able to point their browser to any node in a NiFi cluster and should be able then view and interact with the flow from that node. All nodes should be capable of handling user requests.
A single node will be responsible for Cluster Maintenance. This node will be referred to as the Cluster Coordinator and will be auto-elected via ZooKeeper's Auto Election mechanisms. If this node is lost, a new node will automatically be elected.
The 0.x clustering model called for all nodes to periodically send a heartbeat message to the NCM that included a great deal of status information. This will be broken down into separate concerns:
- Nodes will create a ephemeral ZooKeeper node that indicates the node's Identifier and hostname and port. This allows discovery of which nodes are in the cluster and ZooKeeper Watches can be employed to detect changes to this list. Each node that has an ephemeral node is known to be part of the cluster. This does NOT imply a "Connected" state, as a node in the cluster may have one of several states: Connecting, Connected, Disconnecting, Disconnected.
- Nodes will not publish status information to this ephemeral ZNode, as this adds extra tension on ZooKeeper. Rather, we can push data to the elected Cluster Coordinator.
- Status information will not be pushed to the Cluster Coordinator. Rather, nodes will obtain this information by federating the request to all nodes and aggregating the responses. The result can be cached for performance reasons if desirable.
- Since all nodes should be able to service REST API Requests, all nodes will need to be informed whenever a node's status changes. The internal cluster protocol can be used for this.
If a user connects to a node in a cluster, but that node is unable to communicate with ZooKeeper, requests that do not need to be federated across the cluster, such as obtaining the flow itself, will still be serviced. However, any request that requires communicating with either ZooKeeper or other nodes in the cluster will fail with a 503: Service Unavailable status code. This means that the UI will be highly available and be able to service all requests as long as the ZooKeeper quorum is present. However, in the absence of a ZooKeeper quorum, the nodes in the NiFi cluster will still be capable of providing a visualization of the flow itself but will be unable to show statistics or node information.
Below is a list of questions to be addressed as a result of this requirements document: