Motivation
Ignite clusters are required to have a dynamically changing topology due to the nature of distributed systems - nodes can fail and need to be restarted at any time, so new nodes can be introduced or removed from the cluster. Users expect that topology changes do not break the consistency of the cluster and it remains operational.
Description
Problem statement
This document describes the process of creating a cluster of Ignite nodes and adding new nodes to it. It describes the following concepts:
- Cluster setup - the process of assembling a predefined set of nodes into a cluster and preparing it for the next lifecycle steps;
- Cluster initialization - after setting up the cluster, it should be transferred to a state when it is ready to serve user requests;
- Node validation - in order to maintain the cluster configuration in a consistent state, the joining nodes should be compatible with all of the existing cluster nodes.
Terminology
Meta Storage is a subset of cluster nodes hosting a Raft group responsible for storing a master copy of cluster metadata.
Cluster Management Group
Cluster Management Group or CMG is a subset of cluster nodes hosting a Raft group. CMG leader is responsible for orchestrating the node join process.
The init command is issued by a user with a CLI tool and moves a cluster from the idle state into the running state.
Idle and running cluster
An idle cluster is a cluster that has been assembled for the first time and has never received the init command, therefore the Meta Storage of this cluster does not exist. A cluster can be considered running if it has obtained the information about the Meta Storage and CMG location.
Empty and initialized node
Every Ignite node starts in the empty state. After joining the cluster and passing the validation step, a node obtains the location of the Meta Storage and transitions to the initialized state.
Cluster Tag
A cluster tag is a string that uniquely identifies a cluster. It is generated once per cluster and is distributed across the nodes during the node join process. The purpose of a cluster tag is to understand whether a joining node used to be a member of another cluster, in which case it should be rejected.
A cluster tag consists of two parts:
- . Its purpose is to make the debugging and error reporting easier.
- Its purpose is to ensure that Cluster Tags are different between different clusters.
Physical topology
Physical topology consists of nodes that can communicate with each other on the network level. Nodes enter the topology though a network discovery mechanism, currently provided by the SWIM protocol. However, such nodes may not yet have passed the validation step, so not all nodes from the physical topology can participate in cluster-wide activities. In order to do that, a node must enter the logical topology.
Logical topology
Logical topology consists of nodes that have passed the validation step and are therefore able to participate in cluster-wide activities. Logical topology is maintained by the Cluster Management Group.
Protocol description
Initial cluster setup
This section describes the process of assembling a new Ignite cluster from a set of empty nodes.
- Initial set of nodes is started, providing the following properties:
- provided by an IP Finder. Concrete IP Finder implementations can be used to obtain the and are specified either via the configuration or the CLI.
- The nodes assemble into a physical topology using a network discovery protocol (e.g. SWIM), bootstrapped with the provided seed members.
- An init command is sent by a user to a single node in the cluster, providing the following information:
- Consistent IDs (names) of the nodes that will host the Meta Storage;
- Consistent IDs (names) of the nodes that will comprise the Cluster Management Group. It is possible for both of these address sets to be the same.
- The node, that has received the command, propagates it to all members of the physical topology that were specified in the init command. These members should start the corresponding Raft groups and, after the group leaders are elected, the initial node should return a response to the user. In case of errors, Raft groups should be removed and an error response will be returned to the user. If no response has been received, the user should retry sending the command with the same parameters to a different node.
- As soon as the CMG leader is elected, the leader initializes the CMG state by applying a Raft command (e.g.
ClusterInitCommand
), which includes:- A generated Cluster Tag;
- Ignite product version.
- After the command has been applied, the leader sends a message to all nodes in the physical topology, containing the location of the CMG nodes. At this point the cluster can be considered as running.
- Upon receiving the message, each node sends a join request to the CMG leader, which consists of:
- Protocol version (an integer which is increased every time the join procedure is changed, needed to support backwards compatibility);
- Ignite product version;
- Information from the join requests gets validated on the leader and, if the properties are equal to the CMG state (in case of the protocol version, a different comparison algorithm might be used), a successful response is sent, containing:
- Consistent IDs (names) of the Meta Storage nodes;
- Cluster Tag.
If the properties do not match, an error response is sent and the joining node is rejected.
- If the joining node has passed the validation and received the validation response, it starts some local recovery procedures (if necessary) and sends a response to the CMG leader, indicating that it is ready to be added to the logical topology.
- The CMG leader issues a Raft command (e.g.
AddNodeCommand
), which adds the node to the logical topology.
An example of this flow can be found on the diagram below. Some initialization steps are omitted for Node C as they are identical to the corresponding steps on Node B.
Adding a new node to a running cluster
Depending on the state of the node or the cluster there exist 4 possible scenarios:
Empty node joins an idle cluster
This scenario is equivalent to the initial cluster setup.
Initialized node joins an idle cluster
Such nodes should never be able to join the cluster, because it will fail the cluster tag validation step.
Empty node joins a running cluster
- The new node enters the physical topology;
- CMG leader discovers the new node and sends a message to it, containing the location of the CMG nodes;
- Upon receiving the message, the joining node should execute the validation procedure and be added to the logical topology, as described by steps 7-9 of the Initial cluster setup section.
Initialized node joins a running cluster
- The new node starts the CMG group server or client, depending on the existing local node configuration;
- The new node enters the physical topology;
- If the new node is elected CMG leader, it should use the local CMG state to validate itself, otherwise see point 4.
- CMG leader discovers the new node and sends a message to it, containing the location of the CMG nodes;
- Upon receiving the message, the joining node should execute the validation procedure and be added to the logical topology, as described by steps 7-9 of the Initial cluster setup section.
Implementation details
Node start flow
The following changes are proposed to the node start scenario in regards to the changes to the join protocol:
Each blue rectangle represents a start of a component, changes are marked in red and notable action points are marked in green.
According to the diagram, the following changes are proposed:
RESTManager
component is started earlier.CMGManager
component, responsible for managing CMG interactions, introduced.
nodeRecoveryFinished
action item introduced. It’s a step within components' start process that denotes that a given node has finished its recovery and is ready to be included in a logical topology.
RestManager changes
RestManager
should be started earlier, since it is required to register REST message handlers early to handle the init command.
CMGManager
CMGManager
is responsible for interacting with the CMG and should perform the following actions:
- Launch the local CMG Raft server in case an existing local configuration exists;
- Register a REST message handler for the init command;
- Register a message handler that will be listening for messages from the CMG leader and sending a join request.
Changes in API
TODO
Risks and Assumptions
- Proposed implementation does not discuss message encryption and security credentials;
- Protocol for nodes leaving the topology is out of scope of this document;
- Protocol for migrating CMG and Meta Storage to different nodes is out of scope of this document.
Discussion Links
https://lists.apache.org/thread/4lor2vxkg6x94thprvcr0h19rkm6j1gt
Reference Links
- IEP-73: Node startup
- https://github.com/apache/ignite-3/blob/main/modules/runner/README.md
- IEP-67: Networking module
- IEP-61: Common Replication Infrastructure
Tickets
IGNITE-15114
-
Getting issue details...
STATUS