ID | IEP-128 |
Author | |
Sponsor | |
Created | |
Status | |
- CMG is the Cluster Management Group (it’s a RAFT group)
- MG is the Metastorage Group (also a RAFT group)
An ignite cluster has 2 system Raft groups: CMG and MG, both are required for normal functioning. A Raft group loses availability (and stops being operational) if it loses its majority.
If a CMG becomes unavailable, the cluster loses the ability to allow nodes join (so initialized nodes cannot start) and hence becomes fragile.
If an MG becomes unavailable, primary partition leases stop being updated (and, as a result, they expire after a few seconds), and Metastorage Safetime stops being updated (so SchemaSync starts hanging); both cause incoming operations (any transactional load) to hang, and the cluster becomes inoperable.
Hence we need a way to forcefully repair these groups.
Other types of potential problems (not related to a loss of majority) can also arise, they are summarized in the corresponding section closer to the end of the document.
- It must be possible to repair CMG and/or MG if they lose majority
- Old majority returning back online must not cause troubles
- Best-effort consistency guarantees for MG: allow a node to join the cluster if only minor inconsistencies (like divergence in SafeTime) are detected. If major inconsistencies are there (like diverged Catalog), the node will only be allowed to play a ‘zombie’ role (so that the user is able to copy partitions’ data from it).
We can forcefully reset the voting set of a group (breaking the safety of the Raft protocol). This effectively means that we create a new majority that is independent from the old one. Having two independent majorities might cause the following problems if nodes that didn’t take part in the repair come back online:
- If the returning nodes were included in the voting set of the group before the majority loss, then they can form the old majority again. The new leader would append the new configuration (containing the old voting members) to all nodes that it knows about (for MG, this is all nodes in the cluster due to learners). This might cause the leadership of the group to be hijacked by the old nodes that saw no repair, which might result in inconsistent data in the group state machines
- It might happen that the nodes that come back online had applied some command X before going offline such that no node (which took part in the repair) saw that command, leading to the same inconsistent data problem. This could happen like this:
- ABC is the voting member set, N is a learner
- AB and N get X, it gets applied at AB and N
- AB get destroyed, N goes down
- CDE are appointed as the new voting member set
- N starts; only this node has X, and X is applied
We rely on the following property (it has to be implemented) of our networking: nodes having different cluster IDs cannot establish a network connection (handshake fails). (A node that never was a part of any cluster [a blank node] can establish connection with any other node).
For CMG, we destroy old CMG and recreate another one, doing a reinit with a new cluster ID. As a result, old and new nodes cannot connect to each other.
For MG, we recreate CMG first, even if it was not broken (to change the cluster identity and disallow old MG nodes to mix with the new ones); then we repair MG on the available nodes; then, after the node gets migrated to the new cluster, when going to start a Metastorage on it, we check if the joining node that thinks that it is a voting member is actually one of metastorageNodes. Together these solve problem 1.
Also, to solve problem 2, we do additional validations on join to only join nodes that are compatible by their Metastorage and switch others to a special ‘zombie’ state.
All data stored in CMG either can be safely lost or can be restored.
First, the user should try to restart CMG nodes to restore the majority. If it does not work, they’ll have to forcefully assign a new majority using the ‘recovery cluster reset’ command (issued manually via CLI/REST).
From the bird’s eye:
- Assign new cluster ID to the available nodes
- Restart them, recreate CMG on start, do reinit
- Migrate nodes that were offline during steps 1 and 2 to the new cluster
Details:
- User sees in the monitoring that the CMG majority is lost; they try to bring the lost nodes to life, this does not work
- User issues a ‘cluster reset’ command passing new CMG nodes (these must be in the physical topology) (‘ignite recovery cluster reset –cluster-management-group=<new-cmg-nodes>’)
- The command gets sent to the first node from new-cmg-nodes
- The node that got the repair command is the Repair Conductor
- The general procedure for CMG recreation is invoked with new CMG nodes given in the command (see below)
After this sequence is executed, the nodes of the cluster that got the ClusterResetMessage will have switched to the new CMG.
The procedure might fail at least for the following reasons:
- Some of the CMG nodes specified by the user are not in the physical topology
- The Conductor does not have all the information it needs to start the procedure
In any of these cases, an error should be reported to the user, and the user should try another time (probably, choosing other CMG nodes and/or different Conductor).
This procedure is initiated while repairing the CMG, MG or both. When only the MG lost its majority, we still re-create the CMG to prohibit the old MG majority from joining the new cluster and interfering with the MG repair.
The procedure gets the following parameters:
- New CMG nodes
- New MG replication factor (if MG is to be repaired)
New CMG nodes will be specified by the user as the ‘cluster reset’ command argument.
The node that got the repair command is the Repair Conductor. The Conductor must have cluster state in its Vault (to reuse old cluster name and so on when building new cluster state) and at least 1 revision must be applied in the Metastorage underlying structures (the MG group itself might not be available, but its storage is inspectable); if one of the conditions is broken, it returns an error to the user that asks to try issuing the command to another node. This makes sure that the Conductor has all the information it needs and that the cluster was initialized (this is to make sure that the initialConfiguration is already applied as we’ll throw away the contents of the CMG).
- The Conductor sends ClusterResetMessage to all the nodes that are currently in the physical topology (including itself, with one difference - if MG is to be repaired, the message to self includes additional attributes: conductor=true, nodes=<set-of-nodes-to-which-the-message-gets-sent>); the message contains:
- new CMG nodes
- MG nodes (taken from the old CMG state)
- cluster name (taken from the old CMG state)
- new random cluster ID
- mgReplicationFactor, if provided
- Upon receiving a ClusterResetMessage, a node does the following:
- Stores it to the Vault
- Responds with OK
- Restarts itself (Conductor does this only after it gets OK from all other nodes to which it sent the message, or a timeout passes)
- If a node sees a ClusterResetMessage in the Vault on startup, it uses clusterId from it for handshakes
- During node startup, when going to start the CMGManager, if a ClusterResetMessage is in the Vault, the node does the following:
- destroys the CMG Raft group and removes local CMG data
- Initializes a new CMG Raft node using the new CMG nodes (from the message) as voting set
- Submits a CmgInitCommand (having CMG nodes, MG nodes, cluster name, new cluster ID) to the new CMG
- If cluster ID in the local CMG data is different from the cluster ID in the message:
- Removes the ClusterResetMessage from the Vault
- If MG repair is requested in the message, and the MG is not yet available, carries out MG repair logic (see below); otherwise, proceeds with normal startup
- LogicalTopology events are accompanied with context (bearing clusterId in it), so that LogicalTopology listeners are able to react to a non-monotonous change of Logical Topology version to 1 (they can distinguish between version 1 created for old cluster ID and version 1 created for the new one)
It might happen that some nodes were down (or were segmented from the main cluster segment) while the CMG repair was happening.
The user issues the ‘cluster migrate’ command via the CLI specifying --old-cluster-url and --new-cluster-url.
- The CLI obtains new cluster state from --new-cluster-url
- It then invokes the /recovery/cluster/migrate REST endpoint on the node at --old-cluster-url sending it the new cluster state
- The node receiving the command sends ClusterResetMessage (having CMG nodes, MG nodes, cluster name and cluster ID from new cluster node) to every node that it sees in its physical topology (including itself)
- This makes all the nodes still in the old cluster to switch to the new cluster
If some nodes are stopped and need to be migrated, it’s preferred to start and migrate them one by one. This minimizes potential problems with clients accessing a node that was started just to be migrated.
We will add a mechanism to restart a node in-process (that is, without stopping the current process and starting another one). IgniteServer already contains a reference to an IgniteImpl. The restart will stop the existing IgniteImpl and start another one, swapping the reference value.
The Ignite instance returned by IgniteServer#api() will be a wrapper that will delegate to current IgniteImpl; it will switch from old IgniteImpl to new one during a restart. All public interfaces obtained via Ignite (like IgniteTables, Table and so on) will be wrappers of the same kind.
The restart will be as transparent for the user as possible. While a restart is underway, calls via the wrappers will be blocked, the wait will be limited by a timeout. If the timeout elapses before the restart finishes, an unchecked NodeRestartingException will be thrown to the user.
Unlike the CMG, MG contents cannot be thrown away and recreated. MG is present at every initialized Ignite node (thanks to Raft learners), so the strategy of a forceful repair is to peek nodes with the freshest copies of MG data and form a new MG voting set of them.
Forceful MG repair looks like forceful CMG recreation followed by MG-specific steps.
On the high level, it works like this:
- Recreate CMG to make sure the old MG majority cannot intervene
- Collect information about MG Raft indexes+terms and choose the freshest nodes
- Reset MG using the freshest nodes as the new voting set
- Migrate to the new cluster any nodes of the old cluster that were down/segmented during the repair
- User gets a notification about the absence of a majority of the MG
- User tries to restart Ignite nodes hosting MG nodes (or just their RAFT nodes inside Ignite nodes)
- If this does not work, the user issues a ‘recovery cluster reset {--cluster-management-group=<new-cmg-nodes>|--node=<existing-node>} --metastorage-replication-factor=N’ command first making sure that every possible node starts and joins. (If the user specifies the --node argument instead of --cluster-management-group, the command will take current CMG voting members set from the CMG leader (via --node); if the CMG is not available, the command will fail)
- The general procedure for CMG recreation is invoked (see above); Metastorage-specific steps get executed after the CMG is recreated (see below)
If the Repair Conductor crashes after it removes the ClusterResetMessage from the Vault, but before it finishes the MG repair, the procedure has to be repeated again. (Failover is absent on purpose to avoid simultaneous repair attempts and to make sure each attempt starts with a node restart)
It is important that information about MG Raft index+term pairs is collected from all nodes forming the new cluster, so if a Conductor cannot get information from some nodes (or they simply don’t start up in time), the procedure must be started over (by issuing another ‘cluster reset’ command).
These are the steps that get executed after the general CMG recreation procedure ends when metastorage-specific attributes are present in a ClusterResetMessage. The node has found ClusterResetMessage in its Vault on startup, took part in CMG recreation, removed the message from the Vault, started joining the cluster (passed the basic validation).
- The node pauses its startup routine before starting the Metastorage
- The Repair Conductor waits till all nodes listed in the message appear in the ‘basically validated’ node set (or higher - that is, become fully validated or appear in the logical topology)
- The Conductor broadcasts a MetastorageRepairStartMessage to all nodes that are currently in the ‘basically validated’ node set (or higher)
- Upon receiving a MetastorageRepairStartMessage, an Ignite node does the following:
- Starts the Metastorage Raft group
- Obtains Raft index and term of the Metastorage and sends them in a response to the Conductor
- Waits for the Metastorage Raft group majority to get established, then proceeds with the normal startup
- After receiving responses from all the nodes (if a node leaves the ‘basically validated’ node set (or a higher set), the procedure ends with an error and ‘cluster reset’ has to be repeated), the Conductor chooses new MG nodes from those that successfully returned responses [the set of nodes that returned successful responses is U] (see details below). This new node set must include at least one of the nodes with the highest index+term among U.
- The Conductor submits CmgChangeMgNodesCommand (containing new MG nodes) to the CMG
- Execution of CmgChangeMgNodesCommand on a node changes the metastorageNodes in the cluster state
- The Conductor then chooses one of the new MG nodes having the highest index+term as the new leader and sends it a BecomeMetastorageLeaderMessage
- When a node receives a BecomeMetastorageLeaderMessage, it does resetPeers on the Metastorage Raft group by passing only self as the voting set (and no learners). After it becomes a leader, it returns a successful result to the Conductor. This step is the one that breaks the safety of the Raft protocol.
- When a node becomes an MG leader because it executed BecomeMetastorageLeaderMessage, it does not start issuing idle safe time commands and managing learners yet
- After getting a successful response to BecomeMetastorageLeaderMessage, the Conductor forms a new Raft configuration including all new MG nodes as the voting set and other nodes from U as learners. It then does changePeers via the new leader to switch to this configuration. After the leader elected on the previous step (the sole leader) gets reelected (or ceases to be a leader), the prohibition for it to send idle safe time commands and manage learners is lifted
Having U with their index+term pairs, the Conductor needs to choose new MG nodes. User supplies the desired number of new MG nodes via the command line (--metastorage-replication-factor=N), and the Conductor just chooses top N nodes (by their index+term) trying to make sure that they are in different availability zones.
This is analogous to the same action for CMG repair, see MIgrating nodes still using the old CMG to the new CMG.
After such a node (which did not take part in the Metastorage repair) is migrated to the new cluster, it could potentially cause troubles. Following sections are about preventing them.
If, during a join, a node detects that, according to the MG configuration saved in the MG on this node, this node is the member of the voting set (i.e. it’s a peer, not a learner), and this node is NOT one of the metastorageNodes in the CMG, then, before starting its MG Raft member, it raises a flag that disallows its Raft node becoming a candidate. As soon as the Raft node applies a new Raft configuration (coming from the new leader), this flag is cleared. After this, the Raft node is ‘converted’ to the new MG and cannot hijack the leadership.
If some node was not online during the MG repair, it might have already applied an entry that no node in the repaired set had ever seen. This may lead to the Metastorage data diverging.
- Originally, we disallow a node to join if its Metastorage has any divergence (it is only allowed to exist as a ‘zombie’ node that can be used to download partitions’ data using special tools (out of scope of this phase of the design))
- Later, we might whitelist some types of differences that are considered safe (like if only SafeTime propagation commands differ, or only partition Leases differ and they are all expired)
Currently, during its startup, a node (among other things) does the following:
- Validate itself against the CMG leader by sending a JoinRequestCommand to the CMG (upon successful validation it gets added to the ‘validated’ set)
- Start remaining components
- Complete join via the CMG leader (after this the node gets added to the logical topology)
We need to validate a joining node Metastorage against the ‘cluster’ Metastorage state. Join happens via interaction with the CMG, but an MG leader is needed for validation. This creates a cyclic dependency between validation and the startup of the MG. To avoid this, we split the validation into 2 phases: basic validation and Metastorage validation.
The part of the startup related to join and Metastorage startup now looks like this:
- Validate itself against the CMG leader only doing basic validations, that is, everything except Metastorage (upon successful validation the node gets added to the ‘basically_validated’ set) by sending a JoinRequestCommand to the CMG
- Start Metastorage (and wait for it to catch up with the leader and apply everything that is committed [this is what is already done during node startup])
- Validate the Metastorage against the MG leader
- Record the fact of Metastorage validity in the CMG by sending a ValidMetastorageCommand (the node gets added to the ‘fully_validated’ set)
- Start remaining components
- Complete join via the CMG leader (after this the node gets added to the logical topology)
We could compute a hash/checksum for each Metastorage command (including the previous command hash), getting something like a log chain (similar to how git commits are chained). We currently store some metadata about each applied Metastorage revision, namely <revision, ts>; we’ll add the revision hash to this metadata. A joining node will take <revision, hash> for the latest command it has applied on Metastorage recovery as <nodeRevision, nodeHash> and will use it when validating against the MG leader. It will ask the MG leader to provide its current history in the form of <revision, hash> pairs. Let’s denote min and max revisions in the leader’s history as minRevision and maxRevision and leader’s hash for revision R as leaderHash(R). Outcomes:
- minRevision <= nodeRevision <= maxRevision and leaderHash(nodeRevision) = nodeHash -> node is allowed to join
- minRevision <= nodeRevision <= maxRevision and leaderHash(nodeRevision) <> nodeHash -> the node is not allowed to join
- nodeRevision > maxRevision -> disallow join (maybe it actually diverged and the leader is going to grow its history in another direction while the current node is joining), or we could wait for the MG history on the leader to reach the same revision as the joining node has and then try again
- nodeRevision < minRevision (some history is removed due to compaction) -> we don’t know if it diverged -> the node is not allowed to join
If a network partition happens and the majority of MG remains functional in one segment, while the user only observes another segment where they do a forceful repair, there is a possibility for the MG to diverge. To reduce the probability of such an event, we could add TopologyValidators similar to Apache Ignite 2: if a validator thinks that the current topology became invalid, it switches the Metastorage to read-only mode. For example, a validator might react to a sudden drop of the number of online nodes.
There are two possibilities:
- All nodes that are not destroyed returned to the cluster or were cleaned, but the Metastorage did not diverge. In the worst case, we lost some Metastorage updates cluster-wide, but each of them was only known to the nodes that were destroyed or cleaned. Not a single node in the repaired cluster has ever seen any of these updates, including updates to the Catalog. Not a single node in the repaired cluster has any tuple of a version that was lost. So the cluster was just cleanly pushed back in time with regards to the Metastorage.
- Some nodes that did not take part in the repair were tried to be returned to the cluster, but the Metastorage on them was found to have diverged wrt the new cluster, so they are not let to finish the join procedure. They can only be ‘zombies’. The nodes that took part in the repair demonstrate the same property as in item 1 (above).
The cluster consists of 3 nodes: ABC. The CMG is on A, the MG is on B. Cluster name/ID is Galileo/12345.
- All 3 nodes get segmented (we get 3 network segments, one per node)
- User is in the B segment. They see that cmg.available metric drops to 0 on B (the only observable node)
- User issues
ignite recovery cluster reset --cluster-management-group=B
command - B gets ResetClusterMessage(cmgNodes=[B], clusterName=Galileo, clusterId=54321). It saves it to the Vault and restarts
- During the restart, it starts using new name/ID (Galileo/54321) for network handshakes
- Network partition disappears, A and C get restarted, but they cannot connect to B. They form a cluster AC (as A still hosts old CMG)
- B finds the message in the Vault, clears the CMG locally and does reinit: the new cluster is formed. Now it removes the message from the Vault.
- User issues
ignite recovery cluster migrate –old-cluster-url=A –new-cluster-url=B
command - As A and C see each other in the physical topology, both A and C get ResetClusterMessage(cmgNodes=[B], clusterName=Galileo, clusterId=54321), they save these to their Vaults and restart
- Upon restart, A and C find the message in the Vault, clear their local CMG, connect to the new CMG (from the message) and remove the message. Now they are also in the new cluster
The cluster consists of 3 nodes: ABC. Both the CMG and MG are on A. Cluster name/ID is Galileo/12345.
- All 3 nodes get segmented (we get 3 network segments, one per node)
- User is in the B segment. They see that cmg.available and mg.available metrics drop to 0 on B (the only observable node)
- User issues
ignite recovery cluster reset --cluster-management-group=B
–metastorage-replication-factor=1
command - B gets ResetClusterMessage(cmgNodes=[B], clusterName=Galileo, clusterId=54321, mgReplicationFactor=1, conductor=true, nodes=[B]). It saves it to the Vault and restarts
- During the restart, it starts using new clusterId (54321) for network handshakes
- Network partition disappears, A and C get restarted, but they cannot connect to B. They form a cluster AC (as A still hosts old MG and CMG) and start writing to the Metastorage with term=6 (and indexes=101+)
- B finds the message in the Vault, clears the CMG locally and does reinit: the new cluster is formed. Now it removes the message from the Vault.
- B finds that the MG on it has <index, term> equal to <100, 5>. It chooses itself as the new MG voting set and does resetPeers. Now we have a second, independent, MG.
- User issues
ignite recovery cluster migrate –old-cluster-url=A –new-cluster-url=B
command - As A and C see each other in the physical topology, both A and C get ResetClusterMessage(cmgNodes=[B], clusterName=Galileo, clusterId=54321), they save these to their Vaults and restart
- Upon restart, A and C find the message in the Vault, clear their local CMG, connect to the new CMG (from the message) and remove the message.
- A and C try to validate their Metastorage. It has diverged, so none of them can join, both switch to the ‘zombie’ state
The cluster consists of 5 nodes: ABCDE. The CMG is on CDE, the MG is on ABC. Cluster name/ID is Galileo/12345.
- Nodes A and B are destroyed
- User sees that cmg.available and mg.available metrics drop to 0 on CDE
- User issues
ignite recovery cluster reset --node=C
–metastorage-replication-factor=3
command - CDE get ResetClusterMessage(cmgNodes=[CDE], clusterName=Galileo, clusterId=54321, mgReplicationFactor=3, conductor=true, nodes=[CDE]). They save it to the Vault and restart
- During the restart, CDE start using new clusterId (54321) for network handshakes
- CDE find the message in the Vault, clear the CMG locally and do reinit: the new cluster is formed. Now CDE remove the message from the Vault.
- C is the repair conductor. It initiates MG repair. CDE find that the MG on them have <index, term> equal to <100, 5>, <102, 5> and <102, 5>, correspondingly, and return this information to C.
- The conductor chooses CDE as the new voting members
- D is appointed to be a leader; then C does changePeers and expands the MG configuration to CDE. Now, there is a fully functional MG, the nodes finish their startup procedure.
Sometimes, the user knows that a node is going to fail. In such situations, an API for manual CMG/MG reconfiguration using normal Raft protocol could save the day. This could look like a ‘cmg/metastorage reconfigure’ command that would invoke changePeers on the corresponding group. This is out of scope of this design.
Another way is to automatically reconfigure a system group if it loses a minority for some time (analogously to what is done for partitions). This is also out of scope of this design.
It might happen that a command is added to a Raft group’s log that fails during execution with 100% rate. We can just truncate the log starting with such a command and restart the Raft node.
- /management/v1/recovery/cluster/reset
- /management/v1/recovery/cluster/migrate
- /management/v1/recovery/cmg/state/local
- /management/v1/recovery/cmg/state/global
- /management/v1/recovery/cmg/restart
- /management/v1/recovery/cmg/truncate
- /management/v1/recovery/metastorage/state/local
- /management/v1/recovery/metastorage/state/global
- /management/v1/recovery/metastorage/restart
- /management/v1/recovery/metastorage/truncate
- ignite recovery cluster reset [--node <nodeName> | --cluster-management-group <nodeNames>] [--metastorage-replication-factor=N]
- ignite recovery cluster migrate --old-cluster-url <nodeName> --new-cluster-url <nodeName>
- ignite recovery cluster restart cmg|metastorage [--nodes <nodeNames>]
- ignite recovery cluster states cmg|metastorage [--local [--nodes <nodeNames>] | --global]
- Ignite recovery cluster truncate cmg|metastorage --index <index>
‘states’ commands return either local (Healthy/Initializing/Snapshot installation/Catching up/Broken) or global (Available/Degraded/Unavailable) states of the corresponding groups. For ‘states –local’, kind (voting member/learner) and index+term are also returned.
- cmg.availablePeers is an int gauge, reports number of available members of the CMG voting set
- cmg.available is an int gauge, its value is 1 if the CMG majority is available, 0 otherwise
- metastorage.availablePeers is an int gauge, reports number of available members of the MG voting set
- metastorage.available is an int gauge, its value is 1 if the MG majority is available, 0 otherwise
schemaSync.waits is a reservoir (a histogram?) of times of all waits (in milliseconds) caused by schema synchronization (might indicate that something is wrong with the Metastorage)
Logical topology versions are not monotonous anymore as during a cluster reset version number falls to 1. This has to be handled carefully in rebalancing code.
- IEP-77: Node Join Protocol and Initialization for Ignite 3
- IEP-126: Table/zone disaster recovery
Key
|
Summary
|
T
|
Created
|
Updated
|
Assignee
|
Reporter
|
P
|
Status
|
Resolution
|