ID | IEP-133 |
Author | |
Sponsor | |
Created | 23 Dec 2024 |
Status | DRAFT |
Replicated zones are zones with partitions that have replicas on every node of the cluster that matches the zone filter and storage profile filter. These zones can provide better performance on SQL joins.
Currently in Apache Ignite 3, we have an ability to set any number of replicas for any zone regardless of its initial replica number. The problem is that in large clusters, if we create a zone with a number of replicas that would be equal to the cluster size, there will be too many members in each replication group to maintain consensus. This will have a serious impact on performance. Moreover, we don’t have automatic scale up for this case: if the user decides to scale up their cluster, they would need to change the replica number manually (unless it is big enough, exceeding the nodes number).
Therefore, we need special options for zones to create members of replication groups that would not be a part of consensus, and to be able to scale the zones along with the cluster.
Replication group - full set of replicas of a partition.
Consensus group - subset of replicas of a partition that maintains the data consistency in the replication group. For example, in the case of Raft it is the set of voting members.
Consensus replicas - replicas that form the consensus group.
Learners - replicas of a replication group that are not included into the consensus group.
Quorum - the minimal subset of replicas in the consensus group that is required for it to be fully operational and maintain the data consistency, in the case of Raft it is the majority of voting members.
Data node - a node of the cluster that is included into the distribution zone, matches its filters and storage profiles, and therefore, is used for storing the data and replicas are created there.
REPLICAS
This is an existing parameter of integer type that specifies the constant size of a replication group. The proposal is to add one more acceptable value for this parameter: ALL, meaning that replicas should be created on every data node of the zone.
QUORUM_SIZE
This is the new parameter that is proposed here. It specifies the size of the majority quorum. In fact, the size of consensus group can be derived from it, depending on the replication algorithm, in the case of Raft it is calculated as QUORUM_SIZE * 2 - 1, so the quorum size is the size of the majority of nodes in the consensus of a replication group. This implies that the size of the consensus group will be an odd number, if there are enough replicas.
The QUORUM_SIZE parameter may be set with any sufficient number of replicas, either ALL or not. This means that we can have, for example, 10 data nodes, 7 replicas and quorum size 3, meaning that 5 replicas will form the consensus group and 2 will be learners.
The default value should be
Lower and upper boundaries:
Quorum size parameter can conflict with incorrect replicas count and insufficient data nodes count. We should throw an exception in this case to not create a group without a majority. Also, there should be special error codes for each case.
Quorum is chosen because it is quite straightforward for the user: if they are able to keep QUORUM_SIZE nodes in the zone, there should be no data loss, unless they lose multiple nodes simultaneously so that the quorum would be not able to transfer to nodes that remain online.
Also, many widely-used replication algorithms with strict consistency are based on consensus algorithms that require majority quorum, which has size floor(nodes / 2) + 1. For example, Raft [1] and Zookeeper [2] use majority quorum to replicate the data. Membership-based protocols, such as Hermes [3], may not use majority for replication but need majority-based protocol to maintain the consistent view of the cluster topology on every node. So, the size of the consensus group can be calculated using the quorum size.
The aforementioned parameters of distribution zones should be accessible via SQL commands CREATE ZONE and ALTER ZONE.
QUORUM_SIZE should be added as an optional parameter of these commands. It should be of type int.
For example:
CREATE ZONE ‘MY_ZONE’ WITH REPLICAS = ALL, QUORUM_SIZE = 2
This means that replicas will be created on all nodes of the cluster, and the size of consensus group will be 3. So, if there are 7 nodes, there will be 7 replicas for each partition (3 consensus replicas and 4 learners).
Now the assignments should be calculated with taking the size of the quorum (and consensus group) into account, using the flags Assignment.forPeer and Assignment.forLearner.
The additional requirement here is that consensus group replicas should be distributed by cluster as evenly as the algorithm allows for all replicas. This is required because the consensus group bears additional load related to providing the data consistency; also, there are group leaders which are chosen from the members of the consensus group.
The assignments calculation for learners via Rendezvous should be simple: Rendezvous algorithm builds the list of nodes for each partition, then takes first N of them (where N is replica factor) as assignments. We should modify it so that it would take the first N nodes as a consensus group (where N is the size of this group), and the next M nodes as learners (where M is learners count).
On node join, if the node matches the conditions for zone filter and has appropriate storage profile, it may be included into the zone according to the zone scale-up timer. In this case, assignments should be recalculated as if it was the scale-up for a regular zone, so the new replicas will be created on the new node. The replicas will be either learners or consensus ones, according to the new assignments: the moving of some consensus replicas to the new node will allow it to take some consensus-related load. This will also launch the rebalancing process to move the data to the new node. It can be also related to upgrading new nodes or learners to consensus replicas and downgrading of some consensus replicas on older nodes to learners, in a case if the new consensus replicas are created on new nodes, as we want to preserve the consensus group size.
On node shutdown or failure, the zone scale-down may happen according to the scale-down timer. This will also trigger the assignments recalculation. There are several possibilities (for each partition group):
Default timers are: IMMEDIATE for scale-up and INFINITE for scale-down. They should remain the same for the zones with REPLICAS = ALL.
No changes of read-write transaction protocol are planned. It continues working as is - through primary replicas only.
Read-only transactions should be able to work with learners in the same way as with regular non-primary replicas. Safe time should be propagated to the learners along with any data changes and with idle propagation requests, as long as there is no difference from the point of view of the transaction protocol. So, safe time based reading from learners will work in the same manner as it works with non-primary replicas.
In a case of quorum loss, partitions may be reset in the same way as it was designed for regular zones without learners. We will have to ensure this by extending the test coverage for disaster recovery scenarios:
[1] https://raft.github.io/raft.pdf
[2] https://zookeeper.apache.org/doc/current/zookeeperInternals.html#sc_quorum
[3] Hermes protocol https://arxiv.org/pdf/2001.09804 (2.4, 3.4)