IDIEP-133
Author
Sponsor
Created23 Dec 2024
Status

DRAFT

Motivation

Replicated zones are zones with partitions that have replicas on every node of the cluster that matches the zone filter and storage profile filter. These zones can provide better performance on SQL joins. 

Currently in Apache Ignite 3, we have an ability to set any number of replicas for any zone regardless of its initial replica number. The problem is that in large clusters, if we create a zone with a number of replicas that would be equal to the cluster size, there will be too many members in each replication group to maintain consensus. This will have a serious impact on performance. Moreover, we don’t have automatic scale up for this case: if the user decides to scale up their cluster, they would need to change the replica number manually (unless it is big enough, exceeding the nodes number).

Therefore, we need special options for zones to create members of replication groups that would not be a part of consensus, and to be able to scale the zones along with the cluster.

Description

Requirements

  • user should be able to create a zone with partitions that have replicas on every node of the cluster that matches the zone filter and storage profiles;
  • replicas should be operational in a case when they are not included into the consensus group, let’s call them learners. Learners are created on every node that matches the zone filters where regular replica is not created;
  • read-write transactions work as on regular zones (i.e. through primary replicas);
  • read-only transactions and SQL within read-only transactions use learners as regular non-primary replicas;
  • filters and storage profiles for the zones having learners should work in the same way as for regular zones;
  • auto scaling: if new nodes are included into the cluster, learners are created there automatically (if the nodes match the zone filters).

Proposal

Definitions

Replication group - full set of replicas of a partition.

Consensus group - subset of replicas of a partition that maintains the data consistency in the replication group. For example, in the case of Raft it is the set of voting members.

Consensus replicas - replicas that form the consensus group.

Learners - replicas of a replication group that are not included into the consensus group.

Quorum - the minimal subset of replicas in the consensus group that is required for it to be fully operational and maintain the data consistency, in the case of Raft it is the majority of voting members.

Data node - a node of the cluster that is included into the distribution zone, matches its filters and storage profiles, and therefore, is used for storing the data and replicas are created there.

Changes of distribution zone parameters

REPLICAS

This is an existing parameter of integer type that specifies the constant size of a replication group. The proposal is to add one more acceptable value for this parameter: ALL, meaning that replicas should be created on every data node of the zone.

QUORUM_SIZE

This is the new parameter that is proposed here. It specifies the size of the majority quorum. In fact, the size of consensus group can be derived from it, depending on the replication algorithm, in the case of Raft it is calculated as QUORUM_SIZE * 2 - 1, so the quorum size is the size of the majority of nodes in the consensus of a replication group. This implies that the size of the consensus group will be an odd number, if there are enough replicas.

The QUORUM_SIZE parameter may be set with any sufficient number of replicas, either ALL or not. This means that we can have, for example, 10 data nodes, 7 replicas and quorum size 3, meaning that 5 replicas will form the consensus group and 2 will be learners.

The default value should be

  • If there are 4 or less data nodes: min(2, data_nodes_count);
  • If there are at least 5 data nodes: 3.

Lower and upper boundaries:

  • Lower: 1 if there is only one replica and 2 if there is more than 1 node. Having the quorum of 1 node where there are more replicas makes no sense and decreases reliability;
  • Upper: no less than lower bound, but making the consensus group fit into the configured replicas count.

Quorum size parameter can conflict with incorrect replicas count and insufficient data nodes count. We should throw an exception in this case to not create a group without a majority. Also, there should be special error codes for each case.

Quorum is chosen because it is quite straightforward for the user: if they are able to keep QUORUM_SIZE nodes in the zone, there should be no data loss, unless they lose multiple nodes simultaneously so that the quorum would be not able to transfer to nodes that remain online.

Also, many widely-used replication algorithms with strict consistency are based on consensus algorithms that require majority quorum, which has size floor(nodes / 2) + 1. For example, Raft [1] and Zookeeper [2] use majority quorum to replicate the data. Membership-based protocols, such as Hermes [3], may not use majority for replication but need majority-based protocol to maintain the consistent view of the cluster topology on every node. So, the size of the consensus group can be calculated using the quorum size.

SQL API changes

The aforementioned parameters of distribution zones should be accessible via SQL commands CREATE ZONE and ALTER ZONE.

QUORUM_SIZE should be added as an optional parameter of these commands. It should be of type int.

For example:

CREATE ZONE ‘MY_ZONE’ WITH REPLICAS = ALL, QUORUM_SIZE = 2

This means that replicas will be created on all nodes of the cluster, and the size of consensus group will be 3. So, if there are 7 nodes, there will be 7 replicas for each partition (3 consensus replicas and 4 learners).

Assignments

Now the assignments should be calculated with taking the size of the quorum (and consensus group) into account, using the flags Assignment.forPeer and Assignment.forLearner.

The additional requirement here is that consensus group replicas should be distributed by cluster as evenly as the algorithm allows for all replicas. This is required because the consensus group bears additional load related to providing the data consistency; also, there are group leaders which are chosen from the members of the consensus group.

The assignments calculation for learners via Rendezvous should be simple: Rendezvous algorithm builds the list of nodes for each partition, then takes first N of them (where N is replica factor) as assignments. We should modify it so that it would take the first N nodes as a consensus group (where N is the size of this group), and the next M nodes as learners (where M is learners count).

Distribution zone scale up and scale down

Scale up

On node join, if the node matches the conditions for zone filter and has appropriate storage profile, it may be included into the zone according to the zone scale-up timer. In this case, assignments should be recalculated as if it was the scale-up for a regular zone, so the new replicas will be created on the new node. The replicas will be either learners or consensus ones, according to the new assignments: the moving of some consensus replicas to the new node will allow it to take some consensus-related load. This will also launch the rebalancing process to move the data to the new node. It can be also related to upgrading new nodes or learners to consensus replicas and downgrading of some consensus replicas on older nodes to learners, in a case if the new consensus replicas are created on new nodes, as we want to preserve the consensus group size.

Scale down

On node shutdown or failure, the zone scale-down may happen according to the scale-down timer. This will also trigger the assignments recalculation. There are several possibilities (for each partition group):

  • Learner(s) left the cluster: no actions required, it is not expected that the location of other consensus or learner replicas will change;
  • Consensus replica left the cluster, there are learners, majority quorum is preserved: to maintain the number of consensus replicas in the zone, some learner should be promoted to the consensus replica and included into consensus group;
  • Consensus replica left the cluster, there are no learners, but majority quorum is preserved: no actions required;
  • Consensus replica left the cluster, there are no learners and majority quorum is lost: the zone should be able to be recovered using the regular disaster recovery procedure;
  • Multiple nodes failed at once, consensus replicas left the cluster, there are learners but majority quorum is lost: also requires disaster recovery procedure. Learners cannot be promoted to consensus replicas automatically, because this may lead to split-brain (for example, in cases when “failed” nodes still work but are in the unavailable network segment).

Zone auto-adjust timers

Default timers are: IMMEDIATE for scale-up and INFINITE for scale-down. They should remain the same for the zones with REPLICAS = ALL.

Risks and assumptions

Read-write transactions

No changes of read-write transaction protocol are planned. It continues working as is - through primary replicas only.

Read-only transactions

Read-only transactions should be able to work with learners in the same way as with regular non-primary replicas. Safe time should be propagated to the learners along with any data changes and with idle propagation requests, as long as there is no difference from the point of view of the transaction protocol. So, safe time based reading from learners will work in the same manner as it works with non-primary replicas.

Disaster recovery

In a case of quorum loss, partitions may be reset in the same way as it was designed for regular zones without learners. We will have to ensure this by extending the test coverage for disaster recovery scenarios:

  • if the quorum is lost, and there is no ability to turn a learner into consensus replica (they may be offline, outdated, etc.), then disaster recovery should work as for regular zones without learners;
  • the case when there are less nodes than configured quorum size, meaning that the configured value of quorum size will be overridden.

Reference links

[1] https://raft.github.io/raft.pdf

[2] https://zookeeper.apache.org/doc/current/zookeeperInternals.html#sc_quorum

[3] Hermes protocol https://arxiv.org/pdf/2001.09804 (2.4, 3.4)

Tickets

IGNITE-23790 - Getting issue details... STATUS

  • No labels