When the persistent storage was introduced for Apache Ignite, cluster membership mechanism was not changed, which means that every node startup or shutdown is treated as a cluster member change. This leads to several performance, usability and consistency issues:
Other persistence-enabled clusters separate cluster membership and data distribution. If a node fails, it is considered temporarily offline, and no data movement starts until an administrator confirms that the node is offline permanently.
To resolve the issues described above, we introduce a concept of affinity baseline topology: a target set of nodes intended to keep data for persistence-enabled caches. We will also attach a list of hashes to each baseline topology generated on branching points that will allow us to keep track of activation history and prevent data divergence in the cluster. A branching point is a cluster action which may affect data integrity of the cluster. One of the examples of branching points is cluster activation.
Baseline topology and branching history must be stored in a reliable fail-safe metadata storage which should be available for reading and updating upon node join. This is needed to detect branch divergence and prevent stale nodes from joining the cluster. Each baseline topology change and branching action is saved in the metadata storage.
During Phase I, cache affinity topology will be decoupled form cluster nodes topology.
Affinity for persistence-enabled caches is calculated using the baseline topology and then offline nodes are subtracted from the mapping. When a new node joins the cluster or a node goes offline, the baseline topology does not change, only affinity mapping is corrected with regard to offline nodes. Since affinity must be calculated for offline nodes, the cluster must be able to create 'phantom' topology nodes and pass them to an affinity function. Since affinity function may use arbitrary node attributes, we will introduce an interface that declares which node attributes are used in affinity calculation. Required node attributes will be stored in the metadata storage.
Baseline topology for a newly created cluster (or an old cluster with old persistence files) is created during the first cluster activation. The following baseline topology changes should be either confirmed manually or approved automatically via a user-defined pluggable interface. When a cluster with existing baseline topology is started, it waits for all nodes in the baseline topology to go online. Once all the nodes are started, the cluster can perform auto-activation.
Functionality that is not related to data affinity (compute grid, services) is not affected by Phase I Functionality related to in-memory caches should work the same way as before baseline is introduced.
During Phase II, cache affinity for in-memory caches and persistence-enabled caches should be merged into one Affinity Topology. In order to improve user experience and keep the old behavior for in-memory caches, affinity topology switch policies are introduced.
The policy is defined by a single boolean switch (auto-adjust enabled) and two timeouts - a soft timeout, which extends after each topology change and a hard timeout which should trigger baseline change if the hard timeout passes after first topology change event.
The policy configuration should be adjustable in runtime via JMX beans or control.sh utility and persist it's configuration to the node metastore. A joining node should use the most actual cluster configuration aligned upon cluster join.
After baseline adjustment on timeout is introduced, we can change the behavior of in-memory caches to conform to baseline topology, this should resolve the issues with joins between in-memory and persistent caches and open up more opportunities for PME optimizations.
After all, we should introduce some sort of tracking of which node 'seen' the latest version of each partition upon node failure and work with partition LOSS policy accordingly.
Functionality related to compute grid and services is not affected by Phase II.
During Phase III, a procedure of graceful node decommissioning is introduced. This procedure should allow to shrink clusters with 0 backups. During a decommission procedure, cluster should calculate an intermediate affinity and rebalance partitions that are owned by a node being decommissioned. After rebalance is finished, the node may be excluded from the cluster.
An open question for the Phase III is whether Service Grid and Compute functionality should be allowed on non-affinity nodes.
It's necessary to add an ability to manage baseline topology to both command-line utility set (visorcmd, control.sh script) and consider adding it to the WebConsole.
Additional information in the metadata storage increases disk space consumption. Since we store additional information about caches, we introduce another point of configuration validation. Metadata storage introduction may require additional changes to PageMemory because metadata storage may be required during node join (currently, the memory recovery happens after discovery start).
// Links to various reference documents, if applicable.
Closed Tickets Phase II
Alexey Goncharuk, Sergey Puchnin, it is not clear from the description whether in-memory only caches are affected by proposed functionality or not.
Also it may be useful to add list of use cases and corner cases.
I aslo propose to explain details of following correction:
> ... only affinity mapping is corrected with regard to offline node
As far as I understand this means the following:
at node failed/leave some backup partition may became primary by affinity (phantom can't be primary node), but still no rebalancing is triggered by this correction.