ID	IEP-4
Author	Alexey Goncharuk Sergey Puchnin
Sponsor	Alexey Goncharuk
Created	2017-10-16
Status	DRAFT

Motivation

When the persistent storage was introduced for Apache Ignite, cluster membership mechanism was not changed, which means that every node startup or shutdown is treated as a cluster member change. This leads to several performance, usability and consistency issues:

RebalanceDelay configuration property is not flexible enough to protect cluster from unnecessary rebalancing
Cluster startup in inactive mode requires user intervention
A separate class of cluster activation scenarios may result in data inconsistency
Lost partition recovery is possible only with user intervention

Other persistence-enabled clusters separate cluster membership and data distribution. If a node fails, it is considered temporarily offline, and no data movement starts until an administrator confirms that the node is offline permanently.

Description

To resolve the issues described above, we introduce a concept of affinity baseline topology: a target set of nodes intended to keep data for persistence-enabled caches. We will also attach a list of hashes to each baseline topology generated on branching points that will allow us to keep track of activation history and prevent data divergence in the cluster. A branching point is a cluster action which may affect data integrity of the cluster. One of the examples of branching points is cluster activation.

Baseline topology and branching history must be stored in a reliable fail-safe metadata storage which should be available for reading and updating upon node join. This is needed to detect branch divergence and prevent stale nodes from joining the cluster. Each baseline topology change and branching action is saved in the metadata storage.

Phase I

During Phase I, cache affinity topology will be decoupled form cluster nodes topology.

Affinity for persistence-enabled caches is calculated using the baseline topology and then offline nodes are subtracted from the mapping. When a new node joins the cluster or a node goes offline, the baseline topology does not change, only affinity mapping is corrected with regard to offline nodes. Since affinity must be calculated for offline nodes, the cluster must be able to create 'phantom' topology nodes and pass them to an affinity function. Since affinity function may use arbitrary node attributes, we will introduce an interface that declares which node attributes are used in affinity calculation. Required node attributes will be stored in the metadata storage.

Baseline topology for a newly created cluster (or an old cluster with old persistence files) is created during the first cluster activation. The following baseline topology changes should be either confirmed manually or approved automatically via a user-defined pluggable interface. When a cluster with existing baseline topology is started, it waits for all nodes in the baseline topology to go online. Once all the nodes are started, the cluster can perform auto-activation.

Functionality that is not related to data affinity (compute grid, services) is not affected by Phase I Functionality related to in-memory caches should work the same way as before baseline is introduced.

Phase II

During Phase II, cache affinity for in-memory caches and persistence-enabled caches should be merged into one Affinity Topology. In order to improve user experience and keep the old behavior for in-memory caches, affinity topology switch policies are introduced.

The policy is defined by a single boolean switch (auto-adjust enabled) and two timeouts - a soft timeout, which extends after each topology change and a hard timeout which should trigger baseline change if the hard timeout passes after first topology change event.

The policy configuration should be adjustable in runtime via JMX beans or control.sh utility and persist it's configuration to the node metastore. A joining node should use the most actual cluster configuration aligned upon cluster join.

After baseline adjustment on timeout is introduced, we can change the behavior of in-memory caches to conform to baseline topology, this should resolve the issues with joins between in-memory and persistent caches and open up more opportunities for PME optimizations.

After all, we should introduce some sort of tracking of which node 'seen' the latest version of each partition upon node failure and work with partition LOSS policy accordingly.

Functionality related to compute grid and services is not affected by Phase II.

Phase III

During Phase III, a procedure of graceful node decommissioning is introduced. This procedure should allow to shrink clusters with 0 backups. During a decommission procedure, cluster should calculate an intermediate affinity and rebalance partitions that are owned by a node being decommissioned. After rebalance is finished, the node may be excluded from the cluster.

An open question for the Phase III is whether Service Grid and Compute functionality should be allowed on non-affinity nodes.

Usability considerations

It's necessary to add an ability to manage baseline topology to both command-line utility set (visorcmd, control.sh script) and consider adding it to the WebConsole.

Risks and Assumptions

Additional information in the metadata storage increases disk space consumption. Since we store additional information about caches, we introduce another point of configuration validation. Metadata storage introduction may require additional changes to PageMemory because metadata storage may be required during node join (currently, the memory recovery happens after discovery start).

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/Cluster-auto-activation-design-proposal-td20295.html

http://apache-ignite-developers.2346864.n4.nabble.com/Design-proposal-automatic-creation-of-BaselineTopology-td20756.html

http://apache-ignite-developers.2346864.n4.nabble.com/Baseline-auto-adjust-s-discuss-td40330.html

Reference Links

// Links to various reference documents, if applicable.

Open Tickets Phase I

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Closed Tickets Phase I

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Open Tickets Phase II

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Closed Tickets Phase II

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Page tree

Motivation

Description

Phase I

Phase II

Phase III

Usability considerations

Risks and Assumptions

Discussion Links

Reference Links

Open Tickets Phase I

Closed Tickets Phase I

Open Tickets Phase II

2 Comments

Sergey Chugunov

Dmitry Pavlov

Page tree

IEP-4 Baseline topology for caches

Motivation

Description

Phase I

Phase II

Phase III

Usability considerations

Risks and Assumptions

Discussion Links

Reference Links

Open Tickets Phase I

Closed Tickets Phase I

Open Tickets Phase II

2 Comments

Sergey Chugunov

Dmitry Pavlov