IDIEP-140
Author
Sponsor
Created

  

Status
DRAFT

Motivation

Apache Ignite provides a good horizontal scalability enabling applications built on top of it to store more data and handle more user requests as applications grow. And at some point support for deployment of an application to more than one datacenter is required to ensure disaster resilence and recovery.

Right now Ignite doesn't support 2+ DC deployments effectively though it is not prohibited even in current versions. The reason for that are internal components that are not optimized to perform their functions in MultiDC environment. For instance SQL engine could requiest pieces of data from remote DC even though original request could be fulfilled from a local DC. This leads to lower throughput and higher latency of user operations as well as decreased stability of the cluster (at least for ring-shaped cluster topology used by TcpDiscoverySpi).

Right now Ignite 2.x doesn't support MDC deployments in an efficient way and this IEP describes improvements needed to close this gap in functionality.

Description

Support for MDC isn't a separate big feature but rather a set of improvements of existing components allowing Ignite cluster to reach the following set of goals:

  1. Minimizing cluster stability impact in MDC environment for ring-shaped topology.
  2. Maximazing efficiency of user operations by making internal components aware of MDC configuration.
  3. Addressing increased risks of situations like split-brain.

The first improvement is related to Discovery SPI implementation and aims at minimizing cross-DC edges in ring topology thus increasing cluster stability. It enables the subsequent set of improvements.

Improvements of all other components like KV API, SQL, rebalancing and so on could be done in parallel in one or more phases after Discovery SPI improvement is implemented. All necessary changes for each component will be covered in subsequent sections of this document.

On top of modified Discovery SPI could be built more effective mechanisms for detecting and mitigating split-brain situations.

High-level work decomposition

Preparation phase

Introduce new attributes to label Ignite nodes related to different DCs, develop components necessary for operations in MultiDC environment (TopologyValidator, SegmentationResolver with suitable segmentation policy, tooling for managing stretched cluster).

Dicsovery SPI improvements

Modification of TcpDiscoverySpi is needed to fulfill the following requirements:

  1. No more than 2 cross-DC connections between two datacenters are allowed in the ring.
  2. In order to achive the first requirement a join protocol should be modified to support adding a node at an arbitrary point in the ring.
  3. To enable the second requirement new node attribute is introduced labeling the node as belonging to a particular DC (e.g. NODE_DC with String values).
  4. TcpDiscoveryNode properties order and internalOrder are reworked to protect higher-level code from changed behavior and to preserve as much internal discovery invariants as possible.

At the same time any improvements to TcpDiscoverySpi should not affect single-DC installations (no additional complexity of configuration or performance penalty is introduced).

Other components' improvements

The following list of components is not exhaustive, other components may be added to the list later with detailed plan of their optimization for MDC deployment.

  1. KV API, SQL API: fulfill requests from the same DC where request originator resides (thick client, a server handling thin client request, compute task requesting data from a cache etc).
  2. Thick clients: attach them to a server node from the same (or the closest from a network perspective) DC.
  3. Rebalancing: keep partition loading as much DC local as possible. Partitions should be loaded from the same DC whenever possible. The same is true for historical rebalancing.
  4. Affinity: different affinity functions may provide different guarantees in presence of split-brain scenarios. We need to examine these options and provide an end-user with a clear list of trade-offs each affinity function provides (e.g. at least one copy of each partition in each DC/all PRIMARY copies in one DC etc).
  5. Snapshots: partition files must be loaded from nodes from the same DC whenever possible.

Split-brain handling

At the moment Apache Ignite provides two interfaces to handle split-brain situations: TopologyValidator and ConflictResolver.

However proposed improvements may enable more efficient algorithms for detecting and handling split-brain, e.g. fast detecting of a split-brain on the edges of DCs or simpler and more efficient TopologyValidator implementations based on the notion of "Leader/Follower DC". However all these ideas are a matter of discussion and clarification.

Thin client improvements

Thin client has partition-awareness feature. With this feature client connects to all cluster server nodes and tries to find correct node for key and send request to this node (see IEP-23: Partition Awareness for Thin Clients). To achive this client requests primary node map for cache from server. Non partition-awareness requests (not related to cache requests, or without specified key, which can be mapped to node) distributed randomly to all nodes.

In case of multi DC we can make following improvements to PA mechanism:

  • PA "write" requests still should be sent to primary nodes (nothing changed), but PA "read" requests (if readFromBackup flag is set and write synchrinization mode is not PRIMARY_SYNC) can be sent to backup node, if node is located in the same DC as client (if there is no partitions in the same DC, request should be sent to primary node).
  • Non-PA requests should sent to random node in the same DC as client. 

It's proposed to change "cache partitions" messages (add information about partition in current DC) and introduce new "data center nodes" messages to achive these improvements. Alternatevely, instead of introducing new message we can reuse CLUSTER_GROUP_GET_NODE_IDS message, but in this case client should know about server attribute name to store DC ID (which is server internal information), and everything can be broken if server will change DC ID storage place. 

Protocol changes

Operation codes

The new operation for "data center nodes" request is required:

NameCode
OP_CLUSTER_GET_DC_NODES


5103

OP_CLUSTER_GET_DC_NODES message format

Request
StringData center ID


Response
intNodes count
(UUID) * countNodes IDs

OP_CACHE_PARTITIONS message format changes

Request
boolCustom mapping
StringData center ID  - new field
intCaches count
(int) * countCache IDs


Response
longMajor topology version
int Minor topology version
intPartition mappings count
(Caches configuration and partition mapping) * countCaches configuration and partition mappings


Caches configuration and partition mapping
boolApplicable. Flag that shows, whether standard affinity is used for caches.
intCount of caches
(Cache ID + key configuration) * countCache IDS and cache key configurations
Partition mappingPrimary partition mapping (if Applicable is true)
Partition mappingCurrent DC partition mapping (if Applicable is true) - new field


Partition mapping
intNodes count
(Node ipartitions) * countNode partitions information


Node partitions
UUIDNode ID
intPartitions count
(int) * countPartitions


Risks and Assumptions

Current approach to this IEP introduces components' mostly internal logic modifications, no public API changes or breaking binary compatibility are needed. Protocols stay the same as well with some internal tweaks and possibly some refactoring.

Discussion Links

[Will be added later]

Tickets

Key Summary T Created Updated Due Assignee Reporter P Status Resolution
Loading...
Refresh

  • No labels