IDIEP-72
Author
Sponsor
Created

  

Status
DECLINED

For Ignite 3.x the concept is formulated for the distributed table. The table is the base component that allows a store and updates data in the cluster. The table provides a base guaranty of consistency of data writes/reads.

Motivation

All distributed table structures require to have redundancy of storing data to avoid losing entries when one (or more) member(s) of the structure goes down. Moreover, the data which is available should be consistent every time, while a part of the data available in the structure.

The atomic protocol should provide an ability to keep data redundancy level and keep data consistency until all the copies of the data are lost.

Interface

Table creation

Table creation requires next parameters to be specified for the protocol purpose:

  • Amount of partitions - the total number of parts the data will be distributed among in the cluster.
  • A number of replicas - a redundancy level, desired number of copies each partition should have (1 means no redundancy - a single copy of partition will be in a cluster).

Key value interface

The familiar interface for atomic storage in a table is available through the Key-Value view[1] of a table:

Key Value interface
public interface KeyValueView<K, V>

All batch methods won't have atomically guarantees and added for optimization on network communication.

It is an analogue of Ignite cache interface from Ignite 2.x.

Implementation

Partition consistency

Every partition replicas will be served by one RAFT group (it is implemented into IEP-61[2]). All synchronization guarantees between replies will be provided by RAFT protocol.

Since the RAFT elects a leader on its own, there is no difference between primary and backup replies - all the replicas are equal for the atomic protocol implementation.

Partition distribution

All partitions should be distributed around the cluster as even as possible pursuing a balanced load goal. For this purpose, the Rendezvous affinity function will be used (the similar one is used in Ignite 2.x).

The function is calculated once for the cluster topology on one of the nodes and it's result (partition distribution) is stored to a Distributed metastorage, all the other nodes get precalculated the partition distribution and are able to use it local copy (no recalculation required) before the table will be available for operations.

Mapping entry to partition

A table entry consists of:

  • Key part - a set of unique columns. It can be interpreted as the primary key.
  • Optional Affinity part - a fixed subset of Key columns that define a partition the entry belongs to. By default, Affinity part = Key part.
  • Value part, it is another columns.


Flow description

Table starts to create through public API. In the time a partition distribution is being calculated and will have been available into each node when the table is returned to the client code.

Every invocation of the table API determined a set of data entries which mapped to a partition by a key part of the entries. Distribution determines a RAFT group for specific partition, every partition update is transformed to the RAFT command and applied through RAFT group API.

Links

  1. KeyValueView.java
  2. IEP-61 Common Replication Infrastructure

Tickets

Key Summary T Created Updated Due Assignee Reporter P Status Resolution
Loading...
Refresh


  • No labels