ID	IEP-126
Author	Ivan Bessonov
Sponsor	Ivan Bessonov
Created	01 Aug 2024
Status	DRAFT

Motivation

This document describes various situations, in which a user would have difficulties operating with the data in tables, and the expected behavior of the system. Main example of such situations - data unavailability.

Description

Following definitions will be used throughout the document.

Local partition states. A local property of replica, storage, state machine, etc., associated with the partition:
- Healthy
  State machine is running, everything’s fine.
- Initializing
  Ignite node is online, but the corresponding raft group is yet to complete its initialization.
- Snapshot installation
  Full state transfer is taking place. Once it’s finished, the partition will become healthy or catching-up. Before that, data can’t be read, and log replication is also on pause.
- Catching-up
  Node is in the process of replicating data from the leader, and its data is a little bit in the past. More specifically, node has not replicated the tail of the log, that corresponds to N log entries or log entries for M seconds. The latest index isn’t known on the follower node, while time can be estimated as a difference between safe time and node’s clock, so the time interval seems like a preferred option.
- Broken
  Something’s wrong with the state machine. Some data might be unavailable for reading, log can’t be replicated, and this state won’t be changed automatically without intervention.
Global partition states. A global property of a partition, that specifies its apparent functionality from user’s point of view:
- Available partition
  Healthy partition that can process read and write requests. This means that the majority of peers are healthy at the moment.
- Degraded partition
  This state represents the partition that is by all means available to the user, but is at a higher risk of having issues than other partitions. For example, one of the group's peers is offline. There’s still a majority, but the backup factor is lowered.
- Read-only partition
  Partition that can process read requests, but can’t process write requests. There’s no healthy majority, but there’s at least one alive (healthy/catch-up) peer that can in principle process historical read-only queries.
- Unavailable partition
  Partition that can’t process any requests.

Building blocks

Building blocks are a set of operations that can be executed by Ignite or by the user in order to improve cluster state.

Each building block must either be an automatic action with configurable timeout (if applicable), or a documented API, with mandatory diagnostics/metrics that would allow users to make decisions about these actions.

Offline Ignite node is brought back online, having all recent data.
Not a disaster recovery mechanism, but worth mentioning.
A node with usable data, that doesn’t require full state transfer, will become a peer, will participate in voting and replication, allowing partition to be available if majority is healthy. This is the best case for the user, where they simply restart offline nodes and the cluster continues being operable.
Automatic group scale-down.
Should happen when an Ignite node is offline for too long.
Not a disaster recovery mechanism, but worth mentioning.
Only happens when the majority is online, meaning that user data is safe.
Manual partition restart.
Should be performed manually for broken peers.
Manual group peers/learners reconfiguration.
Should be performed on a group manually, if the majority is considered permanently lost.
Freshly re-entering the group.
Should happen when an Ignite node is returned back to the group, but partition data is missing.
Cleaning the partition data.
If, for some reason, we know that a certain partition on a certain node is broken, we may ask Ignite to drop its data and re-enter the group empty (as stated in option 5).
Having a dedicated operation for cleaning the partition is preferable, because:
- partition is be stored in several storages
- not all of them have a “file per partition” storage format, not even close
- there’s also raft log that should be cleaned, most likely
- maybe raft meta as well
Partial truncation of the log’s suffix.
This is a case of partial cleanup of partition data. This operation might be useful if we know that there’s junk in the log, but storages are not corrupted, so there’s a chance to save some data. Can be replaced with “clean partition data”.

In order for the user to make decisions about manual operations, we must provide partition states for all partitions in all tables/zones. Both global and local states. Global states are more important, because they directly correlate with user experience.

Some states will automatically lead to “available” partitions, if the system overall is healthy and we simply wait for some time. For example, we wait until a snapshot installation, or a rebalance is complete, and we’re happy. This is not considered a building block, because it’s a natural artifact of the architecture.

Current list is not exhaustive, it consists of basic actions that we could implement that would cover a wide range of potential issues.
Any other addition to the list of basic blocks would simply refine it, potentially allowing users to recover faster, or with less data being lost, or without data loss at all. All such refinements will be considered in later releases and should not be a part of this document.

Scenarios

For the purpose of this section, imagine that we have a partition with a corresponding raft group on nodes A, B and C. Nodes D and E might be used in some examples too. The list of scenarios here in the document is not complete, but rather represents the most typical situations and what may happen because of them. More complete diagram can be found by following the Miro link, posted earlier in the document. It’s hard to explain all transitions in a linear fashion.

Minority offline

Node A left topology. This is a typical scenario of a node restart. Nothing bad happens, partition is available, two outcomes are possible:

Node A returns back in time (scale-down timeout):

Node returns with data.
Expected behavior: we replicate the data that it missed. Either via a log replication, or using a full state transfer procedure.
Node returns without data.
Expected behavior: we replicate everything via full state transfer when node re-entered the group. It can be done if we still have valid meta.
Otherwise, we would have to remove the node from peers and re-enter the group back after that. (Apparently, there’s an issue with double voting when a node returns without meta. I saw the Pull Request #825 that addresses the problem. If that’s a legit solution, then we should use it.)
Node returns with broken data. Covered later in the document.

Node A doesn’t return back in time (scale-down timeout):
Expected behavior: we distribute 3rd replica on a new node (D, for instance), and start the rebalance. If node A returns back after that, its data will be purged.

Too many nodes catching-up

Node C is slow, so nodes A and B form the quorum. If node A shuts down, there’s a delay until node C catches up to the latest data. What to do:

Wait for the catch-up to end, there’s no way to speed-up the process. Ideally, we should show this state in monitoring tools, because slow nodes are always a problem.

Minority in broken state

Node A has its state machine in a broken state. This is not a typical scenario, we don’t expect state machines to break. Two possible outcomes:

Node A is shut down.
Expected behavior: we fall-back to the previous scenario where A was offline. Simple restart may fix the issue.
Node A is alive for some period of time. If the replica is not repaired, it’s a lowered backup factor and risk of becoming read-only or unavailable, if something happens with node B or C.

We could restart the raft group for a specific partition on a specific node, it might fix the issue.
We could (potentially) detect replicas in error state and restart them automatically.
We should tell the user that one of the replicas for partition is unavailable.
Same applies to any number of replicas in error state.
In any case, we either fix the issue with restart, or delete corrupted data and rebalance it, or exclude node A from the group. There's an API for all of these cases.

Majority offline

Nodes A and B left topology. Two outcomes are possible:

Node C is a primary replica. Partition becomes read-only. All data is available for reading until lease expires.
Node C is not a primary node. Partition becomes unavailable (at the time of writing the document, please see a clarification).

Clarification: we only read data from the primary replica right now. If we don’t prolong the lease on a group without a majority, we won’t even be able to read the data.

We plan to change that in the future, and read-only partitions will exist longer. This part of the conversation is out of scope of disaster recovery scenarios.

If A or B return back with the data:

Expected behavior: leader is elected, missing data is transferred, leaseholder is elected.

If A or B returns without data:

We must check the following case: can nodes without data participate in the leader election before data is transferred.

If no: data can only be transferred from the leader. Leader can’t be elected until there’s a healthy majority. This is a bad situation, and by its own it won’t lead to an available partition. Other actions required, such as:
- return the 3rd node with the data. This behavior has already been described previously.
- we force node C to become a leader, shrinking the raft group in an unsafe manner. Part of the data might be lost in the process.
If yes: data will be rebalanced without any issues.

If A and B don’t return:

Expected behavior: there’s no real difference from the point of view of the user. As in the previous case, we can’t do anything to safely revive the partition in this state, except for waiting. Manual peers/learners configuration update is, once again, another option. Part of the data might be lost.

Majority in error state/unavailable

Same as with a minority. Bad replicas may be restarted. This situation is reduced to restart with data deletion or restart without a data deletion. Overall, all the ideas are described in previous sections.

Minority lost part of the log

Situation may occur if fsync has been disabled.

Missing part of the log will be replicated from the leader. We must check that it works, otherwise we would have to clear the storage and re-enter the group. One of two.

Majority lost part of the log

Situation may occur if fsync has been disabled.

We can’t detect if data has been lost, and we will automatically restore the raft group with the latest data that we have.

It is possible in this situation that the elected leader will not have the highest value of the committed index. In this case, the minority of nodes that have a higher index value, will have to either truncate their logs and continue to function properly, or become broken and be cleaned-up manually if truncating the log is impossible.

Corrupted Raft Log

Upon restart, we should drop the corrupted part of the log and log. It will be replicated once again from the leader.

Fsync

Different levels of guarantees are possible:

If we sync everything, then nothing should be lost.
If we only sync txn states, then only data from the given partition can be lost.
(lost update/updateAll entries)
If we don’t sync anything, then data from the given partition and the data from other partitions can be lost (due to lost txn state)

More on all these options is later in the document. Considering statements above, we could have different fsync policies, not just “enabled/disabled”.

Resulting issues with the data

What happens when partition loses part of the log? It happens with manual peers/learners configuration updates or while losing part of the log. The most extreme case of it - total partition loss.

Indexes that are already built may be locally seen as not yet built, and there’s no log with “finish build” command. This breaks secondary indexes completely.
- Primary replica finishes the build
- It writes corresponding marker to the meta-storage after replicating the last build command
- For some reason, that last command becomes lost
- Local index data is in unfinished state, while globally we treat it as fully built

How the scenario looks:
We need some type of recovery procedure.
For example - never read from an index if its “last built rowId” is not at the terminal value. This means that disaster happened and data may not be consistent. We should wait until the primary replica completes the build, and only then allow reading. Throwing an exception immediately, if we see an incomplete index, is a good option.

Transactions that are already committed may lose information about transaction state, making write intent resolution problematic.

Can we recover the state, using the rest of the cluster? Ideally, yes, but only if there’s a list of recently committed transactions on those nodes. Generally speaking, it’s impossible. As a result, different partitions will see different states of transactions.
There must be a way to reconcile ongoing transactions with commit partitions. If the commit partition lost part of transaction states, we must gather the entire list of ongoing transactions and explicitly abort them.
Manual peers/learners configuration update is naturally a trigger for such an event. But, if we simply lost part of the log, we can’t notice that easily.
As far as I know, we should have a procedure of cleaning old transactions after cluster restart. Should we do it on every leader’s re-election? That’s too often, a better approach is a leader's re-election after losing the majority. I struggle to describe the approach in better terms, the problem seems to be quite complicated. We should discuss the approach.
Do we require explicit abort for each transaction for secondary storage replication to work? This is an open question. We should figure it out.

Transactions that are already committed may lose part of the data.

There’s nothing we can do about it.
There’s a risk that secondary storage replication might be affected. How do we prevent inconsistencies between primary and secondary storages? I guess raft will figure it out for us?

Proper documentation is required for everything. Users should know what to expect if something bad happens.

API

Here's an example of what the API may look like, real implementation might be different in some specific parts.

Metrics

There are local and global metrics. We only use local, because global partition states can only be known at leader nodes. I propose following information to be available to the user (names are a subject to change):

zones.<zoneId>.<partId>.localState.<consistentId>
= Healthy / Init / CatchUp / Snapshot / Error(msg)
zones.<zoneId>.<partId>.globalState
= Available / ReadOnly / Unavailable (no consistent ID here)

Similar information should be accessible via system views something like

system.partition_local_states
zone | zoneId | partId | nodeId | state
A | 2 | 15 | Node0 | Healthy
system.partition_global_states
zone | zoneId | partId | state
A | 2 | 15 | Available

REST / CLI

Rest and CLI mirror each other.

ignite recovery restart-partitions --nodes <nodeIds> --zones <zoneNames>
[--partitions <partitionIds>] [--purge]
// It’s like a “set baseline”, but much stronger
ignite recovery reset-partitions [--zones <zoneNames>]
[--partitions <partitionIds>]
ignite recovery truncate-log-suffix --zone <zoneName> --partition <partitionId> --index <index>
ignite recovery states [--local [--nodes <nodeIds>] | --global] [--zones <zoneNames>] [--partitions <partitionIds>]

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Tickets

Open Tickets

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	Priority	Priority	Priority	Priority	P	Status	Resolution

Loading...

Refresh

Closed Tickets

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	Priority	Priority	Priority	Priority	P	Status	Resolution