Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We can’t detect if data has been lost, and we will automatically restore the raft group with the latest data that we have.

It is possible in this situation that the elected leader will not have the highest value of the committed index. In this case, the minority of nodes that have a higher index value, will have to either truncate their logs and continue to function properly, or or become broken and be cleaned-up manually if truncating the log is impossible.

...

  • Indexes that are already built may be locally seen as not yet built, and there’s no log with “finish build” command. This breaks secondary indexes completely, we must address it before release.
  • How the scenario looks:
    • Primary replica finishes the build
    • It writes corresponding marker to the meta-storage after replicating the last build command
    • For some reason, that last command becomes lost
    • Local index data is in unfinished state, while globally we treat it as fully built

    • How the scenario looks:
    • We need some type of recovery procedure.
      For example - never read from an index if its “last built rowId” is not at the terminal value. This means that disaster happened and data may not be consistent. We should wait until the primary replica completes the build, and only then allow reading. Throwing an exception immediately, if we see an incomplete index, is a good option.

  • Transactions that are already committed may lose information about transaction state, making write intent resolution problematic.
    • Can we recover the state, using the rest of the cluster? Ideally, yes, but only if there’s a list of recently committed transactions on those nodes. Generally speaking, it’s impossible. As a result, different partitions will see different states of transactions.
    • There must be a way to reconcile ongoing transactions with commit partitions. If the commit partition lost part of transaction states, we must gather the entire list of ongoing transactions and explicitly abort them.
      Manual peers/learners configuration update is naturally a trigger for such an event. But, if we simply lost part of the log, we can’t notice that easily.
      As far as I know, we should have a procedure of cleaning old transactions after cluster restart. Should we do it on every leader’s re-election? That’s too often, a better approach is a leader's re-election after losing the majority. I struggle to describe the approach in better terms, the problem seems to be quite complicated. We should discuss the approach.
      Do we require explicit abort for each transaction for secondary storage replication to work? This is an open question. We should figure it out.

  • Transactions that are already committed may lose part of the data.
    • There’s nothing we can do about it.
    • There’s a risk that secondary storage replication might be affected. How do we prevent inconsistencies between primary and secondary storages? I guess raft will figure it out for us?

...

Here's an example of what the API may look like, real implementation might be different in some specific parts.

Metrics

There are local and global metrics. We only use local, because global partition states can only be known at leader nodes. I propose following information to be available to the user (names are a subject to change):

...