This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Ignite Persistent Store - under the hood

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: delta rebalancing notes, explanations added

...

  • Write Ahead Log (WAL) segments - constant size file files (WAL work directory 0...9.wal;, WAL archive 0.wal…)
  • CP markers - small files for events of starting and finishing checkpoints (UUID-Begin.bin, UUID-End.bin)
  • Page store - constains cache parameters, page store for partitions and SQL indexes data (uses file per partition: cache-(cache_name)\part1,2,3.bin, and index.bin)

...

Page store is the storage for all pages related to particual cache and its partition(cache's partitions and SQL indexes).

 

Using page identifier it is possible to map from page ID to file and to position in particular file:

 

No Format
pageId = ... || partition ID || page index and(idx)
//pageId can be easily converted to file + offset in this file
offset = idx * pageSize

...

Cache page storage contains following files:

  • part-0.bin, part-1.bin, ..., part-21023.bin (shown as P1,P2,PN at picture) - Cache partition pages.
  • index.bin - index partition data, special partition with number 65535 is used for SQL indexes and saved to index.bin
  • cache_data.dat - special file with stored cache data (configuration) . StoredCacheData includes more data than CacheConfiguration, e.g. Query entities

 

Persistence

Checkpointing

...

Checkpointing can be defined as process of

...

storing dirty pages from RAM on a disk, with results of consistent memory state is saved to disk. At the point of process end, page state is saved as it was for the time the process begins.

There are two approaches to implementation of checkpointing:

  • Sharp Checkpointing - if checkpoint is completed all data structures on disk are consistent, data is consistent in terms of references and transactions.
  • Fuzzy Checkpointing - means state on disk may require recovery itself

Implemented Approach implemented in Ignite - Sharp Checkpoint; F.C. - to be done in future releases.

To achieve consistency Checkpoint checkpoint read-write lock is used (see GridCacheDatabaseSharedManager#checkpointLock)

  • Cache Updates - holds read lock
  • Checkpointer - holds write lock for short time. Holding write lock means all state is consistent, updates are not possible. Usage of CP Lock checkpoint lock allows to do sharp checkpoint

Under CP checkpoint write lock held we do the following:

      1. WAL marker is added: CP checkpoint (begin) record is added - CheckpointRecord - marks consistent state in WAL
      2. Collect pages were changed since last checkpoint

And then CP then checkpoint write lock is released, updates and transactions can run.

...

Copy on write technique is used. If there is modification in page which is under CP checkpoint now we will create temporary copy of page.

...

it is updated directly in memory bypassing CP checkpoint pool.


If page was already flushed to disk, dirty flag is cleared. Every future write to such page (which was initially involved into CP checkpoint, but was flushed) does not require CP checkpoint pool usage, it is written dirrectly in segment.

...

Possible future optimisation: for full update we may send page store file over network.

Order of nodes join is not relevant, there is possible situation that oldest node has older partition state, but joining node has higher partition counter. In this case rebalancing will be triggered by coordinator. Rebalancing will be performed from the newly joined node to existing one (note this behaviour may be changed under IEP-4 Baseline topology for caches)

WAL history size

In corner case we need to store WAL only for 1 checkpoint in past for successful recovery (PersistentStoreConfiguration#walHistSize )

...