This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Ignite Persistent Store - under the hood
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Following notation is used: words written in italics and wihout spacing mean class name without package name or method name, for example, GridCacheMapEntry

Table of Contents:

Ignite Persitent Store

File types

There are following file types used for persisting data: Cache pages or page store, Checkpoint markers, and WAL segments

  • Write Ahead Log (WAL) segments - constant size file (WAL work directory 0...9.wal;, WAL archive 0.wal…)
  • CP markers (UUID-Begin.bin, UUID-End.bin)
  • Page store (uses file per partition: cache-(cache_name)\part1,2,3.bin, and index.bin)

Folders structure

Ignite with enabled persistence uses following folder stucture

Consistent ID may be configured using IgniteConfiguration or generated from local IPs set by default.

Page store

Partitions of each cache have corresponding file in page store directory (particular node may own not all partitions). 


Special partition 65535 is used for SQL indexes and saved to index.bin

Persistence and Crash Recovery

Crash Recovery can be

  • Local (most DB are able to do this)
  • and distributed (whole cluster state is restored).

Local Crash Recovery

Ignite Durable Memory is basis for all data structures. There is no cache state saved on heap now. 

To save cache state to disk we can dump all its pages to disk. First prototypes used this simple approach: stop all updates and save all pages.


Can be of two types

  • Sharp Checkpointing - if checkpoint is completed all data structures on disk are consistent, data is consistent in terms of references and transactions.
  • Fuzzy Checkpointing - means state on disk may require recovery itself

Implemented - Sharp Checkpoint; F.C. - to be done in future releases.

To achieve consistency Checkpoint read-write lock is used (see GridCacheDatabaseSharedManager#checkpointLock)

  • Cache Updates - holds read lock
  • Checkpointer - holds write lock for short time. Holding write lock means all state is consistent, updates are not possible. Usage of CP Lock allows to do sharp checkpoint

Under CP write lock held we do the following:

      1. WAL marker is added: CP (begin) record is added - CheckpointRecord - marks consistent state in WAL
      2. Collect pages were changed since last checkpoint

And then CP write lock is released, updates and transactions can run.

Dirty pages is set, when page from non-dirty becomes dirty, it is added to this set.

Collection of pages (GridCacheDatabaseSharedManager.Checkpoint#cpPages) allows us to collect and then write pages which were changed since last checkpoint.

Checkpoint Pool

In parallel with process of writing pages to disk, some thread may want to update data in the page being written.

For such case Checkpoint pool is used for pages being updated in parallel with write. This pool has limitation.

Copy on write technique is used. If there is modification in page which is under CP now we will create temporary copy of page.

If page

  • was not involved into checkpoint,
  • but updated concurrenly with checkpointing process:

it is updated directly in memory bypassing CP pool.

If page was already flushed to disk, dirty flag is cleared. Every future write to such page (which was initially involved into CP, but was flushed) does not require CP pool usage, it is written dirrectly in segment.


  • Percent of dirty pages is trigger for checkpointing (e.g. 75%).
  • Timeout is also trigger, do checkpoint every N seconds




We can’t control moment when node crashes. 


Let's suppose we have saved tree leafs, but didn’t save tree root (during pages allocation they may be reordered because allocation is multithread). In this case all updates will be lost.

In the same time we can’t translate each memory page update to disk each time - it is too slow.


Technique to solve this named write ahead loggingBefore doing actual update, we append planned change information into cyclic file named WAL log (operation name - WAL append/WAL log).


After crash we can read and replay WAL using already saved page set. We can restore to state, which was last committed state of crashed process. Restore is based on pages store + WAL.


Practically we can’t replay WAL from the beginning of times, Volume(HDD)<Volume(full WAL), and we need procedure to throw out oldest part of changes in WAL, and this is done during checkpointing.


Consistent state comes only from pair of WAL and page store.

Operation is acknowleged after operation was logged, and page(s) update was logged. Checkpoint will be started later by its triggers.

WAL records for recovery

Crash recovery involves following records writtent in WAL, it may be of 2 main types

  1. Logical record
    1. Operation description - which operation we want to do. Contains operation type (put, remove) and (Key, Value, Version)  - DataRecord
    2. Transactional record - this record is marker of begin, prepare, commit, and rollback transactions - (TxRecord
    3. Checkpoint record - marker of begin checkpointing (CheckpointRecord)

Structure of data record:

  1. Physical records
    1. Full page snapshot - record is issued for first page update after successfull checkpointing. Record is logged when page state changes from 'clean' to 'dirty' state (PageSnapshot)
    2. Delta record - describes memory region change, page change. Subclass of PageDeltaRecord. Contains bytes changed in the page. e.g bytes 5-10 were changed to [...,]. Relatively small records for B+tree records

Page snapshots and related deltas are combined during WAL replay.

For particular cache entry update we log records in follwowing order:

  1. logical record with change planned - DataRecord with several DataEntry (ies)
  2. page record:
    1. option: page changed by this update was initially clean, full page is loged - PageSnapshot,
    2. option: page was already modified, delta record is issued - PageDeltaRecord

Planned future optimisation - refer data modified from PageDeltaRecord to logical record. Will allow to not store byte updates twice. There is file WAL pointer, pointer to record from the beginning of time. This refreence may be used.


WAL structure


WAL file segments and rotation structure



See also WAL history size section below

Local Recovery Process

Let’s assume node start process is running with existent files.

  1. We need to check if page store is consistent.
  2. Or we need to find out if crash was while Checkpoint (CP) was running

Ignite manages 2 types of CP markers on disk (standalone files, includes timestamp and WAL pointer):

  • CP begin
  • CP end

If we observe only CP begin and there is no CP end marker that means CP not finished; we have not consistent page store.

For crash without CP running restore is simple, logical record are applied.

Let’s suppose crash occurred at the middle of checkpoint. In that case restore process will discover markers for CP1 and 2 start and CP 1 end.

For completed checkpoint CP1 we apply only physical records, for incomplete CP2  - only logical (as physical may be corrupted).

Page Snapshot records required to avoid double apply of data from delta records.

When replay is finished CP2 marker will be added.

If transaction begin record has no corresponding end, tx change is not applied.

Summary, limitations and performance 


Because CP are consistent we can’t start next CP until previous is not completed.

There is possible next situation:

  • updates coming fast from worker threads
  • CP pool (for copy on writes) may become full with new changes originated

For that case we will block new updates and wait running for CP to finish.

To avoid such scenario:

  • increase frequency of checkpoints (to minimize amount of data to be saved in each CP)
  • increase CP buffer size

WAL and page store may be saved to different devices to avoid its mutual influence.

Case if same records are updated many times may generate load to WAL and no significant load to page store.

To provide recovery guarantees each write (log()) to WAL should:

  • call write() itself.
  • but also require fsync (force buffers to be flushed by OS to the real device).

fsync is expensive operation. There is optimisation for case updates coming faster than disk write, fsyncDelayNanos (1ns-1ms, 1ns by default) delay is used. This delay is used to park threads to accumulate more than one fsync requests.

Future optimisation: standalone thread will be responsible to write data to disk. Worker threads will do all preparation and transfer buffer to write.

See also WAL history size section below.

WAL mode

There several levels of guarantees (WALMode)


DEFAULTfsync() on each commitAny crashes (OS and process crash)

write() on commit

Synchronisation is responsibility of OS

Kill process, but no OS fail

do nothing on commit

(records are accumulated in memory)

write() on timeout

kill -9 may cause loss of several latest updates


But there is several nodes containing same data and there is possible to restore data from other nodes.

Distributed Recovery

Partition update counter. This mechanism was already used in continuous queries.

  • Partition update counter is associated with partition
  • Each update causes increment of partition update counter.

Each update (counter) is replicated to backup. If counter equal on primary and backup means replication is finished.

Partition update counter is saved with update recods in WAL.

Node Join (with data from persitence)

Consider partition on joining node was is owning state, update counter = 50. Existing nodes has update counter = 150

Node join causes partition map exchange, update counter is sent with other partition data. (Joining node will have new ID and from the point of view of dicsovery this node is a new node.)

Coordinator observes older partition state and forces partition to moving state. Moving force is required to setup uploading newer data.

Rebalance of fresh data to joined node now may be run in 2 modes:

  • There is WAL on primary node. WAL includes checkpoint marker with partition update cntr = 45. 
    • We can send only WAL logical update records to backup
  • If counter in WAL is too big, e.g. 200, we don’t have delta (can't sent WAL recods) 
    • joined node will have to clear partition data. 
    • Partition state is set to renting state
    • When clean up finished partition goes to moving state.
    • We can’t use delta updates because there is possible problem with keys deleted early. Can get stale key if we send only delta of changes.

Possible future optimisation: for full update we may send page store file over network.

WAL history size

In corner case we need to store WAL only for 1 checkpoint in past for successful recovery (PersistentStoreConfiguration#walHistSize )

We can’t delete WAL segments considering only history size in bytes or segments. It is possible to replay WAL only starting from checkpoint marker.

WAL history size is measured in number of checkpoint.

Assuming that checkpoints are triggered mostly by timeout we can estimate possible downtime after which node may be rebalanced using delta logical WAL records.

By default WAL history size is 20 to increase probability that rebalancing can be done using logical deltas from WAL.

  • No labels