Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Page Memory

PageMemory is an abstraction for working with pages in memory. Internally it interacts with a file store which is responsible for allocating page ids, writing and reading pages. One should distinguish a concept of a page and a page buffer. A page is a block of data of a fixed length with a unique identifier called FullPageId. A page buffer is a region in memory associated with some page. PageMemory fully handles the process of loading pages to the corresponding page buffers and evicting unused page buffers from memory. At any moment in time page memory may keep any subset of page buffers (fitting to the allocated RAM).

FullPageId

FullPageId consists of cache ID (32 bits) and page ID (64 bits). Page ID is effectively a virtual page identifier that may change during the page lifecycle (see PageRotation below). EffectivePageId, which is a part of Page ID (partition ID, page index) does not change during the page lifecycle.

PageId and EffectivePageId

The structure of page ID is defined by the following diagram:

...

  • Offset is a reserved field used either for page ID rotation or for referencing a record in a data page
  • Flags is a reserved field used for page ID rotation
  • Partition ID is either a partition identifier in [0, 65500] or a reserved value 0xFFFF used for index partition. Other values are reserved for future use.
  • Page Index is a monotonically growing number within each partition

Page State

At any moment in time page can be in the following states:

...

PageMemory keeps track of dirty page IDs and is capable of supplying this list to the checkpointer thread. When checkpoint starts, the collection of dirty pages is atomically reset to a new empty collection.

Page Modification

Each page buffer has an associated read write lock paired with a tag. Page tag is a part of page ID used for page rotation. In order to read or write page buffer contents one must acquire the read or write lock. The page buffer read write lock requires a correct tag to be passed in order for the lock to be acquired. If the passed in tag differs from the actual page tag, it means that the page has been reused and the link to this page is no longer valid (see page ID rotation below).

Internal Data Structures

Depending on page type and the amount of free space in a page it can be optionally tracked in either FreeList or ReuseList. FreeList is a data structure used to track partially free pages (applicable for data pages). ReuseList tracks completely free pages that can be used by any other data structure in the database. 

Each partition has a special dedicated page (a meta page) which holds the state of the partition and page IDs serving as roots for all the data structures associated with the partition.

Data partitions

For data partitions, the following data structures are used:

  • FreeList (working as a reuse list at the same time) to track free and empty pages.
  • Partition hash index used as a main storage for the cache.
  • RowStore to keep key-value pairs

Index partition

For index partition, the following data structures are used:

  • ReuseList (since BTrees fully acquire pages and do not need to track free space)
  • Metadata storage. This data structure keeps track of all allocated indexes and contains pairs (index name, root page ID)

Page ID Rotation

The structure of BPlusTree combined with the presence of ReuseList means that it is possible to observe a state when upper pages of BPlusTree refer to a page which has been returned to the reuse list by a concurrent remove, and this page has been taken for yet another place in the same BPlusTree (ABA problem). In such situation it is possible to get a deadlock because the order of page locking can be violated. In order to avoid such situations, we update page tag each time a page is returned to a reuse list. In this case if a reader finds that the lock cannot be acquired because of the page tag mismatch, it restarts an operation from scratch, reading an updated link.

Checkpointing

Checkpointing is a process of writing dirty pages from memory to persistent storage.

...

  • The last checkpoint start marker has a corresponding checkpoint end marker. This means that the process stopped or crashed when no checkpoint was in progress, thus page store (disk) contains a consistent state of data structures (at the time of checkpoint). We only need to replay logical updates that were logged to the WAL since the last checkpoint marker.
  • The last checkpoint start marker does not have a corresponding checkpoint end marker. This means that the process crashed or stopped during a checkpoint and page store contains half-written pages state. In this case the recovery consists of two steps. First, we must replay all WAL records starting from last finished checkpoint marker and ending with the start checkpoint marker. This will bring data structures to a consistent state. After this we must replay WAL logical updates from record B to the end of the log (the same as above).

Page Store

Page store is an interface responsible for allocating, reading and writing page buffers to a physical durable storage. Page store is allocated on per-partition basis. Refer to org.apache.ignite.internal.pagemem.store.PageStore for the store API. Here we describe the meaning of the main store methods

...

All methods of the page store must be thread-safe.

Write-Ahead-Log

WAL is an append-only data structure that logs logical database modifications and page changes. Refer to org.apache.ignite.internal.pagemem.wal.IgniteWriteAheadLogManager for the API. Here we describe the meaning of the main WAL methods

...

Current implementation of WAL works based on segment-based WAL structure. WAL is written to a files of fixed length called segments, each segment has a monotonically growing absolute index. The segment is first written to one of the preallocated work files in WAL work directory. The index of the work file is the absolute index of the segment modulo the number of preallocated segments. In each moment in time only one work file is opened for writing. After a work file has been written, it is copied to the WAL archive directory by a background WAL archiver, then this work file is cleaned (filled with zeros) and given back to WAL for further writes. WAL archiver and WAL are properly synchronized so that WAL does not overwrite unarchived work files.

Rebalancing

Rebalancing logic is based on per-partition update counters. During rebalancing, every node sends it's update counters for each partition to the coordinator. Then coordinator assigns partition states according to update counters: nodes that have the highest value for a partition are considered its owners.

...