PageMemory is an abstraction for working with pages in memory. Internally it interacts with a file store which is responsible for allocating page ids, writing and reading pages. One should distinguish a concept of a page and a page buffer. A page is a block of data of a fixed length with a unique identifier called FullPageId. A page buffer is a region in memory associated with some page. PageMemory fully handles the process of loading pages to the corresponding page buffers and evicting unused page buffers from memory. At any moment in time page memory may keep any subset of page buffers (fitting to the allocated RAM).
FullPageId consists of cache ID (32 bits) and page ID (64 bits). Page ID is effectively a virtual page identifier that may change during the page lifecycle (see PageRotation below). EffectivePageId, which is a part of Page ID (partition ID, page index) does not change during the page lifecycle.
The structure of page ID is defined by the following diagram:
+---------+-----------+------------+--------------------------+
| 8 bits | 8 bits | 16 bits | 32 bits |
+---------+-----------+------------+--------------------------+
| OFFSET | FLAGS |PARTITION ID| PAGE INDEX |
+---------+-----------+------------+--------------------------+
| EFFECTIVE PAGE ID |
+---------------------------------------+
At any moment in time page can be in the following states:
PageMemory keeps track of dirty page IDs and is capable of supplying this list to the checkpointer thread. When checkpoint starts, the collection of dirty pages is atomically reset to a new empty collection.
Each page buffer has an associated read write lock paired with a tag. Page tag is a part of page ID used for page rotation. In order to read or write page buffer contents one must acquire the read or write lock. The page buffer read write lock requires a correct tag to be passed in order for the lock to be acquired. If the passed in tag differs from the actual page tag, it means that the page has been reused and the link to this page is no longer valid (see page ID rotation below).
Depending on page type and the amount of free space in a page it can be optionally tracked in either FreeList or ReuseList. FreeList is a data structure used to track partially free pages (applicable for data pages). ReuseList tracks completely free pages that can be used by any other data structure in the database.
Each partition has a special dedicated page (a meta page) which holds the state of the partition and page IDs serving as roots for all the data structures associated with the partition.
For data partitions, the following data structures are used:
For index partition, the following data structures are used:
The structure of BPlusTree combined with the presence of ReuseList means that it is possible to observe a state when upper pages of BPlusTree refer to a page which has been returned to the reuse list by a concurrent remove, and this page has been taken for yet another place in the same BPlusTree (ABA problem). In such situation it is possible to get a deadlock because the order of page locking can be violated. In order to avoid such situations, we update page tag each time a page is returned to a reuse list. In this case if a reader finds that the lock cannot be acquired because of the page tag mismatch, it restarts an operation from scratch, reading an updated link.
Checkpointing is a process of writing dirty pages from memory to persistent storage.
The following key components are needed for checkpoint
After a checkpoint begun, PageMemory is moved to a copy-on-write mode. This means that if a dirty page buffer is modified, the page will be moved to the Dirty In Checkpoint state and PageMemory will create a temp copy of the page buffer which is visible for the PageMemory users (the current contents of the page). At the same time the checkpoint page buffer is visible only for checkpointer which will eventually flush it to disk. When checkpointer finishes writing, it will flush all copied pages to the main memory.
Upon each page buffer modification (before the page write lock release) the corresponding changes must be logged to the WAL. More specifically, if the page being modified is clean, the whole modified page buffer must be logged. If the page being modified is dirty, only binary page delta should be modified to reduce WAL consumption and improve performance.
Upon start, a node can observe the following states of checkpoint markers:
Page store is an interface responsible for allocating, reading and writing page buffers to a physical durable storage. Page store is allocated on per-partition basis. Refer to org.apache.ignite.internal.pagemem.store.PageStore for the store API. Here we describe the meaning of the main store methods
All methods of the page store must be thread-safe.
WAL is an append-only data structure that logs logical database modifications and page changes. Refer to org.apache.ignite.internal.pagemem.wal.IgniteWriteAheadLogManager for the API. Here we describe the meaning of the main WAL methods
Decoupling
In current API of the WAL we rely on java objects rather than on byte arrays. As per discussion with Veritas, we need to decouple the java records serialization logic and writing the serialized values to disk. In order to achieve ideal integration, we must supply a block-based WAL implementation.
Current implementation of WAL works based on segment-based WAL structure. WAL is written to a files of fixed length called segments, each segment has a monotonically growing absolute index. The segment is first written to one of the preallocated work files in WAL work directory. The index of the work file is the absolute index of the segment modulo the number of preallocated segments. In each moment in time only one work file is opened for writing. After a work file has been written, it is copied to the WAL archive directory by a background WAL archiver, then this work file is cleaned (filled with zeros) and given back to WAL for further writes. WAL archiver and WAL are properly synchronized so that WAL does not overwrite unarchived work files.
Rebalancing logic is based on per-partition update counters. During rebalancing, every node sends it's update counters for each partition to the coordinator. Then coordinator assigns partition states according to update counters: nodes that have the highest value for a partition are considered its owners.
A node that needs to obtain more recent data for a partition will include its current update counter in GridDhtPartitionDemandMessage. A supplier might return either full partition or changes since that counter, extracted from WAL.