Apache Ignite needs to support running SQL queries from disk without loading them in memory and, in general, be fully operational even if all the data is on disk.
Name | Description |
---|---|
Persistent Store | Disk-based (preferably SSD) file or block-level store for data stored in Apache Ignite |
Write Ahead Log (WAL) | Temporary disk-based journal which keeps track of transactional updates that have not been stored in the Persistent Store yet |
Sorted Index | Index that is capable of performing range-based SQL queries |
Hash Index | Index that is capable of doing direct equality-based lookups |
Snapshottable Index | Index that is capable of performing timestamp-based snapshots of the data |
Watch the recording of the Apache Ignite Community meetup to quickly grasp the benefits, specificities, architecture and implementation details of the store.
The main purpose of the Persistent Store is to provide data persistence and fault-tolerance, i.e. no data should ever be lost regardless of any type of failures.
Data partitioning details:
MOVING
, i.e. not eligible to become primary partition yet. We may consider adding the ability to create new files, either periodically, or upon a request. In this case, the process of "applying" logs will create new files, instead of updating the old ones.
NOTE: Apache Ignite Persistent Store currently already supports the following functionality which we should be able to leverage:
Apache Ignite will maintain a WAL
file for all caches. The purpose of the WAL
file is to provide a recovery mechanism for the transactional data stored on disk.
WAL
update details:
WAL
file.WAL
data and move it to Persistent Store
(this will prevent unconditional growth of the WAL
file).WAL
and Persistent Store
data.It should be possible to set the maximum size of the log file.
We should have several WAL files. Whether these should be the same files is an open question, we may decide to can create new ones each time.
We should be able to set the lifetime of the WAL and a place where older WAL files should be moved (after their contents are transferred to the Persistent Store). For example, there may be SSDs and conventional disks. All logs that are required for failure handling are kept on the SSDs and traditional disks, and those that are not needed anymore are removed from SSDs. In general, almost complete analogy with Oracle mode ARCHIVELOG.
In addition to the Persistent Store
, Apache Ignite will maintain a per-partition hash-index for key-based data access both, in memory and on disk. This hash-index
will guarantee fast data lookups based on a primary key.
The on disk hash-index
will always have the latest consistent state, up until the last WAL
flush to the Persistent Store
. In case of failures or restarts, Apache Ignite should be able to immediately recreate the state of the hash-index
on disk by applying the remaining portion of the WAL
that has not been copied yet to the Persistent Store
. Once the disk state is restored, Apache Ignite can start loading the hash-index
into memory in the background, without blocking any cache operations.
Based on configuration, hash-index
should be optionally snapshottable (especially if we decide to use it for SQL queries).
Apache Ignite will need to maintain on-disk and in-memory sorted indexes. These indexes will have the latest state up until the last WAL
flush. It is important that indexes can be loaded in-memory as-is, in the same format as on-disk, without any additional processing, for better performance. Just like the hash-index
, the sorted-indexes
will be split into file-per-partition, which will make it very easy to copy partitions between nodes in case of failures or topology changes.
The on-disk sorted-index
will have the same failure and recovery process, as described for the hash-indexes
above.
NOTE: for in-memory representation of sorted index, need to consider removing per-partition split and keeping the whole index in one data structure in-memory. This will help avoid merge-sort overhead for range-based and order-by queries and potentially provide better performance.
All the indexes that are stored on disk will also be maintained in-memory. However, Apache Ignite will be able to perform all the cache operations, including SQL queries, based exclusively on the on-disk indexes, without having to load them in-memory. The in-memory indexes are only maintained for performance reasons and do not have any effect on the overall supported functionality.
Client node failures do not have effect on persisted state on servers.
In case when all the primaries failed with backups, application must detect that some of partitions are lost and needs to block all the queries until the failed node will be restarted, because otherwise query results will be wrong (as well as cache.get, etc...)
In case when backups or primaries still exist, the failure of a node does not affect query result correctness. Since we will store backup entries in Persistent Store as well as primaries, affinity change to new topology will not affect data consistency between stores on different nodes.
On restart, Apache Ignite will determine if another server contains the latest file for a certain partition, and will copy that file before resuming operation.
It is possible that after all nodes agreed during the prepare
phase, the commit
phase will fail on some node. To handle such cases, nodes should persist the changes to the WAL
during the prepare
phase, but mark them as "uncommitted"
. Then, whenever the commit happens, the "uncommitted"
flag should be flipped.
If after a crash, certain transactions were left in "uncommitted"
state, then their persisted state should be ignored and loaded from other cluster nodes, primary or backups, which have the consistent state.
If after restart, none of the nodes have a semi-failed transaction in committed
state, then the transaction should be rolled back and the "uncommitted"
persisted state should be ignored.
NOTE: this protocol for transactional consistency is already implemented in Apache Ignite for in-memory transactions and will need to be expanded to support disk-based transactions.
Whenever crashes or topology changes occur, an internal cluster protocol will need to determine which cluster node has the most up-to-date partition files for Persistent Store, hash-index
, and sorted-indexes
, and copy these files to the nodes which either have outdated partition data or to the new nodes which have no data at all. For performance reasons we should investigate to only copy the missing delta instead of the whole partition files.
The copy functionality itself should be provided by a pluggable interface and potentially have several implementations. For example, we can copy partition files using standard Java-based File APIs or native Linux scp
or cp
commands.