Apache Ignite needs to support running SQL queries from disk without loading them in memory and, in general, be fully operational even if all the data is on disk.
|Disk-based (preferably SSD) file or block-level store for data stored in Apache Ignite|
|Temporary disk-based journal which keeps track of transactional updates that have not been stored in the |
|Index that is capable of performing range-based SQL queries|
|Index that is capable of doing direct equality-based lookups|
Index that is capable of performing timestamp-based snapshots of the data
Watch the recording of the Apache Ignite Community meetup to quickly grasp the benefits, specificities, architecture and implementation details of the store.
The main purpose of the Persistent Store is to provide data persistence and fault-tolerance, i.e. no data should ever be lost regardless of any type of failures.
Data partitioning details:
MOVING, i.e. not eligible to become primary partition yet.
We may consider adding the ability to create new files, either periodically, or upon a request. In this case, the process of "applying" logs will create new files, instead of updating the old ones.
NOTE: Apache Ignite Persistent Store currently already supports the following functionality which we should be able to leverage:
Apache Ignite will maintain a
WAL file for all caches. The purpose of the
WAL file is to provide a recovery mechanism for the transactional data stored on disk.
WAL update details:
WALdata and move it to
Persistent Store(this will prevent unconditional growth of the
It should be possible to set the maximum size of the log file.
We should have several WAL files. Whether these should be the same files is an open question, we may decide to can create new ones each time.
We should be able to set the lifetime of the WAL and a place where older WAL files should be moved (after their contents are transferred to the Persistent Store). For example, there may be SSDs and conventional disks. All logs that are required for failure handling are kept on the SSDs and traditional disks, and those that are not needed anymore are removed from SSDs. In general, almost complete analogy with Oracle mode ARCHIVELOG.
In addition to the
Persistent Store, Apache Ignite will maintain a per-partition hash-index for key-based data access both, in memory and on disk. This
hash-index will guarantee fast data lookups based on a primary key.
on disk hash-index will always have the latest consistent state, up until the last
WAL flush to the
Persistent Store. In case of failures or restarts, Apache Ignite should be able to immediately recreate the state of the
hash-index on disk by applying the remaining portion of the
WAL that has not been copied yet to the
Persistent Store. Once the disk state is restored, Apache Ignite can start loading the
hash-index into memory in the background, without blocking any cache operations.
Based on configuration,
hash-index should be optionally snapshottable (especially if we decide to use it for SQL queries).
Apache Ignite will need to maintain on-disk and in-memory sorted indexes. These indexes will have the latest state up until the last
WAL flush. It is important that indexes can be loaded in-memory as-is, in the same format as on-disk, without any additional processing, for better performance. Just like the
sorted-indexes will be split into file-per-partition, which will make it very easy to copy partitions between nodes in case of failures or topology changes.
on-disk sorted-index will have the same failure and recovery process, as described for the
NOTE: for in-memory representation of sorted index, need to consider removing per-partition split and keeping the whole index in one data structure in-memory. This will help avoid merge-sort overhead for range-based and order-by queries and potentially provide better performance.
All the indexes that are stored on disk will also be maintained in-memory. However, Apache Ignite will be able to perform all the cache operations, including SQL queries, based exclusively on the on-disk indexes, without having to load them in-memory. The in-memory indexes are only maintained for performance reasons and do not have any effect on the overall supported functionality.
Client node failures do not have effect on persisted state on servers.
In case when all the primaries failed with backups, application must detect that some of partitions are lost and needs to block all the queries until the failed node will be restarted, because otherwise query results will be wrong (as well as cache.get, etc...)
In case when backups or primaries still exist, the failure of a node does not affect query result correctness. Since we will store backup entries in Persistent Store as well as primaries, affinity change to new topology will not affect data consistency between stores on different nodes.
On restart, Apache Ignite will determine if another server contains the latest file for a certain partition, and will copy that file before resuming operation.
It is possible that after all nodes agreed during the
prepare phase, the
commit phase will fail on some node. To handle such cases, nodes should persist the changes to the
WAL during the
prepare phase, but mark them as
"uncommitted". Then, whenever the commit happens, the
"uncommitted" flag should be flipped.
If after a crash, certain transactions were left in
"uncommitted" state, then their persisted state should be ignored and loaded from other cluster nodes, primary or backups, which have the consistent state.
If after restart, none of the nodes have a semi-failed transaction in
committed state, then the transaction should be rolled back and the
"uncommitted" persisted state should be ignored.
NOTE: this protocol for transactional consistency is already implemented in Apache Ignite for in-memory transactions and will need to be expanded to support disk-based transactions.
Whenever crashes or topology changes occur, an internal cluster protocol will need to determine which cluster node has the most up-to-date partition files for Persistent Store,
sorted-indexes, and copy these files to the nodes which either have outdated partition data or to the new nodes which have no data at all. For performance reasons we should investigate to only copy the missing delta instead of the whole partition files.
The copy functionality itself should be provided by a pluggable interface and potentially have several implementations. For example, we can copy partition files using standard Java-based File APIs or native Linux