...
Reuse the current index layout and just treat deletemap the deletionVectors as a new index file type
2.2.
...
Deletion vectors index file encoding
Like hash index, one bucket one deleteMap deletionVector index. Therefore, a deleteMap deletionVector index file needs to contain bitmaps of multiple files in the same bucket, its structure is actually a map<fileName, bitmap>
, to support high-performance reads, we have designed the following file encoding to store this :
...
{
"org.apache.paimon.avro.generated.record": {
"_VERSION": 1,
"_KIND": 0,
"_PARTITION": "\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000p\u0000\u0000\u0000\u0000\u0000\u0000",
"_BUCKET": 0,
"_TYPE": "DELETEDELETION_MAPVECTORS",
"_FILE_NAME": "index-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0",
"_FILE_SIZE": x,
"_ROW_COUNT": count of the map,
"_DELETE_INDEX_RANGES": "binary Map<String, Pair<Integer, Integer>>", key is the fileName, value is <start offset of the serialized bitmap in Index file, size of the serialized bitmap>
}
}
...
- First, record a const magic number by an int.
- Then, record serialized bitmap.
e.g:
...
3.
...
Write
3.1. Overview
Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:
- New data is written to the level-0 layer.
- Perform a compaction with each write and force merge of level-0 layer data, see
ForceUpLevel0Compaction
. - Implement a merge like
LookupDeleteFileMergeFunctionWrapper
, which has the following characteristics:
- a. When records do not belong to the level-0 layer, no deletiton is generated.
- b. When records belong to the level-0 + level-x layers, no deletiton is generated.
- c. When records belong only to the level-0 layer, look up other layers and update the
map<fileName, bitmap>
.
4. After the compaction finish, the bitmap of the before files
in this merge is no longer useful and will be deleted from map<fileName, bitmap>
.
5. Finally, write the new deleteFile and mark the old deleteFile as remove.
6. For full compaction, an optimization can be made: directly clear the map<fileName, bitmap>
.
Example:
Assume the LSM has a total of four layers, the initial stage is as follows (left to right are: file content, LSM tree, delete file):
Then, a new file f7
, is initially added to the level-0. Suppose compaction picks the level-0 layer and the level-2 layer, the following changes will occur:
- 1 belongs only to the level-0 layer, it needs to look up old data and finds that
f1
also contains 1, sof1
's bitmap is modified to add 1. - 2 and 9 belong to both the level-0 and level-2 layers, there's no need to modify the bitmap.
- The bitmap of
f6
can be removed because it has been compacted. f5
,f6
, andf7
are marked asREMOVE
, and the old delete file is marked asREMOVE
.
FInally, assuming that compaction has generated f8
and f9
, the final result is as follows:
f8
andf9
are marked asADD
, and the new delete file is marked asADD
.
3.2. Implementation
Considerations for implementation:
Currently, when set'changelog-producer' = 'lookup'
, the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this.- In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
- If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
- The merge engine is still available.
4. Read
4.1. Overview
- For each read task, load the corresponding deleteFile.
- Construct the
map<fileName, bitmap>
from deleteFile. - Get the bitmap based on the filename, then pass it to the reader.
5. Maintenance
5.1. compaction
We can incorporate bitmap evaluation during compaction pick, such as when the proportion of deleted rows in a file reaches like 50%, we can pick it for compaction.
5.2. expire
Determine whether to delete based on the delete
and add
records in the deleteFileManifest.
6. Other considerations
- Impact on file meta: Currently, the stats (min, max, null count) in file meta are already unreliable, so no special handling will be performed for this aspect.
- ...
Public Interfaces
How to use
a new conf:
deletion-vectors.enabled:
control whether to enable deletion vectors mode: write deletion vectors index and read using it without merge.
limit:
- The first version only supports `file.format` = `orc` or `parquet`
- Only support `changelog-producer` = `none` or `lookup`
Classes
Abstract an interface DeletionVector
to represent the deletion vector, and provide a BitmapDeletionVector
based on RoaringBitmap to implement it:
Code Block | ||||
---|---|---|---|---|
| ||||
public interface DeletionVector {
void delete(long position);
boolean isDeleted(long position);
boolean isEmpty();
byte[] serializeToBytes();
DeleteIndex deserializeFromBytes(byte[] bytes);
} |
Add a DeletionVectorsIndexFile
to read, write and delete deletionVector:
Code Block | ||||
---|---|---|---|---|
| ||||
public class DeletionVectorsIndexFile {
public long fileSize(String fileName);
public Map<String, DeletionVector> readAllDeletionVectors(String fileName, Map<String, Pair<Integer, Integer>> deletionVectorRanges);
public DeletionVector readDeletionVector(String fileName, Pair<Integer, Integer> deletionVectorRange);
public Pair<String, Map<String, Pair<Integer, Integer>>> write(Map<String, DeletionVector> input);
public void delete(String fileName);
} |
Extend the current IndexMaintainer
interface, and create DeletionVectorsIndexMaintainer implements IndexMaintainer<KeyValue, DeletionVectorIndex>
Code Block | ||||
---|---|---|---|---|
| ||||
public interface IndexMaintainer<T, U> {
void notifyNewRecord(T record);
List<IndexFileMeta> prepareCommit();
/* (new) delete file's index */
default void delete(String fileName) {
throw new UnsupportedOperationException();
}
/* (new) get file's index */
default Optional<U> indexOf(String fileName) { |
BitmapDeleteIndex
Implement DeleteIndex
based on RoaringBitmap
Code Block | ||||
---|---|---|---|---|
| ||||
public interface DeleteIndex {
void delete(long position);
boolean isDeleted(long position);
boolean isEmpty();
byte[] serializeToBytes();
DeleteIndex deserializeFromBytes(byte[] bytes);
} |
Code Block | ||||
---|---|---|---|---|
| ||||
public class DeleteMapIndexFile {
public long fileSize(String fileName);
public Map<String, DeleteIndex> readAllDeleteIndex(String fileName, Map<String, Pair<Integer, Integer>> deleteIndexRanges);
public DeleteIndex readDeleteIndex(String fileName, Pair<Integer, Integer> deleteIndexRange);
public Pair<String, Map<String, Pair<Integer, Integer>>> write(Map<String, DeleteIndex> input);
public void delete(String fileName);
} |
Extend the IndexMaintainer
interface, and create DeleteMapIndexMaintainer implements IndexMaintainer<KeyValue, DeleteIndex>
Code Block | ||||
---|---|---|---|---|
| ||||
public interface IndexMaintainer<T, U> {
void notifyNewRecord(T record);
List<IndexFileMeta> prepareCommit();
/* (new) delete file's index */
default void delete(String fileName) {
throw new UnsupportedOperationException();
}
/* (new) get file's index */
default Optional<U> indexOf(String fileName) {
throw new UnsupportedOperationException();
}
/** Factory to restore {@link IndexMaintainer}. */
interface Factory<T, U> {
IndexMaintainer<T, U> createOrRestore(
@Nullable Long snapshotId, BinaryRow partition, int bucket);
}
} |
3. Write
3.1. Overview
Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:
- New data is written to the level-0 layer.
- Perform a compaction with each write and force merge of level-0 layer data, see
ForceUpLevel0Compaction
. - Implement a merge like
LookupDeleteFileMergeFunctionWrapper
, which has the following characteristics:
- a. When records do not belong to the level-0 layer, no deletiton is generated.
- b. When records belong to the level-0 + level-x layers, no deletiton is generated.
- c. When records belong only to the level-0 layer, look up other layers and update the
map<fileName, bitmap>
.
4. After the compaction finish, the bitmap of the before files
in this merge is no longer useful and will be deleted from map<fileName, bitmap>
.
5. Finally, write the new deleteFile and mark the old deleteFile as remove.
6. For full compaction, an optimization can be made: directly clear the map<fileName, bitmap>
.
Example:
Assume the LSM has a total of four layers, the initial stage is as follows (left to right are: file content, LSM tree, delete file):
Then, a new file f7
, is initially added to the level-0. Suppose compaction picks the level-0 layer and the level-2 layer, the following changes will occur:
- 1 belongs only to the level-0 layer, it needs to look up old data and finds that
f1
also contains 1, sof1
's bitmap is modified to add 1. - 2 and 9 belong to both the level-0 and level-2 layers, there's no need to modify the bitmap.
- The bitmap of
f6
can be removed because it has been compacted. f5
,f6
, andf7
are marked asREMOVE
, and the old delete file is marked asREMOVE
.
FInally, assuming that compaction has generated f8
and f9
, the final result is as follows:
f8
andf9
are marked asADD
, and the new delete file is marked asADD
.
3.2. Implementation
Considerations for implementation:
Currently, when set'changelog-producer' = 'lookup'
, the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this.- In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
- If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
- The merge engine is still available.
4. Read
4.1. Overview
- For each read task, load the corresponding deleteFile.
- Construct the
map<fileName, bitmap>
from deleteFile. - Get the bitmap based on the filename, then pass it to the reader.
4.2. Classes
Code Block | ||||
---|---|---|---|---|
| ||||
public class ApplyDeleteIndexReader implements RecordReader<KeyValue> { public ApplyDeleteIndexReader(RecordReader<KeyValue> reader, DeleteIndex deleteIndex) { this.reader = reader; this.deleteIndex = deleteIndex; } @Nullable @Override public RecordIterator<KeyValue> readBatch() throws IOException { RecordIterator<KeyValue> batch = reader.readBatch(); if (batch == null) { return null; } return new RecordIterator<KeyValue>() { @Override throw public KeyValue nextnew UnsupportedOperationException(); throws IOException {} /** Factory to restore {@link IndexMaintainer}. */ interface Factory<T, whileU> (true) { IndexMaintainer<T, U> createOrRestore( KeyValue kv = batch.next(); @Nullable Long snapshotId, BinaryRow partition, int bucket); } } |
Add ApplyDeletionVectorReader implements RecordReader<KeyValue> to read with DeletionVector
Code Block | ||||
---|---|---|---|---|
| ||||
public class ApplyDeletionVectorReader implements RecordReader<KeyValue> { ifpublic ApplyDeletionVectorReader(kv == nullRecordReader<KeyValue> reader, DeletionVector deletionVector) { this.reader = reader; this.deletionVector return null= deletionVector; } @Nullable @Override public RecordIterator<KeyValue> readBatch() throws IOException }{ RecordIterator<KeyValue> batch = reader.readBatch(); if (!deleteIndex.isDeleted(kv.position())batch == null) { return null; } return kv; return new RecordIterator<KeyValue>() { }@Override public KeyValue next() throws }IOException { } while (true) { }KeyValue kv = batch.next(); } ... } |
5. Maintenance
5.1. compaction
We can incorporate bitmap evaluation during compaction pick, such as when the proportion of deleted rows in a file reaches like 50%, we can pick it for compaction.
5.2. expire
Determine whether to delete based on the delete
and add
records in the deleteFileManifest.
6. Other considerations
- Impact on file meta: Currently, the stats (min, max, null count) in file meta are already unreliable, so no special handling will be performed for this aspect.
- ...
Compatibility, Deprecation, and Migration Plan
New conf
...
if (kv == null) {
return null;
}
if (!deletionVector.isDeleted(kv.position())) {
return kv;
}
}
}
};
}
...
} |
Compatibility, Deprecation, and Migration Plan
...
Conversion between delete file mode and original mode
- original mode -> delete file mode: can be directly switched by set `deletedeletion-mapvectors.enabled enabled` = true` `true` .
- delete file mode -> original mode, in theory, perform a full compaction, then clean up the old snapshot, and then set `delete`deletion-mapvectors.enabled enabled` = false``false`
[1]: https://github.com/apache/iceberg
...