...
Here is an exmaple, suppose we have deleted some lines of the files:
When writing new deletions, write it directly into a new delete file.
Suppose we need to delete the 4th line of data_file1 and the 6th line of data_file5:
Advantages:
- Simple storage, can be directly written into formats like Parquet, Orc, Avro, etc.
...
Here is an exmaple (bin {3} means a bitmap in binary which has added line 3) :
When writing new deletions, a new delete file is created, and marking the old one as removed at the same time:
Advantages:
- Simple implementation, similar to Paimon's existing index (one index file per bucket), just adding a delete file manifest.
- The number of delete files is stable and corresponds to the number of buckets.
- The logic for cleaning up delete files is simple, just clean the older one.
...
Therefore, considering both implementation and performance aspects, Approach 2 is ultimately chosen.
2. protocal design
2.1. layout
Update: reuse the current index layout and treat deletemap as a new index file type
{
"org.
├── deleteFile
│apache.paimon.avro.generated.record": {
"_VERSION": 1,
├── delete-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0
│ "_KIND": 0,
└── delete-32f16270-5a81-4e5e-9f93-0e096b8b58d3-1
├── manifest
│ ├── delete-manifest-40d555d3-dc04-4363-a2b3-3860829c427b-0
│ ├── delete-manifest-4c3b5202-986f-4ea3-aa8e-941531e451ac-0
│ ├── manifest-5aeccf75-6dfa-4619-9df0-520488aabe76-0
│ ├── manifest-e93ed16f-91b7-47fe-919d-e8b518992ed3-0
│ ├── manifest-list-8a724129-758d-478c-989c-1b2538e76d44-0
│ └── manifest-list-8a724129-758d-478c-989c-1b2538e76d44-1
├── pt=p
│ ├── bucket-0
│ │ └── data-78c9ef18-7c31-4cf1-8cd0-4772cc0a1678-0.orc
│ ├── bucket-1
│ │ └── data-1ce952bf-8514-49ce-9a55-1b70ed509d78-0.orc
├── schema
│ └── schema-0
└── snapshot
├── EARLIEST
├── LATEST
├── snapshot-1
└── snapshot-2
snapshot-x
{
...
"deleteFileManifest" : "delete-manifest-40d555d3-dc04-4363-a2b3-3860829c427b-0",
}
delete-manfiext-xxx (avro)
{
"org.apache.paimon.avro.generated.record": {
"_VERSION": 1,
"_KIND": 0,
"_PARTITION": "\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000p\u0000\u0000\u0000\u0000\u0000\u0000",
"_BUCKET": 0,
"_TYPE": "RoaringBitmap",
"_FILE_NAME": "delete-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0",
"_FILE_SIZE": x,
"_ROW_COUNT": x
}
}
delete-xxx
The tentative format is Avro, with json as example to show
{
"fileName1": the serialized bitmap1,
"fileName2": the serialized bitmap2,
...
}
2.2. Implementation
It's essentially the same logic as the paimon index.
3. Write
3.1. Overview
Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:
- New data is written to the level-0 layer.
- Perform a compaction with each write and force merge of level-0 layer data, see
ForceUpLevel0Compaction
. - Implement a merge like
LookupDeleteFileMergeFunctionWrapper
, which has the following characteristics:
- a. When records do not belong to the level-0 layer, no deletiton is generated.
- b. When records belong to the level-0 + level-x layers, no deletiton is generated.
- c. When records belong only to the level-0 layer, look up other layers and update the
map<fileName, bitmap>
.
4. After the compaction finish, the bitmap of the before files
in this merge is no longer useful and will be deleted from map<fileName, bitmap>
.
5. Finally, write the new deleteFile and mark the old deleteFile as remove.
6. For full compaction, an optimization can be made: directly clear the map<fileName, bitmap>
.
Example:
Assume the LSM has a total of four layers, the initial stage is as follows (left to right are: file content, LSM tree, delete file):
Then, a new file f7
, is initially added to the level-0. Suppose compaction picks the level-0 layer and the level-2 layer, the following changes will occur:
- 1 belongs only to the level-0 layer, it needs to look up old data and finds that
f1
also contains 1, sof1
's bitmap is modified to add 1. - 2 and 9 belong to both the level-0 and level-2 layers, there's no need to modify the bitmap.
- The bitmap of
f6
can be removed because it has been compacted. f5
,f6
, andf7
are marked asREMOVE
, and the old delete file is marked asREMOVE
.
FInally, assuming that compaction has generated f8
and f9
, the final result is as follows:
f8
andf9
are marked asADD
, and the new delete file is marked asADD
.
3.2. Implementation
Considerations for implementation:
- Currently, when set
'changelog-producer' = 'lookup'
, the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this. - In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
- If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
- The merge engine is still available.
4. Read
4.1. Overview
- For each read task, load the corresponding deleteFile.
- Construct a
map<fileName, bitmap>
. - Get the bitmap based on the filename, then pass it to the reader.
4.2. Implementation
Considerations for implementation:
...
"_PARTITION": "\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000p\u0000\u0000\u0000\u0000\u0000\u0000",
"_BUCKET": 0,
"_TYPE": "DELETE_MAP",
"_FILE_NAME": "index-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0",
"_FILE_SIZE": x,
"_ROW_COUNT": x
}
}
2.2. DeleteMap index file encoding
Like hash index, one bucket one deleteMap index. Therefore, a deleteMap index file needs to contain bitmaps of multiple files in the same bucket, its structure is actually a map<fileName, bitmap>
, to support high-performance reads, we have designed the following file encoding to store this map :
- First, record the size of this map by an int.
- Then, record the max fileName length by an int (because the length of the file name may not be consistent).
- Next, record <fileName (padding to max length), the starting pos of the serialized bitmap, the size of the bitmap> in sequence.
- Finally, record <serialized bitmap> in sequence.
e.g:
2.3 Classes
BitmapDeleteIndex
Implement DeleteIndex
based on RoaringBitmap
Code Block | ||||
---|---|---|---|---|
| ||||
public interface DeleteIndex {
void delete(long position);
boolean isDeleted(long position);
boolean isEmpty();
byte[] serializeToBytes();
DeleteIndex deserializeFromBytes(byte[] bytes);
} |
Code Block | ||||
---|---|---|---|---|
| ||||
public class DeleteMapIndexFile {
public long fileSize(String fileName);
public Map<String, long[]> readDeleteIndexBytesOffsets(String fileName);
public Map<String, DeleteIndex> readAllDeleteIndex(String fileName, Map<String, long[]> deleteIndexBytesOffsets);
public DeleteIndex readDeleteIndex(String fileName, long[] deleteIndexBytesOffset);
public String write(Map<String, DeleteIndex> input);
public void delete(String fileName);
} |
Extend the IndexMaintainer
interface, and add DeleteMapIndexMaintainer implements IndexMaintainer<KeyValue, DeleteIndex>
Code Block | ||||
---|---|---|---|---|
| ||||
public interface IndexMaintainer<T, U> {
void notifyNewRecord(T record);
List<IndexFileMeta> prepareCommit();
/* (new) delete file's index */
default void delete(String fileName) {
throw new UnsupportedOperationException();
}
/* (new) get file's index */
default Optional<U> indexOf(String fileName) {
throw new UnsupportedOperationException();
}
/** Factory to restore {@link IndexMaintainer}. */
interface Factory<T, U> {
IndexMaintainer<T, U> createOrRestore(
@Nullable Long snapshotId, BinaryRow partition, int bucket);
}
} |
3. Write
3.1. Overview
Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:
- New data is written to the level-0 layer.
- Perform a compaction with each write and force merge of level-0 layer data, see
ForceUpLevel0Compaction
. - Implement a merge like
LookupDeleteFileMergeFunctionWrapper
, which has the following characteristics:
- a. When records do not belong to the level-0 layer, no deletiton is generated.
- b. When records belong to the level-0 + level-x layers, no deletiton is generated.
- c. When records belong only to the level-0 layer, look up other layers and update the
map<fileName, bitmap>
.
4. After the compaction finish, the bitmap of the before files
in this merge is no longer useful and will be deleted from map<fileName, bitmap>
.
5. Finally, write the new deleteFile and mark the old deleteFile as remove.
6. For full compaction, an optimization can be made: directly clear the map<fileName, bitmap>
.
Example:
Assume the LSM has a total of four layers, the initial stage is as follows (left to right are: file content, LSM tree, delete file):
Then, a new file f7
, is initially added to the level-0. Suppose compaction picks the level-0 layer and the level-2 layer, the following changes will occur:
- 1 belongs only to the level-0 layer, it needs to look up old data and finds that
f1
also contains 1, sof1
's bitmap is modified to add 1. - 2 and 9 belong to both the level-0 and level-2 layers, there's no need to modify the bitmap.
- The bitmap of
f6
can be removed because it has been compacted. f5
,f6
, andf7
are marked asREMOVE
, and the old delete file is marked asREMOVE
.
FInally, assuming that compaction has generated f8
and f9
, the final result is as follows:
f8
andf9
are marked asADD
, and the new delete file is marked asADD
.
3.2. Implementation
Considerations for implementation:
- Currently, when set
'changelog-producer' = 'lookup'
, the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this. - In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
- If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
- The merge engine is still available.
4. Read
4.1. Overview
- For each read task, load the corresponding deleteFile.
- Construct a
map<fileName, bitmap>
. - Get the bitmap based on the filename, then pass it to the reader.
4.2. Classes
Code Block | ||||
---|---|---|---|---|
| ||||
public class ApplyDeleteIndexReader implements RecordReader<KeyValue> {
public ApplyDeleteIndexReader(RecordReader<KeyValue> reader, DeleteIndex deleteIndex) {
this.reader = reader;
this.deleteIndex = deleteIndex;
}
@Nullable
@Override
public RecordIterator<KeyValue> readBatch() throws IOException {
RecordIterator<KeyValue> batch = reader.readBatch();
if (batch == null) {
return null;
}
return new RecordIterator<KeyValue>() {
@Override
public KeyValue next() throws IOException {
while (true) {
KeyValue kv = batch.next();
if (kv == null) {
return null;
}
if (!deleteIndex.isDeleted(kv.position())) {
return kv;
}
}
}
};
}
...
} |
...
5. Maintenance
5.1. compaction
...
- Impact on file meta: Currently, the stats (min, max, null count) in file meta are already unreliable, so no special handling will be performed for this aspect.
- ...
...
- count) in file meta are already unreliable, so no special handling will be performed for this aspect.
- ...
Compatibility, Deprecation, and Migration Plan
New conf
delete-map.enabled
: contral whether to enable delete map mode, limit:
- The first version only supports
file.format
=parquet
, and more formats will be supported in the future. - Only support
changelog-producer
=none
orlookup
Migration
- Conversion between delete file mode (just temporarily call it) and original LSM
- LSM -> delete file mode: can be directly switched (add a parameter to control whether to enable delete filejust set ).
- delete file mode -> LSM, in theory, perform a full compaction, then clean up the old snapshot, and then disabled delete file mode.
...