...

Here is an exmaple, suppose we have deleted some lines of the files:

Image RemovedImage Added

When writing new deletions, write it directly into a new delete file.

Suppose we need to delete the 4th line of data_file1 and the 6th line of data_file5:

Image RemovedImage Added

Advantages:

Simple storage, can be directly written into formats like Parquet, Orc, Avro, etc.

...

Here is an exmaple (bin {3} means a bitmap in binary which has added line 3) :

Image RemovedImage Added

When writing new deletions, a new delete file is created, and marking the old one as removed at the same time:Image Removed

Advantages:

Simple implementation, similar to Paimon's existing index (one index file per bucket), just adding a delete file manifest.
The number of delete files is stable and corresponds to the number of buckets.
The logic for cleaning up delete files is simple, just clean the older one.

...

Therefore, considering both implementation and performance aspects, Approach 2 is ultimately chosen.

2. protocal design

2.1. layout

Update: reuse the current index layout and treat deletemap as a new index file type

{
  "org.
├── deleteFile
│apache.paimon.avro.generated.record": {
    "_VERSION": 1,
   ├── delete-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0
│ "_KIND": 0,
   └── delete-32f16270-5a81-4e5e-9f93-0e096b8b58d3-1
├── manifest
│   ├── delete-manifest-40d555d3-dc04-4363-a2b3-3860829c427b-0
│   ├── delete-manifest-4c3b5202-986f-4ea3-aa8e-941531e451ac-0
│   ├── manifest-5aeccf75-6dfa-4619-9df0-520488aabe76-0
│   ├── manifest-e93ed16f-91b7-47fe-919d-e8b518992ed3-0
│   ├── manifest-list-8a724129-758d-478c-989c-1b2538e76d44-0
│   └── manifest-list-8a724129-758d-478c-989c-1b2538e76d44-1
├── pt=p
│   ├── bucket-0
│   │   └── data-78c9ef18-7c31-4cf1-8cd0-4772cc0a1678-0.orc
│   ├── bucket-1
│   │   └── data-1ce952bf-8514-49ce-9a55-1b70ed509d78-0.orc
├── schema
│   └── schema-0
└── snapshot
    ├── EARLIEST
    ├── LATEST
    ├── snapshot-1
    └── snapshot-2

snapshot-x

{
  ...
  "deleteFileManifest" : "delete-manifest-40d555d3-dc04-4363-a2b3-3860829c427b-0",
}

delete-manfiext-xxx (avro)

{
  "org.apache.paimon.avro.generated.record": {
    "_VERSION": 1,
    "_KIND": 0,
    "_PARTITION": "\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000p\u0000\u0000\u0000\u0000\u0000\u0000",
    "_BUCKET": 0,
    "_TYPE": "RoaringBitmap",
    "_FILE_NAME": "delete-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0",
    "_FILE_SIZE": x,
    "_ROW_COUNT": x
  }
}

delete-xxx

The tentative format is Avro, with json as example to show

{
  "fileName1": the serialized bitmap1,
  "fileName2": the serialized bitmap2,
  ...
}

2.2. Implementation

It's essentially the same logic as the paimon index.

3. Write

3.1. Overview

Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:

New data is written to the level-0 layer.
Perform a compaction with each write and force merge of level-0 layer data, see ForceUpLevel0Compaction.
Implement a merge like LookupDeleteFileMergeFunctionWrapper, which has the following characteristics:

a. When records do not belong to the level-0 layer, no deletiton is generated.
b. When records belong to the level-0 + level-x layers, no deletiton is generated.
c. When records belong only to the level-0 layer, look up other layers and update the map<fileName, bitmap>.

4. After the compaction finish, the bitmap of the before files in this merge is no longer useful and will be deleted from map<fileName, bitmap>.

5. Finally, write the new deleteFile and mark the old deleteFile as remove.

6. For full compaction, an optimization can be made: directly clear the map<fileName, bitmap>.

Example:

Assume the LSM has a total of four layers, the initial stage is as follows (left to right are: file content, LSM tree, delete file):

Image Removed

Then, a new file f7, is initially added to the level-0. Suppose compaction picks the level-0 layer and the level-2 layer, the following changes will occur:

1 belongs only to the level-0 layer, it needs to look up old data and finds that f1 also contains 1, so f1's bitmap is modified to add 1.
2 and 9 belong to both the level-0 and level-2 layers, there's no need to modify the bitmap.
The bitmap of f6 can be removed because it has been compacted.
f5, f6, and f7 are marked as REMOVE, and the old delete file is marked as REMOVE.

Image Removed

FInally, assuming that compaction has generated f8 and f9, the final result is as follows:

f8 and f9 are marked as ADD, and the new delete file is marked as ADD.

Image Removed

3.2. Implementation

Considerations for implementation:

Currently, when set 'changelog-producer' = 'lookup', the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this.
In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
The merge engine is still available.

4. Read

4.1. Overview

For each read task, load the corresponding deleteFile.
Construct a map<fileName, bitmap>.
Get the bitmap based on the filename, then pass it to the reader.

4.2. Implementation

Considerations for implementation:

...

 "_PARTITION": "\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000p\u0000\u0000\u0000\u0000\u0000\u0000",
    "_BUCKET": 0,
    "_TYPE": "DELETE_MAP",
    "_FILE_NAME": "index-32f16270-5a81-4e5e-9f93-0e096b8b58d3-0",
    "_FILE_SIZE": x,
    "_ROW_COUNT": x
  }
}

2.2. DeleteMap index file encoding

Like hash index, one bucket one deleteMap index. Therefore, a deleteMap index file needs to contain bitmaps of multiple files in the same bucket, its structure is actually a map<fileName, bitmap>, to support high-performance reads, we have designed the following file encoding to store this map :

First, record the size of this map by an int.
Then, record the max fileName length by an int (because the length of the file name may not be consistent).
Next, record <fileName (padding to max length), the starting pos of the serialized bitmap, the size of the bitmap> in sequence.
Finally, record <serialized bitmap> in sequence.

e.g:

Image Added

2.3 Classes

BitmapDeleteIndex Implement DeleteIndex based on RoaringBitmap

Code Block

language	java
title	DeleteIndex.java

public interface DeleteIndex {
    void delete(long position);

    boolean isDeleted(long position);

    boolean isEmpty();

    byte[] serializeToBytes();

    DeleteIndex deserializeFromBytes(byte[] bytes);
}

Code Block

language	java
title	DeleteMapIndexFile.java

public class DeleteMapIndexFile {     
    public long fileSize(String fileName);
     
    public Map<String, long[]> readDeleteIndexBytesOffsets(String fileName);
    
    public Map<String, DeleteIndex> readAllDeleteIndex(String fileName, Map<String, long[]> deleteIndexBytesOffsets);
    
    public DeleteIndex readDeleteIndex(String fileName, long[] deleteIndexBytesOffset);

    public String write(Map<String, DeleteIndex> input);

    public void delete(String fileName);
}

Extend the IndexMaintainer interface, and add DeleteMapIndexMaintainer implements IndexMaintainer<KeyValue, DeleteIndex>

Code Block

language	java
title	DeleteMapIndexMaintainer.java

public interface IndexMaintainer<T, U> {

    void notifyNewRecord(T record);

    List<IndexFileMeta> prepareCommit();
    
    /* (new) delete file's index */
    default void delete(String fileName) {
        throw new UnsupportedOperationException();
    }

    /* (new) get file's index */
    default Optional<U> indexOf(String fileName) {
        throw new UnsupportedOperationException();
    }

    /** Factory to restore {@link IndexMaintainer}. */
    interface Factory<T, U> {
        IndexMaintainer<T, U> createOrRestore(
                @Nullable Long snapshotId, BinaryRow partition, int bucket);
    }
}

3. Write

3.1. Overview

Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:

New data is written to the level-0 layer.
Perform a compaction with each write and force merge of level-0 layer data, see ForceUpLevel0Compaction.
Implement a merge like LookupDeleteFileMergeFunctionWrapper, which has the following characteristics:

a. When records do not belong to the level-0 layer, no deletiton is generated.
b. When records belong to the level-0 + level-x layers, no deletiton is generated.
c. When records belong only to the level-0 layer, look up other layers and update the map<fileName, bitmap>.

4. After the compaction finish, the bitmap of the before files in this merge is no longer useful and will be deleted from map<fileName, bitmap>.

5. Finally, write the new deleteFile and mark the old deleteFile as remove.

6. For full compaction, an optimization can be made: directly clear the map<fileName, bitmap>.

Example:

Assume the LSM has a total of four layers, the initial stage is as follows (left to right are: file content, LSM tree, delete file):

Image Added

Then, a new file f7, is initially added to the level-0. Suppose compaction picks the level-0 layer and the level-2 layer, the following changes will occur:

1 belongs only to the level-0 layer, it needs to look up old data and finds that f1 also contains 1, so f1's bitmap is modified to add 1.
2 and 9 belong to both the level-0 and level-2 layers, there's no need to modify the bitmap.
The bitmap of f6 can be removed because it has been compacted.
f5, f6, and f7 are marked as REMOVE, and the old delete file is marked as REMOVE.

Image Added

FInally, assuming that compaction has generated f8 and f9, the final result is as follows:

f8 and f9 are marked as ADD, and the new delete file is marked as ADD.

Image Added

3.2. Implementation

Considerations for implementation:

Currently, when set 'changelog-producer' = 'lookup', the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this.
In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
The merge engine is still available.

4. Read

4.1. Overview

For each read task, load the corresponding deleteFile.
Construct a map<fileName, bitmap>.
Get the bitmap based on the filename, then pass it to the reader.

4.2. Classes

Code Block

language	java
title	DeleteMapIndexMaintainer.java

public class ApplyDeleteIndexReader implements RecordReader<KeyValue> {

   public ApplyDeleteIndexReader(RecordReader<KeyValue> reader, DeleteIndex deleteIndex) {
        this.reader = reader;
        this.deleteIndex = deleteIndex;
    }

    @Nullable
    @Override
    public RecordIterator<KeyValue> readBatch() throws IOException {
        RecordIterator<KeyValue> batch = reader.readBatch();
        if (batch == null) {
            return null;
        }
        return new RecordIterator<KeyValue>() {
            @Override
            public KeyValue next() throws IOException {
                while (true) {
                    KeyValue kv = batch.next();
                    if (kv == null) {
                        return null;
                    }
                    if (!deleteIndex.isDeleted(kv.position())) {
                        return kv;
                    }
                }
            }
        };
    }
  ...
}

...

5. Maintenance

5.1. compaction

...

Impact on file meta: Currently, the stats (min, max, null count) in file meta are already unreliable, so no special handling will be performed for this aspect.
...

...

count) in file meta are already unreliable, so no special handling will be performed for this aspect.
...

Compatibility, Deprecation, and Migration Plan

New conf

delete-map.enabled : contral whether to enable delete map mode, limit:

The first version only supports file.format = parquet , and more formats will be supported in the future.
Only support changelog-producer = none or lookup

Migration

Conversion between delete file mode (just temporarily call it) and original LSM

LSM -> delete file mode: can be directly switched (add a parameter to control whether to enable delete filejust set ).
delete file mode -> LSM, in theory, perform a full compaction, then clean up the old snapshot, and then disabled delete file mode.

...

Page tree

Versions Compared

Old Version 10

New Version 11

Key

2. protocal design

2.1. layout

2.2. Implementation

3. Write

3.1. Overview

3.2. Implementation

4. Read

4.1. Overview

4.2. Implementation

2.2. DeleteMap index file encoding

2.3 Classes

3. Write

3.1. Overview

3.2. Implementation

4. Read

4.1. Overview

4.2. Classes

5. Maintenance

5.1. compaction

Compatibility, Deprecation, and Migration Plan

New conf

Migration

Page tree

Page History

Versions Compared

Old Version 10

New Version 11

Key

2. protocal design

2.1. layout

2.2. Implementation

3. Write

3.1. Overview

3.2. Implementation

4. Read

4.1. Overview

4.2. Implementation

2.2. DeleteMap index file encoding

2.3 Classes

3. Write

3.1. Overview

3.2. Implementation

4. Read

4.1. Overview

4.2. Classes

5. Maintenance

5.1. compaction

Compatibility, Deprecation, and Migration Plan

New conf

Migration