Discussion thread	https://lists.apache.org/thread/7qjzbcfzdshqb3h7ft31v9o3x43t8k6r
Vote thread
ISSUE	https://github.com/apache/incubator-paimon/issues/2898
Release	0.8

Motivation

Position delete is a solution to implement the Merge-On-Read (MOR) structure, which has been adopted by other formats such as Iceberg[1] and Delta[2]. By combining with Paimon's LSM tree, we can create a new position deletion mode called `deletion vectors mode` unique to Paimon.

Under this mode, extra overhead (lookup and write delete file) will be introduced during writing, but during reading, data can be directly retrieved using "data + filter with position deletedeletion vector", avoiding additional merge costs between different files. Furthermore, this mode can be easily integrated into native engine solutions like Spark + Gluten[3] in the future, thereby significantly enhancing read performance.

Goals

Must

Data read and write operations are accurate.
Reader can directly obtain the final data through "data + filter with position delete" without additional merging.

Should

The number of delete files written each time is controllable.
The additional overhead caused by writing is controllable.
Read performance is superior to the original LSM merge.
Unused delete files can be automatically cleaned up.

Implement

1. Delete File

Delete file is used to mark the deletion of original file. The following figure illustrates how data updating and deleting under the delete file mode:

Currently, there are two ways to represent the deletion of records:

Position delete: marking the specific row in a file as deleted.
Equality delete: writing a filter directly to represent the deletion.

Taking into account:

Paimon can obtain the old records during lookup compaction.
Inserts in paimon may also result in updates (deletes), which are difficult to represent using equality delete.
Position delete is sufficiently efficient for reader.

...

Therefore, we do not consider equality delete and will only implement delete file using the position delete. There are three design approachs as follows:

1.1. Approach 1

Store deletes as list<file_name, pos> , which is doubly sorted by file_name and pos.

...

High redundancy, with the file_name being repeated extensively.
When reading, it is necessary to read all the delete files first, and then construct the bitmap for the corresponding data file.

Approach 1 is inefficient, don’t choose it, Approach 2 and Approach 3 both directly store bitmap in delete file, but the implementations are different.

1.2. Approach 2（pick）

One delete file per bucket, with a structure of map<file_name, bitmap>. When reading a specific data file, read it and construct the map<file_name, bitmap>, and then get the corresponding bitmap by file_name.

...

Reading and writing of delete file is on bucket-level.
In extreme cases, if the deletion is distributed across all buckets, the delete files for all buckets will need to be rewritten.

1.3. Approach 3

One delete file per writing, with a structure of list<bitmap>,and add additional metadata <delete file name, offset, size> to point to its bitmap (this structure is also called delete vector).

When reading a specific data file, obtain the delete_file's file name based on the metadata, and then according to the offset + size, retrieve the corresponding bitmap.

...

More changes to the Paimon protocol are needed, file become a tuple <data_file, delete_meta>, and the logic for cleaning up delete files is more complex.
When writing, it is necessary to merge the bitmaps generated by each bucket into a single delete file.
In extreme cases, if there are deletions with every write, then a new delete file will be generated with each write operation (however, there is a maximum number guaranteed because with each full compaction, all delete files become invalid).

1.4. Test

Before deciding on which approach to go with, let's first conduct a performance test on bitmaps, based on org.roaringbitmap.RoaringBitmap[4]. The reasons for choosing it are as follows:

...

data rate / max num	add(ms)	serialization(ms)	deserialization(ms)	file size(MB)	constains(ms)
20% /2,000,000	43	5	26	0.24	7
50% /2,000,000	47	3	52	0.24	5
80% /2,000,000	57	1	24	0.24	8
20% /20,000,000	450	13	247	2.4	49
50% /20,000,000	629	6	222	2.4	76
80% /20,000,000	1040	5	222	2.4	121
20% /200,000,000	5079	44	2262	24	442
50% /200,000,000	9469	43	2773	24	1107
80% /200,000,000	13625	38	2233	24	1799
20% /2,000,000,000	93753	568	22290	239	5747
50% /2,000,000,000	166070	679	22339	239	14735
80% /2,000,000,000	218233	553	22684	239	26504

Summarize the following points:

The serialization and deserialization of the bitmap and its file size and add&contains cost are basically proportional to the amount of data.
When the data volume reaches 2 billion, it is essentially unusable.

Let's do some choices:

1. RoaringBitmap or Roaring64NavigableMap?

...

Therefore, considering both implementation and performance aspects, Approach 2 is ultimately chosen.

2. protocal design

2.1. layout

Reuse the current index layout and just treat the deletionVectors as a new index file type

2.2. Deletion vectors index file encoding

Like hash index, one bucket one deletionVector index. Therefore, a deletionVector index file needs to contain bitmaps of multiple files in the same bucket, its structure is actually a map<fileName, bitmap>, to support high-performance reads, we have designed the following file encoding to store this :

...

First, record a const magic number by an int.
Then, record serialized bitmap.

e.g:

3. Write

3.1. Overview

Refer to the existing lookup mechanis, design a deleteFile generation mechanism based on compaction + lookup:

...

f8 and f9 are marked as ADD, and the new delete file is marked as ADD.

3.2. Implementation

Considerations for implementation:

~~Currently, when set 'changelog-producer' = 'lookup'~~, the data write behavior is not atomic but divided into two steps: first, data is written to create snapshot1, then lookup compaction generates snapshot2. We need to consider the atomicity of this.
In most cases, the data will be transferred to level-0 first, and then rewritten. The writing overhead is a bit high, and perhaps some optimization can be done in this regard.
If change log needs to be generated, in theory, change log and delete file can be produced simultaneously (without reading twice).
The merge engine is still available.

4. Read

4.1. Overview

For each read task, load the corresponding deleteFile.
Construct the map<fileName, bitmap> from deleteFile.
Get the bitmap based on the filename, then pass it to the reader.

5. Maintenance

5.1. compaction

We can incorporate bitmap evaluation during compaction pick, such as when the proportion of deleted rows in a file reaches like 50%, we can pick it for compaction.

5.2. expire

Determine whether to delete based on the delete and add records in the deleteFileManifest.

6. Other considerations

Impact on file meta: Currently, the stats (min, max, null count) in file meta are already unreliable, so no special handling will be performed for this aspect.
...

Public Interfaces

How to use

a new conf:

deletion-vectors.enabled: control whether to enable deletion vectors mode: write deletion vectors index and read using it without merge.

...

Only support for tables with primary keys
The first version only supports `file.format` = `orc` or `parquet`
Only support `changelog-producer` = `none` or `lookup`

other:

`changelog-producer.lookup-wait` can't be `false`
`merge-engine` can't be `first-row`, because the read of first-row is already no merging, deletion vectors are not needed
This mode will filter the data in level-0, so when using time travel to read `APPEND` snapshot, there will be data delay

other:

Since there is no Since there is no need to merge when reading, in this mode, we can support filter pushdown of non-PK fields and data reading concurrency is no longer limited !

Classes

Abstract an interface DeletionVector to represent the deletion vector, and provide a BitmapDeletionVector based on RoaringBitmap to implement it:

Add RecordWithPositionIterator to get row position

Code Block

language	java
title	DeleteIndexRecordWithPositionIterator.java

public interface DeletionVectorRecordWithPositionIterator<T> extends RecordReader.RecordIterator<T> {

    void delete(long position);

/**
    boolean isDeleted(long position);

    boolean isEmpty();

    byte[] serializeToBytes();
* Get the row position of the row returned by {@link RecordReader.RecordIterator#next}.
    DeleteIndex deserializeFromBytes(byte[] bytes);
}

Add a DeletionVectorsIndexFile to read, write and delete deletionVector:

Code Block

language	java
title	DeletionVectorsIndexFile.java

public class DeletionVectorsIndexFile {  
   
    public long fileSize(String fileName);
    
    public Map<String, DeletionVector> readAllDeletionVectors(String fileName, Map<String, Pair<Integer, Integer>> deletionVectorRanges);
    
    public DeletionVector readDeletionVector(String fileName, Pair<Integer, Integer> deletionVectorRange);

    public Pair<String, Map<String, Pair<Integer, Integer>>> write(Map<String, DeletionVector> input);

    public void delete(String fileName);
}

Extend the current IndexMaintainer interface, and create DeletionVectorsIndexMaintainer implements IndexMaintainer<KeyValue, DeletionVectorIndex>

Code Block

language	java
title	IndexMaintainer.java

public interface IndexMaintainer<T, U> {

    void notifyNewRecord(T record);

    List<IndexFileMeta> prepareCommit();
    
    /* (new) delete file's index */
    default void delete(String fileName) {
        throw new UnsupportedOperationException();
    }

    /* (new) get file's index */
    default Optional<U> indexOf(String fileName) {
        throw new UnsupportedOperationException();
    }

    /** Factory to restore {@link IndexMaintainer}. */
    interface Factory<T, U> {
        IndexMaintainer<T, U> createOrRestore(
                @Nullable Long snapshotId, BinaryRow partition, int bucket);
    }
}

Add ApplyDeletionVectorReader implements RecordReader<KeyValue> to read with DeletionVector

 *
     * @return the row position from 0 to the number of rows in the file
     */
    long returnedPosition();
}

Abstract an interface DeletionVector to represent the deletion vector, and provide a BitmapDeletionVector based on RoaringBitmap to implement it:

Code Block

language	java
title	DeletionVector.java

public interface DeletionVector {

    void delete(long position);

    boolean checkedDelete(long position);
    
    boolean isDeleted(long position);

    boolean isEmpty();

    byte[] serializeToBytes();

    DeleteIndex deserializeFromBytes(byte[] bytes);
}

Add a DeletionVectorsIndexFile to read, write and delete deletionVector:

Code Block

language	java
title	DeletionVectorsIndexFile.java

public class DeletionVectorsIndexFile {  
   
    public long fileSize(String fileName);
    
    public Map<String, DeletionVector> readAllDeletionVectors(String fileName, Map<String, Pair<Integer, Integer>> deletionVectorRanges);
    
    public DeletionVector readDeletionVector(String fileName, Pair<Integer, Integer> deletionVectorRange);

    public Pair<String, Map<String, Pair<Integer, Integer>>> write(Map<String, DeletionVector> input);

    public void delete(String fileName);
}

Add DeletionVectorsMaintainer to maintain dv:

Code Block

language	java
title	DeletionVectorsMaintainer.java

public interface IndexMaintainer<T, U> {      

	public void notifyNewDeletion(String fileName, long position);

    public void removeDeletionVectorOf(String fileName);
	
	List<IndexFileMeta> prepareCommit();

    public Optional<DeletionVector> deletionVectorOf(String fileName);
}

Add ApplyDeletionVectorReader implements RecordReader<KeyValue> to read with DeletionVector

Code Block

language	java
title	ApplyDeletionVectorReader.java

public class ApplyDeletionVectorReader implements RecordReader<KeyValue> {

   public ApplyDeletionVectorReader(RecordReader<KeyValue> reader, DeletionVector deletionVector

Code Block

language	java
title	ApplyDeleteIndexReader.java

public class ApplyDeletionVectorReader implements RecordReader<KeyValue> {

   public ApplyDeletionVectorReader(RecordReader<KeyValue> reader, DeletionVector deletionVector) {
        this.reader = reader;
        this.deletionVector = deletionVector;
    }

    @Nullable
    @Override
    public RecordIterator<KeyValue> readBatch() throws IOException {
        RecordIterator<KeyValue> batch = reader.readBatch();
        if (batch == null) {
            return null;
        }
        return new RecordIterator<KeyValue>() {
            @Override
            public KeyValue next() throws IOException {
                while (true) {
                    KeyValue kv = batch.next();
                    if (kv == null) {
        this.reader = reader;
        this.deletionVector      return null= deletionVector;
    }

    @Nullable
       @Override     }
    public RecordIterator<T> readBatch() throws IOException {
        RecordIterator<T> batch = if (!deletionVector.isDeleted(kv.position())) {reader.readBatch();

        if (batch == null) {
            return kvnull;
        }

        FileRecordIterator<T> batchWithPosition =  }(FileRecordIterator<T>) batch;

        return batchWithPosition.filter(
       }
         a   }-> !deletionVector.isDeleted(batchWithPosition.returnedPosition()));
    }     };
    }
  ...
}

Compatibility, Deprecation, and Migration Plan

Conversion between delete file deletion vectors modeand original mode

original mode -> delete file mode: can be directly switched by deletion vectors mode: perform a full compaction, then set `deletion-vectors.enabled` = `true`, and time travel to the snapshots before enabled will be prohibited.
delete file deletion vectors mode -> original mode, in theory, perform a full compaction, then clean up the old snapshot, and then set `deletion-vectors.enabled` = `false`, and time travel to the snapshots before enabled will be prohibited.

Future work

Integrate deletion vectors with append table
...

[1]: https://github.com/apache/iceberg

...

Page tree

Versions Compared

Old Version 27

New Version Current

Key

Motivation

Goals

Must

Should

Implement

1. Delete File

1.1. Approach 1

1.2. Approach 2（pick）

1.3. Approach 3

1.4. Test

2. protocal design

2.1. layout

2.2. Deletion vectors index file encoding

3. Write

3.1. Overview

3.2. Implementation

4. Read

4.1. Overview

5. Maintenance

5.1. compaction

5.2. expire

6. Other considerations

Public Interfaces

How to use

Classes

Compatibility, Deprecation, and Migration Plan

Future work

Page tree

Page History

Versions Compared

Old Version 27

New Version Current

Key

Motivation

Goals

Must

Should

Implement

1. Delete File

1.1. Approach 1

1.2. Approach 2（pick）

1.3. Approach 3

1.4. Test

2. protocal design

2.1. layout

2.2. Deletion vectors index file encoding

3. Write

3.1. Overview

3.2. Implementation

4. Read

4.1. Overview

5. Maintenance

5.1. compaction

5.2. expire

6. Other considerations

Public Interfaces

How to use

Classes

Compatibility, Deprecation, and Migration Plan

Future work