Status

Current state: [One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread:

JIRA or Github Issue:

Released: <Doris Version>

Google Doc: <If the design in question is unclear or needs to be discussed and reviewed, a Google Doc can be used first to facilitate comments from others.>

Motivation

Doris' Unique-Key model is widely used in scenarios with update requirements such as Flink-CDC synchronization, user portrait, and e-commerce orders, but Unique-Key currently uses the Merge-On-Read implementation. Due to the following reasons, query performance is poor:

Since it is merge on read, it is necessary to read all the filtered data during the query and perform a merge sort to confirm which row of data is the latest data. This merge sort brings additional cpu-cost

Since it is impossible to confirm whether the data of a row in a segment is the latest data, UniqueKey does not support push-down of aggregate function predicates, which increases the amount of data read

The following figure is a simplified representation of the execution process of the Unique-Key model:

Related Research

We investigated the implementation of other mainstream data warehouses for the data update model. In addition to the Merge-On-Read used by UniqueKey, there are basically the following categories

Merge-On-Write（Mark Delete and Insert)

Program overview:
- Checks if it is an update request via a primary key index and locates to which line of the file the overwritten key is in
- Delete the rowid of the key through a bitmap record tag, and filter based on the rowid recorded in the bitmap when reading

Advantages: key comparison is avoided when reading, and various predicate push-down and secondary indexes are supported

Cons: Additional query index and cost of updating bitmap on write

Representative system: Hologres, Aliyun ADB, ByteHouse

Copy-On-Write

Program overview:
- When writing/updating data, the original file is directly merged synchronously to generate a new version of the file. Even if only one byte of new data is committed, the entire column data file needs to be rewritten.

Advantages: Highest query efficiency

Disadvantages: The cost of writing data is very high, suitable for low-frequency mass update data, high-frequency reading scenarios

Representative system: Delta Lake, Hudi

Delta-Store

Program overview:
- Divide data into base data and delta data. Each key is unique in base data
- When writing/updating data, first query the base data, if there is an update request, write the updated data to the delta store, and merge the data in the delta store with the base data to obtain the latest data when reading; if it cannot be queried in base data, write it to memtable and flush it into base data

Advantages: Good writing performance, efficient support for partial updates

Cons: Poor support for secondary indexes

Representative system: Kudu

Solution Selection

Combining Doris' Unique Key usage scenarios (high real-time requirements for import, high import frequency, and hope to query as fast as possible), Merge-On-Write is a more suitable solution for optimization.

Detailed Design

Key Design Points

To implement the Mark Delete and Insert scheme, consider the following:

Add a primary key index, which can quickly check whether the key exists, as well as the file location and line number when importing

Add a delete bitmap to mark the old key as deleted when an update occurs

The update of delete bitmap needs to consider transaction support and multi-version support

Query adaptation delete bitmap

Conflict handling between Compaction and data import. Compaction may have overwritten a row of a rowset by an ongoing import task after reading the data of the rowset before committing. The row needs to be re-marked and deleted in the newly generated rowset

Primary key index

Several key points of primary key index design and selection:

High QPS (a single BE target can support 10w QPS)
1. Doris uses columnar storage. When the primary key column contains multiple columns, a single click needs to read the value of each column column column by column, involving multiple disk IO . In order to improve the performance of the primary key index, we need to store the encoded primary key in the primary key index. When the primary key column contains multiple columns, these columns will be encoded together and stored in the primary key index.

Do not consume internal memory excessively
1. The primary key index selection uses a scheme similar to RocksDB Partitioned Index [refer to https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters ], created in Segment when MemTable flushes. The main considerations for choosing this scheme are as follows:
2. RocksDB has proved that its index scheme can support very high QPS. With the reasonable configuration of BlockCache, the query performance can be guaranteed under the condition of ensuring the index page 90 +% Cache hit rate
3. Because it is a file-based index, it can be loaded on demand, and the consumption of memory is controllable
4. The scheme prefixes the primary key data, which can further save space, reduce the occupation of memory cache and the amount of data on disk IO

Influence of Too Many Small Files on Check Performance under High Frequency Import
1. In the default configuration, a tablet may have 500 rowsets, and the performance of seeking one by one may be poor
2. In response to this situation, each tablet can consider maintaining an additional CuckooFilter for small file click acceleration. Compared with BloomFilter, CuckooFilter has the advantages of low FP rate (same size), deletion, and value storage (such as storing line numbers). (This scheme is inspired by SlimDB: https://www.vldb.org/pvldb/vol10/p2037-ren.pdf , using a scheme similar to Multi Cuckoo Filter in SlimDB can quickly determine whether a key exists with lower memory consumption and IO consumption)

For the primary key index, add the corresponding BloomFilter. The overall design is shown in the figure below

Scheduling

specific implementation steps and approximate scheduling.

Page tree

DSIP-018: Support Merge-On-Write implementation for UNIQUE KEY data model