Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

This FLIP proposes upgrading the version of FRocksDB in the Flink Project from 6.20.3 to 8.10.0.

RocksDBStateBackend is widely used by Flink users in large state scenarios.The last upgrade of FRocksDB was in version Flink-1.14, which mainly supported features such as support arm platform, deleteRange API, period compaction, etc. It has been a long time since then, and RocksDB has now been released to version 8.x. The main motivation for this upgrade is to leverage the features of higher versions of Rocksdb to make Flink RocksDBStateBackend more powerful. While RocksDB is also continuously optimizing and bug fixing, we hope to keep FRocksDB more or less in sync with RocksDB and upgrade it periodically.

(Although this upgrade almost doesn't change anything in flink code, the main reason for creating this FLIP is because the feature and performance changes it brings may be visible to users)

Main benefits

This section will introduce the main benefits of upgrading RocksDB, including optimizing the performance of Rocksdb rescaling using IngestDB API and other potential optimization points that have not yet been implemented, such as async io and tiered storage etc.

IngestDB

The problem of slow RocksDBStateBackend rescaling has been troubling users for a long time. Such as when the parallelism of a job changes from 2 to 1. When RocksDB performs rescaling recovery, it is necessary to merge the data of two RocksDB into one DB instance. 

In the current implementation, merging data from two RocksDB can only traverse all records of one DB and then insert them into to the other one. To address this issue, we extended RocksDB API to directly merge data from multiple RocksDB instances and support IngestDB Recovery Mode FLINK-31238 - Getting issue details... STATUS in RocksDBStateBackend. In this way, the time for merging DB data has been increased from minutes or even half an hour to a few seconds.

Now the APIs required by IngestDB have been merged into the main branch of RockDB FLINK-33337 - Getting issue details... STATUS , and the implementation of IngestDB has also already been completed https://github.com/apache/flink/pull/24031.(special thanks to Stefan Richter )

However, the feature of IngestDB still cannot be used because FRocksDB has not been upgraded to the corresponding version yet.

We tested the performance of ingestDB with a snapshot size of 1g. The results are as follows:

  • In scenarios where parallelism increases(1->2, 2->8), the performance of using IngestDB Mode is the same as before, because during the recovery phase, only deleteRange will be called, which will return very quickly.
  • In other scenarios, IngestDB has shown significant improvements compared to previous recovery methods.

CurrentModeIngestDBMode
Restore DetailsMax Restore TimeRestore DetailsMax Restore Time
1->2task0: 809ms
task1: 738ms
809mstask0: 789ms
task1: 745ms
789ms
2->1task0: 42979ms42979mstask0: 890ms890ms
2->8task0 : 677ms
task1 : 566ms
task2 : 580ms
task3 : 610ms
task4 : 534ms
task5 : 439ms
task6 : 436ms
task7 : 481ms
677mstask0 : 496ms
task1 : 474ms
task2 : 408ms
task3 : 415ms
task4 : 456ms
task5 : 428ms
task6 : 433ms
task7 : 407ms
496ms
8->2task0 : 33603ms
task1 : 42867ms
42867mstask0 : 680ms
task1 : 650ms
680ms
2->3task0 : 421ms
task1 : 15299ms
task2 : 474ms
15299mstask0 : 499ms
task1 : 952ms
task2 : 428ms
952ms
3->2task0 : 14833ms
task1 : 14810ms
14833mstask0 : 738ms
task1 : 669ms
738ms

Other Potential Optimization Points

In addition to IngestDB, upgrading Rocksdb can also bring some other features that will be very useful for Flink. These features have not yet been used and implemented in Flink, so they are listed here as potential optimization points

AsyncIO & MultiGet

RocksDB supports MultiGet & Async IO in higher versions to improve the performance by over 100% according to different data pipelines which is also required by one of the FLIP-426 changes.Therefore, using Async IO can significantly improve the performance of state batch access and traversal, which are also widely used in Flink. Although further discussion may be needed on how to adapt async io in Flink RocksdbStateBackend, all of these require us to first upgrade FRocksDB as a prerequisite.

Tiered Storage 

In FLIP-423, some limitations of current State management were mentioned, including Local Disk Constraints, Light and Fast Checkpoints, Spiky Resource Usage, Elasticity and Fast Rescaling.

RocksDB now supports Tiered Storage which can assign data temperature when creating the new SST which hints the file system to put the data on the corresponding storage media, so the data in a single DB instance can be placed on different storage media

This feature can also address some of the aforementioned issues. At the same time, RocksDB has also optimized the performance of Tiered Storage, including such as Per Key Placement Compaction and Secondary Cache.

Although FLIP-423 has designed a complete solution ForSt for these scenarios, perhaps RocksDBStateBackend can also directly reuse the existing features of RocksDB for some enhancements at a relatively low cost, and there is no conflict between these two optimization

BlobDB

In the usage scenarios of Flink, the state value may be very large in many scenarios, such as machine learning features or long-term user-defined states. In LSM, this large-value use case can cause a lot of read and write amplification, affecting the performance of RocksDB. Fortunately, RocksDB supports BlobDB to address this issue. The basic idea, which was proposed in the WiscKey paper, is a key value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, avoid copying the values over and over again during compaction. This reduces write amplification, which has several potential benefits like improved SSD lifetime, and better write and read performance. Perhaps BlobDB can also be supported in RocksDBStateBackend in future FLIP.

Public Interfaces

No changes 

Proposed Changes

  • Flink
    • Update FRocksDB version to 8.10.0 after create official artifacts
  • FRockDB
    • In FrocksDB 6 thread-local perf-context is disabled by reverting a specific commit (FLINK-19710). However, this creates conflicts and makes upgrading more difficult. We found that disabling PERF_CONTEXT can improve the performance of statebenchmark by about 5% and it doesn't create any conflicts. So we plan to drop the support of PERF_CONTEXT in the new FRocksDB release.

Compatibility, Deprecation, and Migration Plan

this change doesn't break any existing formats or jobs

Test Plan

For version upgrades, performance testing is very important, so we upgraded RocksDB to version 8.10.0 on the Flink Master branch and compared it with the current version 6.20.3 on StateBenchmark and Nexmark, respectively. The test results are as follows:

Nexmark

We ran the Flink Nexmark benchmark on 4 independent machines. Each machine consists of an AMD EPYC 9Y24 CPU and NVME ssd disk. Flink's cluster consists of 1 JobManager and 8 TaskManagers (each with 1 slot and  with 8 GB of memory). And we selected all queries that use state in Nexmark . From the test results, it appears that the performance of most Queries is similar to the previous throughput.

Nexmark
Query
Rocksdb-6.20.3Rocksdb 8.10Compare
Throughput(r/s)Throughput(r/s)
q33.29M3.3M0.304%
q4445.24K423.51K-4.881%
q5730.26K986.83K35.134%
q7292.4K302.88K3.584%
q83.49M3.48M-0.287%
q9252.69K228.57K-9.545%
q11621.61K620.93K-0.109%
q123.17M3.15M-0.631%
q151.87M1.8M-3.743%
q16544.57K603.11K10.750%
q172.62M2.59M-1.145%
q18828.87K841.21K1.489%
q19875.74K902.24K3.026%
q20333.23K318.04K-4.558%
Total--2.099%

StateBenchmark

We conducted State Benchmark testing on RocksDB in Flink Speed Center (special thanks to Zakelly Lan and Roman Khachatryan ), and ran three rounds of testing to reduce jitter. From the test results, it appears that the read performance of RocksDB V8 is not much different from V6, but there is a slight regression in write performance

BenchmarkV8-Round1V8-Round2V8-Round3V8-AverageV6-Round1V6-Round2V6-Round3V6-AverageV8 vs. V6
listAdd448.491818453.333675457.166387452.997293464.362323473.023696469.028608468.804876-3.37%
listAddAll263.26387270.054733262.261058265.19322284.309874289.76554288.108038287.394484-7.73%
listAppend442.991378448.92388442.740273444.885177456.645484449.272529472.059381459.325798-3.14%
listGet112.747402114.132869116.11602114.332097107.290669110.100745108.338023108.5764795.30%
listGetAndIterate110.51841114.477101116.979456113.991656105.17187108.271787105.472538106.3053987.23%
listUpdate446.459229464.632036451.062437454.051234459.690066472.406343471.146269467.747559-2.93%
mapAdd342.679319328.865235331.80544334.449998377.897566395.306567388.034114387.079416-13.60%
mapContains43.02230841.73088642.78191542.51170341.55285742.43321742.38554242.1238720.92%
mapEntries307.232009308.416811312.635165309.427995296.845214303.213684301.87162300.6435062.92%
mapGet43.6099243.52694344.93493144.023931341.77190741.95509742.50260642.07653674.63%
mapIsEmpty38.45956838.82762138.86335938.716849337.62688838.6907338.2172838.17829931.41%
mapIterator308.022383313.170994309.486335310.226571296.904146304.785345299.106992300.2654943.32%
mapKeys312.194847313.422331312.004277312.540485298.532292309.173538299.409632302.3718213.36%
mapPutAll117.615508115.833148116.925634116.79143111.550326115.296096113.288307113.3782433.01%
mapRemove350.185313356.262055356.58919354.345519382.432421398.891158383.474288388.265956-8.74%
mapUpdate326.56323343.909292337.205441335.892654379.508134377.136306388.68193381.775457-12.02%
mapValues311.106842311.309557311.582689311.333029300.050187295.121068307.620506300.9305873.46%
valueAdd336.48586331.392677343.310332337.062956382.798278379.32724392.866154384.997224-12.45%
valueGet554.671769562.49161546.083539554.415639563.677194594.294875596.363248584.778439-5.19%
valueUpdate325.541024337.096711353.488646338.708794383.990973396.756718396.228346392.325346-13.67%

Overall, We still think that compared to the huge benefits brought by the upgraded version, the slight regression in write performance is still acceptable since the write overhead is a very small part of the overall.

Rejected Alternatives

  • Upgrade FRocksDB to 8.11.x or 9.X version
    It has been some time since we started upgrading FRocksDB. At that time, the latest release version of RocksDB was 8.10.0, but now RocksDB has released several new versions. However, we still plan to upgrade to 8.10.0 in this upgrade. The main reason is that
    When we upgrade, we need to do a lot of adaptation, testing, and bug fixing work, which requires a lot of time and effort. In later released versions of RocksDB, we did not see such attractive features that are worth repeating these job. In fact, we have also conducted quick benchmark validation based on the latest version 9.X, and its performance is similar to 8.10.0. So we believe there is no need to upgrade to other versions.

   


  • No labels