Status
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Currently, some default config values of blocking shuffle are not good enough and to achieve good stability and performance, users have to tune several config options. For example,
- Shuffle data compression is not enabled by default, users usually need to enable data compression first;
- Sort-shuffle is not enabled by default, for large parallelism jobs, users must enable it manually;
- As reported in the user mailing list, in some scenarios (for example, data skew), the read buffer request timeout exception may occur, users need to increase the batch read memory size to solve this issue;
- The current default value of memory size per partition for sort-shuffle is pretty small, which may influence blocking shuffle performance, usually, users need to increase this value.
Public Interfaces
This FLIP only changes the default value of 4 config options related to blocking shuffle, including taskmanager.network.blocking-shuffle.compression.enabled, taskmanager.network.sort-shuffle.min-parallelism, taskmanager.memory.framework.off-heap.batch-shuffle.size, taskmanager.network.sort-shuffle.min-buffers. No new public interface will be added and pipeline streaming mode is not influenced.
Proposed Changes
Key | Current Value | Target Value | Reason |
---|---|---|---|
taskmanager.network.blocking-shuffle.compression.enabled | false | true | Usually, data compression can reduce both disk and network IO which is good for performance. At the same time, it can save storage space. |
taskmanager.network.sort-shuffle.min-parallelism | Integer.MAX_VALUE | 1 | For high parallelism batch jobs, sort-shuffle is better for both stability and performance. (We tested setting this value to 1, 128, 256, 512 and 1024 with TPC-DS and the result showed that 1 is the best one.) |
taskmanager.memory.framework.off-heap.batch-shuffle.size | 32m | 64m | Aside from the improvement in FLINK-24954, increasing this value can also help to solve the read buffer request issue. (Previously, when choosing the default value, both ‘32M' and '64M' are OK for tests and we chose the smaller one in a cautious way.) |
taskmanager.network.sort-shuffle.min-buffers | 64 | 512 | The current default value is quite modest and the performance can be influenced especially after we enable sort shuffle for large parallelism batch jobs by default. |
Compatibility, Deprecation, and Migration Plan
There should be no compatibility issues. Though the number of network buffers required per partition is increased, the default TM size can still support 8 blocking result partitions concurrently, which is big enough. For large parallelism jobs, because sort-shuffle decouples the network buffer consumption from parallelism, less network buffers are required which is good for usability.
Test Plan
- Run Flink tests multiple times (20) to ensure that the change does not influence test stability; (Already done)
- Run TPC-DS tests on a cluster to to ensure both performance and stability (no failures) will be improved. (Already done)
Rejected Alternatives
Use 128 as the default value of taskmanager.network.sort-shuffle.min-parallelism.