DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Under Discussion
Discussion thread: old thread, (ongoing) thread
JIRA: KAFKA-13361
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
As a following work of KIP-390: Support Compression Level, this proposal suggests adding support for per-codec configuration options to Producer, Broker, and Topic configurations, which enable a fine-tuning of compression behavior.
Modern compression codecs have intrinsic mechanisms with numerous tunable parameters, enabling them to adapt to different data patterns, size constraints, and latency requirements. Until now, Kafka has only allowed compression with default settings, with compression level being the sole tunable parameter (i.e., KIP-390).
Fine-tuning compression is increasingly important. As event-driven architectures scale, multi-region Kafka deployments become more common, and Kafka shifts from a simple messaging system to a source of truth for long-lived data, compression strategy has a direct impact on network transfer efficiency, disk usage, and retention periods. For example, a user may prioritize speed when compressing messages into a regional Kafka cluster, but maximum ratio when storing the same messages in an aggregation cluster for long-term retention.
To ensure Kafka exposes only meaningful and safe-to-use knobs, I have evaluated and tested available codec parameters across (e.g., GZIP, Snappy, LZ4, and Zstd), eliminating those with negligible impact or potential for instability. The resulting set represents the most impactful and broadly applicable parameters for tuning performance and compression ratio in real-world Kafka workloads.
This feature enables the use cases described above while maintaining Kafka’s operational simplicity.
Public Interfaces
This KIP exposes per-codec compression tuning options on the Producer/Topic/Broker. Ranges/defaults match the validators and defaults defined in CompressionType.
Config key | Purpose | Valid values | Default |
|---|---|---|---|
compression.gzip.buffer | GZIP I/O buffer size (bytes). | ≥ 512 | 8192 |
compression.gzip.strategy | GZIP deflater strategy. | 0=DEFAULT, 1=FILTERED, 2=HUFFMAN_ONLY | 0 |
compression.lz4.block | LZ4 block size code. | 4–7 → {4: 64KB, 5: 256KB, 6: 1MB, 7: 4MB} | 4 (64KB) |
compression.snappy.block | Snappy block size (bytes). | 1024 – 536,870,912 | 32768 (32KB) |
compression.zstd.window | Zstd long-mode window log (memory/ratio trade-off). | 0 (disabled) or 10–27 | 0 |
compression.zstd.workers | Zstd worker threads. | 0 (single-threaded) or 4–16 | 0 |
So additional info is that GZIP strategy is restricted to the three JDK strategies and Zstd workers: 0 keeps single-threaded behavior; >0 enables multi-threaded compression (higher throughput, more memory/CPU).
All of the above are different but somewhat in common from the point of compression process in that it impacts the memorize size during the process.
Proposed Changes
Producer
The producer will support per-codec fine-tuning through additional configuration keys. All new configs have defaults matching Kafka’s current behavior, ensuring zero change for users who do not explicitly set them. All applicable parameters can be set together, letting users trade off speed, ratio, memory, and CPU usage depending on workload and deployment. Default values are chosen from codec maintainers’ recommendations and match the behavior before this KIP.
Broker
Like the present, the broker will use the newly introduced compression settings to re-compress the delivered record batch. The one thing to note is, we only re-compress the records only when the given record batch is compressed with the different codec from the topic's compression settings - it does not care about the detailed configurations. For example, if the given record batch is compressed with GZIP, level 1, and the topic's compression config is GZIP with level 9, it does not re-compress the batch since the codec is the same.
Whether to recompress or not is up to the codec difference only, like the present. The reason for this behavior will be discussed in the Rejected Alternatives section below.
Consumer
If the user uses GZIP, the buffer size of BufferedInputStream, which reads the decompressed data from GZIPInputStream, is impacted by `compression.gzip.buffer`. In other cases, there are no explicit changes.
Benchmarks
To evaluate both compression ratio and end-to-end encode speed on realistic payloads, I added a purpose-built microbenchmark, TestCompression (org.apache.kafka.jmh.log.TestCompression). This complements the traditional TestLinearWriteSpeed tool:
TestLinearWriteSpeed focuses on the disk write path of already-compressed batches; it does not report compressed size or the time spent compressing.
TestCompression measures the compression work itself.
What TestCompression measures
For each run/configuration it reports:
Compressed size: average bytes per batch (compressedAvgBytes)
Uncompressed baseline: average bytes per batch (uncompressedAvgBytes)
Compression ratio: compressed / uncompressed
Throughput (MB/s) converting uncompressed input to a MemoryRecords with the chosen codec (e.g., zstd, snappy): average, median, and best of N runs (mbps_avg, mbps_median, mbps_best)
It supports four data modes to reflect different realities:
random (low redundancy), zeros (max redundancy), mixed (configurable % zeros), and real-world data i.e., debezium (Connect JSON events with optional schemas).
How it works (at a glance)
Builds batches of SimpleRecords from files or synthetic payloads.
Encodes them into MemoryRecords using Compression builders wired to producer/topic/broker-style options.
Repeats for multiple runs (with optional warmup) and summarizes.
Running the benchmark
Here is very simple configuration with having compression gzip and additional gzip tunables (e.g., level, buffer, strategy). Moreover we have picked to prepare random data.
# Random 1 KiB messages, gzip level=6 buffer=32768 strategy=DEFAULT java -cp ... org.apache.kafka.jmh.log.TestCompression \ --compression gzip \ --compression-property compression.gzip.level=6 \ --compression-property compression.gzip.buffer=32768 \ --compression-property compression.gzip.strategy=0 \ --batch-size 10 --batch-count 1000 \ --msg-size 1024 \ --data random \ --runs 5 --warmup 1 \ --csv results.csv
or debezium-shaped data:
java -cp ... org.apache.kafka.jmh.log.TestCompression \ --compression zstd \ --compression-property compression.zstd.level=6 \ --compression-property compression.zstd.window=24 \ --compression-property compression.zstd.workers=4 \ --data debezium --dbz-event UPDATE --dbz-max-fields 10 --dbz-schemas false \ --batch-size 10 --batch-count 1000 --runs 5 --warmup 1 \ --csv results.csv
One can easily make it through IDE. You just need to update program arguments within your run/debug configuration. For instance if one runs:
--compression gzip --runs 1 --warmup 1 --matrix true --matrix-algos snappy
It would print out:
MATRIX sweep: algos=[snappy], data=[RANDOM, ZEROS, MIXED, DEBEZIUM], preset=full, runs=1 warmup=0, batchCount=1000 batchSize=10 [ ] 0% 0/20 | elapsed 00:00:00 | eta --:--:-- | starting… Starting TestCompression: Compression codec : snappy Batch count : 1000 Batch size : 10 Message size : 1024 Data source : random Runs & Warmup : 1 & 0 Codec config : snappy.block=8192 run mb_sec uncompressed_avg_bytes compressed_avg_bytes ratio 1 30.954 10391.0 10425.0 1.003 ... ... ... [=====================================> ] 95% 19/20 | elapsed 00:00:01 | eta 00:00:00 | snappy | debezium | blk=65536 Starting TestCompression: Compression codec : snappy Batch count : 1000 Batch size : 10 Message size : - Data source : debezium(json) [event=UPDATE, maxFields=10, schemas=false] Runs & Warmup : 1 & 0 Codec config : snappy.block=131072 run mb_sec uncompressed_avg_bytes compressed_avg_bytes ratio 1 852.616 6755.8 2197.1 0.325 Uncompressed Size (avg): 6755.8 bytes. Compressed Size (avg) : 2197.1 bytes. Compression Ratio : 0.325 (67.5% smaller) [=======================================>] 100% 20/20 | elapsed 00:00:01 | eta 00:00:00 | snappy | debezium | blk=131072
This benchmarking framework allows evaluating the complete configuration matrix — spanning all compression algorithms × data types × per-codec configuration combinations × multiple runs with warm-up iterations. This results in a substantial search space (currently 368 configurations). Some combinations are impractical in real-world scenarios — for example, Zstd at level 22 delivers extremely high compression ratios but is unsuitable when low-latency compression is required.
Benchmarking Results
Test System Specifications
- Kernel: Darwin 24.5.0
- CPU: Apple M3 Pro (12 cores) @ 4.06 GHz
- GPU: Apple M3 Pro (18-core integrated) @ 1.38 GHz
- Memory: 36 GiB
- Swap: Disabled
To give a complete picture, each codec is benchmarked for both:
- Throughput (MB/s) – how quickly the algorithm processes data.
Compression Ratio – how much the data size is reduced.
For each codec (Zstd, LZ4, Gzip, Snappy), I have provided a pair of graphs:
- Graphs 1: Throughput vs. configuration
- Graphs 2: Compression Ratio vs configuration
This dual view makes it easier to understand trade-off for instance,
- Zstd at high levels greatly increases ratio but reduces speed
LZ4 maintains high speed but offers modest compression gains.
- Gzip balances both but is sensitive to buffer tuning.
- Snappy favors speed with minimal ratio change.
The graphs below illustrate how quickly each algorithm processes data under different configurations (note: green part means default configuration of each codec):
Graphs 1: Throughput vs. configuration
ZSTD
- demonstrates the trade-off between compression level and throughput, with extreme levels (e.g., 22) offering minimal speed but maximal compression ratio.
LZ4
- maintains consistently high throughput, with block size adjustments having minor performance impact.
GZIP
- sensitive to buffer size, showing substantial variance in throughput depending on configuration.
Snappy
- optimized for speed, with block size influencing throughput less drastically than in other codecs.
The graphs below illustrate how effectively each algorithm reduces data size under different configurations:
Graphs 2: Compression Ratio vs configuration
ZSTD
best ratios at higher levels, but gains diminish after mid-range.
LZ4
stable for most levels, slight improvement at mid-range.
GZIP
consistent ratios, minor gains with buffer changes
Snappy
stable ratios regardless of block size.
Across all datasets, Random data remains incompressible (ratio ≈ 1.0), Zeros compress nearly perfectly (≈ 0), and Mixed stays around 0.5 (i.e., we are generating 50%/50% random and zeros). Debezium shows codec-dependent variability
In this KIP-780-benchmarks.xlsx you can view all possible benchmarks, the most important sheet is KIP-780-final, where you can see all data related to the graphs I have provided above.
Summary & Recommendations
From the combined compression ratio and throughput results:
Codec | Best Ratio | Best Throughput | Trade-off & Recommended Use |
|---|---|---|---|
Zstd | Excellent at high levels (esp. ≥ 10) | Moderate at low levels, drops fast with higher levels | Use when compression ratio is critical and some latency is acceptable. Lower levels (1–3) give a good speed–ratio balance. |
Snappy | Modest, stable across settings | Extremely high (4–5 GB/s) | Use for near real-time streaming or when CPU cost must be minimal. Ratio gains are negligible. |
LZ4 | Moderate, stable | High, with spikes at certain parameter combos | Use when both speed and reasonable compression are needed; a solid default for mixed workloads. |
Gzip | Good, but not much better than LZ4/Zstd at low levels | Lowest of all tested | Use when portability and wide compatibility matter, or when CPU is abundant and speed is less important. |
So, in other words if speed is a key for you then choose Snappy or LZ4. On the other hand, if space savings matter most, then choose Zstd with higher levels. Gzip is legacy but still useful for compatiblity with older tools and environments. LZ4 default configuration is for balance those two things (i.e., low-latency + space savings).
Know your DATA! If the dataset is already highly random (e.g., encrypted or compressed), no algorithm will improve ratio—compression only wastes CPU. If it contains long runs of identical or repetitive patterns (e.g., zeros, logs, telemetry), higher compression ratios are possible and more aggressive codecs like Zstd or Gzip will pay off.
Compatibility, Deprecation, and Migration Plan
Since the default values of newly introduced options are all identical to the currently used values, there are no compatibility or migration issues.
Rejected Alternatives
None.







