DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: Accepted
Discussion thread: here
Vote Thread: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Apache Kafka currently exposes only 2 out of 7 available Linux I/O metrics from /proc/self/io (read_bytes and write_bytes) as linux-disk-read-bytes and linux-disk-write-bytes, preventing operators from understanding critical I/O behaviour in production.
The 5 missing metrics (rchar, wchar, syscr, syscw, and cancelled_write_bytes) are essential for diagnosing performance issues and optimising Kafka deployments as without these metrics, operators cannot calculate cache hit ratios to understand page cache effectiveness (rchar vs read_bytes), detect write amplification from filesystem overhead (write_bytes vs wchar), identify inefficient I/O patterns from excessive system calls, or monitor log compaction activity. These blind spots make troubleshooting significantly harder.
The proposed change exposes all 7 metrics via JMX with low overhead as the data is already being read from the kernel, we are simply parsing and exposing the 5 additional fields that are currently discarded.
This brings Kafka's I/O observability in line with industry standards like PostgreSQL and MySQL, enabling operators to proactively monitor I/O health, troubleshoot issues faster, optimise capacity planning by understanding actual storage requirements including amplification, and reduce costs by making data-driven infrastructure decisions.
The change is purely additive, backward compatible, and provides immediate value to anyone running Kafka on Linux in production.
Proposed Changes
Extend Linux I/O Metrics Collected to parse and expose all 7 metrics from /proc/self/io instead of just 2.
- Currently exposed Metrics are :
- read_bytes - Bytes read from storage (actual disk I/O)
- write_bytes - Bytes written to storage (actual disk I/O)
- Newly exposed (proposed) metrics as part of this KIP are
- rchar - Total bytes read (including page cache)
- wchar - Total bytes written (including buffered writes)
- syscr - Number of read system calls
- syscw - Number of write system calls
- cancelled_write_bytes - Bytes cancelled before write (truncations)
New Metrics Overview
- linux-disk-rchar (Total Characters Read)
- What it measures: Total bytes read by the Kafka process, including data served from the Linux page cache.
- Why it matters:
- Represents the logical read volume.
- Includes both cache hits and disk reads
- Essential for calculating cache effectiveness
- Key insight:
Cache Hit Ratio = (rchar - read_bytes) / rchar - Example interpretation:
- rchar = 10 GB, read_bytes = 100 MB → 99% cache hit ratio (excellent!)
- rchar = 1 GB, read_bytes = 900 MB → 10% cache hit ratio (poor, consider adding RAM)
- linux-disk-wchar (Total Characters Written)
- What it measures: Total bytes written by the Kafka process, including buffered writes that may not have reached disk yet.
- Why it matters:
- Represents the logical write volume.
- Combined with write_bytes, reveals write amplification
- Critical for capacity planning and storage sizing
- Key insight: Write Amplification = write_bytes / wchar
- Example interpretation:
- wchar = 1 GB, write_bytes = 1 GB → 1.0x amplification (ideal, no overhead)
- wchar = 1 GB, write_bytes = 3 GB → 3.0x amplification (filesystem/RAID overhead)
- linux-disk-syscr (Read System Call Count)
- What it measures: Total number of read-related system calls (read, pread, readv, etc.) made by the Kafka process.
- Why it matters:
- High syscall count with low byte count indicates inefficient I/O patterns.
- Many small reads have high kernel overhead.
- Essential for optimizing I/O batching.
- Key insight: Average Read Size = read_bytes / syscr
- Example interpretation:
- read_bytes = 1 GB, syscr = 10,000 → 100 KB per syscall (good batching)
- read_bytes = 1 GB, syscr = 10,000,000 → 100 bytes per syscall (poor, tune fetch.min.bytes)
- Optimization target: Larger average read sizes reduce syscall overhead and improve throughput.
- linux-disk-syscw (Write System Call Count)
- What it measures: Total number of write-related system calls (write, pwrite, writev,etc.) made by the Kafka process.
- Why it matters
- Reveals write batching effectiveness.
- High syscall count indicates potential for optimisation.
- Correlates with CPU overhead from kernel transitions.
- Key insight: Average Write Size = write_bytes / syscw
- Example interpretation
- write_bytes = 500 MB, syscw = 1,000 → 500 KB per syscall (excellent batching).
- write_bytes = 500 MB, syscw = 500,000 → 1 KB per syscall (poor, check batch.size configuration)
- Optimization target: Kafka batch writes should result in large average write sizes.
- linux-disk-cancelled-write-bytes (Cancelled Write Bytes)
- What it measures: Bytes that were queued for writing but were cancelled before reaching storage, typically due to truncation operations.
- Why it matters
- Indicates log compaction and cleanup activity
- Reveals aggressive truncation patterns
- Helps diagnose write cancellation behaviour during crashes or forced shutdowns
- Key insight:
- High values indicate
- Active log compaction (log.cleaner)
- Frequent topic deletions
- Log retention policies causing truncation
- Potential inefficiency if very high relative to actual writes
- Example interpretation
- cancelled_write_bytes = 10 MB, write_bytes = 1 GB → 1% cancelled (normal)
- cancelled_write_bytes = 500 MB, write_bytes = 1 GB → 50% cancelled (investigate compaction settings)
- Use cases
- Monitor log compaction effectiveness
- Detect unexpected truncation activity
- Correlate with broker restarts or crashes
Metric Semantics and Behaviour
| Scenario | rchar | wchar | syscr | syscw | cancelled_write_bytes |
|---|---|---|---|---|---|
Normal steady-state operation | Available (MB-GB range) | Available (MB-GB range) | Available (thousands-millions) | Available (thousands-millions) | Available (MB range low) |
Broker cold start (empty cache) | Equals read_bytes initially | Available | Available | Available | Minimal (near zero) |
High cache hit ratio (>90%) | Much larger than read_bytes | Available | Available | Available | Available |
Low cache hit ratio (<20%) | Close to read_bytes value | Available | Available | Available | Available |
Write amplification scenario | Available | Much smaller than write_bytes | Available | Available | Available |
Optimal I/O batching | Available | Available | Low count (large avg size) | Low count (large avg size) | Available |
Poor I/O batching (tiny reads/writes) | Available | Available | Very high count (small avg size) | Very high count (small avg size) | Available |
Active log compaction | Available | Available | Available | Available | High (MB-GB range) |
No log compaction activity | Available | Available | Available | Available | Low (near zero) |
Topic deletion in progress | Available | Available | Available | Available | Elevated (MB range) |
Non-Linux platform | N/A | N/A | N/A | N/A | N/A |
Linux platform without /proc/self/io | N/A | N/A | N/A | N/A | N/A |
Metric Cardinality:
- Question: How many metric instances will this create per broker?
- The proposed I/O metrics have low cardinality - specifically 7 metrics per broker process.
- These metrics are process-level counters, not topic-level (like `BytesInPerSec` (per topic), partition-level ( `LogEndOffset`), or replica-level metrics(`UnderReplicatedPartitions`).
- Cardinality Breakdown - Per Broker Process:
- `linux-disk-read-bytes`: 1 metric (existing)
- `linux-disk-write-bytes`: 1 metric (existing)
- `linux-disk-rchar`: 1 metric (new)
- `linux-disk-wchar`: 1 metric (new)
- `linux-disk-syscr`: 1 metric (new)
- `linux-disk-syscw`: 1 metric (new)
- `linux-disk-cancelled-write-bytes`: 1 metric (new)
Metric Behaviour During Operations:
1. During Normal Operation :All metrics are monotonically increasing counters
2. During Partition Reassignment: Metrics reflect total process I/O per broker, regardless of partition ownership.
3. Metric collection frequency - how often are these updated?
Kernel-Side Update Frequency
Source: /proc/self/io (procfs virtual filesystem)
- Update mechanism: Updated by Linux kernel in real-time as I/O operations occur
- Granularity: Every I/O syscall updates the relevant counters immediately
- Atomicity: All 7 metrics in the file are read atomically in a single file read operation
Example :The kernel updates these instantly as I/O happens:
rchar: 4052 # Updated on every read() syscall
wchar: 3000 # Updated on every write() syscall
syscr: 13 # Incremented on each read syscall
syscw: 10 # Incremented on each write syscall
read_bytes: 0 # Updated when data physically read from disk
write_bytes: 0 # Updated when data physically written to disk
cancelled_write_bytes: 0 # Updated when buffered writes are cancelled
4. if the linux version would impact the availability of these metrics?
- All metrics require Linux kernel 2.6.20+ (released January 2007)
- Bottom Line:
- 100% of modern Linux systems support all 7 metrics
- Even 10+ year old kernels have full support
- No distribution-specific patches or backports needed
Graceful Degradation:
- Non-Linux systems: Metrics simply not available (existing behavior)
- Linux without procfs: Detected and disabled automatically
- Containers without /proc mounted: Detected and disabled
- No errors, no exceptions - just logs debug message
5. Container/Virtualization Impact
Docker/Kubernetes:
- /proc/self/io works correctly in containers by default
- No special configuration needed - procfs is mounted automatically
Public Interfaces
JMX Metrics
Scope Note: The `linux-disk-*` prefix is a historical naming convention inherited
from the two existing metrics. All seven metrics in this group reflect aggregate I/O
for the Kafka JVM process across all configured `log.dirs`. They do not provide
per-disk or per-log-directory visibility.
All metrics are exposed under:
- Brokers: kafka.server:type=KafkaServer,name=<metric-name>
- Controllers: kafka.server:type=ControllerServer,name=<metric-name>
In combined mode (broker + controller on the same node), both sets of metrics are registered, but they share the same underlying LinuxIoMetricsCollector reading from the same /proc/self/io, so the values reflect aggregate process-level I/O.
For Example: (Existing metric)
Broker: kafka.server type=KafkaServer name=linux-disk-read-bytes Value=16384 JMXTool=1.5.3
Controller: kafka.server type=ControllerServer name=linux-disk-read-bytes Value=16384 JMXTool=1.5.3
New Metrics:
Metric Name: linux-disk-rchar
Type: Gauge
Unit: Bytes
Description: Total bytes read (including page cache hits)
Use Case: Calculate cache hit ratio: (rchar - read_bytes) / rchar
────────────────────────────────────────
Metric Name: linux-disk-wchar
Type: Gauge
Unit: Bytes
Description: Total bytes written (including buffered writes)
Use Case: Detect write amplification: write_bytes / wchar
────────────────────────────────────────
Metric Name: linux-disk-syscr
Type: Gauge
Unit: Count
Description: Number of read system calls
Use Case: Identify inefficient I/O patterns: read_bytes / syscr for avg read size
────────────────────────────────────────
Metric Name: linux-disk-syscw
Type: Gauge
Unit: Count
Description: Number of write system calls
Use Case: Analyze write batching: write_bytes / syscw for avg write size
────────────────────────────────────────
Metric Name: linux-disk-cancelled-write-bytes
Type: Gauge
Unit: Bytes
Description: Bytes cancelled before write (truncations)
Use Case: Monitor log compaction and cleanup activity
────────────────────────────────────────
Metric Name: linux-disk-read-bytes (existing)
Type: Gauge
Unit: Bytes
Description: Bytes read from storage layer (actual disk I/O)
Use Case: Track physical disk reads
────────────────────────────────────────
Metric Name: linux-disk-write-bytes (existing)
Type: Gauge
Unit: Bytes
Description: Bytes written to storage layer (actual disk I/O)
Use Case: Track physical disk writes
Compatibility, Deprecation, and Migration Plan
Backward Compatibility - Fully backward compatible
- Existing metrics remain unchanged
- New metrics are additive only
- No changes to APIs, configurations, or protocols
- Existing dashboards and alerts continue to workForward Compatibility - Forward compatible:
- If new metrics are added to /proc/self/io in future Linux kernels, they can be added following the same pattern
- No breaking changes anticipatedMigration Plan
No migration required - this is a purely additive change with no data migration, configuration changes, or operational impact.
Test Plan
The feature will be validated through:
- Unit Tests
- Test parsing all 7 metrics from mock `/proc/self/io` data
- Test with realistic production values
- Test edge cases (zeros, large numbers, missing fields)
- Integration Tests
- Verify all 7 metrics are registered in JMX on Linux systems
- Verify metrics are NOT registered on non-Linux systems
- Verify metric values update correctly over time
- Manual Testing - Validation on Real Kafka Cluster:
1. Start Kafka broker on Linux
2. Verify metrics appear in JMX:
```bash
jconsole # Connect to Kafka broker
# Navigate to kafka.server:type=KafkaServer,name=linux-disk-*
# Verify all 7 metrics are present
```
3. Generate load (produce/consume messages)
4. Verify metric values update correctly
5. Calculate cache hit ratio: `(rchar - read_bytes) / rchar`
6. Calculate write amplification: `write_bytes / wchar`
Rejected Alternatives
- Use External Node Exporters (e.g., Prometheus node_exporter)
Rejected because:
- Node exporters track their own process (/proc/self/io of node_exporter), not Kafkas process
- Cannot correlate with Kafka-specific operations (produce, consume, compaction)
- Requires additional infrastructure deployment and maintenance
- Metrics not integrated with Kafka's' existing monitoring framework
- No automatic correlation in dashboards - Calculate and Expose only Derived Metrics (e.g., cache_hit_ratio)
Rejected because:
- Too opinionated - different operators need different analysis
- Loses granularity - operators cannot build custom metrics
- Doesn't' address all use cases (write amplification, syscall analysis, cancelled writes)
- Raw metrics are more flexible and composable - Use Different Metric Name Prefixes: linux-cache-rchar and linux-cache-wchar (instead of linux-disk-*); linux-io-rchar for all new metrics; Separate prefixes for cache vs disk metrics.
Rejected because:
- Breaks naming consistency with existing linux-disk-read-bytes and linux-disk-write-bytes
- The terms rchar, wchar, syscr, syscw are well-established Linux kernel terminology
- Grouping all LinuxIoMetricsCollector metrics under linux-disk-* is clearer for operators
- Less disruptive.
- Easier to discover related metrics (all start with linux-disk-*) - Expose Per-Partition or Per-Topic I/O Metrics
Rejected because:
- /proc/self/io provides process-level metrics only, not per-topic/partition
- Out of scope for this KIP
- Could be a separate future enhancement if needed - Wait for Platform-Agnostic Solution: Wait until a cross-platform I/O metrics solution is available for Linux/Windows/macOS.
Rejected because: Kafka production deployments are overwhelmingly on Linux.
- Existing Linux Io Metrics Collector is already Linux-specific
- No timeline for when/if other platforms would support equivalent metrics
- Users need this observability now
- Platform-specific implementations are common in Kafka (e.g., Linux-specific tuning) - Use BPF/eBPF for Detailed I/O Tracing , Rejected because:
- BPF requires additional kernel modules and privileges
- Much higher overhead than reading /proc/self/io
- Unnecessary complexity for the use case
- /proc/self/io provides exactly the metrics we need with zero overhead
- BPF is better suited for detailed tracing, not ongoing metrics collection