This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: Accepted

Discussion thread: here

Vote Thread: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation


Apache Kafka currently exposes only 2 out of 7 available Linux I/O metrics from /proc/self/io (read_bytes and write_bytes) as linux-disk-read-bytes and linux-disk-write-bytes, preventing operators from understanding critical I/O behaviour in production. 

The 5 missing metrics (rchar, wchar, syscr, syscw, and cancelled_write_bytes) are essential for diagnosing performance issues and optimising Kafka deployments as without these metrics, operators cannot calculate cache hit ratios to understand page cache effectiveness (rchar vs read_bytes), detect write amplification from filesystem overhead (write_bytes vs wchar), identify inefficient I/O patterns from excessive system calls, or monitor log compaction activity. These blind spots make troubleshooting significantly harder. 
 
The proposed change exposes all 7 metrics via JMX with low overhead as the data is already being read from the kernel, we are simply parsing and exposing the 5 additional fields that are currently discarded. 

This brings Kafka's I/O observability in line with industry standards like PostgreSQL and MySQL, enabling operators to proactively monitor I/O health, troubleshoot issues faster, optimise capacity planning by understanding actual storage requirements including amplification, and reduce costs by making data-driven infrastructure decisions.

The change is purely additive, backward compatible, and provides immediate value to anyone running Kafka on Linux in production.

Proposed Changes

Extend Linux I/O Metrics Collected to parse and expose all 7 metrics from /proc/self/io instead of just 2.

  • Currently exposed Metrics are :
    • read_bytes - Bytes read from storage (actual disk I/O)
    • write_bytes - Bytes written to storage (actual disk I/O)
  • Newly exposed (proposed) metrics as part of this KIP are
    • rchar - Total bytes read (including page cache)
    • wchar - Total bytes written (including buffered writes)
    • syscr - Number of read system calls
    • syscw - Number of write system calls
    • cancelled_write_bytes - Bytes cancelled before write (truncations)

New Metrics Overview

  • linux-disk-rchar (Total Characters Read)
    • What it measures: Total bytes read by the Kafka process, including data served from the Linux page cache.
    • Why it matters:
      • Represents the logical read volume.
      • Includes both cache hits and disk reads
      • Essential for calculating cache effectiveness
    • Key insight:
       Cache Hit Ratio = (rchar - read_bytes) / rchar
    • Example interpretation:
      • rchar = 10 GB, read_bytes = 100 MB → 99% cache hit ratio (excellent!)
      • rchar = 1 GB, read_bytes = 900 MB → 10% cache hit ratio (poor, consider adding RAM)
  • linux-disk-wchar (Total Characters Written)
    • What it measures: Total bytes written by the Kafka process, including buffered writes that may not have reached disk yet.
    • Why it matters:
      • Represents the logical write volume.
      • Combined with write_bytes, reveals write amplification
      • Critical for capacity planning and storage sizing
    • Key insight: Write Amplification = write_bytes / wchar
    • Example interpretation:
      • wchar = 1 GB, write_bytes = 1 GB → 1.0x amplification (ideal, no overhead)
      • wchar = 1 GB, write_bytes = 3 GB → 3.0x amplification (filesystem/RAID overhead)
  • linux-disk-syscr (Read System Call Count)
    • What it measures: Total number of read-related system calls (read, pread, readv, etc.) made by the Kafka process.
    • Why it matters:
      • High syscall count with low byte count indicates inefficient I/O patterns.
      • Many small reads have high kernel overhead.
      • Essential for optimizing I/O batching.
    • Key insight: Average Read Size = read_bytes / syscr
    • Example interpretation:
      • read_bytes = 1 GB, syscr = 10,000 → 100 KB per syscall (good batching)
      • read_bytes = 1 GB, syscr = 10,000,000 → 100 bytes per syscall (poor, tune fetch.min.bytes)
    • Optimization target: Larger average read sizes reduce syscall overhead and improve throughput.
  • linux-disk-syscw (Write System Call Count)
    • What it measures: Total number of write-related system calls (write, pwrite, writev,etc.) made by the Kafka process.
    • Why it matters
      • Reveals write batching effectiveness.
      • High syscall count indicates potential for optimisation.
      • Correlates with CPU overhead from kernel transitions.
    • Key insight: Average Write Size = write_bytes / syscw
    • Example interpretation
      • write_bytes = 500 MB, syscw = 1,000 → 500 KB per syscall (excellent batching).
      • write_bytes = 500 MB, syscw = 500,000 → 1 KB per syscall (poor, check batch.size configuration)
    • Optimization target: Kafka batch writes should result in large average write sizes.
  • linux-disk-cancelled-write-bytes (Cancelled Write Bytes)
    • What it measures: Bytes that were queued for writing but were cancelled before reaching storage, typically due to truncation operations.
    • Why it matters
      • Indicates log compaction and cleanup activity
      • Reveals aggressive truncation patterns
      • Helps diagnose write cancellation behaviour during crashes or forced shutdowns
    • Key insight:
    • High values indicate
      • Active log compaction (log.cleaner)
      • Frequent topic deletions
      • Log retention policies causing truncation
      • Potential inefficiency if very high relative to actual writes
    • Example interpretation
      • cancelled_write_bytes = 10 MB, write_bytes = 1 GB → 1% cancelled (normal)
      • cancelled_write_bytes = 500 MB, write_bytes = 1 GB → 50% cancelled (investigate compaction settings)
    • Use cases
      • Monitor log compaction effectiveness
      • Detect unexpected truncation activity
      • Correlate with broker restarts or crashes

Metric Semantics and Behaviour

Scenariorcharwcharsyscrsyscwcancelled_write_bytes







Normal steady-state operation

Available (MB-GB range)

Available (MB-GB range)

Available (thousands-millions)

Available (thousands-millions)

Available (MB range low)

Broker cold start (empty cache)

Equals read_bytes initially

Available

Available

Available

Minimal (near zero)

High cache hit ratio (>90%)

Much larger than read_bytes

Available

Available

Available

Available

Low cache hit ratio (<20%)

Close to read_bytes value

Available

Available

Available

Available

Write amplification scenario

Available

Much smaller than write_bytes

Available

Available

Available

Optimal I/O batching

Available

Available

Low count (large avg size)

Low count (large avg size)

Available

Poor I/O batching (tiny reads/writes)

Available

Available

Very high count (small avg size)

Very high count (small avg size)

Available

Active log compaction

Available

Available

Available

Available

High (MB-GB range)

No log compaction activity

Available

Available

Available

Available

Low (near zero)

Topic deletion in progress

Available

Available

Available

Available

Elevated (MB range)

Non-Linux platform

N/A

N/A

N/A

N/A

N/A

Linux platform without /proc/self/io

N/A

N/A

N/A

N/A

N/A


Metric Cardinality:

  • Question: How many metric instances will this create per broker?
    • The proposed I/O metrics have low cardinality - specifically 7 metrics per broker process.
    • These metrics are process-level counters, not topic-level (like `BytesInPerSec` (per topic), partition-level ( `LogEndOffset`), or replica-level metrics(`UnderReplicatedPartitions`).
    • Cardinality Breakdown - Per Broker Process:
      • `linux-disk-read-bytes`: 1 metric (existing)
      • `linux-disk-write-bytes`: 1 metric (existing)
      • `linux-disk-rchar`: 1 metric (new)
      • `linux-disk-wchar`: 1 metric (new)
      • `linux-disk-syscr`: 1 metric (new)
      • `linux-disk-syscw`: 1 metric (new)
      • `linux-disk-cancelled-write-bytes`: 1 metric (new)


Metric Behaviour During Operations:

1. During Normal Operation :All metrics are monotonically increasing counters

2. During Partition Reassignment: Metrics reflect total process I/O per broker, regardless of partition ownership.

3. Metric collection frequency - how often are these updated?
Kernel-Side Update Frequency

Source: /proc/self/io (procfs virtual filesystem)
    - Update mechanism: Updated by Linux kernel in real-time as I/O operations occur
    - Granularity: Every I/O syscall updates the relevant counters immediately
    - Atomicity: All 7 metrics in the file are read atomically in a single file read operation

Example :The kernel updates these instantly as I/O happens:
    rchar: 4052              # Updated on every read() syscall
    wchar: 3000              # Updated on every write() syscall
    syscr: 13                # Incremented on each read syscall
    syscw: 10                # Incremented on each write syscall
    read_bytes: 0            # Updated when data physically read from disk
    write_bytes: 0           # Updated when data physically written to disk
    cancelled_write_bytes: 0 # Updated when buffered writes are cancelled
    
 4. if the linux version would impact the availability of these metrics?
 
   - All metrics require Linux kernel 2.6.20+ (released January 2007)
   - Bottom Line:
        - 100% of modern Linux systems support all 7 metrics
        - Even 10+ year old kernels have full support
        - No distribution-specific patches or backports needed

    Graceful Degradation:
        - Non-Linux systems: Metrics simply not available (existing behavior)
        - Linux without procfs: Detected and disabled automatically
        - Containers without /proc mounted: Detected and disabled
        - No errors, no exceptions - just logs debug message

 5. Container/Virtualization Impact

        Docker/Kubernetes:
        - /proc/self/io works correctly in containers by default
        - No special configuration needed - procfs is mounted automatically

Public Interfaces

JMX Metrics

Scope Note: The `linux-disk-*` prefix is a historical naming convention inherited 
from the two existing metrics. All seven metrics in this group reflect aggregate I/O
for the Kafka JVM process across all configured `log.dirs`. They do not provide
per-disk or per-log-directory visibility.

All metrics are exposed under:
- Brokers: kafka.server:type=KafkaServer,name=<metric-name>
- Controllers: kafka.server:type=ControllerServer,name=<metric-name>

          In combined mode (broker + controller on the same node), both sets of metrics are registered, but they share the same underlying LinuxIoMetricsCollector reading from the same /proc/self/io, so the values reflect aggregate process-level I/O.

For Example: (Existing metric)

Broker:     kafka.server type=KafkaServer name=linux-disk-read-bytes  Value=16384 JMXTool=1.5.3
Controller: kafka.server type=ControllerServer name=linux-disk-read-bytes  Value=16384 JMXTool=1.5.3

New Metrics:

Metric Name: linux-disk-rchar
Type: Gauge
Unit: Bytes
Description: Total bytes read (including page cache hits)
Use Case: Calculate cache hit ratio: (rchar - read_bytes) / rchar
────────────────────────────────────────
Metric Name: linux-disk-wchar
Type: Gauge
Unit: Bytes
Description: Total bytes written (including buffered writes)
Use Case: Detect write amplification: write_bytes / wchar
────────────────────────────────────────
Metric Name: linux-disk-syscr
Type: Gauge
Unit: Count
Description: Number of read system calls
Use Case: Identify inefficient I/O patterns: read_bytes / syscr for avg read size
────────────────────────────────────────
Metric Name: linux-disk-syscw
Type: Gauge
Unit: Count
Description: Number of write system calls
Use Case: Analyze write batching: write_bytes / syscw for avg write size
────────────────────────────────────────
Metric Name: linux-disk-cancelled-write-bytes
Type: Gauge
Unit: Bytes
Description: Bytes cancelled before write (truncations)
Use Case: Monitor log compaction and cleanup activity
────────────────────────────────────────
Metric Name: linux-disk-read-bytes (existing)
Type: Gauge
Unit: Bytes
Description: Bytes read from storage layer (actual disk I/O)
Use Case: Track physical disk reads
────────────────────────────────────────
Metric Name: linux-disk-write-bytes (existing)
Type: Gauge
Unit: Bytes
Description: Bytes written to storage layer (actual disk I/O)
Use Case: Track physical disk writes

Compatibility, Deprecation, and Migration Plan

  • Backward Compatibility - Fully backward compatible
     - Existing metrics remain unchanged
     - New metrics are additive only
     - No changes to APIs, configurations, or protocols
     - Existing dashboards and alerts continue to work


  • Forward Compatibility - Forward compatible:
     - If new metrics are added to /proc/self/io in future Linux kernels, they can be added following the same pattern
     - No breaking changes anticipated


  • Migration Plan

    No migration required - this is a purely additive change with no data migration, configuration changes, or operational impact.
      

Test Plan

The feature will be validated through:

  • Unit Tests
    • Test parsing all 7 metrics from mock `/proc/self/io` data
    • Test with realistic production values
    • Test edge cases (zeros, large numbers, missing fields)
  • Integration Tests
    • Verify all 7 metrics are registered in JMX on Linux systems
    • Verify metrics are NOT registered on non-Linux systems
    • Verify metric values update correctly over time
  • Manual Testing - Validation on Real Kafka Cluster:
    1. Start Kafka broker on Linux
    2. Verify metrics appear in JMX:
       ```bash
       jconsole # Connect to Kafka broker
       # Navigate to kafka.server:type=KafkaServer,name=linux-disk-*
       # Verify all 7 metrics are present
       ```
    3. Generate load (produce/consume messages)
    4. Verify metric values update correctly
    5. Calculate cache hit ratio: `(rchar - read_bytes) / rchar`
    6. Calculate write amplification: `write_bytes / wchar`

Rejected Alternatives

  • Use External Node Exporters (e.g., Prometheus node_exporter)
     Rejected because:
     - Node exporters track their own process (/proc/self/io of node_exporter), not Kafkas process
     - Cannot correlate with Kafka-specific operations (produce, consume, compaction)
     - Requires additional infrastructure deployment and maintenance
     - Metrics not integrated with Kafka's' existing monitoring framework
     - No automatic correlation in dashboards
  • Calculate and Expose only Derived Metrics (e.g., cache_hit_ratio)
     Rejected because:
     - Too opinionated - different operators need different analysis
     - Loses granularity - operators cannot build custom metrics
     - Doesn't' address all use cases (write amplification, syscall analysis, cancelled writes)
     - Raw metrics are more flexible and composable
  • Use Different Metric Name Prefixes: linux-cache-rchar and linux-cache-wchar (instead of linux-disk-*); linux-io-rchar for all new metrics; Separate prefixes for cache vs disk metrics.
    Rejected because:
     - Breaks naming consistency with existing linux-disk-read-bytes and linux-disk-write-bytes
     - The terms rchar, wchar, syscr, syscw are well-established Linux kernel terminology
     - Grouping all LinuxIoMetricsCollector metrics under linux-disk-* is clearer for operators
     - Less disruptive.
     - Easier to discover related metrics (all start with linux-disk-*)
  • Expose Per-Partition or Per-Topic I/O Metrics
    Rejected because:
     - /proc/self/io provides process-level metrics only, not per-topic/partition
     - Out of scope for this KIP
     - Could be a separate future enhancement if needed
  • Wait for Platform-Agnostic Solution: Wait until a cross-platform I/O metrics solution is available for Linux/Windows/macOS.
    Rejected because: Kafka production deployments are overwhelmingly on Linux.
     - Existing Linux Io Metrics Collector is already Linux-specific
     - No timeline for when/if other platforms would support equivalent metrics
     - Users need this observability now
     - Platform-specific implementations are common in Kafka (e.g., Linux-specific tuning)
  • Use BPF/eBPF for Detailed I/O Tracing , Rejected because:
     - BPF requires additional kernel modules and privileges
     - Much higher overhead than reading /proc/self/io
     - Unnecessary complexity for the use case
     - /proc/self/io provides exactly the metrics we need with zero overhead
     - BPF is better suited for detailed tracing, not ongoing metrics collection
  • No labels