Status

Current state: Accepted

Discussion thread: here

Vote: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Kafka operators rely heavily on storage metrics to manage retention, capacity planning, and alerting. Today, Kafka exposes only absolute partition size metrics (bytes). While useful, these metrics do not directly indicate how close a topic is to its configured retention limits.

This limitation becomes more pronounced in the presence of tiered storage, where data is split across local and remote tiers with different retention constraints.

Problems with the Current State

  • Operators must manually calculate storage utilization percentages by correlating log size with retention configurations.
  • Alerting based on absolute sizes is difficult because retention limits vary widely across topics.
  • Tiered storage requires correlating multiple metrics to understand overall and local storage utilization.
  • Capacity planning and operational troubleshooting lack a clear, normalized signal indicating retention pressure.

Proposed Change

This KIP proposes introducing percentage-based, per-partition storage metrics that express log size as a proportion of configured retention limits.

New Metrics Overview

  1. RetentionSizeInPercent
    • Non-tiered topics
    • Tiered storage topics with remote copy enabled (local + remote data)
    • Represents total partition size as a percentage of retention.bytes
    • Available for:
    • Enables operators to quickly identify partitions approaching or exceeding retention limits
  2. LocalRetentionSizeInPercent
    • Represents local log size as a percentage of local.retention.bytes
    • Available only for tiered storage topics 
    • Helps operators monitor pressure on local disks independently of remote storage

Both metrics:

  • Are reported at partition granularity
  • Are exposed via JMX
  • Use integer percentages
  • May exceed 100% to reflect delayed retention cleanup

Metric Semantics and Behavior

Scenario

RetentionSizeInPercent

LocalRetentionSizeInPercent

Non-tiered topic with retention configured

Available

N/A

Non-tiered topic without retention configured

0

N/A

Tiered topic with remote copy enabled

Available

Available

Tiered topic with remote copy disabled

Available

Available

Unlimited retention (-1)

0

0

Design Principles

  • Retention-aware: Metrics reflect configured retention limits, not absolute sizes
  • Tier-aware: Distinguish between total storage usage and local disk usage
  • Operationally meaningful: Enable simple alerting and dashboards

Low overhead: Values are derived during existing retention-related workflows, not calculated on demand

Public Interfaces

JMX Metrics

  • Non-tiered topics kafka.log:type=Log,name=RetentionSizeInPercent,topic=<topic>,partition=<partition>
  • Tiered storage topics kafka.log.remote:type=RemoteLogManager,name=RetentionSizeInPercent,topic=<topic>,partition=<partition>

kafka.log.remote:type=RemoteLogManager,name=LocalRetentionSizeInPercent,topic=<topic>,partition=<partition>

These metrics complement existing size metrics and do not modify or replace any current interfaces.

Existing size metric

Size of a partition on disk (in bytes)
kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)

Compatibility, Deprecation, and Migration Plan

  • Backward compatible: No existing metrics or behavior are changed
  • No deprecations
  • No migration required: Metrics become available immediately after upgrade

Test Plan

The feature will be validated through:

  • Unit tests covering percentage calculations, edge cases, and retention configurations
  • Integration tests verifying correct exposure via JMX
  • Tiered storage tests ensuring correct behavior during leadership changes and retention cleanup

Rejected Alternatives

  • Single unified metric for all storage types
    Rejected due to differing semantics between tiered and non-tiered storage.
  • Floating-point percentages
    Rejected in favor of simpler integer-based metrics sufficient for alerting.
  • Broker-level aggregated metrics
    Rejected because per-partition visibility is essential for identifying retention pressure.
  • Capping values at 100%
    Rejected to preserve visibility into retention lag and cleanup delays.
  • Similar metrics for the time retentionFor topics older than the configured retention time that have ongoing production, log segments tend to remain near expiration continuously. This would cause such a metric to hover close to zero in steady state. As a result, it may not provide meaningful operational signal to topic owners, since data older than the retention time is expected to be eligible for deletion.
  • No labels