DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: Accepted
Discussion thread: here
Vote: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Kafka operators rely heavily on storage metrics to manage retention, capacity planning, and alerting. Today, Kafka exposes only absolute partition size metrics (bytes). While useful, these metrics do not directly indicate how close a topic is to its configured retention limits.
This limitation becomes more pronounced in the presence of tiered storage, where data is split across local and remote tiers with different retention constraints.
Problems with the Current State
- Operators must manually calculate storage utilization percentages by correlating log size with retention configurations.
- Alerting based on absolute sizes is difficult because retention limits vary widely across topics.
- Tiered storage requires correlating multiple metrics to understand overall and local storage utilization.
- Capacity planning and operational troubleshooting lack a clear, normalized signal indicating retention pressure.
Proposed Change
This KIP proposes introducing percentage-based, per-partition storage metrics that express log size as a proportion of configured retention limits.
New Metrics Overview
- RetentionSizeInPercent
- Non-tiered topics
- Tiered storage topics with remote copy enabled (local + remote data)
- Represents total partition size as a percentage of retention.bytes
- Available for:
- Enables operators to quickly identify partitions approaching or exceeding retention limits
- LocalRetentionSizeInPercent
- Represents local log size as a percentage of local.retention.bytes
- Available only for tiered storage topics
- Helps operators monitor pressure on local disks independently of remote storage
Both metrics:
- Are reported at partition granularity
- Are exposed via JMX
- Use integer percentages
- May exceed 100% to reflect delayed retention cleanup
Metric Semantics and Behavior
Scenario | RetentionSizeInPercent | LocalRetentionSizeInPercent |
Non-tiered topic with retention configured | Available | N/A |
Non-tiered topic without retention configured | 0 | N/A |
Tiered topic with remote copy enabled | Available | Available |
Tiered topic with remote copy disabled | Available | Available |
Unlimited retention (-1) | 0 | 0 |
Design Principles
- Retention-aware: Metrics reflect configured retention limits, not absolute sizes
- Tier-aware: Distinguish between total storage usage and local disk usage
- Operationally meaningful: Enable simple alerting and dashboards
Low overhead: Values are derived during existing retention-related workflows, not calculated on demand
Public Interfaces
JMX Metrics
- Non-tiered topics kafka.log:type=Log,name=RetentionSizeInPercent,topic=<topic>,partition=<partition>
- Tiered storage topics kafka.log.remote:type=RemoteLogManager,name=RetentionSizeInPercent,topic=<topic>,partition=<partition>
kafka.log.remote:type=RemoteLogManager,name=LocalRetentionSizeInPercent,topic=<topic>,partition=<partition>
These metrics complement existing size metrics and do not modify or replace any current interfaces.
Existing size metric
Size of a partition on disk (in bytes) kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)
Compatibility, Deprecation, and Migration Plan
- Backward compatible: No existing metrics or behavior are changed
- No deprecations
- No migration required: Metrics become available immediately after upgrade
Test Plan
The feature will be validated through:
- Unit tests covering percentage calculations, edge cases, and retention configurations
- Integration tests verifying correct exposure via JMX
- Tiered storage tests ensuring correct behavior during leadership changes and retention cleanup
Rejected Alternatives
- Single unified metric for all storage types
Rejected due to differing semantics between tiered and non-tiered storage. - Floating-point percentages
Rejected in favor of simpler integer-based metrics sufficient for alerting. - Broker-level aggregated metrics
Rejected because per-partition visibility is essential for identifying retention pressure. - Capping values at 100%
Rejected to preserve visibility into retention lag and cleanup delays. - Similar metrics for the time retentionFor topics older than the configured retention time that have ongoing production, log segments tend to remain near expiration continuously. This would cause such a metric to hover close to zero in steady state. As a result, it may not provide meaningful operational signal to topic owners, since data older than the retention time is expected to be eligible for deletion.