Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Compaction enables Kafka to remove old messages that are flagged for deletion while other messages can be retained for a relatively longer time.  Today, a log segment may remain un-compacted for a long time since the eligibility for log compaction is determined based on compaction ratio (“min.cleanable.dirty.ratio”) and min compaction lag ("min.compaction.lag.ms") setting.  Ability to delete a log message through compaction in a timely manner has become an important requirement in some use cases (e.g., GDPR).  For example,  one use case is to delete PII (Personal Identifiable information) data within 7 days while keeping non-PII indefinitely in compacted format.  The goal of this change is to provide a time-based compaction policy that ensures the cleanable section is compacted after the specified time interval regardless of dirty ratio and “min compaction lag”.  However, dirty ratio and “min compaction lag” are still honored if the time based compaction rule is not violated. In other words, if Kafka receives a deletion request on a key (e..g, a key with null value), the corresponding log segment will be picked up for compaction after the configured time interval to remove the key.

Example

A compacted topic with user id as key and PII in the vaule:


No Format
1 => {name: "John Doe", phone: "5555555"}
2 => {name: "Jane Doe", phone: "6666666"}


# to remove the phone number we can replace the value with a new message
1 => {name: "John Doe"}


# to completely delete key 1 we can send a tombstone record
1 => null


# but until compaction runs (and some other conditions are met), reading the whole topic will get all three values for key 1, and the old values are still retained on disk.

if there is a requirement to guarantee a maximum time an old record can exist (for example an interpretation of GDPR PII) a new topic setting is needed, because existing configurations focus only on the minimum time it should live, or sacrifice the efficiencies gained by the dirty ratio settings.

This example mentions GDPR because it is widely known, but the requirement here is to provide some guarantees around a tombstone or a new value leading to deletion of old values within a maximum time.


Note: This Change focuses on when to compact a log segment, and it doesn’t conflict with KIP-280, which focuses on how to compact log.

...