Current state: Under Discussion
Currently client (e.g. producer, consumer) fetches metadata from the least loaded node. Because Kafka Controller sends UpdataMetadataRequest to brokers concurrently and there may be difference in when brokers process the UpdateMetadataRequest, it is possible that client fetches a metadata that is older than the existing metadata in its cache. This can cause OffsetOutOfRangeException in consumer even if there is no log truncation in the Kafka cluster (See KAFKA-6262 for more detail). For MirrorMaker whose offset reset policy is oldest, it can cause MM to rewind back to consume from the oldest offset. This increases the latency of transmitting the data from source to destination cluster and duplicates many data in the destination cluster.
In this KIP we propose to add leader_epoch and partition_epoch fields in the MetadataResponse so that client can refresh metadata if the incoming metadata is older than the existing metadata in its cache.
1) Add znode /partition_epoch with the following json format
2) Update the znodes /brokers/topics/[topic]/partitions/[partition] to use the following json format
1) Add partition_epoch to UpdateMetadataRequest
2) Add partition_epoch and leader_epoch to MetadataResponse
3) Add partition_epoch and leader_epoch to OffsetCommitRequest
4) Add partition_epoch and leader_epoch to OffsetFetchResponse
5) Add partition_epoch to LeaderAndIsrRequest
6) Add partition_epoch to FetchResponse
7) Add leader_epoch and partition_epoch to ListOffsetResponse
Offset topic schema
Add partition_epoch and leader_epoch to the schema of the offset topic value.
1) Create new class OffsetAndOffsetEpoch
2) Add the following methods to the interface org.apache.kafka.clients.consumer.Consumer
3) Add field offsetEpoch to the class org.apache.kafka.clients.consumer.OffsetAndMetadata and the class org.apache.kafka.clients.consumer.OffsetAndTimestamp
4) Add new Error INVALID_PARTITION_EPOCH
consumer.poll() will throw InvalidPartitionEpochException if partition_epoch of the given partition is different from the corresponding partition_epoch from the FetchResponse. This can happen if user consumes data with previously used offset after the topic is deleted.
1) Topic creation and partition expansion.
When broker creates topic or expand partition, broker should increment the global partition_epoch in the znode
/partition_epoch by 1 for each new partition to be created. Thus each partition can be identified by a globally unique partition_epoch. The resulting partition -> partition_epoch mapping for this partition should be written in the znode
2) Topic deletion.
When a topic is deleted, the global partition_epoch in the znode
/partition_epoch will be incremented by 1. This can help us recognize the more recent metadata after the topic deletion.
3) Metadata propagation from controller to brokers and from brokers to clients.
Controller should include the current maximum partition_epoch in the UpdateMetadataRequest. Controller should also include partition_epoch and leader_epoch for each partition in the UpdateMetadataRequest. Broker should similarly include max_partition_epoch, partition_epoch and leader_epoch in the MetadataResponse to clients.
4) Client's metadata refresh
After client receives MetadataResponse from a broker, it compares with the MetadataResponse with the cached metadata to check whether the MetadataResponse is outdated. The MetadataResponse is outdated if, for all those partitions that the client is interested in, there exists a partition such that the following is true:
The client should refresh metadata again with the existing backoff mechanism if the MetadataResponse is determined to be outdated.
Note that producer is interested in all partitions. Consumers can potentially be interested in only partitions that it has explicitly subscribed to. The purpose of checking only a subset of partitions is to avoid unnecessary metadata refresh when the metadata is only outdated for partitions not needed by client. In other words, it is for optimization and it is not needed for correctness.
5) Offset commit
When consumer commits offset, it includes leader_epoch and partition_epoch together with offset in the OffsetCommitRequest. The leader_epoch should be the largest leader_epoch of messages whose offset < the commit offset. If no message has been consumed since consumer initialization, the leader_epoch from seek(...) or OffsetFetchResponse should be used. The partition_epoch should be read from the last FetchResponse corresponding to the given partition and commit offset. The coordinator should extract these values from the OffsetCommitRequest and write them into the offset topic.
6) Consumer rebalance or initialization using the offset and epoch from the Kafka offset topic
After consumer receives OffsetFetchResponse, it remembers the leader_epoch and the partition_epoch for each partition it needs to consume. Then the consumer needs to repeatedly refresh metadata if a metadata is considered outdated, which is determined using the similar criteria as introduced in the above subsection "Client's metadata refresh". The only difference is that, instead of using the leader_epoch and partition_epoch from the cached metadata, the consumer uses the values from the OffsetFetchResponse.
For existing version of the offset topic, leader_epoch and partition_epoch will not be available in the value of the offset topic message. Both the leader_epoch and the partition_epoch are assumed to be -1 and the client will preserve the existing behavior without doing the additional metadata refresh.
7) Consumer initialization if offset is stored externally.
When committing offset, user should use the newly added API
positionAndOffsetEpoch(...) to read the offset and offsetEpoch (which encodes leader_epoch and partition_epoch). The leader_epoch should be the largest leader_epoch of those messages whose offset < position. If no message has been consumed since consumer initialization, the leader_epoch from seek(...) or OffsetFetchResponse should be used. The partition_epoch should be read from the last FetchResponse corresponding to the given partition and position offset. Both offset and the offsetEpoch (in the form of byte array) should be written to the external store.
When initializing consumer, user should read the externally stored offsetEpoch (as byte array) together with the offset. Then user should call the newly added API
seek(partition, offset, offsetEpoch) to seek to the previous offset. Later when consumer.poll() is called, the consumer needs to repeatedly refresh metadata if a metadata is considered outdated, which is determined using the similar criteria as introduced in the above subsection "Client's metadata refresh". The only difference is that, instead of using the leader_epoch and partition_epoch from the cached metadata, the consumer uses the values that extracted from the offsetEpoch provided to the seek(...).
If the offsetEpoch from seek(...) can not be decoded by Kafka client implementation, the
IllegalArgumentException is thrown from seek(...).
If user calls the existing API seek(partition, offset) and there is not committed offset/leader_epoch/partition_epoch from the Kafka offset topic, both the leader_epoch and the partition_epoch are assumed to be -1 and the client will preserve the existing behavior without doing the additional check.
8) Consumption after topic deletion
consumer.poll() will throw this exception if partition_epoch of the given partition is different from the corresponding partition_epoch from the FetchResponse. This can happen if user consumes data with previously used offset after the topic is deleted.
After consumer receives FetchResponse from a broker, consumer should verify that the partition_epoch from the FetchResponse equals the partition_epoch associated with the last used offset. The partition_epoch associated with the last used offset can be obtained from seek(...) and OffsetFetchResponse. If there is no partition_epoch associated with the last used offset, this check is not performed. If there is partition_epoch associated with the last used offset and its value is different from the partition_epoch from the FetchResponse, consumer.poll() should throw InvalidPartitionEpochException (constructed with partition -> partition_epoch mapping) those partitions.
Compatibility, Deprecation, and Migration Plan
The KIP changes the inter-broker protocol. Therefore the migration requires two rolling bounce. In the first rolling bounce we will deploy the new code but broker will still communicate using the existing protocol. In the second rolling bounce we will change the config so that broker will start to communicate with each other using the new protocol.
- Use a global per-metadata version.
This can be a bit more complicated by introducing a new state in Kafka. leader_epoch is an existing state we already maintain in zookeeper. By using per-partition leader_epoch the client will only be forced to re-fresh metadata if the MetadataResponse contains out-dated metadata for those partitions that the client is interested in.
1) We can use the leader_epoch to better handle the offset reset in case of unclean leader election.