Current state: Under Discussion
Discussion thread: here
JIRA: Not created
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Searching offset by timestamp in Kafka has very coarse granularity (log segment level), it also does not work well when replica is reassigned. This KIP tries to introduce a time-based log index to allow searching message in Kafka by timestemp at a finer granularity.
The time index will also be used for time based log retention.
This KIP depends on KIP-32.
No actual public interface change. The search by timestamp function will still be provide by OffsetRequest.
In order to enable timestamp based search at finer granularity, we need to add the timestamp to log indices as well. Broker will build time index based on LogAppendTime of messages.
Because all the index files are memory mapped files the main consideration here is to avoid significantly increasing the memory consumption.
The time index file needs to be built just like the log index file based on each log segment file.
Create another index file for each log segment with name SegmentBaseOffset.time.index to have index at minute level. The time index entry format is:
Time Index Entry => Timestamp Offset Timestamp => int64 Offset => int32 |
The time index granularity does not change the actual timestamp searching granularity. It only affects the time needed for searching. The way it works will be the same as offset search - find the closet timestamp and corresponding offset, then start the leaner scan over the log until find the target message. The reason we prefer minute level indexing is because timestamp based search is usually rare so it probably does not worth investing significant amount of memory in it.
The following table give the summary of memory consumption using different granularity. The number is calculated based on a broker with 3500 partitions.
second | 86400 | 3.4 GB |
Minute | 1440 | 57 MB |
This configuration allows user to change granularity of indexing
Based on the proposal in KIP-32, the broker will build the time index in the following way:
On broker startup, the broker will need to find the earliest and latest timestamp of the current active log segment. The latest timestamp may needed for the next log index append. The earliest timestamp is needed to enforce time based log rolling. So the broker will need to scan the current active log segment to find the latest or earliest timestamp of messages.
To enforce time based log retention, the broker will check the last time index entry of a log segment. The timestamp will be the latest timestamp of the messages in the log segment. So if that timestamp expires, the broker will delete the log segment.
Because the broker keeps the earliest timestamp of the messages in the current active log segment. If the time beyond the configured log segment rolling out time from that timestamp. The broker will roll out a new log segment.
The broker only provides minute level accuracy. The guarantees provided are:
For documentation purpose, the followings are a few discussion for this KIP.
Option 1 | Option 2 | |
---|---|---|
Accuracy of Searching by time | Millisecond | Locate to the first message in the log falls into the minute. |
Order of timestamp in actual log | monotonically increasing | out of order |
Broker log retention / rolling policy enforcement | Simple to implement if we have LogAppendTime in each message. Otherwise separate design is needed. | Need to implement separately |
Exposure of LogAppendTime to user? | Optional. It depends on whether we include LogAppendTime in each message. See more detail discussion in KIP-32. | Not necessarily needed |
Memory consumption | Using memory mapped file. Typically less memory is needed than option 2 | All entry are in memory. Memory footprint is higher than Option 1 |
Complexity | Both options are similar for indexing | Similar to Option 1, but needs separate design to honor log retention / rolling |
Application friendliness | User need to track CreateTime (assuming we include it in message) and LogAppendTime (See further discussion in Use case discussion section) | User only need to track CreateTime |
Use case | Goal | Solution with LogAppendTime index | Solution with CreateTime index | Comparison | |
---|---|---|---|---|---|
1 | Search by timestamp | Not lose messages | If user want to search for a message with CreateTime CT. They can use CT to search in the LogAppendTime index. Because LogAppendTime > CT for the same message (assuming no skew clock). If the clock is skewed, people can search with CT - X where X is the max skew. If user want to search for a message with LogAppendTime LAT, they can just search with LAT and get a millisecond accuracy. | User can just search with CT and get a minute level granularity offset. | If the latency in the pipeline is greater than one minute, user might consume less message by using CreateTime index. Otherwise, LogAppendTime index is probably preferred. Consider the following case:
If user want to search with CT after they consumed m2, they will have to reconsume from m1. Depending on how big LAT2 - LAT1 is, the amount of messages to be reconsumed can be very big. |
2 | Search by timestamp (bootstrap) |
| In bootstrap case, all the LAT would be close. For example If user want to process the data in last 3 days and did the following:
In this case, LogAppendTime index does not help too much. That means user needs to filter out the data older than 3 days before dumping them into Kafka. | In bootstrap case, the CreateTime will not change, if user follow the same procedure started in LogAppendTime index section. Searching by timestamp will work. | LogAppendTime index needs further attention from user. |
3 | Failover from cluster 1 to cluster 2 |
| Similar search by timestamp. User can choose to use CT or LAT of cluster 1 to search on cluster 2. In this case, searching with CT - MaxLatencyOfCluster will provide strong guarantee on not losing messages, but might have some duplicates depending on the difference in latency between cluster 1 and cluster 2. | User can use CT to search and get minute level granularity. Duplicates are still not avoidable. There can be some tricky cases here. Consider the following case [1]:
In this case, m1 is created before m2. Due to latency difference, m1 arrives cluster 1 then m2 does, m2 arrives cluster 2 before m1 does. If a consumer consumed m2 in cluster 2 and fail over to cluster 1, simply search by CT2 will miss m1 because m1 has larger offset than m2 in cluster 2 but smaller offset than m2 in cluster 1. So the same trick or CT - MaxLatencyOfCluster is still needed. | In cross cluster fail over case, both solution can provide strong guarantee of not losing messages. But both needs to depend on the knowledge of MaxLatencyOfCluster. |
4 | Get lag for consumers by time | Know how long a consumer is lagging by time. | With LogAppendTime in the message, consumer can easily find out the lag by time and estimate how long it might need to reach the log end. | Not supported. | |
5 | Broker side latency metric | Let the broker to report latency of each topic. i.e. LAT - CT | The latency can be simply reported as LAT - CT. | The latency can be reported as System.currentTimeMillis - CT | The two solutions are the same. This latency information can be used for MaxLatencyOfCluster in use case 3. |
From the use cases list above, generally having a LogAppendTime index is better than having a CreateTime based timestamp.
The change is backward compatible after KIP-31 and KIP-32 are checked in.
The most straight forward approach to have a time index is to let the log index files have a timestamp associate with each entry.
Log Index Entry => Offset Position Timestamp Offset => int32 Position => int32 Timestamp => int64 |
Because the index entry size become 16 bytes instead of 8 bytes. The index file size also needs to be doubled. As an example, one of the broker we have has ~3500 partitions. The index file took about 16GB memory. With this new format, the memory consumption would be 32GB.
In order to enable timestamp based search at finer granularity, we need to add the timestamp to log indices as well. Broker will build time index based on LogAppendTime of messages.
Because all the index files are memory mapped files the main consideration here is to avoid significantly increasing the memory consumption.
The time index file needs to be built just like the log index file based on each log segment file.
Create another index file for each log segment with name SegmentBaseOffset.time.index to have index at minute level. The time index entry format is:
Time Index Entry => Timestamp Offset Timestamp => int64 Offset => int32 |
The time index granularity does not change the actual timestamp searching granularity. It only affects the time needed for searching. The way it works will be the same as offset search - find the closet timestamp and corresponding offset, then start the leaner scan over the log until find the target message. The reason we prefer minute level indexing is because timestamp based search is usually rare so it probably does not worth investing significant amount of memory in it.
The time index will be built based on the log index file. Every time when a new entry is inserted into log index file, we take a look at the timestamp of the message and if it falls into next minute, we insert an entry to the time index as well. The following table give the summary of memory consumption using different granularity. The number is calculated based on a broker with 3500 partitions.
second | 86400 | 3.4 GB |
Minute | 1440 | 57 MB |
Users don't typically need to look up offsets with seconds granularity.
Another option is to build index based on CreateTime of messages. Similar to option 1, we are going to have one time index file per log segment.
The biggest challenge of indexing using CreateTime is that CreateTime can be out of order.
One solution is as below:
Create a timestamp index file for each log segment. The entry in the file is in following format:
Time Index Entry => Timestamp Offset Timestamp => int64 Offset => int32 |
So the timestamp index file will simply become a persistent copy of timestamp index map. Broker will load the timestamp map from the file on startup.