This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Current state: Draft
Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]
JIRA: here [Change the link from KAFKA-1 to your own ticket]
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Kafka Streams have a very tight coupling with the input partition counts. There are dimensions that could be scaled:
- Number of consumers
- Number of tasks (thread)
- Number of state stores
Out of all the scaling needs, as we discussed in KIP-X, the scaling out consumer is indeed hard as we have to struggle with server side filtering and a not well scaled offset commit mode. Furthermore, most cases the number of consumers is not the bottleneck - the total parallelism is. The biggest win could in fact come from scaling the single consumer application, instead of adding more consumers.
One more thing we don't want to tackle is the synchronization. Constantly synchronizing the states between threads means a lot of performance penalty. So the individual commit feature is still a favored approach to solve the problem. What we need is just a different way of viewing the offset data format. Instead of mapping from offset → committed boolean, we actually need to map from worker-id → committed offsets. From the old design, coordination point was pushed down and way frequent that potentially hurts the performance. If different sub task gets the chance to be assigned with partial tasks, the
High Level Design
In offset commit protocol, we always have a topic partition → offset mapping to remember our progress. In fact if suppose we build a consumer with multi-threading access, we could actually do the rebalance assignment of key ranges to workers and let those mappings returned and stored on broker side. In this way, say if we have two workers A and B sharing the same consumer, they should be able to commit their progress individually by (worker-id, offset) pairs. Adding the group assignment message which has key range mapping, we could easily do the client side filtering for the first generation if possible. This work also unblocks the potential later if we want consumer level scaling by defining their individual key ranges, so that we could allow concurrent commit.
Concurrent RocksDB access
To make sure we have predictable amount of state, threads will be accessing a shared state store which means the task assignment pointing to the stream thread will be shared between background threads. To make a fair assignment and avoid concurrent access as much as possible, we take an iterative approach. Based on the parallelising factor, we would determine the task assignment
Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
- If we are changing behavior how will we phase out the older behavior?
- If we need special migration tools, describe them here.
- When will we remove the existing behavior?
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.
Add parallelism on producing. Right now producers use customized partitioners to choose to write to different partitions. We could add a key-range to producer as well to split the batches to write different keys concurrently. The offset of each split phantom partitions will be strictly larger than the log end offset when doing the split, and during the split producers will temporarily restrict access until the split is finished.
To merge a split, one has to first move the partition leaders to co-locate on the same broker. Any produce request going to the primary partition broker will be written to the primary partition log instead of fan-out partition log. The fan-out partitions also have a consumed offset which means there are still data that are not consumed. Either way, we will advance the partition offset with this batch while adding a range commit to the offset log.