Table of Contents |
---|
Status
Current state: Discarded in deference to KIP-266
Discussion thread:
JIRA:
Jira | ||||||||
---|---|---|---|---|---|---|---|---|
|
Jira | ||||||||
---|---|---|---|---|---|---|---|---|
|
...
Motivation
The current implementation of poll takes a timeout parameter, but the timeout doesn't apply to all parts of the implementation that block.
...
This causes a number of problems. First, several have noticed conditions in which wait-for-metadata blocks the thread uninterruptibly forever (see the linked Jiras). Second, there is a need for users of Consumer to just do #1, which they currently do by passing a timeout of 0ms, but the semantics of this usage are not clean (e.g., it's not clear whether or not it's safe to ignore the result). Third, during a regular poll, despite passing a specific timeout, there's no guarantee about how long the call will block, since an indefinite amount of time may be spent in phase #1.
Proposed Public Interface Change
We propose the following interface changes to Consumer and KafkaConsumer:
...
Just as it does today, poll will still send async requests when the timeout is too low to actually wait for a response.
Compatibility, Deprecation, and Migration Plan
Since poll will send async requests and check on the responses in subsequent calls, it should continue to work as expected if callers retry their calls.
There is a possibility that callers depend on the consumer blocking until it gets an assignment, though, so we plan to throw a TimeoutException if phase #1 times out. This should be sufficient to alert callers to the change in semantics and protect them from broken assumptions.
Test Plan
This change will break numerous of our tests that call poll. We will update the tests either to allow more time to poll or to call the new method.
Also Qiang Zhao has submitted a new unit test verifying poll doesn't block forever: https://github.com/apache/kafka/pull/4861/files#diff-444ce0cab1dc8ec666294c4eafc9baefR1195
Rejected Alternatives
Just change the semantics of poll with no new method
This would be the simplest thing to do. We could just change the semantics of poll to use the timeout for both metadata and data. Adding about 30s to the timeouts in all our tests is sufficient to get them to pass, so we could share the same advice in the release notes.
But this actually breaks the existing use case for blocking on assignment, which we have witnessed in some open source projects. There would simply be no way to allow enough time for assignment, but not enough to to a poll, so you'd be forced to save the results for later processing. In other words, this alternative causes more disruption for some users, since it would force them to restructure their Consumer usage instead of just using a new method where appropriate.
Method variant with separate timeout for metadata
We could instead add a new poll variant like:
...
Plus, from an informal survey of usage, it really seems like you would have two distinct use cases, poll(some_value, 0)
with the result ignored for just getting an assignment, and poll(some_value, some_other_value)
with the result saved for really doing a poll. In the former usage, I'd argue it's actually a bit unsafe to ignore the result, since you can't be sure that poll won't return some results, and if it does, you'll miss them (the consumer would have no way of knowing that you're ignoring the results). The latter usage is also not great, since it's not clear how I as a caller should divide my available time between the assignment "phase" and actually polling.
Specify the metadata timeout via config
We could instead add a new config like one of these:
...