Current state: Accepted
Discussion thread: here (Not happening yet)
JIRA: here
Instantiating a new client may result in a fatal failure if the bootstrap server cannot be resolved due to misconfiguration or transient network issues such as slow DNS. This is suboptimal because of the fact that it might take a long time for the address to become available at the DNS server, and users will need to continue to retry. Also, the ConfigException exception type does not accurately reflect the root cause of the problem, which makes it hard to handle this failure case. We think it is reasonable to allow users to have a grace period to retry if the address cannot be resolved immediately. Also, poisoning the clients during the construction can be obstructive; I think it is better to fail the client on its first attempt to connect to the network.
This KIP proposes moving bootstrapping logic from the constructor to the NetworkClient poll for two purposes,
1. not failing the client upon instantiation. In many cases, this behavior also kills the app, which might not be desirable.
2. piggybacking onto the client poll is a more natural way to retry.
We propose to add a new configuration option for timing out the bootstrapping process, a new exception type for handling bootstrap-related issues, and additional logging to aid in diagnosing bootstrapping failures.
Client Constructor: The constructor will only parse the bootstrap configuration.
NetworkClient:
bootstrap.resolve.timeout.ms
The proposed configuration specifies the maximum amount of time clients can spend trying to resolve for the bootstrap server address. If the resolution cannot be completed within this timeframe, a BootstrapResolutionException will be thrown.
Type: | long |
Default: | 120000 (i.e. 2 minutes) |
Valid Values: | 0 - LONG_MAX |
Importance: | high |
Name: BootstrapResolutionException extends KafkaException
Message: "Couldn't resolve server {} from {} as DNS resolution failed for {}"
Type: Non-retriable.
In this section, I outlined how clients can react to bootstrap failures. In particular, I want to cover two common cases:
Case 1: Non-transient case
When the bootstrap timeout expires, the client will throw a BootstrapResolutionException.
Case 2: Transient Network Issue
consumer poll won't return any record until the client has been bootstrapped. If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrown.
Case 1: Non-transient case
The BootstrapResolutionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the max.block.ms elapsed before the timeout expires, a TimeoutException will be thrown instead.
Case 2: Transient Network Issue
The send() and partitionsFor() methods will be blocked on bootstrap until either the max.block.ms or the bootstrap timeout elapses.
Case 1: Non-transient case
The API call results will either timeout if the request times out first or be completed exceptionally with a BootstrapResolutionException.
Case 2: Transient Network Issue
The user won't be able to get the results back until the address is resolved. Meanwhile, the API calls can expire.
The exception is meant to be fatal, so the user should check their network setup, configuration, or adjust the timeout.
The user can continue to retry, but this exception is meant to alert user to take action upon failing to bootstrap.
NetworkClient
Test DNS resolution upon its initial poll
Test if the right exception type is thrown
Existing clients (Consumer, Producer, AdminClient)
Test successful bootstrapping upon retrying
We've discussed many alternatives. Eventually, we asked ourselves what's the goal of this KIP, i.e., giving people a chance to retry on DNS resolution without poisoning the client. Which came down to two resolutions: 1. giving people a configurable timeout, and 2. adding a fatal error to alert the user.
Here are the rejected alternatives:
Maintain the current code behavior and add a retry loop with a timeout.
Pros: Same logic, less code change.
Cons: Do users want to be blocked on instantiating the client? I don't like this idea.
Throw DNS resolution upon failing but no retry
Pros: No additional config is needed
Cons: This is a behavioral change, and the application owner might need to rewrite the exception handling, i.e. catching the DNS failure logic.