You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Status

Current state: Under discussion

Discussion thread

JIRA

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The consumer session timeout is a crucial configuration for group stability. When there is a member failure, the group must pause in order to rebalance partitions. If the failure is spurious, then we often get two rebalances: one for the member failure and one for the failed member when it rejoins. Spurious failures may be unlikely when the cluster has dedicated resources and has been properly tuned. However, as explained in KIP-537, multi-tenant cloud environments with dynamically changing loads are becoming the norm. In practice, we find that transient network/load failures are much more common than genuine client failures, so we propose to increase the consumer's session timeout  from 10s to 40s to avoid the likelihood of spurious failures.

A second related issue concerns the consistency of configurations. By default, the consumer uses a 30s request.timeout.ms, which is consistent with both the producer and the admin client. However, it does not work well with the 10s default for session.timeout.ms. When the connection to the coordinator is not closed cleanly, the consumer will wait 30s before it disconnects and retries. By the time that happens, the group often has already kicked the member out and moved on. By increasing to 40s, we allow enough time to reconnect and retry after one request timeout.

Public Interfaces

We propose to increase `session.timeout.ms` in the consumer from 10s to 40s. We are also making changes to the behavior of group.min.session.timeout.ms and group.max.session.timeout.ms as described below.

Note that we are not making any change to heartbeat.interval.ms. Although the increased session timeout allows for less frequent heartbeats, the heartbeat also serves the purpose of discovering that a rebalance is in progress. 

Proposed Changes

In addition to the default session timeout change for the consumer, we propose to change the behavior of the broker configurations group.min.session.timeout.ms and group.max.session.timeout.ms. Previously, if a consumer attempted to join a group with a session timeout outside of the allowed range of these configurations, then the broker would return INVALID_SESSION_TIMEOUT in the JoinGroup response, which was treated as a fatal error. Instead, the coordinator will now round the value to the nearest limit. For example, if the session timeout is less than group.min.session.timeout.ms,  then the coordinator will ignore the client provided value and use group.min.session.timeout.ms. At the same time, we will make both these dynamic configurations so that they can be changed without restarting brokers.

The motivation for this change of behavior is primarily to give operators more graceful options to restrict session timeout behavior. Today if an operator wants to change either of these settings, there is no safe way to do it without potentially causing existing applications to fail. 

Compatibility, Deprecation, and Migration Plan

We don't foresee any complications with compatibility. New clients will take the new default and old clients will continue to use the old value.

Rejected Alternatives

N/A

  • No labels