Status

Current state: "Under Discussion"

Discussion thread

Voting thread: 

JIRA: KAFKA-15229 - Getting issue details... STATUS

Motivation

The Kafka Connect config task.shutdown.graceful.timeout.ms. has a default value of 5s. As per it's definition:

Amount of time to wait for tasks to shutdown gracefully. This is the total amount of time, not per task. All task have shutdown triggered, then they are waited on sequentially.

it is the total timeout for all tasks to shutdown. When a worker needs to shutdown, it needs to stop all it's running tasks. It would first issue a stop for all the assigned tasks and then wait for all the tasks to stop. The timeout above defines how long to wait for tasks to be stopped. The default value of this config(5s) is ok for smaller clusters running a handful of tasks,  but on a larger cluster it is possible that all the tasks won't be able to successfully stop within the above timeout value. For all such tasks, we would notice log lines like:

Graceful stop of task <task-id> failed.

In case of failure in graceful stop of tasks, the tasks are cancelled which means that they won't send out an UNASSIGNED  status update. Let's say the task stop was triggered by a worker(non-leader) going down. This doesn't lead to confusions when the cluster is configured to use the EAGER  protocol because the tasks would be immediately reassigned after the round of revocations. However, If the cluster is configured to use Incremental Cooperative Assignor, then the task wouldn't be reassigned until scheduled.rebalance.delay.max.ms interval elapses. So, for that amount of duration, the task's status would show up as RUNNING . This can be confusing for the users.

This problem can be exacerbated on cloud environments(like kubernetes pods) because there is a high chance that the running status would be associated with an older worker_id which doesn't even exist in the cluster anymore.

While the net effect of all of this is not catastrophic i.e it won't lead to any processing delays or loss of data but the status of the task would be off leading to confusion. So, the proposal is to increase the default value to a higher value. This would also be helpful because:

  • While users can set this config on their clusters, this is a low importance config and generally goes unnoticed. Increasing the default value would help new users, who aren't aware of this behaviour to not get confused.
  • Since version 2.3, Incremental Cooperative Assignor has been the default connect protocol and since 2.4.0, it is SESSIONED . Considering this, it makes sense to not have the default value of the config to such a low value of 5s. Also, one thing worth mentioning is that in KIP-415: Incremental Cooperative Rebalancing in Kafka Connect#Compatibility,Deprecation,andMigrationPlan, it is mentioned that with version 3.0, the Connect protocol 0 would be marked as deprecated and removed in version 4.0. However, I believe that hasn't happened yet but we might decide to deprecate the Connect protocol 0 in the future.

Public Interfaces

No new Public Interfaces would be added. We will update the default value of  task.shutdown.graceful.timeout.ms config to 30s. The documentation would also be updated to reflect the same.

Proposed Changes

As already mentioned, the default value of  task.shutdown.graceful.timeout.ms config would be updated to 30s. We have been using this value internally for some time and have noticed that for clusters running close to 30 tasks, we don't notice failures in graceful shutdowns. Of-course, it is not always possible to achieve graceful shutdown for all the tasks always because of the number of tasks running on the worker or if a task's stop method takes time due to some reason. It should be noted as well that the task.shutdown.graceful.timeout.ms  config is used as a timeout for some of the internal resources created within the Connect runtime, so we don't want to set a very large value thereby delaying the closure of workers for example. Considering these points, 30s seems like a decent value as default.

One thing to note here is that when a task is restarted, it is stopped and then started. The stop of the task is also governed by the same  task.shutdown.graceful.timeout.ms  config. More importantly, the restart happens on the tick thread which is also responsible for invoking poll()  regularly to ensure group membership. Considering an increase in task restart time can lead to spurious poll timeouts on workers, the stop phase in task restart would use a timeout of 5s. The value would be an internal config and won't be exposed to the users.

Compatibility, Deprecation, and Migration Plan

Changes are backward compatible. It involves no deprecations and when a worker running a version with this change is started, the new values would be in-effect.

Test Plan

Update any tests referring to the config value and depending upon the previous value. 

Rejected Alternatives

N/A



  • No labels