You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

To be Reviewed By: April 22th, 2022

Authors: Alberto Gomez (alberto.gomez@est.tech)

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

Threads that handle client requests could hang forever due to bugs in code. This has been observed in systems in the field with threads stuck for months. This provokes a reduction of the capacity of the system that could lead to a complete service outage if all the threads assigned to client requests reach that state.

These threads normally get stuck waiting for a condition to be fulfilled that never does (for example waiting for a CountDownLatch to be decreased) or waiting indefinitely (and uninterruptibly) for an answer from another member of the system.

Anti-Goals

This proposal does not intend to provide a mechanism to timely identify deadlocks in the system in order to act on the root cause and get the system out of them. The aim is just to provide a defensive mechanism to release threads stuck for such an unreasonable amount of time that stopping them is considered the best option.

Solution

In order to release client request threads that have been stuck for such a long time that it can be assumed that the client has already lost interest in the response, it is proposed to include a mechanism in Geode that releases these threads after some configurable time.

The release of the threads would be done by two means:

  • For threads waiting uninterruptibly for a response from another member, the new configurable timeout will allow the thread to exit from this wait when the waiting time reaches the timeout value.
  • For threads waiting on a different condition than the one above, the mechanism implemented by the ThreadsMonitoringProcess and AbstractExecutor classes used by Geode to detect threads stuck for some time, could be enhanced to also interrupt threads that have been stuck for longer than the new configurable timeout.

A draft pull request has been created in order to show what the implementation of this mechanism could look like: https://github.com/apache/geode/pull/7555

Changes and Additions to Public Interfaces

Two new configurable parameters are proposed in order to specify the new time-outs available:

  • maxWaitTimeout
  • maxThreadStuckTime

Alternatively, just one parameter could be used to release threads waiting on any of the conditions described above.

Performance Impact

No impacts foreseen

Backwards Compatibility and Upgrade Path

No impacts foreseen

Prior Art

-

FAQ


Errata


  • No labels