Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This proposal does not intend to provide a mechanism to timely identify deadlocks in the system in order to act on the root cause and get the system out of them. The aim is just to provide a defensive mechanism to release threads stuck for such an unreasonable amount of time that stopping them is considered the best option.

...

Original Proposal (left just for the record but abandoned for the reasons shown in the comments)

In order to release client request threads that have been stuck for such a long time that it can be assumed that the client has already lost interest in the response, it is proposed to include a mechanism in Geode that releases these threads after some configurable time.

The release of the threads would be done by two means:

  • For threads waiting uninterruptibly for a response from another member, the new configurable timeout will allow the thread to exit from this wait when the waiting time reaches the timeout value.
  • For threads waiting on a different condition than the one above, the mechanism implemented by the ThreadsMonitoringProcess and AbstractExecutor classes used by Geode to detect threads stuck for some time, could be enhanced to also interrupt threads that have been stuck for longer than the new configurable timeout.

A draft pull request has been created in order to show what the implementation of this mechanism could look like: https://github.com/apache/geode/pull/7555

Changes and Additions to Public Interfaces

Two new configurable parameters are proposed in order to specify the new time-outs available:

  • maxWaitTimeout
  • maxThreadStuckTime

Alternatively, just one parameter could be used to release threads waiting on any of the conditions described above.

Solution

When a thread is stuck in a Geode member, the only generic safe action to release it is to restart the member. Stopping the stuck thread selectively may lead to data inconsistencies or other types of problems and therefore it is not recommended.

In order to be able to release stuck threads in a Geode server in a safe way the following is proposed:

  • Enhance the current mechanism in Geode to detect threads that are stuck (based on ThreadsMonitoringProcess and AbstractExecutor) to detect when a thread has been stuck for longer than a reasonable period (which would be a new configurable parameter).
  • Send an alert when the above mechanism detects that a thread has been stuck for longer than the maximum value configured.

External systems to Geode could receive the new alert and possibly issue a restart of the member with stuck threads at a convenient time.

Changes and Additions to Public Interfaces

A new configurable parameter is proposed in order to specify the maximum time a thread can be stuck before the member sends an alert:

  • max-thread-stuck-time-minutes

Performance Impact

No impacts foreseen

...