...
Superseded by: N/A
Related: N/A
Problem
Threads that handle client requests could hang forever due to bugs in code. This has been observed in systems in the field with threads stuck for months. This provokes a reduction of the capacity of the system that could lead to a complete service outage if all the threads assigned to client requests reach that state.
These threads normally get stuck waiting for a condition to be fulfilled that never does (for example waiting for a CountDownLatch to be decreased) or waiting indefinitely (and uninterruptibly) for an answer from another member of the system.
Anti-Goals
This proposal does not intend to provide a mechanism to timely identify deadlocks in the system in order to act on the root cause and get the system out of them. The aim is just to provide a defensive mechanism to release threads stuck for such an unreasonable amount of time that stopping them is considered the best option.
Solution
In order to release client request threads that have been stuck for such a long time that it can be assumed that the client has already lost interest in the response, it is proposed to include a mechanism in Geode that releases these threads after some configurable time.
...
A draft pull request has been created in order to show what the implementation of this mechanism could look like: https://github.com/apache/geode/pull/7555
Changes and Additions to Public Interfaces
Two new configurable parameters are proposed in order to specify the new time-outs available:
...
Alternatively, just one parameter could be used to release threads waiting on any of the conditions described above.
Performance Impact
No impacts foreseen
Backwards Compatibility and Upgrade Path
No impacts foreseen
Prior Art
-