To be Reviewed By: April 22th, 2022
Authors: Alberto Gomez (alberto.gomez@est.tech)
Status: Draft | Discussion | Active | Dropped | Superseded
Superseded by: N/A
Related: N/A
Problem
Threads that handle client requests could hang forever due to bugs in code. This has been observed in systems in the field with threads stuck for months. This provokes a reduction of the capacity of the system that could lead to a complete service outage if all the threads assigned to client requests reach that state.
These threads normally get stuck waiting for a condition to be fulfilled that never does (for example waiting for a CountDownLatch to be decreased) or waiting indefinitely (and uninterruptibly) for an answer from another member of the system.
Anti-Goals
This proposal does not intend to provide a mechanism to timely identify deadlocks in the system in order to act on the root cause and get the system out of them. The aim is just to provide a defensive mechanism to release threads stuck for such an unreasonable amount of time that stopping them is considered the best option.
Original Proposal (left just for the record but abandoned for the reasons shown in the comments)
In order to release client request threads that have been stuck for such a long time that it can be assumed that the client has already lost interest in the response, it is proposed to include a mechanism in Geode that releases these threads after some configurable time.
The release of the threads would be done by two means:
For threads waiting uninterruptibly for a response from another member, the new configurable timeout will allow the thread to exit from this wait when the waiting time reaches the timeout value.
For threads waiting on a different condition than the one above, the mechanism implemented by the ThreadsMonitoringProcess and AbstractExecutor classes used by Geode to detect threads stuck for some time, could be enhanced to also interrupt threads that have been stuck for longer than the new configurable timeout.
A draft pull request has been created in order to show what the implementation of this mechanism could look like: https://github.com/apache/geode/pull/7555
Changes and Additions to Public Interfaces
Two new configurable parameters are proposed in order to specify the new time-outs available:
maxWaitTimeoutmaxThreadStuckTime
Alternatively, just one parameter could be used to release threads waiting on any of the conditions described above.
Solution
When a thread is stuck in a Geode member, the only generic safe action to release it is to restart the member. Stopping the stuck thread selectively may lead to data inconsistencies or other types of problems and therefore it is not recommended.
In order to be able to release stuck threads in a Geode server in a safe way the following is proposed:
- Enhance the current mechanism in Geode to detect threads that are stuck (based on ThreadsMonitoringProcess and AbstractExecutor) to detect when a thread has been stuck for longer than a reasonable period (which would be a new configurable parameter).
- Send an alert when the above mechanism detects that a thread has been stuck for longer than the maximum value configured.
External systems to Geode could receive the new alert and possibly issue a restart of the member with stuck threads at a convenient time.
Changes and Additions to Public Interfaces
A new configurable parameter is proposed in order to specify the maximum time a thread can be stuck before the member sends an alert:
max-thread-stuck-time-minutes
Performance Impact
No impacts foreseen
Backwards Compatibility and Upgrade Path
No impacts foreseen
Prior Art
-
13 Comments
Anthony Baker
What thread pools are you proposing to interrupt?
If a thread doing a distribution operation is interrupted, that will have implications for data consistency. How to you propose to deal with this?
Alberto Gomez
The idea would be to interrupt any thread no matter what it is doing assuming that after so much time stuck (let's say 24 hours), it makes no difference if it is interrupted or not.
Anthony Baker Do you think the system should be more selective and never interrupt certain threads?
Anthony Baker
I think if we interrupt certain threads we will introduce a data consistency problem. If member A and member B have different values for a given key due to thread interruption, how do we resolve this?
Alberto Gomez
You have a point here. But again, if the thread in charge of replicating an entry has been stuck for many hours, then the inconsistency will already be there and I would not count on waiting for it to be fixed. At least, if you interrupt the thread you would be releasing wasted resources.
Anthony Baker
Right. So the question is which is worse – data consistency or non-availability? In the case of data loss I think you would encounter downstream failures since callbacks / WAN / CQ events would all be impacted.
I agree that non-availability is also a concern. Perhaps we can think of this as an administrative tool rather than a product property. "If your cluster is in a bad state, here's a tool to help you out". Have you thought about a gfsh command that could interrupt a given thread on a member? Of course this is dangerous and should require superuser permissions.
Alberto Gomez
I am not so sure that for the case described (thread stuck for hours/days), it is a matter of data inconsistency vs non-availability. Again, if the stuck thread is the one in charge of replicating the data, if we do not act, we would still have the data inconsistency problem and eventually, the availability problem.
I also considered the administrative tool approach, but it would not help in the cases where the thread is in an uninterruptible wait (for those cases we need to set a timeout as you can see in the draft PR). Also I do not see a clear advantage over a mechanism that automatically stops the threads after some configurable safe time (which should be quite high).
Dan Smith
I kinda agree with Anthony Baker that the proposed fix here is worse than the original problem. If a member stops waiting for a response to a message, that could just result in data inconsistency for a single key. But, if that message is related to more critical metadata like what bucket lives where, it could potentially result in removing the last copy of bucket, or all sort of other nasty issues.
Interrupting threads at arbitrary places where they might be stuck also seems like it will have completely unpredictable results, potentially resulting in cascading failures to other members.
I think the only safe thing do to if a member has stuck threads would be to shutdown the entire member. Maybe that is a better action to take if we detect a thread or reply that is stuck for a long time.
I wonder if this is another case where being able to plug in custom logic might be the better way to go - perhaps ThreadsMonitoringProcess could notify a callback that some threads have been stuck for X amount of time, and that callback could do whatever a user wants - shutdown the member, ping and administrator, etc.
Alberto Gomez
Anthony Baker, Dan Smith , thanks for the valuable feedback.
So, summarizing your comments, the only safe action to take when a thread is stuck for long is to restart the server.
Things being so, what if instead of the original proposal, Geode sends an alert when a thread has been stuck for some configurable time? That way, a client registered to notifications could notice and decide on what action to take and when to take it if appropriate.
Alberto Gomez
Anthony Baker , Dan Smith , do you agree on going on with the proposal of sending an alert when a thread has been stuck for long?
Anthony Baker
That seems good to me. Here's another place where we send an alert based on not receiving a message reply:
https://geode.apache.org/docs/guide/114/managing/troubleshooting/system_failure_and_recovery.html#sys_failure__section_2C5E8A37733D4B31A12F22B9155796FD
Alberto Gomez
Anthony Baker Thanks for the pointer!
Kirk Lund
Some concerns I have on interrupting stuck threads:
It might be better to monitor for stuck threads without interrupting them. This could generate a logging based Alert as you mention in the comments. You could also define a callback (notification listener or observer) that is invoked with information about the thread being stuck. Then you could implement a callback notification listener that does something like:
Any new "feature" should be something that most users will want to enable and use. If there's any doubt about how many users will want to run with this turned on, then it's much better to make some sort of SPI or plugin hook or listener mechanism for customizing specific deployments of Geode by a limited number of users.
Alberto Gomez
Kirk Lund Thanks for the thorough answer.
You guys have convinced me that interrupting a stuck thread is dangerous and the only safe action is to bounce the server.
For now, I think the alert option could be good enough although I will give a thought to the callback notification listener alternative.