With https://issues.apache.org/jira/browse/SLING-3278 we have started implementing a HealthChecksExecutor service that will help handle "slow" Health Checks (HC) better.
Discussions on our dev list show that we have slightly different ideas about how that should work, so I'm starting this page to clarify the use cases and requirements.
Typical use cases
The general idea is that executing a set of HCs is guaranteed to return a result in a specified amount of time - either a fresh or a cached result. If an actual result is not available yet for a given health check, a Result with a HEALTH_CHECK_ERROR state is returned, with a WARN log message that indicates the problem.
A) JMX MBean
An MBean acts as a JMX facade for an HC and wants to get a result quickly. Getting a cached result is acceptable, as long as that's not expired, and getting the MBean attributes several times per second should not cause the HC to be executed every time.
B) Human user, webconsole
A human user looks at a set of HCs, selected by their tags, on the webconsole. The user expects the webconsole page to refresh within N seconds (specified in the webconsole execution form), even if some HCs take longer than that to execute. Getting cached results is fine, as long as they are not expired, and their age should be displayed on the webconsole, as well as the HC's name and tags. For this use case, HCs should be executed in parallel using the Sling thread pools mechanism.
C) HTTP front-end, machine client
An HTTP front-end uses a set of HCs, selected by tags, to decide whether to include a Sling instance in the pool used to process incoming requests. Every few seconds, the front-end pings a specific servlet on the Sling instance, which provides the aggregated HC results in a format specified by the front-end.
Cached results are not acceptable in this case, the servlet must either reply quickly with an OK status, or fail after a timeout specified by the front-end. HCs are executed in parallel as in use case B.
Georg: Cached results are totally fine for short TTLs (2 sec is the default, the actual driver for having the cache is JMX and the JMX module would even be fine with 500ms TTL as the whole point is to not execute checks several times for each JMX attribute exposed).
Suggested API HealthCheckExecutor
API variant A
If several HCs are provided, they are executed in parallel using a configurable Sling thread pool.
The default execution timeout is configurable, can be overridden in the options map.
The Results class needs some changes to handle caching and to provide enough metadata to display the results:
For use case C we also need an AggregatedResult, which extends Result with a getIndividualResults() method that returns a List<Result>. The AggregatedResult's state is the highest of its individual results states, and its log is the sum of the individual results logs.
API Variant B
This API viariant does not need to change the class org.apache.sling.hc.api.Result as it is returned by health check implementations. Rather it introduces an execution result that can be used by the three consumer use cases to show rich execution details.
As HealthCheckExecutionResult has a natural order, IMHO it's better to return a list.
The API variant B is mostly in line with the current version in SVN, but it improves the structure of HealthCheckExecutionResult and adds the conventience method execute(String... tags)