Overview
With https://issues.apache.org/jira/browse/SLING-3278 we have started implementing a HealthChecksExecutor service that will help handle "slow" Health Checks (HC) better.
Discussions on our dev list show that we have slightly different ideas about how that should work, so I'm starting this page to clarify the use cases and requirements.
Typical use cases
The general idea is that executing a set of HCs is guaranteed to return a result in a specified amount of time - either a fresh or a cached result. If an actual result is not available yet for a given health check, a Result with a HEALTH_CHECK_ERROR state is returned, with a WARN log message that indicates the problem.
A) JMX MBean
An MBean acts as a JMX facade for an HC and wants to get a result quickly. Getting a cached result is acceptable, as long as that's not expired, and getting the MBean attributes several times per second should not cause the HC to be executed every time.
B) Human user, webconsole
A human user looks at a set of HCs, selected by their tags, on the webconsole. The user expects the webconsole page to refresh within N seconds (specified in the webconsole execution form), even if some HCs take longer than that to execute. Getting cached results is fine, as long as they are not expired, and their age should be displayed on the webconsole, as well as the HC's name and tags. For this use case, HCs should be executed in parallel using the Sling thread pools mechanism.
C) HTTP front-end, machine client
An HTTP front-end uses a set of HCs, selected by tags, to decide whether to include a Sling instance in the pool used to process incoming requests. Every few seconds, the front-end pings a specific servlet on the Sling instance, which provides the aggregated HC results in a format specified by the front-end. Cached results are not acceptable in this case, the servlet must either reply quickly with an OK status, or fail after a timeout specified by the front-end. HCs are executed in parallel as in use case B.
Georg: Cached results are totally fine for short TTLs (2 sec is the default, the actual driver for having the cache is JMX and the JMX module would even be fine with 500ms TTL as the whole point is to not execute checks several times for each JMX attribute exposed).
Suggested API HealthCheckExecutor
API variant A
interface HealthCheckExecutor { /** @param options for things like "no cached results", "execution timeout", extensible * @param hc we might have to use ServiceReference instead of HC unfortunately */ List<Result> execute(Map<String, Object> options, HealthCheck ... hc); }
If several HCs are provided, they are executed in parallel using a configurable Sling thread pool.
The default execution timeout is configurable, can be overridden in the options map.
The Results class needs some changes to handle caching and to provide enough metadata to display the results:
public class Result { /** Optional metadata, the HealthCheckExecutor initializes this * with the HC's service properties, and adds timing information * like the result's creation timestamp and time to live (TTL), used * for caching. HC's can add any relevant metadata here. TTL, for example, * can be defined in the HC's service properties, with a default set * by the executor's configuration. */ public Map<String, Object> getMetadata(); /** If this Result's metadata includes a creation timestamp and * time to live, this is used to expire results from the * HealthCheckExecutor's cache */ public boolean isExpired(); // the rest is unchanged }
For use case C we also need an AggregatedResult, which extends Result with a getIndividualResults() method that returns a List<Result>. The AggregatedResult's state is the highest of its individual results states, and its log is the sum of the individual results logs.
API Variant B
This API viariant does not need to change the class org.apache.sling.hc.api.Result as it is returned by health check implementations. Rather it introduces an execution result that can be used by the three consumer use cases to show rich execution details.
/** * Executes health checks registered as OSGi services and * implementing the interface {@link HealthCheck}. */ @ProviderType public interface HealthCheckExecutor { /** * Executes all health checks for the given tags. Will look up * the relevant checks using HealthCheckFilter. * * @return List of results (can be empty but never null) */ List<HealthCheckExecutionResult> execute(String... tags); /** * Executes all given health checks. * * @return List of results (can be empty but never null) */ List<HealthCheckExecutionResult> execute(ServiceReference... healthCheckReferences); }
As HealthCheckExecutionResult has a natural order, IMHO it's better to return a list.
/** * Interface for health check results that are returned from * the executor providing additional information (execution * timing, health check meta data). * */ @ProviderType public interface HealthCheckExecutionResult { /** * The actual result as returned by the health check */ Result getHealthCheckResult(); /** * Meta information about the health check like its name, * tags or additional attributes that might be added via * OSGi service properties later. */ HealthCheckDescriptor getHealthCheckDescriptor(); /** * The elapsed time in ms. */ long getElapsedTimeInMs(); /** * The time the health check finished. */ Date getFinishedAt(); }
The API variant B is mostly in line with the current version in SVN, but it improves the structure of HealthCheckExecutionResult and adds the conventience method execute(String... tags)