This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Current state: "Accepted"
Discussion thread: here
Voting thread: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
When running a Trogdor cluster, it is useful to get information about the health of the Trogdor cluster itself.
Currently, a user would need to query Trogdor’s REST API in order to get any sort of information about a Trogdor cluster. This presents a significant burden on the user and limits the amount of information readily and easily available in terms of the health of a Trogdor cluster. Thus, adding metrics would allow for significant ease in monitoring agents and tasks in Trogdor clusters.
We define a new trogdor-metrics group that captures the metrics as defined below.
The total number of active agents in the Trogdor cluster
The total number of created tasks in the Trogdor cluster
The total number of running tasks in the Trogdor cluster
The total number of done tasks in the Trogdor cluster
All metrics listed above are simply cumulative sums of the number of tasks/agents in each respective state. Thus, as these are cumulative counters, we expect that when a Trogdor cluster has finished all tasks, we'll have created-task-count = running-task-count = done-task-count.
We propose adding a TrogdorMetrics class to Trogdor that exposes the aforementioned metrics. Since Trogdor agents and tasks share a common Platform class, a TrogdorContainer class will be created inside the Platform class to allow for the creation of a shared TrogdorMetrics instance between the Agent and Coordinator classes.
Compatibility, Deprecation, and Migration Plan
There should be no impact on compatibility, deprecation, or migration since this KIP simply adds some metrics to Trogdor.
Since there technically is a STOPPING state for a task in addition to PENDING, RUNNING and DONE, it would be nice to have metrics for each of these states.
However, by way of simple mathematics, we are able to deduce the number of pending tasks by simply subtracting the number of pending tasks from those that are running and done. Similarly, we are able to deduce the number of running tasks from those that are pending and done. The number of done tasks will be the true number of done tasks, with no mathematics necessary. This allows for the tracking of fewer metrics. The STOPPING state is more of a transient state and thus doesn’t add too much significance to metrics, so it was deemed useful to only have metrics tracking PENDING, RUNNING, and DONE tasks.