Table of Contents


Requirements

  • One endpoint

    • Operator preference

  • Minimize libprocess events when updating metrics

    • Allow wider use of metrics within libprocess

  • High performance

    • Minimal locking, minimal copying

API

Objects derived from Metric can be constructed and must be made copyable through use of internally shared data structures.

Explicit calls to process::metrics::add and process::metrics::remove are required for the metric to show up in the endpoint.

Metric

  • public

    • Constructor that takes name strings and an optional Duration and capacity used as a window for retaining history and over which to aggregate data. If this is missing, no history is kept and no statistics are calculated [RB 18718, 20015].

    • A pure virtual value method that returns a Future<double> [RB 18718].

    • A statistics method that returns a struct containing fields such as: max, min, mean, median, p90, p95, p99 [RB 20047, 20018].

    • A poll method that takes a Duration and starts a Timer that calls snapshot and push. Used to pull changes into the history. Note, derived classes can still explicitly push changes using push should they want to [RB ???].

  • protected

    • A push method that takes a double and a process::Time that defaults to Clock::now(). Used to push changes to the history [RB 20015].

  • private

    • A shared data pointer that contains a process::TimeSeries<double> [RB 20015].

Counter

  • operator++, operator+=, operator--, operator-= overloads [RB 18718].

  • reset() method to set the counter to zero.

Gauge

No extra API [RB 18718].

Timer

  • start/stop methods [RB 20339] that take a key as a string. This is necessary as an instance of a Timer may be used to time concurrent events.

Calling stop before start is an error.

Calling value before stop returns 0, otherwise it returns the delta between the last stop and its start.

Class diagram

 Metrics - New Page (2).png 

Private implementation details

When a Metric is constructed, it is not automatically added to the metric system. There is a running metrics process to which add and remove calls are dispatched. This is also the process that handles the endpoints below. The metrics process instance is lazily initialized on the first call to add or remove using process::Once to instantiate a static pointer to the process. The process is never explicitly terminated.

The Metric base class contains a shared data object that contains the TimeSeries for the historical data. The time series must be locked before reading or writing. On a call to the statistics() method, the statistics are calculated on demand after copying the relevant data from the locked time series. A Statistics class exists that takes a TimeSeries in the constructor to abstract the calculation.

The need for explicit calls can be removed if the constructor and destructor of the Metric base class calls them directly. However, at the time of caller Metric destruction, the shared data will have 2 references; the one being destroyed and the one registered with the MetricsProcess. As such, the call to remove should only be made when the refcount is ‘2’; i.e., the usual check for unique() on the shared pointer is not sufficient. Further, the explicit calls mirror the two-phase initialization we use throughout mesos: construct/initialize and finalize/destroy.

Factory methods were also considered for creation and retrieval of Metrics but the asynchronous nature of the factory method make them difficult to work with. Similarly, ownership semantics become confused.

The metric system stores the metrics in a hashmap<std::string, process::Owned::Metric> > container. The key is the name for the Metric and is also stored in the Metric. This allows trivial future extension to endpoints to filter returned metrics. A future implementation may take this further and build a tree with Metric objects at the leaves and strings down the branches to allow arbitrary nesting of names. The Metric pointer value is created using an explicit call to the copy constructor for the metric type. As such, the signature for add is:

template <typename T> Future<Nothing> add(const T& metric);

The endpoint method iterates through the hashmap building a map of JSON key to the Future returned by Metric::snapshot. It then uses await on those futures and builds a JSON object mapping the keys to values in a continuation. Any failed futures are ignored and not added to the JSON. The JSON format is:

{
 “name”: 42.0,
 “name/min”: 42.0,
 “name/max”: 42.0,
 “name/median”: 42.0,
 ...
 “foo”: 3.0,
 ...
}

Counter

Current value is stored as int64_t so we can take advantage of atomic increment and decrement support in the language and avoid locking. It is cast to a double on calls to value.

Gauge

Function object is stored as a Deferred<Future<double> (void)>.

Timer

Contains map of string to Stopwatch instances that are stopped by default. value returns elapsed().us(). The map is locked on start, stop, and value calls.

One library or two

The initial design split the metrics library from the statistics library under the assumption that there would be many more metrics than statistics. However, this adds significant asynchronous issues as the shared lifetime of Metric objects that are referenced by Statistics objects becomes difficult to reason about. The same result (having metrics with no statistics attached) can be achieved through a simple API that is described below.

Further, any metrics/statistics interprocess communication requires the use of libprocess mechanisms which may themselves contain metrics objects. This sets up the potential for infinite accounting.

The aggregation data itself would be better handled under a split library as the per-Statistic Timeseries would be owned by a Statistics process and updates to it would then be deferred and serialized. With a single library, and a per-Metric Timeseries, the Timeseries needs to be locked to be updated and for the statistics to be calculated. It is possibile to use a lock-free list for the Timeseries, and to calculate the statistics on demand if the lock is found to be prohibitively expensive. However, the current metric usages are all ostensibly single-threaded so there should be no contention. Similarly, the lock is only held long enough to copy out the values from the TimeSeries.

An alternative is to still have a single library and a single Metrics process but have the Timeseries owned by the MetricProcess directly instead of the Metric base class. High frequency metrics may put pressure on the deferral queue but this may be a reasonable cost compared to lock contention. However, this violates the requirement to minimize libprocess events on metric updates.

Another benefit of having a single library is that the API is much simpler to reason about: Metrics are created and their statistics are exposed; there’s no need for users to also create Statistics objects. Similarly, the requirement for a single endpoint is trivially realised by having a single process responsible for tracking Metrics and Statistics.

Testing

  • Individual Metric types and operations on them.

    • Counter ++, --, +=, -=

    • Gauge

    • Timer

  • Base accessors

    • value, statistics with and without window

  • Statistics from Timeseries

    • empty and non-empty timeseries

  • JSON content

    • before add, after add, after remove, after destruction, after destruction without remove

  • No labels