IDIEP-116
Author
Sponsor
Created

 

Status


IN PROGRESS



Motivation

Apache Ignite 3 is a complex and highly distributed system and the ability to monitor the state of the system is really important. This ability is impossible without metrics - some measurable indicators that reflect the characteristics of the system (performance, availability, liveness, etc.).

Apache Ignite 2 has a metrics implementation [1] that laid the foundation for collecting metrics and exporting them to external tools. But this implementation still has some drawbacks mainly related to metrics management during development of main system functionality. For example, it is impossible to give any guarantees about the existence of some metric value or absence of data race between metrics initialization and reading\writing values to the metrics. These problems lead to system failures that are not related to main functionality and, therefore, to an unpleasant user experience. 

Existing popular metrics frameworks are good enough. But it seems these frameworks don't match specific requirements of Apache Ignite 3 [2] in some performance related aspects.

This proposal describes the design of a metrics subsystem which is based on the community experience and requirements described below. 

Requirements

Ease of maintaining existing metrics and adding new ones

Apache Ignite 2 metrics implementation (as well as known frameworks) gives an ability to add\remove\lookup metric by some string name. Any metric value could be added or accessed at any place of the code base. A metric name is a unique identifier which is represented by a string literal or some concatenation of strings. This approach is really simple and clear for a system developer (at least that's what it looks like) but such approach brings a lot of problems:

  1. It is really easy to make a mistake in the metric name.
    As a result some metric could be added with one name and accessed by another name. It could lead to null reference access errors. Or it could be removed by the wrong name and the original metric value will still be stored in memory. Or a metric value from the wrong metrics set could be obtained and erroneously modified.

    Compile time checking can solve this problem.

  2. It is really easy to smear related metrics over all codebase.
    This makes it difficult to find all places where some metric is used. This approach is error prone because a developer can’t be sure that all needed places at the code base are discovered and fixed or removed.

    This also makes it almost impossible to get a list of the whole metrics that the system exposes to.

    This problem could be solved by standard Java programming language means (types for metrics containers) and by compile time checking (using access methods instead of access by string names).


Avoidance of data races

The metrics smearing over codebase leads to the race conditions because metrics could be initialized and accessed at different places of the system and at different times.

Effectively atomic initialization of whole metrics set could solve this problem

Enabling\disabling 

It is useful to manage what metrics should be calculated and what should not. The system has to provide the possibility to enable or disable some metrics at runtime. It could improve performance a little or reduce the amount of data transfer during export of metrics values to external tools.

It is reasonable to enable or disable related metrics at once instead of enabling or disabling metrics one by one.

Metrics consistency

The set of related metrics should be always consistent independently of some component properties. For example, Apache Ignite 2 could return different metrics for cache with enabled WAL and for cache with disabled WAL. Moreover, these metrics sets could be modified by enabling\disabling WAL. 

The proper distribution of metrics over metrics sets and metrics set immutability can solve this problem.

Compact format for exporting metrics

While popular APM systems like Prometheus use plain text formats for exporting metrics it is not always a good idea. For large deployment plain text format could be a real problem. Apache Ignite 3 should provide a compact binary format for exporting.

This could be provided by a stable, versioned and effectively immutable schema for metrics sets. 

Support of standard export formats 

Despite the previous requirement the system still should provide industry standard ways for exporting metrics (MBean, OpenMetrics, etc). It is all about user adoption and experience. 

Reasonable performance characteristics

Because of using JVM manageable runtime the implementation should take into account all negative effects that can be caused by disregarding the Java memory management. So primitive types should be preferred to boxed types.

Also proper concurrency related approaches should be used according to a particular use case. It is always a trade off (LongAdder vs volatile long - concurrency throughput vs memory layout, etc).

Merics value lookup by a name should be prohibited. The metrics type instances should be cached instead.

Node local metrics

This fundamental limitation was imposed by Apache Ignite 2 because distributed metrics are really difficult to calculate and aggregate. Metrics aggregation is a responsibility of special 3rd party tools.

Definitions

Metric (metric value) - a value of some type which could be instantiated, accessed and modified. The metric value has a meaning only in the context where it is defined. Interpretation of the metric value is a task of a 3rd party APM tool user. Metric has a name and a description. How and where these properties are kept depends on the implementation.

Component - a system entity (e.g. node, storage engine, memory region, etc) which produces some metrics.


Metrics source - a class that implements some interface and provides access to related metrics. Metrics source always corresponds to some component or entity of the system (e.g. node is a metrics source, storage engine is a metric source). Metrics source has a unique name and type (e.g. class name). The metrics source exposes an interface for modification of metrics values. Instead of looking up metrics by a name and modifying metrics value directly, developers must use methods defined in the metrics source. 

Metrics holder - an instance of an object which keeps references to all related metric instances. Holder is always encapsulated within a metrics source instance. Metrics source refers to the holder instance if metrics are enabled for this source, otherwise holder reference is null. It allows to atomically disable metrics and clean up all resources while doing so.

Metrics set - a set of metrics. Actually it is a mapping of a metric name to the metric itself. Metric set is immutable. Only purpose of metrics set is providing access to metrics value for exporting. The metrics set has the same name as corresponding metrics source. For the metric sources of the same type the resulting metrics set must have the same layout.

Metrics registry - a system component\manager. Metrics source must be registered in metrics registry after initialization of corresponding component and must be removed in case of component destroying or stopping. Metrics registry also provides access to all enabled metrics through corresponding metrics sets. 

Exporter - a component responsible for exporting metrics to 3rd party tools (e.g. Control Center, Prometheus, etc). 

Design

Entities and relationships

The diagram below shows relationships between entities.

Note, the following:

  1. Metrics registry lifetime is equal to a node lifetime. It is a manager in terms of Apache Ignite 3.
  2. Metrics source lifetime is equal to a component lifetime. It is an auxiliary entity in relation to a system component.
  3. Holder lifetime depends on the state of the metrics source which owns this holder. If metrics source is enabled then holder exists,otherwise not. Holder is a volatile field of corresponding metrics source class.
  4. Metrics set lifetime depends on the state of the metrics source which produces this metrics set on metrics enabling. It is metrics registry responsibility to manipulate by metrics sets produced by metrics sources.


Defining a metrics source for a component

In order to provide an ability to produce metrics by some component the following steps should be taken:

  1. Define metrics source class and metrics holder class.
  2. Define all metric variables in the holder class.
  3. Implement initialization logic which is responsible for creation of actual instances of metrics and building an immutable instance of metrics set (will be executed on metrics enabling). Implement additional clean up logic if needed (will be executed on metrics disabled).
  4. Define methods that implement desirable manipulations with metrics. Every such method must check the state of the metrics source (enabled\disabled) by testing the holder reference to null (some kind of guard) before modifying a metric value. Read holder reference only once and use it in scope of the method in order to avoid data races.
  5. Metrics source should be registered in metrics registry on component start and should be unregistered on component stop. metricsInitial metrics source state (enabled\disabled) should be set according to configuration for each particular component.
  6. Add logic related with metrics values modification (add invocations of metrics source methods).

Metrics lifecycle

Registering metrics source

Metrics source could be registered in the metrics registry using the register(MetricsSource src) method of the MetricsRegistry class.

Fail with exception if metrics source with given name already exists.

Enable metrics

Metrics could be enabled using the enable(metricSourceName) method of MetricsRegistry class.

Preconditions:

Metrics source is instantiated and registered in the metrics registry.

Actions:

  1. Check that metrics source with given name is present in the metrics registry.
    Fail with exception if not (fail fast).
  2. Do not do anything if the metrics source is already enabled.
  3. The MetricsSource.enable() method produces an instance of the MetricsSet class. Also metrics source creates an instance of Holder internally and initializes corresponding reference. Add the produced metrics set to the map of metrics sets using the same name as metrics source.

Disable metrics

Metrics could be disabled using the disable(metricSourceName) method of MetricsRegistry class.

Preconditions:

Metrics source is instantiated and registered in the metrics registry.

Actions:

  1. Check that metrics source with given name is present in the metrics registry.
    Fail with exception if not (fail fast).
  1. Do not do anything if the metrics source is already disabled.
  2. Remove metrics set from the map of metrics sets using metrics source name.
  3. The MetricsSource.disable() method cleans up all resources if needed and assigns null to the holder reference internally.

Unregistering metrics source

Metrics source could be unregistered from the metrics registry using the unregister(MetricsSource src) method of the MetricsRegistry class.

Fail with exception if metrics source with given name doesn’t exist. Also metrics set should be removed from metrics sets map.

Metrics types

Apache Ignite 3, unlike Apache Ignite 2, will support only number based metrics because it is not clear what is a meaning of, for example, string metric and what operations are applicable to such metrics.

Scalar metrics

Metric

A metric is just a wrapper on a numeric value which could be increased or decreased to some value. Support of IntMetric, LongMetric and DoubleMetric types is enough. Only primitive types are allowed as a metric value. These types should use volatile variables of corresponding types and atomic field updaters should be used in order to reduce memory footprint (see IntMetricImpl class in Apache Ignite 2 for example).

It also could be useful to have LongAdderMetric and DoubleAccumulatorMetric based on LongAdder and DoubleAccumulator respectively (see LongAdderMetric in Apache Ignite 2 for example).

Aforementioned metrics are just values and don’t have any behavior. Apache Ignite 2 provides HitRateMetric which accumulates approximate hit rate statistics based on hits in the last time interval (see HitRateMetric class in Apache Ignite 2). 

Gauge

A gauge is an instantaneous measurement of a value provided by some existing component. For example Java ThreadPoolExecutor already provides some metrics like completed task count, tasks count, etc.

The gauge just uses some value supplier which returns a desirable value from an existing component (see LongGauge class in Apache Ignite 2 for example).

Apache Ignite 3 should provide IntGauge, LongGauge and DoubleGauge. Return value must always be a primitive (unboxed) type.

Composite metrics

A composite metric is a group of closely related values. The values themselves are numeric. The only one composite metric is represented in Apache Ignite 2 - histogram (see HistogramMetric class). Because users expect another behavior from histogram metrics (see histograms in Dropwizard [3]) it could be renamed to BucketMetric or DistributionMetric.

Note: Dynamic metric reconfiguration (e.g. changing histogram buckets) is not a scope for this design. It seems that this feature is not needed but leads to a lot of problems (keeping configuration for such metric, unclear management scope, unclear conflict resolution).

Metrics naming

Metric name is a concatenation of short name of metric (e.g. QueueSize) and a metric source containing this metric separated by dot. Metric names should start with an uppercase char and use camel case.

Metrics source name is a combination of names like Java package name separated by dots. Metrics source name should use only lowercase chars.

Only latin symbols are allowed.

Example:

Let’s say we have a partition. Transactions can obtain locks on the whole partition in some cases. Our metric has the name PartitionLocksCount and our metric source has the name partition.<part_no>.tx.<part_no> is a partition number. The fully qualified name of metric will be: partition.<part_no>.tx.PartitionLocksCount.

Such notation allows to build a hierarchy of metrics similar to file system directories.

Exporters

This proposal doesn’t define exporter design. Need additional research. So simplest implementations are allowed for demonstration purposes.

JMX

JMX exporter (or reporter) should expose (if enabled) each enabled metrics source as MXBean where each metric is represented by an attribute in terms of JMX (see JmxMetricExporterSpi class in Apache Ignite 2 for example).

Metrics Management

It should be possible to enable or disable metrics sources by name via Ignite Shell.

Syntax (not final, discuss with CLI team):

ignite metrics enable <metrics_source_name>

ignite metrics disable <metrics_source_name>

Further Steps

  • Adding new composite metrics if needed: for example, histogram with order statistics (quantiles, average, median, etc).
  • Adding different exporters as extensions if needed. 

References

  1. IEP-35 Monitoring & Profiling
  2. Metrics: 3rd party vs proprietary
  3. https://metrics.dropwizard.io/4.2.0/getting-started.html#histograms
  4. https://github.com/apache/ignite/pull/7074

Open Tickets


Key Summary T Created Updated Due Assignee Reporter Priority Priority Priority Priority P Status Resolution
Loading...
Refresh

Closed Tickets

Key Summary T Created Updated Due Assignee Reporter Priority Priority Priority Priority P Status Resolution
Loading...
Refresh


  • No labels