ID | IEP-116 |
Author | |
Sponsor | |
Created |
|
Status |
Apache Ignite 3 is a complex and highly distributed system and the ability to monitor the state of the system is really important. This ability is impossible without metrics - some measurable indicators that reflect the characteristics of the system (performance, availability, liveness, etc.).
Apache Ignite 2 has a metrics implementation [1] that laid the foundation for collecting metrics and exporting them to external tools. But this implementation still has some drawbacks mainly related to metrics management during development of main system functionality. For example, it is impossible to give any guarantees about the existence of some metric value or absence of data race between metrics initialization and reading\writing values to the metrics. These problems lead to system failures that are not related to main functionality and, therefore, to an unpleasant user experience.
Existing popular metrics frameworks are good enough. But it seems these frameworks don't match specific requirements of Apache Ignite 3 [2] in some performance related aspects.
This proposal describes the design of a metrics subsystem which is based on the community experience and requirements described below.
Apache Ignite 2 metrics implementation (as well as known frameworks) gives an ability to add\remove\lookup metric by some string name. Any metric value could be added or accessed at any place of the code base. A metric name is a unique identifier which is represented by a string literal or some concatenation of strings. This approach is really simple and clear for a system developer (at least that's what it looks like) but such approach brings a lot of problems:
The metrics smearing over codebase leads to the race conditions because metrics could be initialized and accessed at different places of the system and at different times.
Effectively atomic initialization of whole metrics set could solve this problem
It is useful to manage what metrics should be calculated and what should not. The system has to provide the possibility to enable or disable some metrics at runtime. It could improve performance a little or reduce the amount of data transfer during export of metrics values to external tools.
It is reasonable to enable or disable related metrics at once instead of enabling or disabling metrics one by one.
The set of related metrics should be always consistent independently of some component properties. For example, Apache Ignite 2 could return different metrics for cache with enabled WAL and for cache with disabled WAL. Moreover, these metrics sets could be modified by enabling\disabling WAL.
The proper distribution of metrics over metrics sets and metrics set immutability can solve this problem.
While popular APM systems like Prometheus use plain text formats for exporting metrics it is not always a good idea. For large deployment plain text format could be a real problem. Apache Ignite 3 should provide a compact binary format for exporting.
This could be provided by a stable, versioned and effectively immutable schema for metrics sets.
Despite the previous requirement the system still should provide industry standard ways for exporting metrics (MBean, OpenMetrics, etc). It is all about user adoption and experience.
Because of using JVM manageable runtime the implementation should take into account all negative effects that can be caused by disregarding the Java memory management. So primitive types should be preferred to boxed types.
Also proper concurrency related approaches should be used according to a particular use case. It is always a trade off (LongAdder
vs volatile long
- concurrency throughput vs memory layout, etc).
Merics value lookup by a name should be prohibited. The metrics type instances should be cached instead.
This fundamental limitation was imposed by Apache Ignite 2 because distributed metrics are really difficult to calculate and aggregate. Metrics aggregation is a responsibility of special 3rd party tools.
Metric (metric value) - a value of some type which could be instantiated, accessed and modified. The metric value has a meaning only in the context where it is defined. Interpretation of the metric value is a task of a 3rd party APM tool user. Metric has a name and a description. How and where these properties are kept depends on the implementation.
Component - a system entity (e.g. node, storage engine, memory region, etc) which produces some metrics.
Metrics source - a class that implements some interface and provides access to related metrics. Metrics source always corresponds to some component or entity of the system (e.g. node is a metrics source, storage engine is a metric source). Metrics source has a unique name and type (e.g. class name). The metrics source exposes an interface for modification of metrics values. Instead of looking up metrics by a name and modifying metrics value directly, developers must use methods defined in the metrics source.
Metrics holder - an instance of an object which keeps references to all related metric instances. Holder is always encapsulated within a metrics source instance. Metrics source refers to the holder instance if metrics are enabled for this source, otherwise holder reference is null. It allows to atomically disable metrics and clean up all resources while doing so.
Metrics set - a set of metrics. Actually it is a mapping of a metric name to the metric itself. Metric set is immutable. Only purpose of metrics set is providing access to metrics value for exporting. The metrics set has the same name as corresponding metrics source. For the metric sources of the same type the resulting metrics set must have the same layout.
Metrics registry - a system component\manager. Metrics source must be registered in metrics registry after initialization of corresponding component and must be removed in case of component destroying or stopping. Metrics registry also provides access to all enabled metrics through corresponding metrics sets.
Exporter - a component responsible for exporting metrics to 3rd party tools (e.g. Control Center, Prometheus, etc).
The diagram below shows relationships between entities.
Note, the following:
In order to provide an ability to produce metrics by some component the following steps should be taken:
Metrics source could be registered in the metrics registry using the register(MetricsSource src)
method of the MetricsRegistry
class.
Fail with exception if metrics source with given name already exists.
Metrics could be enabled using the enable(metricSourceName)
method of MetricsRegistry
class.
Preconditions:
Metrics source is instantiated and registered in the metrics registry.
Actions:
MetricsSource.enable()
method produces an instance of the MetricsSet
class. Also metrics source creates an instance of Holder
internally and initializes corresponding reference. Add the produced metrics set to the map of metrics sets using the same name as metrics source.Metrics could be disabled using the disable(metricSourceName)
method of MetricsRegistry
class.
Preconditions:
Metrics source is instantiated and registered in the metrics registry.
Actions:
MetricsSource.disable()
method cleans up all resources if needed and assigns null
to the holder reference internally.Metrics source could be unregistered from the metrics registry using the un
register(MetricsSource src)
method of the MetricsRegistry
class.
Fail with exception if metrics source with given name doesn’t exist. Also metrics set should be removed from metrics sets map.
Apache Ignite 3, unlike Apache Ignite 2, will support only number based metrics because it is not clear what is a meaning of, for example, string metric and what operations are applicable to such metrics.
A metric is just a wrapper on a numeric value which could be increased or decreased to some value. Support of IntMetric
, LongMetric
and DoubleMetric
types is enough. Only primitive types are allowed as a metric value. These types should use volatile variables of corresponding types and atomic field updaters should be used in order to reduce memory footprint (see IntMetricImpl
class in Apache Ignite 2 for example).
It also could be useful to have LongAdderMetric
and DoubleAccumulatorMetric
based on LongAdder
and DoubleAccumulator
respectively (see LongAdderMetric
in Apache Ignite 2 for example).
Aforementioned metrics are just values and don’t have any behavior. Apache Ignite 2 provides HitRateMetric
which accumulates approximate hit rate statistics based on hits in the last time interval (see HitRateMetric
class in Apache Ignite 2).
A gauge is an instantaneous measurement of a value provided by some existing component. For example Java ThreadPoolExecutor
already provides some metrics like completed task count, tasks count, etc.
The gauge just uses some value supplier which returns a desirable value from an existing component (see LongGauge
class in Apache Ignite 2 for example).
Apache Ignite 3 should provide IntGauge
, LongGauge
and DoubleGauge
. Return value must always be a primitive (unboxed) type.
A composite metric is a group of closely related values. The values themselves are numeric. The only one composite metric is represented in Apache Ignite 2 - histogram (see HistogramMetric
class). Because users expect another behavior from histogram metrics (see histograms in Dropwizard [3]) it could be renamed to BucketMetric
or DistributionMetric
.
Note: Dynamic metric reconfiguration (e.g. changing histogram buckets) is not a scope for this design. It seems that this feature is not needed but leads to a lot of problems (keeping configuration for such metric, unclear management scope, unclear conflict resolution).
Metric name is a concatenation of short name of metric (e.g. QueueSize
) and a metric source containing this metric separated by dot. Metric names should start with an uppercase char and use camel case.
Metrics source name is a combination of names like Java package name separated by dots. Metrics source name should use only lowercase chars.
Only latin symbols are allowed.
Example:
Let’s say we have a partition. Transactions can obtain locks on the whole partition in some cases. Our metric has the name PartitionLocksCount
and our metric source has the name partition.<part_no>.tx
.<part_no>
is a partition number. The fully qualified name of metric will be: partition.<part_no>.tx.PartitionLocksCount
.
Such notation allows to build a hierarchy of metrics similar to file system directories.
This proposal doesn’t define exporter design. Need additional research. So simplest implementations are allowed for demonstration purposes.
JMX exporter (or reporter) should expose (if enabled) each enabled metrics source as MXBean where each metric is represented by an attribute in terms of JMX (see JmxMetricExporterSpi
class in Apache Ignite 2 for example).
It should be possible to enable or disable metrics sources by name via Ignite Shell.
Syntax (not final, discuss with CLI team):
ignite metrics enable <metrics_source_name>
ignite metrics disable <metrics_source_name>
Key | Summary | T | Created | Updated | Due | Assignee | Reporter | Priority | Priority | Priority | Priority | P | Status | Resolution |
---|
Key
Summary
T
Created
Updated
Due
Assignee
Reporter
Priority
Priority
Priority
Priority
P
Status
Resolution