DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Current state: “Under Discussion”
Discussion thread: https://lists.apache.org/thread/sb2jrrl9f5vn26mfmwf9qo7w2rkcryfj
JIRA: SOLR-17458 - Getting issue details... STATUS
Released: 10
Motivation
Solr currently uses the Java Dropwizard 4 framework for metric collection and event measurement. There have been a number of problems with Solr still using Dropwizard and motivation to move off of it:
No tag/attribute metrics support for aggregation
Metrics with Dropwizard have no concept of tags as the tags themselves are embedded in the metrics. See the example below if someone wanted to look for the number of select requests from the /admin/metrics API:
"QUERY./select.requests":0,
"QUERY./select.serverErrors":{
"count":0,
"meanRate":0.0,
"1minRate":0.0,
"5minRate":0.0,
"15minRate":0.0
}
In order to retrieve this programmatically or just as a user, they need to rely on the format of the metrics which is in this example <category>.<handler>.<type>. Solr has filtering in a number of different ways such as regex to retrieve this with the following QUERY\./select\..*. If the metric name changes at all, this breaks that workflow.
In a tag based metric framework, appending or removing tags not relating to their metric query workflow would avoid this problem.
Complex and difficult filtering
Adding to the point above, filtering with Dropwizard is difficult and complex. The Prometheus Exporter shipped with Solr has a very large and complex JQ config file with a templating feature that is not easy to understand. For example, if a user wants all metrics relating to core in a core registry, here is the JQ query template from the prometheus exporter configuration:
<template name="core" defaultType="COUNTER">
.metrics | to_entries | .[] | select(.key | startswith("solr.core.")) as $parent |
$parent.key | split(".") as $parent_key_items |
$parent_key_items | length as $parent_key_item_len |
(if $parent_key_item_len == 3 then $parent_key_items[2] else "" end) as $core |
(if $parent_key_item_len == 5 then $parent_key_items[2] else "" end) as $collection |
(if $parent_key_item_len == 5 then $parent_key_items[3] else "" end) as $shard |
(if $parent_key_item_len == 5 then $parent_key_items[4] else "" end) as $replica |
(if $parent_key_item_len == 5 then ($collection + "_" + $shard + "_" + $replica) else $core end) as $core |
$parent.value | to_entries | .[] | {KEYSELECTOR} as $object |
$object.key | split(".")[0] as $category |
$object.key | split(".")[1] | rtrimstr("]") | split("[") | .[0] as $handler | .[1] // "false" as $internal |
select($handler | startswith("/")) |
{METRIC} as $value |
if $parent_key_item_len == 3 then
{
name: "solr_metrics_core_{UNIQUE}",
type: "{TYPE}",
help: "See following URL: https://solr.apache.org/guide/solr/latest/deployment-guide/metrics-reporting.html",
label_names: ["category", "handler", "internal", "core"],
label_values: [$category, $handler, $internal, $core],
value: $value
}
else
{
name: "solr_metrics_core_{UNIQUE}",
type: "{TYPE}",
help: "See following URL: https://solr.apache.org/guide/solr/latest/deployment-guide/metrics-reporting.html",
label_names: ["category", "handler", "internal", "core", "collection", "shard", "replica"],
label_values: [$category, $handler, $internal, $core, $collection, $shard, $replica],
value: $value
}
end
</template>
The query is complex and brittle due to tags in its name making backwards compatibility difficult and easy to introduce regressions.
Maintenance and operational overhead of the prometheus exporter
The prometheus exporter is a Solr specific process for getting metrics from /admin/metrics and requires constant maintenance with any new metrics or big changes that Solr introduces. Also running this external process can be costly to maintain.
Proposed changes
OpenTelemetry SDK - Prometheus
Solr already uses Open Telemetry as a module for distributed tracing which integrates the SDK with auto-configuration. We aim to leverage the Open Telemetry SDK and move its dependencies into core.
Open Telemetry framework has 2 different interfaces. The API and the SDK.
- The API is used to measure/capture metrics with instruments in Solr. Open Telemetry supports different types of instruments which we will try to migrate from Dropwizard's equivalent. See Open Telemetry Meter section from its documentation.
- The SDK is then used to configure and export the API measurements collected from Solr by providing a number of different exporters from OTLP, prometheus or even in-memory metric readers. If not the SDK, the Java agent can also be used as a “zero-code” option
In Solr core, we will implement the Open Telemetry API to collect and record metrics.
We will then add only the Open Telemetry Prometheus Exporter SDK into core to expose these metrics to the /admin/metrics endpoint as prometheus. This will work right out of the box for Solr meaning by default Solr's exposed method for collecting metrics is through Prometheus and a pull based protocol.
SolrMetricsContext needs to be completely refactored instead of being wrapped around Dropwizard to being wrapped around Open Telemetry. Some custom metric types that Solr uses from Dropwizard might not be possible or too difficult to port over to OTel such as MetricsMaps or complex Gauges.
OpenTelemetry SDK - OTLP
For users who want to use collect Solr metrics with a push based protocol using OTLP, then they would need to enable the Open Telemetry OTLP module the same way the module is enabled with distributed trace. Push with OTLP which is becoming an industry standard with multiple different plugins and tools for exporting these metrics.
Metrics API /admin/metrics
The metrics endpoint will be changed to output prometheus metrics as the standard for pull metric pipelines. This will use Open Telemetry SDK’s prometheus exporters to transform the API to the prometheus data model.
Filtering
We will maintain filtering capabilities on the metrics API to enable users to limit the amount of data and metrics being scraped from the endpoint. Enhancements may include filtering by tags, preserving the concept of registries/groups with Open Telemetry, or specific metric names possibly. Need to see what the Open Telemetry SDK supports for filtering in Solr but also filtering is possible to happen at an exporter level with OTEL collector/
Deprecation of Prometheus Exporter
With the removal and deprecation of the Prometheus Exporter, users will be encouraged to adopt alternative solutions such as the OTel collector. For those requiring pre-aggregation filtering or custom solutions, tools like Telegraf can be recommended. Deprecating this also is a benefit of no longer needing to maintain this.
JVM Metrics Collection
JVM metrics can be collected using the Open Telemetry runtime-telemetry-java17, which gathers comprehensive metric sets from JFR and JMX. We can also programmatically filter what metrics we want based JfrFeatures available giving further filtering for users. See the runtime-telemetry-java17 JfrFeature table.
Snippet of these metrics output from OTel:
jvm_gc_duration_seconds_count{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java17",otel_scope_version="2.14.0-alpha"} 3
jvm_gc_duration_seconds_sum{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java17",otel_scope_version="2.14.0-alpha"} 0.013224165999999999
jvm_gc_duration_seconds_bucket{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java8",otel_scope_version="2.14.0-alpha",le="0.01"} 3
jvm_gc_duration_seconds_bucket{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java8",otel_scope_version="2.14.0-alpha",le="0.1"} 3
jvm_gc_duration_seconds_bucket{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java8",otel_scope_version="2.14.0-alpha",le="1.0"} 3
jvm_gc_duration_seconds_bucket{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java8",otel_scope_version="2.14.0-alpha",le="10.0"} 3
jvm_gc_duration_seconds_bucket{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java8",otel_scope_version="2.14.0-alpha",le="+Inf"} 3
jvm_gc_duration_seconds_count{jvm_gc_action="end of minor GC",jvm_gc_name="G1 Young Generation",otel_scope_name="io.opentelemetry.runtime-telemetry-java8",otel_scope_version="2.14.0-alpha"} 3
New metrics
Introducing new metrics that did not exist before. Metrics on replica state, overseer and zookeeper. For example something similar to below:
solr_core_is_leader{core=core1, host=localhost:8983, shard=shard1} 1
solr_core_is_leader{core=core2, host=localhost:8983,shard=shard1} 0
solr_core_state{core=core1, host=localhost:8983,shard=shard1, state=active} 1
solr_core_state{core=core1, host=localhost:8983,shard=shard1, state=recovery} 0
The prometheus exporter originally retrieved these metrics by scraping the /admin/collections handler and transforming it. We will expose these natively from /admin/metrics instead.
Use-case migration with Open Telemetry
Pull model with metrics API (GET /admin/metrics) and filtering
The GET /admin/metrics API will continue to exist, allowing users to scrape metrics with a pull based system. However, the removal of Dropwizard means the current format, its naming conventions and usage will change. The endpoint will now output Prometheus standard formatted metrics and filters will be around tags instead of regex. Below are some use-cases and how the user would migrate:
Manual Access to Solr metric endpoint
For example, users who curl Solr with the following to see number of /select requests and errors:
curl 'localhost:8983/solr/admin/metrics?regex=QUERY\./select\..*'
With the new endpoint changes, the curl would look more like something below:
curl 'localhost:8983/solr/admin/metrics?name=solr_requests_total&catergory=QUERY&handler=/select'
Programmatic reading of the endpoint parsing JSON or XML
This is breaking and will no longer be supported. Users applications need to change and follow Prometheus Exposition format and data model seen here instead https://prometheus.io/docs/concepts/data_model/
Basic usage of the Prometheus Exporter with default configuration
Prometheus Exporter will be deprecated. If users are using the standard Solr configuration file for the prometheus exporter they can instead scrape directly from Solr nodes with a prometheus server or their application.
Custom usage of the Prometheus Exporter
Users who use a more custom configuration file for the prometheus exporter for complex prefilter or aggregation should instead adopt other third party tooling such as the OTel collector or Telegraf which also offer far more powerful aggregation and filter methods
An example of the OTel Collector config is shown below that will pull in OTLP and export into Prometheus on port `8889`.
service:
# extensions: [health_check, pprof, zpages]
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheus]
telemetry:
metrics:
address: 0.0.0.0:8888
level: detailed
receivers:
# Data sources: traces, metrics, logs
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
prometheus:
endpoint: 0.0.0.0:8889
namespace: default
Users who use a more custom configuration file for the prometheus exporter for complex pre-filtering or aggregation should instead adopt other third party tooling in the OTel collector or other tools such as Telegraf which also offer far more powerful methods and plugins.
Users can also user prometheus OTLP receiver as seen here https://prometheus.io/docs/guides/opentelemetry/ if they don’t want to use OTEL collector.
Usage of the /admin/metrics?wt=prometheus endpoint
Scrape and pull metric models will not change. Only changes are new metrics, naming conventions, and tags.
Tests asserting metrics
Solr uses the /admin/metrics for some integration testing. For example, the PeerSyncReplicationTest asserts on REPLICATION.peerSync.errors for any failures on its integration testing. This workflow will stay the same, but developers should pull an Open Telemetry SDK MetricReader to find these metrics to assert on for their tests.
Metric reporters
Metric reporters will be deprecated. (TBD need more research and information on reporters)
Push model with OTLP
Users will now be open to the option of using Open Telemetry and its push model with OTLP metrics, similar to how traces and spans are being exported but must enable the Open Telemetry OTLP module.
Execution plan
Many large changes will take place including breaking backwards compatibility. We will create a feature branch for code reviews until this feature is ready for main. At a high level, the changes will be split into 3 major parts for code reviews:
Open Telemetry Dependencies and module refactoring
Solr needs to add a few main dependencies to instrument and collect metrics in Solr core:
io.opentelemetry:opentelemetry-api - Enables Solr to record telemetry across Solr through a Global Open Telemetry context.
opentelemetry-runtime-java17 - Enable this on startup and Java metrics are automatically recorded and collected with the Open Telemetry API
opentelemetry-exporter-prometheus - This is part of the Open Telemetry SDK needed to export metrics collection from the Open Telemetry API. We will only include the opentelemetry-exporter-prometheus and exclude other unneeded SDK dependencies to allow Solr to hold a Prometheus Metric Reader and expose these metrics to the /admin/metrics endpoint.
From CoreContainer we will need to refactor how Open Telemetry is initialized which is done through TracerConfigurator:loadTracer but creates a OpenTelemetrySdk that only tracing. This needs to be refactor to support tracing and metrics.
The Open Telemetry Module goes through an auto configuration initialization for OTLP. When a user enables this module, the initialization of Open Telemetry will need to also enable OTLP and the prometheus metric reader so that the /admin/metrics endpoint is still available. This requires making some custom changes to Open Telemetry auto configuration for metrics.
Metric wrappers and /admin/metrics endpoint
Before recording metrics and telemetry, we will need to create wrappers around Open Telemetry so that it is easy to record metrics going forward. Solr currently does this by wrapping Dropwizard around SolrMetricManager and SolrMetricContext. We will modify this with wrappers creating Open Telemetry instruments such as counter and gauges while also moving Dropwizard registries to Open Telemetry's equivalent called scope. The metricsHandler will also remove all of its PrometheusFormatter code and we will collect the Open Telemetry metrics in Prometheus format using the Prometheus Metric reader created from the original refactoring from coreContainer. This will lay the foundation of workable recordable and reachable prometheus metrics on the /admin/metrics endpoint.
Metric API instrumentation and Deprecation of Dropwizard
SolrMetricProducer has 219 implementations of this interface. We will need to go to each overridded initializeMetrics() call and migrate it to the Open Telemetry equivalent metric API for measuring events instead of Dropwizard. We will also remove Dropwizard as a dependency from Solr and all it's existing usages. This also means deprecation metric reporters and the prometheus exporter. Lastly we will need to rewrite and update all metric documentation.