DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Jobs with large number of tasks and operator can result in significant payload size. Open Telemetry exporter gRPC calls can be rejected due to too large payload leading to loss of exported metric data in production environments. Existing implementation presents the following limitations:
While OTel exporter allows to enable gZip compression for requests, it is not available in Flink Configuration.
All metric data is exported in a single gRPC call, with no pagination which impacts complex enough jobs.
Metric attributes can be unexpectedly large, for example task_name attribute can contain hundreds of operator names inside and be present in multiple metrics having a significant impact on overall payload.
Public Interfaces
This change doesn’t modify existing public interfaces but introduces new Flink configuration options.
Gzip compression config option
Config description: Compression method for OTel Reporter only 'gzip' or 'none'. Default is 'none'.
Config name: metrics.reporter.otel.exporter.compression
Default value: none
Support of batching for metric data export
Batch size:
Config description: Number of metric data points per batch. Values <= 0 disable batching (all metrics are sent in a single request).
Config name: metrics.reporter.otel.batch.size
Default value: 0
Support of exported metric attributes truncation
Config description: Limits of the exported attribute values length. Configuration is prefix based, for example to limit task_name attribute set transform.attribute-value-length-limits.task_name: 100 in the config for OTel reporter. A special key '*' can be used to define a global limit for all attributes not explicitly listed.For example transform.attribute-value-length-limits.*: 1024 will limit all attributes to attributeValue.substring(0, 1024). Global limit defaults to Integer.MAX_VALUE if not set. Individual attribute limits always override the global limit and verified by exact match on the attribute name. 0 can be used to drop an attribute. Negative values are interpreted as no limit for the attribute(can be used for global limit overrides).
Config name: metrics.reporter.otel.transform.attribute-value-length-limits.<attribute_name>
Default value: N/A
Proposed Changes
Gzip compression support for gRPC OTel exporter.
Trivial change in the factory to support config option as OTel client allows to configure this property. Note, that OTel collector supports both options out of the box without additional configuration.
Compression will be disabled by default.
Support of batching for metric data export
Proposed changes will be related to OpenTelemetryMetricReporter:report() method. I suggest to split gRPC payload into batches for the cases when payload exceed configured batch size based on number of metric points assuming that payload size is highly correlated with the number of the metric data points. All gRPC calls will be done in async manner with a followup check on success for all batches. If some batches fail, we will log error matching existing behaviour.
Enabling batching would result in more predictable gRPC payload limit to be set on OTel collector, for example assuming batch size 1000 and single data payload of 2-3KB, max_recv_msg_size_mib could be set around 4MB.
Batching will be disabled by default.
Support of exported metric attributes truncation
Add internal “Adapter/Transformer” class to OTel reporter responsible for truncating metric metadata at the time of metric registration. I suggest to allow global truncation limits with support of per attribute overrides.
As attributes truncation can lead to collisions in the exported metric data, we will detect such occurrences when adding metric to exporter and log detected collisions as a warning.
OTel collector also allows to configure “transformations” of the data that can perform truncation, however doing it at source is preferable to avoid unnecessary network and CPU load along with large payload sizes.
Attribute truncation will be disabled by default.
Compatibility, Deprecation, and Migration Plan
All proposed changes are forward and backwards compatible. Default behaviour for the user doesn’t change. Suggested robustness improvements are opt-in.
Test Plan
Unit and integration tests will be added to cover the following behaviours:
gRPC payload is compressed when gzip is enabled.
Large number of MetricData points is split into multiple gRPC calls when batching is enabled.
Large metric attributes are cropped when exceed configured limits.
Rejected Alternatives
I have considered additional robustness changes related to “Retries on failures”, however, current OTel client doesn’t expose sufficient details to distinguish retriable and non-retriable errors, hence for the time being I am leaving retries out of scope.