Motivation

The current way of taking performance metrics out of airflow is by using StatsD (https://github.com/statsd/statsd) library. StatsD has been an easy and common choice by many applications for years but due to its inherent limitations and simplicity, there is a need to make the instrumentation of metrics for airflow more informative and useful to make an effective use of it. In addition to the metrics instrumented in StatsD, we have a new requirements to capture traces that would provide better visibility across multiple DAGs and Tasks that may constitute to a single DAG run of airflow. There is also a growing requirements to add additional context to existing logging of Airflow that is based on python logging, such as better description of request context and resource context of what and where the log message is related to.

Considerations

What change do you propose to make?

This proposal would like to have a new emerging telemetry standard OpenTelemetry (https://opentelemetry.io/) as part of airflow’s support for its metrics, traces, and logs. In addition to the existing StatsD instrumentation, this will be a configurable option to make the similar metrics available in OpenTelemetry (OTEL) protocol. Also, in addition to the metrics, there will be additional feature to generate Distributed Tracing (https://www.splunk.com/en_us/data-insider/what-is-distributed-tracing.html) and logs under the same OpenTelemetry standard.
What problem does it solve?

The current StatsD based metrics instrumentation has several short-comings

  • StatsD relies on UDP protocol to emit its metrics. UDP does not guarantee a reliable data transfer. OpenTelemetry on the other hand has extensible output exporters that can emit metrics in much wider format and protocol.
  • StatsD, relying on UDP, is also not secure in transmitting its metrics data. Various security compliances will not allow packets of data to be transmitted without a secure and reliable transport layer security in place.
  • StatsD does NOT support tagging of metric data. Because of that, it is common for StatsD to include context information such as dag_id and task_id in the metric name itself. Also, having additional context data will make the metric name longer and more complicated.
    • For example, currently StatsD metric format forces airflow to product metric that has dag_id and task_id like this: some.metric.<dag_id>.<task_id>, since its format does not have tagging capabilities.
    • With tagging, you can create metric that looks something like this : some.metric dag_id="<dag_id>" task_id="<task_id>" which reduces footprint on metric name, as you will not be seeing a high cardinality in metric namespace. This also enables metric to have additional tag keys and values that can be extended with richer context (e.g. custom metrics).
  • StatsD does not support Distributed Tracing. Distribute Tracing is useful when distributed services and workloads working together needs to be monitored, and identifying bottlenecks and problem areas in much comprehensive way.
  • StatsD does not support Logging, so emitting logging information has to be done separately. OpenTelemetry supports logging.

Why is it needed?

OpenTelemetry has been considered as the next standard specification for telemetry data (metrics, traces, and logs) and has been gaining popularity and support from many users and monitoring tool vendors.

StatsD is considered legacy and has not been gaining new features for a while. Its future of providing advanced features such as distributed tracing and logs are not clear.

Adopting OpenTelemetry will also greatly help Airflow to be more compatible with other monitoring tools (prometheus, datadog, tanzu observability, dynatrace, new relic, etc.), as its export features will be more compatible.

OpenTelemetry will help telemetry data to be transmitted more reliably and securely.

OpenTelemetry is also designed to be suitable for modern application architectures like MicroService Architecture.

Are there any downsides to this change?

OpenTelemetry instrumentation is going to affect additional network traffic that can be more than what StatsD produces (due to its nature of containing more information of metrics, traces, and logs), with additional functionalities. Also, OpenTelemetry will require OTel collecting agent and optionally a collection gateway which can be an additional process running for airflow.

Which users are affected by the change?

  • Airflow users who needs to monitor airflow using monitoring tools.
  • SRE team wanting to monitor airflow and centralize its telemetry data

How are users affected by the change? (e.g. DB upgrade required?)

Users will be able to monitor airflow with more context, making their monitoring more productive and efficient.

Users will be able to use this feature to ‘centralize’ their telemetry to achieve ‘single-pane of glass’ using APM, SaaS monitoring solutions, and other third party tools.

There will be changes in the database model on dagrun and task instance side. Additional column to contain serialized span data will be introduced.

Existing StatsD implementation will go through the DEPRECATION once the implementation of OpenTelemetry is completed.

Implementation Details (1st POC)

There are useful learnings that came from our Outreachy Internship project where we developed a Proof-Of-Concept of the integration. Those findings are documented here to provide information on how the integration will look like:

OTEL concepts

Open-Telemetry introduces several concepts:

  • Data gathered by OpenTelemetry
    • Instrumentation 
    • Open Telemetry Metrics
    • Open Telemetry Logging
  • Open Telemetry Collector
  • Open Telemetry Exporters

Data gathered can be grouped in one of those types

Instrumentation determines what kind of information is gathered from the running application. There are many integrations that are available for Python applications: Flask, SQLAlachemy, Database etc. 

Metrics can be used to publish custom metric from Airflow

Logs can be used to publish logs from Airflow

Collector can be run in one of two modes Agent and Gateway. It is used to gather all the open-telemetry data of various types (instrumentation, metrics, logs) as Collector Agent (usually sidecar next to Airflow process) and send it to Collector Gateway which exports the data

Exporters are used to export the gathered telemetry data to the analytics backends that can process and make available for the users. There are many backends (commercial and free that provide their exporters) - LightStep, Prometheus, DataDog etc. 

Integration scenarios

Typical integration of Open-Telemetry is presented below. This would be our recommendation for Open Telemetry for Airflow in Helm Chart/Kubernetes Operator modes where the collector Agents will run as sidecars to all components and standalone collector Gateway can be configured to export the Telemetry using analytics backend exporter chosen by the user.

Also a simpler setup should be possible for smaller installations, where all airflow components communicate with a single collector (separate component) and this collector directly exports data to the backend systems. 

OTEL instrumentation mode: auto vs. manual

Open-Telemetry instrumentation can be integrated with application in one of two modes: auto-instrumentation and manual instrumentation. Manual instrumentation requires manual initialization of the open-telemetry in the code of Airflow where auto-instrumentation relies on automated instrumentation of Airflow when particular library is installed and application is executed via “opentelemetry-instrument” command. 

The POC during the internship project was performed with “auto-instrumentation” and the experiences with it were a bit of mixed:

  • It allows the addition of instrumentation of a particular feature (flask, sqlalchemy, etc.) without modifying the application. This is nice and extensible.
  • However automated instrumentation does not always work out-of-the-box. Particularly, the forking model used by Airflow internally and by Gunicorn, prevent auto-instrumentation to work without tweaks to “post-fork” hooks. Generally speaking, Airflow does not work with auto-instrumentation without making it “Open-Telemetry aware”, in the POC we’ve extended Airflow and added flags supporting “open-telemetry” but the fact that we have to do it undermines quite a bit the benefits of auto-instrumentation. More information about open-telemetry workarounds for forking is available here.
  • OpenTelemetry auto-instrumentation required to run “opentelemetry-instrument” command line to start airflow (for example “opentelemetry-instrument airflow webserver” which might complicate deployments of airflow with/without instrumentation as we would have to provide custom scripts or modify existing deployments to allow this option.
  • The auto-instrumentation works on the basis of what packages are installed in the environment. This means that in order to disable/enable particular instrumentation you need to install/uninstall packages in the execution environment. That makes it difficult to selectively enable/disable certain instrumentation (it requires changing the deployment environment). This might be problematic if some instrumentation will prove to be to “heavy” and needs to be disabled (some of the instrumentations introduced might cause an overhead and more resource usage).
  • The manual instrumentation is quite a bit more involved as it requires manual enabling/disabling of particular integration, and it does not allow new instrumentations to be added without modifying Airflow. However it allows to configure and enable open-telemetry at the level of Airflow configuration (via the same mechanisms Airflow is configured today) and allows to disable/enable each instrumentation separately.

Our recommendation is to follow the manual instrumentation route and pick the instrumentations supported by Airflow. It will allow users to configure open-telemetry using Airflow configuration and allow the users to fine-tune which of the instrumentations to enable and control overhead caused by the instrumentation.

Metrics integration

Metrics integration is rather straightforward. It is fully controlled from the application (no auto-instrumentation equivalent) and the API to produce the metrics is rather standard and simple to implement. It should be relatively easy to map the current (and new) metrics of Airflow to the Open-Telemetry metrics. The consideration here is the scope of the spans in which the events should be emitted should correspond to Airflow `entities` (Tasks/DAGs//DagRuns etc.). Detailed list of spans and metrics produced should be produced at a later stage.

Log integration

Log integration - the log integration might become a challenge as Airflow has a complex logging infrastructure, including taking over stdout and stderr and logging redirection which is likely to result with Open-Telemetry Logging being confused. 

It’s worth noting that Logging integration are subject to change as it is in the BETA phase. The APIs are not complex, but the integration details might change in the future so the recommendation is to treat it as experimental as well and introduce them gradually. Logging were not completed in the scope of the POC for Open-Telemetry run as Outreachy internship. 

While Metrics seems to be ready to integrate within the scope of this AIP and Open-Telemetry integrations (no surprises foreseen), we believe Log integration requires more work and POC that should be accompanied with an approach to rewrite and simplify logging integration as Airflow logging is particularly vulnerable to disruption and it is integrated with Airflow UI in a subtle way. We propose to exclude Logging from the scope of the AIP (and implement it as a follow-up AIP).

According to the OpenTelemetry documentation related to logs (https://opentelemetry.io/docs/reference/specification/logs/overview/), it may be worth implementing a logging strategy that would contain the following attributes when emitting logs 1) request context such as trace ID or span ID - or even DAG id or TASK id to be included in the log format so that aside from the text message, the log can later be easily correlated to any runtime activities during that timestamp, and 2) resource context such as source of where the lot originated such as host name, IP address, POD ID or container ID, etc. that can easily relate to where it was originated. When implementing the logs, it is assumed that application specific logging would have these requirements be considered.

Performance overhead

The POC had not measured the impact of instrumentation, the open-telemetry focus is to implement as performant solution as possible including maintaining and publishing performance benchmarks: https://open-telemetry.github.io/opentelemetry-python/benchmarks/index.html

The impact of particular instrumentations should be measured as part of the project and some guidelines for the users produced, however the selective enablement/disablement of the instrumentations should help the users to fine-tune the observability of their systems vs performance overhead.

Other Internal POC to test out Opentelemetry (2nd POC)

There was also an additional internal POC that was done which tested generating telemetry data as well as distributed traces using OTEL. Details can be found here in this PDF doc:

Code Changes

Even though this is not a final version, here is the branch for the github repo used for this POC that outlines the extent of the code changes required: https://github.com/howardyoo/airflow/tree/opentelemetry-poc-1

Code Changes Summary

https://github.com/howardyoo/airflow/commit/cc83c2b377ac22f0e7ef82e7f59784df972037fd lists out the DIFF on the opentelemetry-poc-1  branch against the existing branch, but here is the overview of the expected change to implement OpenTelemetry support and instrumentation to Apache Airflow.

airflow/configuration.py

Has the additional configurations added:

  • otel_on
  • otel_host
  • otel_port
  • otel_prefix
  • otel_allow_list
  • otel_debug
airflow/dag_processing/manager.py
  • Has code to initialize OpenTelemetry tracer
  • Has OTEL decorators added to produce traces for number of key functions
  • Added extra attributes to existing Stats.* functions to support 'tags' to the metrics being created
airflow/dag_processing/processor.py
  • Added extra attributes to existing Stats.* functions to support 'tags' to the metrics being created
airflow/executors/base_executor.py
  • Has code to initialize OpenTelemetry tracer
  • Has OTEL decorators added to produce traces for number of key functions
  • Added code to parse json of Span stored inside Task instance, and setting up the attributes, and end it.
  • Added code to parse json of Span stored inside DagRun instance, and setting up the attributes, and end it.
airflow/executors/local_executor.py
  • Added code to parse json of Span stored inside DagRun instance, and setting up the attributes, and end it.
airflow/executors/sequential_executor.py
  • Added code to parse json of Span stored inside DagRun instance, and setting up the attributes, and end it.
airflow/jobs/scheduler_job.py
  • Has code to initialize OpenTelemetry tracer
  • Has OTEL decorators added to produce traces for number of key functions
  • Has OTEL instrumentation to create spans on _process_executor_events 
  • Added extra attributes to existing Stats.* functions to support 'tags' to the metrics being created
airflow/models/dag.py
  • Added span_json as a new column in the model
airflow/models/dagrun.py
  • Added span_json as a new column in the model
airflow/models/taskinstance.py
  • Added span_json as a new column in the model
  • Added extra attributes to existing Stats.* functions to support 'tags' to the metrics being created
airflow/models/stats.py
  • Added support for OTEL to be added as another option for statsD related codes ( get_otel_logger )
  • Added classes such as OtelTraceLogger, SafeOtelLogger
  • Added TraceLogger type

Other considerations?

Using OpenTelemetry will introduce new python package dependencies. For the basic auto-instrumentation to work, the following packages will be required. However, due to the evolving nature of Open Telemetry, managing dependencies for opentelemetry within the realm of airflow should be more thoroughly considered and tested.

  • opentelemetry-api
  • opentelemetry-sdk
  • opentelemetry-distro

What defines this AIP as "done"?

  • Implementation of OpenTelemetry (python) inside Airflow to instrument metrics, traces, and logs using its SDK.
  • Refactor existing metrics - introduce new metrics and tags as needed, and also decide whether to reuse existing name convention or not.
  • Implement Distributed tracing on DAG runs - this can be optionally turned on/off. The details and dependencies on DAG runs can now be emitted as distributed traces and can be monitored using tools like Jaeger, zipkin, etc.
  • Make OpenTelemetry and StatsD optional and interchangeable. Configurations should also be able to control the amount of cardinality via controlling tags being emitted. StatsD implementation will undergo a gradual DEPRECATION at the completion of OpenTelemetry implementation.
  • Enabling and disabling opentelemetry should be done on airflow.cfg file, under the section [metrics]. A new entry that starts with the prefix opentelemetry_ will be introduced, that will enable/disable it, to settings related to configuring its cardinality and destination (otel agent) point, etc.
  • By cardinality, it should be designed in such a way that can enable / disable high cardinality tags from being emitted. This needs to be instrumented within the airflow code for optimal performance.
  • Documentation of the new feature, along with how-to’s and instructions of setting up OTel collection agents.
  • Provide Grafana dashboards or other popular monitoring tool dashboards in monitoring Airflow.

Scope of the initial implementation

PHASE 1

Metrics

Although it may be too early to completely predict how exactly the implementation should be done in this proposal phase, there is a high-level scope which the proposal would like to mention that can be subjected to the initial design and implementation. Since the development of Logging in OpenTelemetry is in BETA stage, the initial scope of work should be focused on designing and implementing Metrics of current Airflow. The current Airflow's metrics (based on StatsD) will be migrated, but with certain changes such as support for the additional tags, and renaming certain metrics as necessary. All the detailed metrics names that should be supported by the new OpenTelemetry implementation should also be documented with detail as part of the work in scope.

Traces

OpenTelemetry Traces will be the next one in scope, having its code base stable and mature. The traces should also have a section in documentation to describe what components the traces will be generated from, and what kind of tags it will contain to make it useful.

PHASE 2

Logs

In the initial implementation of OpenTelemetry, there will not be any significant work done in Logging area. The current logging of airflow will have no change. As Logging support of OpenTelemetry mature over time, however, naturally this will be included as an additional scope as one of the key goal of OpenTelemetry implementation would be to unify all the data needed for observability. Log messages going through the OpenTelemetry will contain request context (such as dag_id, task_id, run_id) as well as resource text (host, pod, instance) of where the event is happening from.

Custom

Lastly, there will be documentation and examples for creating user's 'custom metrics and traces' using OpenTelemetry's SDK in case users are willing to add it inside their DAGs. Comprehensive documentation should be provided so that users can develop, test, deploy, and maintain their own instrumentation.

Dashboards and Alerts

Since it may be counterproductive for users to create and supply their own dashboards or visualizations of various products (grafana, DataDog, Tanzu Observability, etc.), there will be a reference dashboards and alerts that can be used to help users to set-it up if the dashboards are not part of the integration options that monitoring tools have. The following reference dashboards will be provided in a separate 'contribution repo' that may be separately maintained. Airflow certainly do not have the full control over how the product's visualizations are implemented, and therefore, would have to primarily rely on the open community's participation. All of the details should be outlined in the Airflow's contribute readme.

  • Grafana
  • DataDog
  • Tanzu Observability