AIP-66: DAG Bundles & Parsing

Status

State	Accepted
Discussion Thread	https://lists.apache.org/thread/l8ksl144xd43jfk1wk3kz77t1xgbbq7z
Vote Thread	https://lists.apache.org/thread/6ngsm0vtkt7l1r1vpvfoc14mfl5w74dl
Vote Result Thread	https://lists.apache.org/thread/2bznbcccnbf2qkxzntlxbp5spx308t0b
Progress Tracking (PR/GitHub Project/Issue Label)	https://github.com/orgs/apache/projects/406
GitHub Project	link
Umbrella AIP	AIP-63: DAG Versioning
Date Created	2024-07-10
Version Released
Authors	Jedidiah Cunningham

This AIP is part of AIP-63, which aims to add DAG Versioning to Airflow.

This AIP supersedes AIP-5 and AIP-20.

Motivation

Today, Airflow will always execute tasks using the latest DAG code. While in some cases this may be what users want, in others, it may be desired to complete a whole DAG run on a single version of a DAGs code. Or, when clearing tasks in older runs, allow that task to run on the DAG code used at the time of the original run.

We also continuously parse DAG files, as we have no way of knowing if the DAGs from that file could have changed. This is clearly inefficient and means you either wait a potentially long time for changes to be reflected or burn a ton of resources needlessly. More control around parsing behavior is desirable.

Goals:

Allow tasks to run using a specific version of DAG code
Allow more control of Airflow parsing behavior
DAG code can come from many sources

In order to support this, we will introduce the following concepts:

DAG Bundle - a collection of DAGs and other files. Think of today's DAG folder.
DAG Bundle Manifest - metadata about the DAG bundle.

DAG Bundles

Airflow will support one or more DAG Bundles (a collection of DAGs and other files, like today's DAG folder). They will support versioning, which allows Airflow to control the version of the DAG code used for a given task try. This means that a DAG run could continue running on the same DAG code for the entire run, even if the DAG is changed mid-way through, as the worker can retrieve a specific DAG bundle when running a task.

This will require Airflow to support a different way of retrieving DAGs - it can no longer simply expect to find DAGs on local disk. This will be done in a pluggable way with DAG bundle backends and optional versioning support so the ecosystem can evolve as time goes on.

By default, DAG runs will execute on the same DAG bundle for the whole run, assuming the DAG file backend supports versioning. However, DAGs can opt to continue the existing behavior of running on the latest DAG code instead.

Any file can be placed in a DAG Bundle and it will be versioned (if the DAG bundle backend supports versioning, of course). Imagine a yaml config file, for example.

Pluggable DAG bundle backends

DAG bundle backends will allow Airflow to retrieve DAG bundles at a given version at any point in time. Since it’s common that DAGs are not simply contained in a single file, we require a way to version a whole set of files simultaneously. How this is accomplished will depend on the backend. Airflow will only operate on complete DAG bundles - it will never attempt to identify or fetch a subset of a DAG bundle.

Backends are encouraged to implement local caching in order to reduce the impact and/or reliance on external systems where possible. The exact mechanism, and whether it will fall on the backend or Airflow itself will be determined during implementation.

This does introduce the need for potentially large amounts of temp disk, but where Airflow places these versioned bundles will be configurable and controlled by the deployment manager as appropriate.

Possible backends

We may not build all of the following backends for Airflow 3.0, and there may be multiple variants for each technology, but this gives you a good idea of what we are thinking and what is possible.

git

Git already versions everything in the repository, so we can simply use a commit hash or tag.

The backend will support both moving branches (e.g. tracking main) and a specific tag/commit that is updated in the DAG bundle backend configuration. This allows for flexibility of deployment strategies.

Blob storage

Versioning support for blob storage systems varies. Some support versioning at the object level, while others support no versioning at all. It is still possible to write a backend that supports versioning, but Airflow will only keep track of a single bundle version, and bridging that gap is up to the backend to work out. For example, you could write a backend that will grab a zipped bundle, giving both the backend and Airflow a single object to version. There are alternative approaches as well (e.g. keeping a “manifest” - not to be confused with the manifest from this AIP - that tracks versions per object), but Airflow itself is shielded from the complexity of this.

Local Disk

Not every environment will need DAGs from an external system (imagine a local development environment or when you bake DAGs into an image). To support such use cases, the existing DAG directory will be exposed via a DAG bundle backend, but one that does not support versioning. Airflow will operate under the “latest only” paradigm to support this.

The local disk backend is something we will build for Airflow 3.0 in order to have a robust local development experience.

Configuring DAG bundle backends

DAG bundle backend configuration will be stored in the Airflow db - this allows for configuring them via the Airflow UI, API, or CLI. We will also support importing from json, allowing for easy provisioning by deployment managers.

By default, the local disk backend pointing at today's DAG directory will be the only configured backend.

There will be no restriction on the number of DAG bundle backends that can be configured.

Parsing

Instead of constantly trying to parse DAG files, we will add support for a manifest file in DAG Bundles. This allows Airflow to support more flexible parsing rules, which will allow users to tell us things like:

Whether we need to scan for files containing DAGs in any parts of the DAG folder structure.
How often we should scan for those files.
Or, a list of files we should parse.
How often files should be parsed (basically `min_file_process_interval`, but at a folder/file level), or if they should only be parsed if the files checksum / mtime changes.

For example, a manifest could look something like this:

# /manifest.yaml

directories_to_scan:
  - path: /glacier/
    scan_interval: 1 day
    min_file_parse_interval: 1 day
  - path: /fast/
    scan_interval: 10 minutes
    min_file_parse_interval: 1 minute
  - path: /static/
    scan_interval: never
    min_file_parse_interval: on change

dag_files:
  - path: /never_scanned/mydag.py
    min_file_parse_interval: on change

bundle_fetch_interval: 5 minutes

We will nail down the exact functionality that will be supported and the file structure during implementation.

A few notes:

Manifests will be optional.
Parsing will always happen on the latest version of the DAG bundles (and the execution environment configured by the manifest), as is the case today.
Today’s `dag_dir_list_interval` will be replaced with a new configuration option to control how often we look for new versions of a DAG bundle. We will also have UI/API/CLI support for “refreshing” on demand.
`.airflowignore` will still be supported.

Scheduler

The scheduler, when creating new DAG runs, will always use the latest DAG version.

While scheduling tasks, the scheduler will specify the DAG bundle version when sending the task to the executor, so that the task can run on that specific DAG bundle version.

Worker

The executor will tell the worker what version of a DAG bundle a task needs to run against. The worker can then use the DAG bundle backend to get the DAG bundle and execute the task.

If the worker is capable of running more than one task (e.g. LocalExecutor worker or CeleryExecutor worker), the worker can store the bundle on temp disk and use it for multiple tasks. These can even be kept around even if no task is currently using that version of the dag bundle.

Packaging (Where do DAG Bundle Backends live?)

The packaging decisions (e.g. where will the out-of-the-box DAG bundle backends live) will be deferred until we have a better understanding of how AIP-72 will be packaged and delivered to users. We will have to work closely with the AIP-72 folks to nail this aspect of our AIPs down, as they have some overlap.

Phases

Phase 1: DAG Bundle Backend framework and initial DAG Bundle Backends
Phase 2: Parser / Manifest changes and integrating versioning into AIP-72’s task execution interface

You may be surprised to see that Connections, Variables, and Plugins are not mentioned. They are out of scope for now.

Migration Considerations

There are no direct breaking changes, unless deployment managers or authorized users decide to remove the local dag bundle backend that will be configured by default.

Space shortcuts

Page tree

Status

Motivation

DAG Bundles

Pluggable DAG bundle backends

Possible backends

git

Blob storage

Local Disk

Configuring DAG bundle backends

Parsing

Scheduler

Worker

Packaging (Where do DAG Bundle Backends live?)

Phases

Migration Considerations

16 Comments

Andrey Anshin

Jarek Potiuk

Shubham Mehta

Jedidiah Cunningham

Jens Scheffler

Jedidiah Cunningham

Igor Kholopov

Jedidiah Cunningham

Kaxil Naik

Jarek Potiuk

Jedidiah Cunningham

Jarek Potiuk

Jarek Potiuk

Jedidiah Cunningham

Jedidiah Cunningham

Jedidiah Cunningham