Status
Abstract
This proposal aims to enhance Apache Airflow's DAG loading mechanism to support reading DAGs from ephemeral storage, superseding AIP-5 Remote DAG Fetcher. The objective is to generalize the DAG loader and processor to load DAGs from pathlib.Path
rather than assuming direct access to the OS filesystem. This includes implementing a custom module loader that supports loading from ObjectStoragePath
locations and other Path
locations with caching capabilities provided by fsspec
. Additionally, while this AIP does not implement DAG versioning directly, the proposed loader can be easily extended to support DAG versioning, laying the foundational layer for AIP-63.
Motivation
Currently, Airflow's DAG loading mechanism is tightly coupled with the local filesystem, limiting its flexibility and scalability. As organizations increasingly adopt cloud-native architectures and ephemeral storage solutions, there is a growing need to support reading DAGs from various storage backends, including object storage services like S3, GCS, and Azure Blob Storage. Traditionally, Airflow has relied on third party syncing mechanisms, like git-sync, which make for more complex deployments and have drawbacks with keeping 'in-sync'. This proposal addresses this need by abstracting the DAG loading process to accommodate different storage backends seamlessly.
Proposal
Goals
- Generalize the DAG loader and processor to use
pathlib.Path
for file operations. - Implement a custom module loader to support
ObjectStoragePath
and otherPath
-like abstractions. - Integrate
fsspec
for caching mechanisms within the loader to optimize performance and reduce redundant data retrievals from remote storage. - Lay the foundational layer for supporting DAG versioning as described in AIP-63.
Implementation Details
1. Generalize DAG Loader and Processor
Refactor DAG Loader:
- Modify the DAG loader to use
pathlib.Path
for all file operations. - Ensure backward compatibility with existing DAG loading processes.
- Modify the DAG loader to use
Update DAG Processor:
- Adapt the DAG processor to handle
pathlib.Path
objects. - Abstract any direct filesystem access to use
pathlib.Path
methods.
- Adapt the DAG processor to handle
2. ObjectStorageLoader & ObjectStorageFinder
This is to support loading python modules from cloud storage. This consists of having a Finder, to locate arbitrary modules, and a Loader.
- Utilize the existing
ObjectStoragePath
class, which extendspathlib.Path
and usesfsspec
as the underlying implementation. - Ensure
ObjectStoragePath
integrates seamlessly with Airflow's DAG discovery process by adding the Finder.
- Ensure caching for compiled py files and for downloaded DAGs and modules.
- Allow users to define cache size, eviction policies, and storage locations via Airflow configuration.
- Optimize the custom module loader for minimal latency and efficient data retrieval.
- Utilize the existing
3. Support for DAG Versioning (Foundational Layer)
- DAG Versioning Integration:
- Design the loader to support versioning mechanisms provided by
fsspec
or cloud storage backends. - Ensure the architecture allows easy extension to include versioning logic as specified in AIP-63.
- Provide configuration options to enable versioning features, either leveraging built-in capabilities of
fsspec
and cloud storage or through an additional versioning layer.
- Design the loader to support versioning mechanisms provided by
Security Considerations
Confidentiality
We rely on the mechanisms provided by Airflow's Connections and ObjectStoragePath to ensure secure access to object storage. Storage will be read-only (no cache files will be written to Object Storage)
Availability
As DAGs and modules will need to be downloaded before Airflow can parse them this can impact performance of the DAG processing. Configurable caching is implemented to reduce the impact.
Integrity
Currently no additional mechanisms are planned to verify the integrity of the DAGs and modules. We rely on object storage semantics and fsspec to provide this.
Documentation
- Update Airflow's documentation to include detailed instructions on configuring and using the new DAG loading mechanism.
- Provide examples and best practices for leveraging ephemeral storage in Airflow.
3 Comments
Jens Scheffler
Thanks Bolke de Bruin for the AIP - compared to others in Reply-to-all I'll try to comment in Confluence. I have some general questions which might turn into concern but am "open to be convinced":
__file__
?Overall I like the idea of de-coupling the core from IO/file access though of course!
Bolke de Bruin
Jens Scheffler
Regarding 5) I expect nothing explicit - just wanted to ask for clarification if the deployment is included in the AIP (in scope) such that A GIT-Sync-free deployment via ObjectStorage will be added as configuration option to the helm chart as well. (I am not certain whether this is easily possible)