AIP-71 Generalizing DAG Loader and Processor for Ephemeral Storage

Status

State	Draft
Discussion Thread	https://lists.apache.org/thread/qz6ckld9jsv6grsc02qnp3sf1jlrto87
Vote Thread
Progress Tracking (PR/GitHub Project/Issue Label)	https://github.com/apache/airflow/pull/39647
Created	2024-05-26
In Release
Authors	Bolke de Bruin

Abstract

This proposal aims to enhance Apache Airflow's DAG loading mechanism to support reading DAGs from ephemeral storage, superseding AIP-5 Remote DAG Fetcher. The objective is to generalize the DAG loader and processor to load DAGs from pathlib.Path rather than assuming direct access to the OS filesystem. This includes implementing a custom module loader that supports loading from ObjectStoragePath locations and other Path locations with caching capabilities provided by fsspec. Additionally, while this AIP does not implement DAG versioning directly, the proposed loader can be easily extended to support DAG versioning, laying the foundational layer for AIP-63.

Motivation

Currently, Airflow's DAG loading mechanism is tightly coupled with the local filesystem, limiting its flexibility and scalability. As organizations increasingly adopt cloud-native architectures and ephemeral storage solutions, there is a growing need to support reading DAGs from various storage backends, including object storage services like S3, GCS, and Azure Blob Storage. Traditionally, Airflow has relied on third party syncing mechanisms, like git-sync, which make for more complex deployments and have drawbacks with keeping 'in-sync'. This proposal addresses this need by abstracting the DAG loading process to accommodate different storage backends seamlessly.

Proposal

Goals

Generalize the DAG loader and processor to use pathlib.Path for file operations.
Implement a custom module loader to support ObjectStoragePath and other Path-like abstractions.
Integrate fsspec for caching mechanisms within the loader to optimize performance and reduce redundant data retrievals from remote storage.
Lay the foundational layer for supporting DAG versioning as described in AIP-63.

Implementation Details

1. Generalize DAG Loader and Processor

Refactor DAG Loader:
- Modify the DAG loader to use pathlib.Path for all file operations.
- Ensure backward compatibility with existing DAG loading processes.
Update DAG Processor:
- Adapt the DAG processor to handle pathlib.Path objects.
- Abstract any direct filesystem access to use pathlib.Path methods.

2. ObjectStorageLoader & ObjectStorageFinder

This is to support loading python modules from cloud storage. This consists of having a Finder, to locate arbitrary modules, and a Loader.

- Utilize the existing ObjectStoragePath class, which extends pathlib.Path and uses fsspec as the underlying implementation.
- Ensure ObjectStoragePath integrates seamlessly with Airflow's DAG discovery process by adding the Finder.
- Ensure caching for compiled py files and for downloaded DAGs and modules.
- Allow users to define cache size, eviction policies, and storage locations via Airflow configuration.
- Optimize the custom module loader for minimal latency and efficient data retrieval.

3. Support for DAG Versioning (Foundational Layer)

DAG Versioning Integration:
- Design the loader to support versioning mechanisms provided by fsspec or cloud storage backends.
- Ensure the architecture allows easy extension to include versioning logic as specified in AIP-63.
- Provide configuration options to enable versioning features, either leveraging built-in capabilities of fsspecand cloud storage or through an additional versioning layer.

Security Considerations

Confidentiality

We rely on the mechanisms provided by Airflow's Connections and ObjectStoragePath to ensure secure access to object storage. Storage will be read-only (no cache files will be written to Object Storage)

Availability

As DAGs and modules will need to be downloaded before Airflow can parse them this can impact performance of the DAG processing. Configurable caching is implemented to reduce the impact.

Integrity

Currently no additional mechanisms are planned to verify the integrity of the DAGs and modules. We rely on object storage semantics and fsspec to provide this.

Documentation

Update Airflow's documentation to include detailed instructions on configuring and using the new DAG loading mechanism.
Provide examples and best practices for leveraging ephemeral storage in Airflow.

Space shortcuts

Page tree