Status
Page properties | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Motivation
Currently, DAGs are discovered by Airflow through traversing the all the files under $AIRFLOW_HOME/dags, looking for files that contains "airflow" and "DAG" in the content, which is not efficient. We need to find a better way for Airflow to discover the DAGs.
Considerations
Is there anything special to consider about this AIP? Downsides? Difficultly in implementation or rollout etc?
What change do you propose to make?
I am proposing to introduce DAG manifest, an easier and more efficient way for Airflow to discover DAGs. The DAG manifest would be composed with manifest entries, where each entry represents a single DAG and contains in formation about where to find the DAG.
Format:
Code Block | ||
---|---|---|
| ||
dag_manifest_entry: dag_id: the DAG ID uri: where dag can be found, DAG locations will be given via URI, i.e. s3://my-bucket/dag1.zip, local:////dags/day1.zip conn_id: connection id to use to interact with remote location |
File-based DAG manifest
Airflow services will look at $AIRFLOW_HOME/manifest.json for the DAG manifest. The manifest.json contains all the DAG entries. We should expect a manifest.json like:
Code Block | ||
---|---|---|
| ||
[ "dag_1": { "uri": "local://dags/hello.py" }, "dag_2": { "uri": "s3://dags/superhero.py" } ] |
Custom DAG manifest
The manifest can also be generated by a callable supplied in the airflow.cfg that would generate a list of entries when called, i.e
Code Block | ||
---|---|---|
| ||
[core] # callable to fetch dag manifest list dag_manifest_entries = my_config.get_dag_manifest_entries |
The DAG manifest can be stored on S3 and my_config.get_dag_manifest_entries will read the manifest from S3.
What problem does it solve?
An easier and more efficient approach for Airflow DAG discovery.
Why is it needed?
- With the manifest people are able to more explicitly note which DAGs should be looked at for by Airflow
- Airflow no longer has to crawl through a directory importing various files possibly causing problems
- Users are not forced to allow for a way to crawl various remote sources
- We can get rid of
, which requires strings such as "airflow" and "DAG" to be present in DAG file.Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key AIRFLOW-97
Are there any downsides to this change?
- An extra step to add a new DAG, e.g. add an entry in the DAG manifest.
- Migration is required for upgrade. We may need to create a script that traverse all the files under $AIRFLOW_HOME/dags or remote storage (assuming we have remote dag fetcher) and populates entries in the DAG manifest.
Which users are affected by the change?
All users attempted to upgrade to the latest.
How are users affected by the change? (e.g. DB upgrade required?)
Users need to run a migration script to populate the DAG manifest.
Other considerations?
N/AWhat defines this AIP as "done"?
Airflow discovers DAGs by looking at DAG manifest without traversing through all the files in the filesystem.