Video of current problem: https://www.youtube.com/watch?v=sNrBruPS3r4
We propose to persist all information of a DAG.py file in the Airflow metastore and make the webserver read the state from the metastore instead of parsing the dags in the webserver process itself. This has several advantages:
Webserver Gunicorn processes read state from the DB instead of each webserver process managing its own state by reading the DAGs itself. This currently results in the weird disappearing and reappearing behaviour for several minutes after you make a change (e.g. add a new DAG). The DB will be the single source of truth.
Once the scheduler processed the DAG it should be visible for all Gunicorn workers. The webserver will not scan the DAG files anymore, only use it for specific actions:
Trigger; This requires the DAG to be processed once. This does not need to live in memory, and can be done on the fly.
Code; Just to view the code, execution of the DAG is not required here.
The webserver should not use the DagBag but the metastore DB as the single source of truth. This will prevent DagBags in different Gunicorn workers to go out of sync with each other. Also the webserver does not need to process the DAG files anymore.
Related JIRA issue: https://jira.apache.org/jira/browse/AIRFLOW-3562
In order to achieve this all DAG components should be stored in the database. This is currently not the case so the following components should be persisted in the database:
Biggest database change: DagEdges: https://jira.apache.org/jira/browse/AIRFLOW-3585
Other information will be detected during working on this.
Edges in the database
Adding edges to the database is needed to be able to visualize the graph of a DAG.
To do this there are 3 options:
Persist single version of DAG edges (latest only)
Problem is that each version is overwritten if the DAG is updated, so only the latest is stored
Adding edges for each DagRun.
In this implementation history is maintained.
Each DagRun has its own complete version of the DAG in the database, which is rather redundant.
In order to have best of both, I suggest to store each version of a DAG graph with some ID, to which every DagRun can be linked so we preserve DAG history while not storing redundant data.