Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Abandoned in favour of AIP-24


Page properties

StateDraftAbandoned in favour of AIP-24
Discussion ThreadTBD



As a follow up on AIP-19 (making the webserver stateless) I suggest to store the DAG structure and metadata in the database. Where AIP-19 targets the webserver to read only from the DB, this AIP targets Airflow core reading the and storing all information in the DB. Currently, this is done only partially, and the webserver processes need to read the in order to fetch missing information which is very costly. This AIP should:

  1. Simplify architecture (disentangle processes)
  2. Make tasks fetch-able from the DB without having to read the file
  3. Improve performance by not having to read the full file when running a single task

What problem does this solve?

Currently files are parsed by multiple processes (webserver, scheduler & workers). This is:

  1. Unclear from a code perspective
  2. Unnecessary because we assume DAG schedule interval does not change every few seconds, and can be parsed only when required, to avoid re-parsing all the time. This problem is especially large with larger files. For example, a with a sleep(60) at the top level will take a long time to parse.

I suggest to make a "DagParser" that processes files only when necessary and storing the retrieved metadata in the database. Other processes should read only from the database.

Why is this change needed?

In the current architecture the parsing is entangled in multiple processes and is executed too often, leading to out of sync state and performance issues. See 

Widget Connector

Suggested implementation

I see Airflow being used in two ways:

  1. Static DAGs - the structure (tasks + dependencies) is static, does not change over time
  2. Dynamic DAGs - the structure changes over time, e.g. by reading the structure from a YAML file instead of the Python file itself. This makes it impossible to parse the file only once. You will need to re-parse the file at the time it is executed.

With only static DAGs, we could parse at the creation time and from then on only read the metadata from the database. However, with dynamic DAGs this is not possible since the code itself does not change, thus we can only detect changes by evaluating the file.

So, I suggest the following (simplified) architecture:

The DagParser responsibilities:

  • Read files:
    • When a file is added
    • When a file is edited (detect with file listeners)
    • With dynamic DAGs the will not change, so possibly periodically?
    • When a DagRun starts, to fetch the current situation
  • Extract metadata (name, start_date and schedule_interval most importantly) and DAG structure (tasks + dependencies)
    • Tasks should be serialized with dill (currently also used by Airflow), which also supports e.g. lambda expressions
    • For fast comparison, compute and store a hash from each byte stream. Whenever an already known file is parsed, we can lookup the hashes and store a new task if needed.
  • Store metadata and DAG structure in the database tables DAG, Task and Edge

The scheduler responsibilities:

  • Check if current datetime == next execution datetime for DAG
  • If so, kick off the parsing process, to fetch the structure of a DAG at that point in time. This is required because we want to support dynamic DAGs, which must be reprocessed at the time of execution.
  • Based on the fetched information, create a DagRun, possible new tasks, edges, and corresponding TaskInstances

Other things not included in this picture:

  • Currently, the webserver parses DAG files itself. This should be removed (out of scope for this AIP), and the webserver should read ONLY from the database, or via an API. But NOT handle parsing itself.