Status

StateAbandoned in favour of AIP-24
Discussion Thread
JIRA


Created

2019-03-15

Motivation

** Note: this AIP replaces AIP-12 with a better defined scope that should help in the discussion surrounding the AIP. **

Issue

In the current implementation, Gunicorn webserver processes each maintain their own DAG representation using their own `DagBag` instance. Because there is no synchronization between the different webserver processes, these `DagBag` instances can contain different states, especially in situations when DAGs have just been added or modified.

As a result, users can see different DAGs in the webserver, depending on the process that handles their request. This results in various stability issues, where DAGs seem to randomly appear and disappear between refreshes until the webserver processes are stabilized.

An example of these issues is illustrated in the following video:



Proposal

To avoid the issues outlined above, we aim to make the webserver endpoints stateless, so that queries to the webserver return the same result regardless of the thread that is handling the request. We propose to do so by modifying the endpoints to use a (shared) single source-of-truth for displaying DAG/task related information.

The most obvious solution for maintaining a single source-of-truth for DAG-related information is the database, as this is already where Airflow persists DAG-related metadata. To make the webserver stateless, we then simply need to make sure that all required information is available in the database and can be queried by the webserver.

To keep this AIP tractable, we propose to leverage the existing ORM models for storing and querying DAG metadata from the database in the webserver. Following this approach, we should be able to achieve our objective by adding required fields/methods to the `DagModel` ORM class, which will serve as our entrypoint for querying DAG-related metadata from the database.

Required changes

To achieve our goal, we need to implement the following changes:

  • All references to the shared `DagBag` instance need to be removed from the webserver endpoints.
  • Functionality that is independent of the DAG file needs to be moved to the `DagModel` class, rather than the `DAG` class. Examples include the `following_schedule` method, etc. This way we can use this functionality independent of the DAG file. Backwards compatibility can be maintained if needed by forwarding calls on `DAG` instances to the backing `DagModel` instance.
  • The webserver endpoints need to be modified to reference the required methods/attributes on the `DagModel` class rather than the `DAG` class.

Note that some endpoints may need to be modified to read data not present in the database directly from the corresponding DAG file (but only if really needed). This will for example be required in the DAG graph view, as graph edges are currently not stored in the database. This can only be avoided by adding this missing data to the database.

Considerations

Larger discussions concerning serialization formats for DAGs and DAG versioning are not part of this AIP, as we only intend to make a few small changes to the existing classes to address the problem at hand. These other discussions involve larger (architectural) changes which are outside of the scope of this AIP.

7 Comments

  1. Do we have a list of fields that we will need to persist into DB to support stateless webserver? And I would imagine that we would also need to model the tasks in the DAG. For instance, the Airflow UI supports showing rendered template fields in task instance.

  2. Good point Chao-Han Tsai

    I think we should be thoughtful about this, we need to go through the Task Instance Details page and see what we actually need. There is a lot of data there, also about the DAG that nobody looks at. The Rendered Templates is going to be the toughest I think, but when we create the dag-run we can store this in the database.

    Currently, for me it feels awkward that we need the actual DagBag in the overview page. There is nothing that we should not be able to get from the database itself.

  3. The biggest issues are the endpoints that require task edge information (graph and tree), as this data is not (yet) stored in the database as far as I know. Most other information is already in the database or can be added to the DagModel or TaskInstances models.

    For the edge information, we essentially have two choices as far as I can see:

    • We can parse the DAG file on every view and use the edge information from the DAG file to build up the view. The advantage of this approach is that it is very easy to implement, however it does incur a performance penalty as DAGs need to be parsed for every view (which can be problematic for large DAGs).
    • We can add edges to the database by either
      • Adding the current state of the DAG in the database, so that edges reflect the most recent version of DAG as it was parsed. This reflects how Airflow currently handles changes to the DAG structure, as the parsed version of the current DAG is used to build up the current view.
      • Maintaining a history of DAG edges in the database, to that edges reflect the DAG as it was at the time of the DagRun. This would require some careful thought about how to manage versioning (also how this would work when backfilling etc).

    For this AIP, I would prefer persuing option 1 or option 2a, as I think that option 2b involves extra work that is outside of the scope of this AIP. Moreover, options 1 or 2a should be sufficient for keeping inline with the current behaviour of Airflow.

    1. In the first option, we would need to load the whole DagBag (which can take an enormous amount of time) on every view. Which can be slow if there are a lot of DAGs in the repository. I would prefer option 2 which persists the edge information in the DB.

  4. Here a summary of changes that I see per endpoint:

    • graph
      • Tasks + task edges need to be parsed from the DAG file or stored in the database.
        The most practical way forward would be to store the edges of the current
        DAG file in the database, as this mimics the existing behaviour of Airflow
        and avoids having to maintain DAG history in the database (which is outside
        of the scope of this AIP IMO).
    • tree
      • Task + task edges need to be parsed from the DAG file or stored in the DB
        (see above).
      • SubDags need to be handled on the DagModel class.
      • DagRuns should be queryable via a relationship on DagModel (nice to have).
      • TaskInstances should be queryable via a relationship on DagModel (nice to have).
    • duration
      • SubDags need to be handled on the DagModel class.
      • Helper methods such as `date_range` need to be moved to the DagModel class.
      • TaskInstances should be queried from the DagModel.
      • All other information seems to be present in the `TaskInstance` model.
    • tries
      • SubDags need to be handled on the DagModel class.
      • Helper methods such as `date_range` need to be moved to the DagModel class.
      • TaskInstances should be queried from the DagModel.
    • landing_times
      • SubDags need to be handled on the DagModel class.
      • Helper methods such as `date_range` need to be moved to the DagModel class.
      • TaskInstances should be queried from the DagModel.
    • gantt
      • SubDags need to be handled on the DagModel class.
    • dag_details
      • Make TaskInstance/DagRun queryable on DagModel (nice to have).
    • code
      • Nothing to do?
    • task
      • Move `Task` attributes from `Task` to `TaskInstance` if they are required
        for the view. As Fokko already indicated, there are many attributes in here
        that may not be crucial to show. There are also some duplications, such as
        `dag_id` which is being shown for both `Task` and `TaskInstance`.
      • Templated content would also need to be either stored on the `TaskInstance`
        or otherwise be resolvable from `TaskInstance`.
    • rendered
      • Check if templates can be rendered from `TaskInstance` without needing the
        `Task` objects from the DAG.
    • log
      • Make TaskInstance queryable on DagModel (nice to have).
    • xcom
      • Nothing to do?


    Feel free to add if I missed something.

    1. I would be interested in how the tables, especially DagModel, TaskModel, etc. would look like and how exactly you model the edges, subdags.

  5. I've just commented on this AIP's sibling, (AIP-18) with a slightly different perspective on the need for this which pertains to lineage and traceability; ensuring that the Webserver renders a correct view of what happened when an historic DagRun happened, rather than building that view from the Dag code as it is "today."

    Not sure if this AIP considers such a thing as part of it's remit but it's worth a quick read I hope.  Can't seem to find a way to link to a comment directly, so here's a link to the AIP and it's presently the only comment.

    https://cwiki.apache.org/confluence/x/vCclBg