Status
Motivation
** Note: this AIP replaces AIP-12 with a better defined scope that should help in the discussion surrounding the AIP. **
Issue
In the current implementation, Gunicorn webserver processes each maintain their own DAG representation using their own `DagBag` instance. Because there is no synchronization between the different webserver processes, these `DagBag` instances can contain different states, especially in situations when DAGs have just been added or modified.
As a result, users can see different DAGs in the webserver, depending on the process that handles their request. This results in various stability issues, where DAGs seem to randomly appear and disappear between refreshes until the webserver processes are stabilized.
An example of these issues is illustrated in the following video:
Proposal
To avoid the issues outlined above, we aim to make the webserver endpoints stateless, so that queries to the webserver return the same result regardless of the thread that is handling the request. We propose to do so by modifying the endpoints to use a (shared) single source-of-truth for displaying DAG/task related information.
The most obvious solution for maintaining a single source-of-truth for DAG-related information is the database, as this is already where Airflow persists DAG-related metadata. To make the webserver stateless, we then simply need to make sure that all required information is available in the database and can be queried by the webserver.
To keep this AIP tractable, we propose to leverage the existing ORM models for storing and querying DAG metadata from the database in the webserver. Following this approach, we should be able to achieve our objective by adding required fields/methods to the `DagModel` ORM class, which will serve as our entrypoint for querying DAG-related metadata from the database.
Required changes
To achieve our goal, we need to implement the following changes:
- All references to the shared `DagBag` instance need to be removed from the webserver endpoints.
- Functionality that is independent of the DAG file needs to be moved to the `DagModel` class, rather than the `DAG` class. Examples include the `following_schedule` method, etc. This way we can use this functionality independent of the DAG file. Backwards compatibility can be maintained if needed by forwarding calls on `DAG` instances to the backing `DagModel` instance.
- The webserver endpoints need to be modified to reference the required methods/attributes on the `DagModel` class rather than the `DAG` class.
Note that some endpoints may need to be modified to read data not present in the database directly from the corresponding DAG file (but only if really needed). This will for example be required in the DAG graph view, as graph edges are currently not stored in the database. This can only be avoided by adding this missing data to the database.
Considerations
Larger discussions concerning serialization formats for DAGs and DAG versioning are not part of this AIP, as we only intend to make a few small changes to the existing classes to address the problem at hand. These other discussions involve larger (architectural) changes which are outside of the scope of this AIP.
7 Comments
Chao-Han Tsai
Do we have a list of fields that we will need to persist into DB to support stateless webserver? And I would imagine that we would also need to model the tasks in the DAG. For instance, the Airflow UI supports showing rendered template fields in task instance.
Fokko Driesprong
Good point Chao-Han Tsai
I think we should be thoughtful about this, we need to go through the Task Instance Details page and see what we actually need. There is a lot of data there, also about the DAG that nobody looks at. The Rendered Templates is going to be the toughest I think, but when we create the dag-run we can store this in the database.
Currently, for me it feels awkward that we need the actual DagBag in the overview page. There is nothing that we should not be able to get from the database itself.
Julian de Ruiter
The biggest issues are the endpoints that require task edge information (graph and tree), as this data is not (yet) stored in the database as far as I know. Most other information is already in the database or can be added to the DagModel or TaskInstances models.
For the edge information, we essentially have two choices as far as I can see:
For this AIP, I would prefer persuing option 1 or option 2a, as I think that option 2b involves extra work that is outside of the scope of this AIP. Moreover, options 1 or 2a should be sufficient for keeping inline with the current behaviour of Airflow.
Chao-Han Tsai
In the first option, we would need to load the whole DagBag (which can take an enormous amount of time) on every view. Which can be slow if there are a lot of DAGs in the repository. I would prefer option 2 which persists the edge information in the DB.
Julian de Ruiter
Here a summary of changes that I see per endpoint:
The most practical way forward would be to store the edges of the current
DAG file in the database, as this mimics the existing behaviour of Airflow
and avoids having to maintain DAG history in the database (which is outside
of the scope of this AIP IMO).
(see above).
for the view. As Fokko already indicated, there are many attributes in here
that may not be crucial to show. There are also some duplications, such as
`dag_id` which is being shown for both `Task` and `TaskInstance`.
or otherwise be resolvable from `TaskInstance`.
`Task` objects from the DAG.
Feel free to add if I missed something.
Chao-Han Tsai
I would be interested in how the tables, especially DagModel, TaskModel, etc. would look like and how exactly you model the edges, subdags.
Tony Brookes
I've just commented on this AIP's sibling, (AIP-18) with a slightly different perspective on the need for this which pertains to lineage and traceability; ensuring that the Webserver renders a correct view of what happened when an historic DagRun happened, rather than building that view from the Dag code as it is "today."
Not sure if this AIP considers such a thing as part of it's remit but it's worth a quick read I hope. Can't seem to find a way to link to a comment directly, so here's a link to the AIP and it's presently the only comment.
https://cwiki.apache.org/confluence/x/vCclBg