Status


State

Completed

Discussion Threadhttps://lists.apache.org/thread/tc37cgrx2tojv3zgzokz06f5ypk0y2hg
Created



Motivation

To run airflow tasks, airflow needs to parse the dag file twice, once in the airflow run local process, once in airflow run raw (It is mitigated partially when there is no `run_as_user` specified). This is a waste of memory, doubling the memory. During the peak hour, the CPU can spike for a long time due to double parsing, thus slowing down the task starting time , even causing dag parsing to timeout. This is especially true when there are complex large dags in the airflow cluster.

By removing the double parsing, it can save the memory required to run the task and reduce the task bootstrap time.

What change do you propose to make?

Relying on the deserialized DAG in the `airflow run local` process to do deps check.

In utils/cli.py, add another method: 


def get_serialized_dag(dag_id, task_id):

    dag_model = SerializedDagModel.get(dag_id)

    if dag_model is None:

        raise AirflowException("Serialized dag_id could not be found: {}".format(dag_id))

    return dag_model.get_dag_with_task_id(dag_id, task_id)


The `get_dag_with_task_id` takes dag_id and task_id, which deserializes json data from the  serialized_dag in a streaming way to the dag with only one task with the task_id. (this saves lots of memory).


Callbacks Discussion

This requires moving the callback execution from `airflow run  local` back to `airflow run raw` since callbacks require the python object from the parsed DAG.

Since the process owner of the `airflow run local` and `airflow run raw` can be different and callbacks are meant to run with the same user as the `airflow run raw`.  It might be better to run callbacks in the `airflow run raw` process.

Houqp ( sorry i don't know your username in confluence) https://github.com/apache/airflow/pull/10917#issuecomment-694567832 Let me know your thoughts. Having the callbacks in the `airflow run raw` will make the `airflow run local` process simpler and more lightweight. 


For the case where `airflow run raw` is force-killed, we can leverage the CallbackManager mentioned in AIP-43 DAG Processor separation. The `airflow run local` process creates an entry in the `callbacks` table and the CallbackManager decides where/how to run them, Or we delegate it to the scheduler to mark it as zombie and create/run callbacks.

Options:

We can use feature flag to control this behavior Or make it the default behavior.

What defines this AIP as "done"?

airflow run --local process does not parse the dag file.