Status

StateCompleted
Discussion Thread
Vote Threadhttps://lists.apache.org/thread/cdrxd4tsq982gjmbbl32vp2ygt9dxgpk
Vote Result Threadhttps://lists.apache.org/thread/s8vgjfkhz8z8t27vdvofc7w22jwzvz0y
Progress Tacking (PR/GitHub Project/Issue Label)
Umbrella AIPAIP-63: DAG Versioning
Date Created

2024-03-05

Version Released2.10
Authors

This AIP is part of AIP-63, which aims to add DAG Versioning to Airflow.

Motivation

Today TaskInstance doesn’t maintain any history for prior tries of a task - when a retry starts, it resets everything about the prior try. This means one is unable to know even simple things about a prior try without looking at the task logs. This also means it’s impractical to show this level of detail in the UI for users.

Considerations

To start, we will take this opportunity to create a synthetic primary key on TaskInstance to make it easier to reference a specific task instance. We will also adjust the foreign key on TaskInstance to use the DagRun synthetic key (id) instead of the logical key (dag_id, run_id).

At a high level, Airflow will need to store the details of a TaskInstance try. We have a few options on how to achieve this, and we will determine the best approach as we roll up our sleeves during implementation. A few options include:

  • Track the complete try details in another table, something like `task_instance_tries`. Optionally keep the logical key portion of the existing `task_instance` table in that table, with select denormalized columns (like `state`), and possibly a FK to the tries row.
  • Add `try_number` to the logical key of `task_instance`. This might have query complexity concerns, however.

One thing to keep in mind is that every column on a task instance other than dag_id, task_id, run_id, and map_index can change between tries, so we have to keep all of them for a complete history.

A focal point will be in ensuring the query the scheduler uses to decide what tasks are ready remains performant. We will, however, consider other common access patterns as well, like ensuring the UI dashboard is still performant enough.

TaskInstance is a widely used entity in Airflow, so backward compatibility will be maintained for places where user code interacts directly with TIs (e.g. task context).

The UI will utilize the familiar try buttons in use today for task logs for the TI views:

The REST api endpoints will also be updated as appropriate. For example, the `dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}` endpoint will return the latest Task Instance. An additional ``dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/tries/{try_number}` endpoint will be added to allow retrieval of earlier Task Instances.

This AIP will be considered done when TaskInstance history is tracked, and basic UI/API/CLI functionality exposes that information to users.

8 Comments

  1. What if we create synthetic/surrogate key for the entire task instance? E.g. UUID5/3 of dag_id, task_id, run_id, map_index this might get the benefit with storage into the DB tables in linked table, because we do not need to store entire key + build such of complex index / pk.


    UUID size (constant)dag_id + task_id (flexible)
    PG128 bit / 16 bytes (native)

    1 + (1..250) + 1 + (1..250) + 1 + (1..250) + 4 bytes 

    or

    from 10 bytes to 757 (without data aligning overhead)

    MySQL36 (size of string repr) + 1 byte

    1 + (1..250) + 1 + (1..250) + 1 + (1..250) + 4 bytes 

    or

    from 10 bytes to 757


    UUID could be calculated on the client side before send request to the Postgres:

    • It should reduce time on select for exact task instance in case of access by primary key (task_instance_table) or composite keys (linked tables)
    • It should increase time on insert/update/delete on task_instance table
    • It should reduce time to insert/update/delete on linked tables, such as task_instance_role, task_instance_reschedule, task_maprendered_task_instance_field, task_fail and maybe xcom
    • Traffic between DB and Airflow component slightly reduced
    • Required migration (+each time when we extend key of task instance)
    1. I intentionally didn't specify what the synthetic key would be so we have flexibility (smile) I want to think through UUID4 vs UUID5 vs something else more before we commit to something long term.

  2. +1 for this nice feature. However, this will cause the database size to increase, knowing that many users do not care about previous retries. So I'm wondering if we can add a configuration to determine how many retries to keep in the database with 1 as the default value (we can also set it to -1 to enable the feature by default and add a doc to explain how to configure Airflow to achieve the previous behavior)

    1. I am very much +1 for this. Because other than parsing logs or making external data points there is no good "structured" way to analyze failed in the mtat data DB.

      I would assume from point of sizing the main pain points are described. Nevertheless looking forward for the proposed target model. As somebody with DBA legacy might be I can also contribute a few nits. Let me know if it helps. (For example there are a couple of other string field which - if normalized - could dramatically reduce storage overhead)

      From point of sizing and retries I would hope and assume that if a high traffic setup is made that <10% of tasks need a retry. Else you should consider thinking about operational stability. In other (hopefully?) rare cases where a retry is kinda mis-used for calculated side effects I hope traffic is rather low to medium. (I remember the fix in UI lately where somebody reported UI is un-readable if more than 128 retries (big grin)) - but therefore at least now we have a drop-down to substitute buttons (big grin)

    2. I am default against new configurations, but I have to agree with Hussein on this one. Users should have the freedom to control this as DB size especially becomes very important when you're in-place upgrading to new airflow versions or upgrading your db engine.

    3. I am with Hussein on this one too. Users should have the flexibility to control this number, either through configuration in the airflow instance or we can set up the necessary tables by default and use them only if configured to maintain backward compatibility and reduce complexities.

    4. I also like configurability a lot - but in this case I do not feel this helps. If you make the number of versions configurable then with every retry you need to check & purge records.

      In such case I#d rather propose to add a new purge option in `airflow db clean` command to off-load this to a nightly job.

      Do we have any figure at-hand from running installations? How many tasks are we talking about? What is you "in the wild" amount of rows it increases with all retries?

      On our side: We have had in peak times 1.2 mio tasks per day. without explicit counting I'd assume we might have around 500-1000 retries in total. (luckyly we are down to 100k tasks/day currently, still with 500-1000 tasks in retry per day the overhead is neglegible

    5. I'm in agreement with Jens (and I went and crunched some numbers to confirm) - the vast majority of tasks runs are the first try, and if you have db size pressure from try number rows for subsequent tries, you likely have bigger problems to solve first.

      I think the current behavior of "throw away the old try" is bad and definitely shouldn't be the default, if we do end up making it configurable at all.

      I'm not convinced we should add extra complexity (code wise for us, or config wise for deployment managers) for this. TI is such a key entity in Airflow too, I'm just not sure having try level retention rules makes sense in the big picture. I think there are much bigger foot-guns we could look at first for db size concerns (cough, xcom, log).

      I'm in favor of keeping all try history and using existing tooling to handle cleanup (e.g. db clean).