Status
This AIP is part of AIP-63, which aims to add DAG Versioning to Airflow.
Motivation
Today TaskInstance doesn’t maintain any history for prior tries of a task - when a retry starts, it resets everything about the prior try. This means one is unable to know even simple things about a prior try without looking at the task logs. This also means it’s impractical to show this level of detail in the UI for users.
Considerations
To start, we will take this opportunity to create a synthetic primary key on TaskInstance to make it easier to reference a specific task instance. We will also adjust the foreign key on TaskInstance to use the DagRun synthetic key (id) instead of the logical key (dag_id, run_id).
At a high level, Airflow will need to store the details of a TaskInstance try. We have a few options on how to achieve this, and we will determine the best approach as we roll up our sleeves during implementation. A few options include:
- Track the complete try details in another table, something like `task_instance_tries`. Optionally keep the logical key portion of the existing `task_instance` table in that table, with select denormalized columns (like `state`), and possibly a FK to the tries row.
- Add `try_number` to the logical key of `task_instance`. This might have query complexity concerns, however.
One thing to keep in mind is that every column on a task instance other than dag_id, task_id, run_id, and map_index can change between tries, so we have to keep all of them for a complete history.
A focal point will be in ensuring the query the scheduler uses to decide what tasks are ready remains performant. We will, however, consider other common access patterns as well, like ensuring the UI dashboard is still performant enough.
TaskInstance is a widely used entity in Airflow, so backward compatibility will be maintained for places where user code interacts directly with TIs (e.g. task context).
The UI will utilize the familiar try buttons in use today for task logs for the TI views:
The REST api endpoints will also be updated as appropriate. For example, the `dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}` endpoint will return the latest Task Instance. An additional ``dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/tries/{try_number}` endpoint will be added to allow retrieval of earlier Task Instances.
This AIP will be considered done when TaskInstance history is tracked, and basic UI/API/CLI functionality exposes that information to users.
8 Comments
Andrey Anshin
What if we create synthetic/surrogate key for the entire task instance? E.g. UUID5/3 of dag_id, task_id, run_id, map_index this might get the benefit with storage into the DB tables in linked table, because we do not need to store entire key + build such of complex index / pk.
1 + (1..250) + 1 + (1..250) + 1 + (1..250) + 4 bytes
or
from 10 bytes to 757 (without data aligning overhead)
1 + (1..250) + 1 + (1..250) + 1 + (1..250) + 4 bytes
or
from 10 bytes to 757
UUID could be calculated on the client side before send request to the Postgres:
Jedidiah Cunningham
I intentionally didn't specify what the synthetic key would be so we have flexibility I want to think through UUID4 vs UUID5 vs something else more before we commit to something long term.
Hussein Awala
+1 for this nice feature. However, this will cause the database size to increase, knowing that many users do not care about previous retries. So I'm wondering if we can add a configuration to determine how many retries to keep in the database with 1 as the default value (we can also set it to -1 to enable the feature by default and add a doc to explain how to configure Airflow to achieve the previous behavior)
Jens Scheffler
I am very much +1 for this. Because other than parsing logs or making external data points there is no good "structured" way to analyze failed in the mtat data DB.
I would assume from point of sizing the main pain points are described. Nevertheless looking forward for the proposed target model. As somebody with DBA legacy might be I can also contribute a few nits. Let me know if it helps. (For example there are a couple of other string field which - if normalized - could dramatically reduce storage overhead)
From point of sizing and retries I would hope and assume that if a high traffic setup is made that <10% of tasks need a retry. Else you should consider thinking about operational stability. In other (hopefully?) rare cases where a retry is kinda mis-used for calculated side effects I hope traffic is rather low to medium. (I remember the fix in UI lately where somebody reported UI is un-readable if more than 128 retries ) - but therefore at least now we have a drop-down to substitute buttons
Shubham Mehta
I am default against new configurations, but I have to agree with Hussein on this one. Users should have the freedom to control this as DB size especially becomes very important when you're in-place upgrading to new airflow versions or upgrading your db engine.
Amogh Desai
I am with Hussein on this one too. Users should have the flexibility to control this number, either through configuration in the airflow instance or we can set up the necessary tables by default and use them only if configured to maintain backward compatibility and reduce complexities.
Jens Scheffler
I also like configurability a lot - but in this case I do not feel this helps. If you make the number of versions configurable then with every retry you need to check & purge records.
In such case I#d rather propose to add a new purge option in `airflow db clean` command to off-load this to a nightly job.
Do we have any figure at-hand from running installations? How many tasks are we talking about? What is you "in the wild" amount of rows it increases with all retries?
On our side: We have had in peak times 1.2 mio tasks per day. without explicit counting I'd assume we might have around 500-1000 retries in total. (luckyly we are down to 100k tasks/day currently, still with 500-1000 tasks in retry per day the overhead is neglegible
Jedidiah Cunningham
I'm in agreement with Jens (and I went and crunched some numbers to confirm) - the vast majority of tasks runs are the first try, and if you have db size pressure from try number rows for subsequent tries, you likely have bigger problems to solve first.
I think the current behavior of "throw away the old try" is bad and definitely shouldn't be the default, if we do end up making it configurable at all.
I'm not convinced we should add extra complexity (code wise for us, or config wise for deployment managers) for this. TI is such a key entity in Airflow too, I'm just not sure having try level retention rules makes sense in the big picture. I think there are much bigger foot-guns we could look at first for db size concerns (cough, xcom, log).
I'm in favor of keeping all try history and using existing tooling to handle cleanup (e.g. db clean).