AIP-78 Scheduler-managed backfill

Status

State	Accepted
Discussion Thread	https://lists.apache.org/thread/p78py27z06fdtdlp9tj2vqm0t609zwk8
Vote Thread	https://lists.apache.org/thread/zxb0gtk80gogkj918z62z4qhzjywqgoz
Vote Result Thread	https://lists.apache.org/thread/pgkrpf853srv2nt6dgxzg6fsn4ldcdr0
Progress Tracking (PR/GitHub Project/Issue Label)	https://github.com/orgs/apache/projects/408
GitHub Project	link
Date Created	2024-07-09
Version Released
Authors	Daniel Standish Vikram Koka

Motivation

Backfill is currently CLI only

Backfill can only be invoked via CLI, but users don’t always have access to the CLI. And since backfill is a synchronous ad hoc processes, if the CLI process dies, then the backfill job dies. It would be nicer if we could trigger backfills from API, let the scheduler handle the job, and poll for the status as desired. It would also be nice if we could observe the progress of a backfill via the scheduler.

Backfill is a second scheduler

The CLI backfill command is in effect a second airflow scheduler. This has a few negative consequences. Of course maintaining two things is more effort than maintaining one. But also, as Airflow evolves, having a second scheduler makes it hard to ensure that the behavior of the two does not diverge. And there is potential that bringing all the scheduling decisions under a single airflow scheduler could allow users to set different priorities for backfill and normally-scheduled runs.

Simplify other Airflow 3.0 work

This sorta falls under “backfill is a second scheduler”, but I think merits a specific call out. There are a number of initiatives coming in Airflow 3.0 which may require changes to the scheduler. Consolidating scheduling logic into one component may reduce the effort required for those changes.

Considerations

What change do you propose to make?

Remove the backfill scheduler
Make a “backfill job” a first class citizen
CLI backfill will no longer run any tasks; it will create a backfill job, and the scheduler will manage the job, creating the dag runs and scheduling the tasks
Backfill jobs are to be triggered asynchronously by API and managed by the scheduler
We should be able to view backfill jobs in the webserver and observe progress and status, and cancel or pause them

As elaborated on below, I propose to remove many parameters and do not intend to preserve all functionality presently available in backfill. In particular, behavioral aspects that are specific to the old locally-run approach will be removed.

Example ORM model for managing backfill runs

Specific table design is TBD, but below are some examples.

Model to track each backfill job at high level:

# one record for each backfill job
class BackfillRun:
    id: int  # primary key
    dag_id: str

    # generic config dict to hold params such as
    # start_date, end_date, and run_backwards.
    # intended to be flexiblie for potential changes
    # to how backfill works
    # e.g. changes in data completeness behavior
    config: dict  
    dag_run_conf: dict
    clear_failed_tasks: bool
    clear_dag_run: bool
    is_paused: bool

Model to associate each backfill job with specific dag runs.

# mapping table to associate a backfill run with dag runs
# this may not be necessary if we make dag runs "immutable" 
# i.e. if we stop reusing them, but that's out of scope
class BackfillRunDagRun:
    id: int
    backfill_run_id: int
    dag_run_id: int
    is_deleted: bool

Scheduler implementation notes

The scheduler will need to manage creation of backfill dag runs in accordance with any concurrency limitations and in the order specified. E.g. if depends_on_past is true, it's important that old runs should be created before new runs. There is also a "reverse" option for backfill that I do not propose to remove. Once the dag run is created, the tasks in the dag run should be handled by the normal scheduler processes, and those processes should remain more or less unchanged through this AIP.

Are there any downsides to this change?

No more local run

The old backfill provided a mechanism of local ad hoc run. You could set the executor to local executor in an env var and trigger backfill and it would run locally. This will go away and all tasks will be handled by the scheduler with the configured executor.

Scheduler load increase

This will result in increased load on the scheduler. I’m not saying it will tank scheduler performance, but there will simply be more for the scheduler to do when backfill is used. The scheduler’s job will be more complicated because now it has to make decisions about backfill tasks too. It will also have to handle unique characteristics of backfills like, creating the dag runs sequentially over potentially a long period of time. Ensuring that we do this in a performant manner is non-trivial.

Which users are affected by the change?

Any user who uses backfill. The behavior will change somewhat, so it affects them. The most significant difference is that execution will no longer be local.

How are users affected by the change? (e.g. DB upgrade required?)

DB migrations will be required to create the tables necessary for tracking backfill jobs.

Some backfill parameters will go away.

UI work

Some UI work will be required to allow users to create, observe, manage backfill jobs.

In addition to creating jobs, we should be able to pause and cancel jobs, and view backfill job progress and history.

Other considerations / difficulties

Some features and aspects of backfill are not well understood.

For example, why and how does a backfill job become deadlocked? I suspect that deadlocking is more likely when users elect to run a partial subset of the dag (with “task_regex” option). How will this be affected by moving management to scheduler? What changes to the scheduler code will be required to appropriately catch this type of occurrence.

Changes to params

Remove

continue_on_failures

If true: after each batch of dag runs, evaluate whether there is any task error; if so, do not continue with the next dag runs.
it seems like a pretty marginal feature so i am inclined to remove it.

ignore_first_depends_on_past

(already deprecated) is True always

treat_dag_as_regex

(already deprecated in favor of treat_dag_id_as_regex)

treat_dag_id_as_regex

Since this is now about creating distinct jobs

task_regex

this runs backfill on a portion of the dag, a subset of the tasks in the dag, whatever tasks match the regex. I am skeptical of how useful / important it is, so i am inclined to propose removing it and see if there’s pushback.

ignore_dependencies

this is a confusing parameter. we should get rid of it.

local

this tries to override the “configured” executor and use local executor
we don’t allow this anymore

donot_pickle

this should be removed in 3.0 generally

delay_on_limit_secs

not applicable since will now be handled by scheduler

run_at_least_once: means even if no exec date in range, stil run once

This kindof only makes sense as ad hoc run. There is no reason to use backfill for this. use trigger dag run.

verbose

Not running locally anymore so doesn’t really apply

subdir

Don’t think this makes sense anymore

continue_on_failures

This feature is about cancelling a backfill job after there is any failure. I’m inclined to leave it out initially.

Add

config

this is an attempt at being more generic so that we don’t need to change the CLI command to add new types of backfill or approaches to backfill, as may be delivered in other AIPs

example: {“type”: “schedule”, “start_date”: “2024-01-01”, “end_date”: “2024-02-01”}

Out of scope

“Other kinds” of backfill

This AIP is focused on how backfill is invoked and observed and managed, but not what backfill means. Other AIPs are looking at evolving airflow’s data awareness, so that backfill could mean, for example, creating one task run for a larger date range to “backfill” data (instead of, say, a million task runs with 5 minute intervals). Or perhaps, partitions that are not date-driven. We are not directly dealing with that here; we’re just moving execution from CLI to scheduler. But since these initiatives are running in parallel, in order to remain flexible, for now I am proposing to use a generic json config object to create the backfill run, and this is discussed below.

Task priority

It’s been floated that we should be able to set different priorities for backfill and normally-scheduled dag runs. We probably should.. But the current AIP is focused on moving backfill from a synchronous CLI process to an asynchronously-submitted scheduler-managed process.

Interactions with other AIPs

Other AIPs may have some impact on how this AIP shakes out. For example, currently, the dag run table is the store of data completeness state in Airflow, and you can only have one dag run per execution date. So, for example, a dag run can start its life as a “manual” run, the get manually cleared and become a “manual” dag run. If that changes, that affects how backfill works and perhaps the interface or the table structure we might want to use. But given the timeline for 3.0, we need to work in parallel so we have to accept some uncertainty.

What defines this AIP as "done"?

There are API endpoints that allow CRUD operations for backfill runs
We can create a backfill run for a particular time range and scheduler will create and manage the dag runs
There is a CLI interface that allows basic operations for backfill
There is a way in the web UI to create, monitor and manage backfill jobs.
Docs have been updated appropriately

Other ideas under consideration

Add extra concurrency control on dag run
Apply max active dag runs separately for backfill
Override any dag param in creating the backfill job and it’s only applied in that scope

Space shortcuts

Page tree