DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Motivation
Airflow uses synchronous programming in all of its components except the Triggerer, which manages and executes triggers asynchronously. Airflow has many blocking I/O tasks in the other components, particularly SQLAlchemy queries used in the Scheduler, API, Webserver, and Executor, as well as methods used to execute and monitor Airflow tasks in external systems by remote executors. By migrating these methods and queries to asynchronous programming, Airflow will be much faster and consume fewer resources than ever before and make full use of CPU cores.
Considerations
One of the requirements of this migration is to integrate Asynchronous SQLAlchemy into the Airflow core and provide an asynchronous session for use in Asyncio coroutines. This integration can reduce the execution time of processing based on multiple independent queries and enable Airflow to work more easily concurrently with low resource consumption.
What change do you propose to make?
This AIP proposes:
- Migrating the API and WebServer to asynchronous programming
- Migrating the Scheduler to asynchronous programming
- Migrating the Executors to asynchronous programming
Steps to make it happen include:
- Adding support for asynchronous SQLAlchemy in Airflow core by inferring asynchronous connection (SQLAlchemy engine) from metadata connection configuration
- Implementing an asynchronous version of the secrets backend to retrieve connections and variables asynchronously
What problem does it solve?
Reduces latency and improves throughput for I/O-bound operations (DB queries, remote calls, and secret fetches)
Better utilization of multi-core systems via true asynchronous execution paths
Better code decoupling through making independent sub-components run as separate tasks
Why is it needed?
Airflow’s execution model currently relies heavily on synchronous blocking calls that serialize I/O operations, resulting in inefficient CPU utilization, slower response times, and scalability bottlenecks in high-load environments. Modern workloads—especially those with many concurrent DAG runs or external API interactions—require non-blocking concurrency to manage hundreds or thousands of parallel tasks efficiently. By adopting asynchronous programming using asyncio and asynchronous SQLAlchemy sessions, Airflow can improve performance, reduce resource overhead per process, and achieve more consistent task scheduling under load. This evolution aligns Airflow’s architecture with industry trends toward async-native systems.
In addition, async allows components to be divided into sub-components in a natural way (was the only way before threading was a thing in Python). For example, the scheduler's loop consists of totally unrelated routines that may and should be decoupled for easier maintenance of the code.
Are there any downsides to this change?
Increased complexity: Introducing asynchronous code paths can make debugging, logging, tracing, and testing more complex compared to synchronous flows
Compatibility risk: Custom plugins, operators, or executors that rely on synchronous hooks or blocking DB operations may need updates to support async behavior
Gradual migration: Full migration of all components won’t happen at once, so hybrid async/sync operation modes must be supported temporarily
Which users are affected by the change?
- Those running at scale (e.g., high DAG concurrency, heavy task monitoring) will benefit from the change
- Core developers will benefit from easier maintenance, as independent sub-components run in different threads of execution
Implementation details
Scheduler
The scheduler consists of multiple components that are mostly independent from one another:
- DagRun Creation Routine
- DagRun "Starting" Routine
- Task Instance Scheduling Routine
- Task Enqueuing Routine (Critical Section)
Running them sequentially leads to throughput inefficiencies, and the need to carefully configure some ceiling parameters to ensure we don't stay in any section for too long. Running them in separate threads would fix the inefficiency automatically, by allowing every component to advance at its own pace.
4 Comments
Theo S.
An asynchronous scheduler implementation will do a good thing for decoupling between its several independent components:
(roughly)
As there is no logical coupling between these steps in the scheduling loop, it's natural to make them fully independent.
natanel
Hello, is the AIP still active? May I contribute to the given AIP?
As I think that it is a good idea to support async in airflow, as Theo S. said, the scheduler would probably have the best performance gains, in addition, having the dag processor as async could also reduce time, as most of the time is wasted for io, and as of now, the cpu just sits idle, it will allow to reduce resource usage as well.
Theo S. as for the scheduler, I would put any "critical section" in an async runtime, in addition to what you wrote, it also includes task creation, dag callback scheduling, task adoption and more, in addition to the executor synchronization.
Theo S.
There is an enormous value we can get from async scheduler, considering the amount of SQL queries, including the heavy ones.
Jarek Potiuk
Just to comment on that - MAYBE - SOME queries. Async is not magic, it does not mean things are running suddenly 10x faster, it means that if you manage to make "concurrent" behaviour (not parallel, but concurrent) in the calling code, then you can get some gains. The main reason why async is perceived as faster is not because it is faster on it's own but because it allows for concurrent (i.e. fast context switching while waiting for I/O) execution of I/O heavy operations. This only helps if you can actually run some I/O operations at the same time. If you have one I/O operation (SQL query in this case) waiting to start for another I/O operation (another query) - the gains are 0 (nill).
We currently have single, synchronous scheduler loop in Airflow. Currently they are sync - but more importantly - they **really** depend on each other. It's mentioned in the implementation details, that they are largely independent, but currently this is not true. Dag Run starting actually depends on dag run creation. Task instance scheduling depends on Dag run starting etc. Each of those has different set of queries to run, with different indexe, different access and locking patterns, and it's more than likely that trying them to run "in parallel" or "concurrently" via async routines will simply cause the sequential behaviours will shift from client to server - and async change will bring 0 or net-negative effects. For example async Dag Run Creation will create an exclusive lock on Dag Run table that might make it not possible to start new dagruns - and the async query to do it will hang anyway on the lock - even if it is async. And it might also include difficult to debug issues, locks. contentions that might even cause such change to loose performance rather than gain it.
So what we are talking here **really** is not about mapping current scheduling algorirthm to use async for particular queries, but to redesign the scheduling algorithm to be concurrency-friendly
That's how I see it. As usuall with those kind of changes, benchmarking and POC will be a good start to see actual benefits of "async" and "concurrency". I am not saying it's not possible, I am just saying that simply switching "sync" to "async" is not at all a guarantee of any improvements.