Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Page properties
StateDraft
Discussion Thread

https://lists.apache.org/thread/bhz76jzsb8thwy89sz77fcywhgkj63sw

Vote Thread
Vote Result Thread
Progress Tacking (PR/GitHub Project/Issue Label)
Date Created

 

Version Released
Authors

Motivation

Airflow uses synchronous programming in all of its components except the Triggerer, which manages and executes triggers asynchronously. Airflow has many blocking I/O tasks in the other components, particularly SQLAlchemy queries used in the Scheduler, API, Webserver, and Executor, as well as methods used to execute and monitor Airflow tasks in external systems by remote executors. By migrating these methods and queries to asynchronous programming, Airflow will be much faster and consume fewer resources than ever before and make full use of CPU cores.

Considerations

One of the requirements of this migration is to integrate Asynchronous SQLAlchemy into the Airflow core and provide an asynchronous session for use in Asyncio coroutines. This integration can reduce the execution time of processing based on multiple independent queries and enable Airflow to work more easily concurrently with low resource consumption.

What change do you propose to make?

This AIP proposes:

  • Migrating the API and WebServer to asynchronous programming
  • Migrating the Scheduler to asynchronous programming
  • Migrating the Executors to asynchronous programming

Steps to make it happen include:

  • Adding support for asynchronous SQLAlchemy in Airflow core by inferring asynchronous connection (SQLAlchemy engine) from metadata connection configuration
  • Implementing an asynchronous version of the secrets backend to retrieve connections and variables asynchronously

What problem does it solve?

  • Reduces latency and improves throughput for I/O-bound operations (DB queries, remote calls, and secret fetches)

  • Better utilization of multi-core systems via true asynchronous execution paths

  • Better code decoupling through making independent sub-components run as separate tasks

Why is it needed?

Airflow’s execution model currently relies heavily on synchronous blocking calls that serialize I/O operations, resulting in inefficient CPU utilization, slower response times, and scalability bottlenecks in high-load environments. Modern workloads—especially those with many concurrent DAG runs or external API interactions—require non-blocking concurrency to manage hundreds or thousands of parallel tasks efficiently. By adopting asynchronous programming using asyncio and asynchronous SQLAlchemy sessions, Airflow can improve performance, reduce resource overhead per process, and achieve more consistent task scheduling under load. This evolution aligns Airflow’s architecture with industry trends toward async-native systems.

In addition, async allows components to be divided into sub-components in a natural way (was the only way before threading was a thing in Python). For example, the scheduler's loop consists of totally unrelated routines that may and should be decoupled for easier maintenance of the code. 

Are there any downsides to this change?

  • Increased complexity: Introducing asynchronous code paths can make debugging, logging, tracing, and testing more complex compared to synchronous flows

  • Compatibility risk: Custom plugins, operators, or executors that rely on synchronous hooks or blocking DB operations may need updates to support async behavior

  • Gradual migration: Full migration of all components won’t happen at once, so hybrid async/sync operation modes must be supported temporarily

Which users are affected by the change?

  • Those running at scale (e.g., high DAG concurrency, heavy task monitoring) will benefit from the change
  • Core developers will benefit from easier maintenance, as independent sub-components run in different threads of execution

Implementation details

Scheduler

The scheduler consists of multiple components that are mostly independent from one another:

  • DagRun Creation Routine
  • DagRun "Starting" RoutinRoutine
  • Task Instance Scheduling Routine
  • Task Enqueuing Routine (Critical Section)

Running them sequentially leads to throughput inefficiencies, and the need to carefully configure some ceiling parameters to ensure we don't stay in any section for too long. Running them in separate threads would fix the inefficiency automatically, by allowing every component to advance at its own pace.

How are users affected by the change? (e.g. DB upgrade required?)


Other considerations?


What defines this AIP as "done"?