Status

StateDraft
Discussion Thread

Vote Result Thread
Progress Tracking (PR/GitHub Project/Issue Label)https://github.com/apache/airflow/pull/54721
Date Created

Aug 20, 2025

Version Released
Authors

Motivation

What problem does it solve? / Why is it needed?

Since Airflow came into existence, it has always been connected to the master node of the metadata database in read-write mode. This works perfectly for small to mid-sized installations, but for organizations that are running Airflow on a huge scale with thousands of workers and multiple schedulers, web-servers, and dag-processors, would love to make use of their database read-replicas, so the overall load on their master DB node stays within limits. It'll also help to keep the resource requirements for the master node in check and make use of database read-replicas that the organizations might have already. 

So in this AIP, we are proposing the ability to specify SQLAlchemy connection details for a read-only metadata database. This setting would be purely optional, and the absence of this would fallback to the regular master DB connectivity, thus ensuring 100% backward compatibility. 

In the first phase, we would make the code changes to let users specify a read-only connection string, and in subsequent phases, we can migrate the functions which doesn't need write connectivity for their workings and can handle data freshness lag up to a couple of seconds in general. 


Considerations

What change do you propose to make?

Phase 1

We make changes in airflow/utils/session.py to have a capability to support read-only sessions as decorators and context managers. We also need to provide the same interface for async DB sessions. There would be a fallback, which makes use of the existing master DB session in place of read-only session, if the configurations are missing. As of now, we would not take other SqlAlchemy configs like args, encoding etc for this new read-only session and it would make use of the same configs, which are being provided for the master session. In the future, we may add separate configs for this if needed. 

Phase 2

In this phase, we would find functions in the Airflow code, where a master session isn't really needed and can handle a couple of seconds of replication lag. Once identified, we can change them by changing the decorator from provide_session to provide_readonly_session while keeping the rest of the function same. 


Something like in the example below:


Alternate Architecture

If we do not wish to go and change code for such methods, we can expose another config, which is basically a list of function names (empty by default), and users can configure it accordingly. What I mean to say is that in Phase 1, we add the support of connecting to read-replica & then users can decide for themselves if they want to use RO sessions or not for their setup, function-wise. 

Something like this in their configuration, and we change the code in provide_session to connect to a RO session only for methods specified in this config. 

functions_to_use_ro_session:
	- airflow.models.serialized_dag.get_latest_serialized_dags


This approach gives full freedom to choose which functions users want to migrate as per their needs. However, the downside would be having a deep knowledge of the Airflow codebase and its workings, though we are targetting this feature mainly for power users, so that should be fine.   


Compatibility / Which users are affected by the change?

It won't impact any user, unless they change the read-only SQLAlchemy config value. For users who wants to reduce load on their master database node can opt for this feature, provided they have read-replicas setup for their DBs. Most of the cloud vendors support out-of-the-box support for setting up read-replicas and connecting to them. 


Maintainability

Nothing specific to maintain as a feature-wise. Users who are opting for the feature would need to make sure that their read replicas are not lagging too much. These power users would have to invest in monitoring and alerting on replication lag, something in the following lines 


https://cloud.google.com/sql/docs/mysql/replication/replication-lag

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_MySQL.Replication.ReadReplicas.Monitor.html


Execution dependencies

Nothing I can think of. 


Are there any downsides to this change?

It's completely optional, so there won't be any downside. For users who want to use this, they have to be aware of how to setup read-replicas and the impact of replication lags


How are users affected by the change? (e.g. DB upgrade required?)

None, as it's a completely optional and backward-compatible change. 

What is the level of migration effort (manual and automated) needed for the users to adapt to the breaking changes? (especially in context of Airflow 3)

None for those who don't want this feature. For users who want, they just need to make sure that Airflow can connect to their read-replicas. However, as maintainers of Airflow, we might have to invest in testing infra for this feature. 

What defines this AIP as "done"?

Once we are able to migrate all the methods that are using provide_session  but not writing into the DB to provide_readonly_session  decorator. Of course, we would add relevant documentation and positives/negatives of using this feature.  



13 Comments

  1. For me this proposal sounds reasonable. Some questions coming into my mind:

    • I think we do today not document any replica DB type support. In regards of complexity and testing should we maybe limit this feature only to one DB engine? Would it be Postgres?
    • Can we ensure that the connection string in the RO decorator is also changed in the "normal" setup to limit SQLAlchemy to RO operations somehow? That would remove any burden to test that not by accident the "wrong" decorator is used in a write operation which then is only detected in the hybrid deployment.
    • Do we have an estimate how many of the DB transactions could be made towards a RO mirror to estimate which amount of workload has a potential to be off-loaded? Due to the latency of replication/delay it must be read-only non critical information where no write activity as follow-up is adding inconsistencies.
    • In the "completion" I assume deployment documentation must be added to make such feature usable.
      1. We've planned to add full documentation on this feature. In regard to unit testing, we might have to just use the master DB details, and for integration tests, we can invest in setting up proper infra. I think we should support Mysql and Postgres. 
      2. I could not understand this point. 
      3. In our org, which is running an Airflow fork in big scale, the total QPS is around 80k, out of that roughly 10k goes to the master node. 
      4. Yeah, we'll do that. 
      1. Thanks for the response!

        Yeah, sorry, (2) was hard to read... spaghetti sentence. Let me try to re-phrase:

        • Assume you have code that consumes @provide_readonly_session
        • Can we pass also a RO session to the code of use (even if it is a RW session by configuring in SQLAlchemy / remove the W flag)?
        • Because this would make testing the feature much easier, then "normal" functional use directly would show a leak where a RO connection is used to write. Then no specific testing is needed to ensure that the DB split capability is working w/o error. That would remove a lot of concerns of additional complexity in testing.
    • Do we have an estimate how many of the DB transactions could be made towards a RO mirror to estimate which amount of workload has a potential to be off-loaded? Due to the latency of replication/delay it must be read-only non critical information where no write activity as follow-up is adding inconsistencies.

    I think this is the most important question.

    First of all my intuition tells me that there very little number of those operations that would really benefit from that if we are talking about speeding up 5% of queries that are currently insignificant in terms of load, this is not worth it. If it's 50% - likely, anything between then - maybe yes, maybe no.

    I would say we need to revert the phases:

    • first - finding which database operations can be really benefiting from it
    • measure the benefits of such replication (including some measurements of how replication to a read-only replica actually impacts the regular access).
    • then think if it's worth optimising

    Akso I do not agree we have "no impact" and "no dependencies" and "completely backwards compatible". This downplays the effect such implementation might have on maintenance, testing, and support cost it might incur. I believe such change adds signficant complexity to the software development and testing process of airflow if you ask me.

    Yes it might be compatible, but it has impact on complexity of the code and testing. You have to remember that when you release airflow with this feature on - we take responsibility to tell that it "works" in that combination. This means for example that we are going to make sure that we not only properly found and marked the read-only parts but that we also make sure it is correct in the future and that it does not slow us down with development. And it basically means that we need to have a test setup that will cover the scenarios - means that for both Postgres and MySQL, we would need to have test setup with replication and test everything, and that we need to be able to replicate any kind of problem the users might have.  I do not see this being considered as "challenge" here.

    Also - do we have any data showing how much replication of data can actually slow down regular database access. Remember that we are heavily basing airflow on using "skip-locks" in the database - do we have any measurements that show that having read only replicas is actually helping with performance? There are a lot of assumptions here that it does, but do we have actual numbers showing it?

    My engineering experience tells me that no matter the experience we are very poor in assessing impact of similar changes on performance. With the modern complex software + hardware + virtualisation no performance implications can be actually taken on the "faith" and they are often counter-intuitive.  We need have quite hard data showing us that adding replication and read-only replicas is actually speeding up things more than slow-downs introduced by those replicas. Also - we need to know whether such speedups are bigger than if we just ... increase machine size, speed, number of cpus and have more memory for cache and possibly faster networking.

    About 10 years ago - I would agree that adding read-replicas is the right choice for a number of problems - but does it still hold? In the modern infrastructure you can have machines with TBs of memory 60+ CPUS,  super fast local SSM disks and so on . Are we sure that putting two machines next to one another where one is read replica will bring more benefits than having 2x more memory, networking?  Airflow database is "small" - not petabytes but terabytes of data (at most) - this means that all database can be cached in memory generally speaking - and all you read queries will be using that data in memory. With read replicas we are introducing extra networking, synchronisations, CPU, etc. that might or might not make things faster and more scalable than just  increasing memory.

    I would be very hard no to such change unless we see some realistic data showing that what we are going to implement is actually giving any benefits to anyone.

    1. I do agree with the points related to testing and maintaining the feature. However, I feel that it's a quite useful feature if you are running Airflow at scale. Most of the big orgs that are using Airflow must have invested already in running read-replicas, so why not use them. Also, it's not just 2 nodes, i.e. 1 master and 1 replica, for example we are running 1 mysql master node and dozens of replicas, and read-only load is being distributed on them. I'm not sure, if I can share the internal metrics here, but without read-replica support, the maintenance of Airflow infra would have been a nightmare. 

      1. > In our org, which is running an Airflow fork in big scale, the total QPS is around 80k, out of that roughly 10k goes to the master node. 

        Did you do any checks including the completely changed DB access architecture of Airflow 3?

        While I perfectly understand that for Airflow 2 that could have been the case Airlfow 3 quite completely changed the way how the distributed database access worked. Almost all the characteristics of Airflow DB access has changed in Airflow 3 - dramatically. There are no longer 1000s of workers accessing DB directly. We deliberately centralized DB access in api_server. This means that there are fare more benefits from connection pooling of sqlalchemy, and caching of read-only data available in api_server. The numbers of Airflow 2 are in no way comparable to Airflow 3. While I understand that distributing 1000s of workers talking directly to db among replicas might make sense, it does not seem nearly as powerful to have single api_server talking to both master and replicas at the same time.

        > 1 mysql master node and dozens of replicas, and read-only load is being distributed on them

        Each replica is not "cost free" every time you add a replica you impact the master. It does not "magically" solves the problem of scalability, it shifts it from accessing the DB to replicating.

        Question: Did you actually make any measurements to assess what is the impact of it (in case of Airflow 3 access scenario for DB, or did you assume that if it worked for Airflow 2, it will work the same for Airlfow 3 - with completely changed DB access mechanism and distribution of DB access. Did you do any analysis that would include changes implemented in Airflow 3? What were the results? 

        > the maintenance of Airflow infra would have been a nightmare. 

        What kind of nightmare? Do you have some tangible explanation what kind of problems it solved ? I can imagine for example that it solved the problem of having multiple connections opened to the single DB server. But this problem has already been solved by Airflow 3 architecture change. I think "nightmare" bears absolutely no meaning without knowing what kind of nightmare it is.

        1. > Did you do any checks, including the completely changed DB access architecture of Airflow 3?

          Not yet, but just by removing workers accessing DB directly, doesn't mean that those interactions aren't needed anymore. It'll go through API routes instead of a direct DB route, but the QPS would be more or less the same. Not sure what kind of DB caching we've in API server at the moment, so I've to look into that. 

          > Each replica is not "cost free" every time you add a replica you impact the master. It does not "magically" solves the problem of scalability, it shifts it from accessing the DB to replicating.

          Well I'm not DBA expert, but I don't think that any replica is doing a read call from master and updating itself. They must be getting updated by applying binlogs or something similar in the background. Also, we've these many replicas, for various reasons, like zonal and/or regional disaster recovery point of view. 

          > What kind of nightmare? 

          Recently, when our infra was running at 50-50 traffic to master and repicas, the master node was maxed out with cores we can provide and still choked. Every now and then we would get some stuck queries, which would spike the load to 100% and make everything stuck. It may be a Mysql world problen, but after a certain point adding number of cores didn't work out for us. 

          1. The access scenario are **vastly** different for Airflow 3.

            • In case of Airlfow 2 we only sent task_id to the worker and it had to **read** the data using completely new session and connection.
            • In case of Airlfow 3 - we are retrieving the information about task in scheduler during scheduling and passing it to task via TaskAPI before the task even starts

            That on it's own means saving of at least one or few read DB call (or more) per task run.

            I think there is absolutely no way to take Airflow 2 DB acccess patterns and compare them to Airflow 3 - those are like apple-to-pears comparision if you ask me.

            And I think it's quite a bit "reckless" approach to not see how this is going to increase the burden on testing and maintenance for the maintainers . I find it a bit uniformed to repeat multiple times and stress that this is backwards compatible and has no influence on anything, where we are talking about essentially doubling of our DB tests, having way more complex test infrastructure, the neeed to manually test various scenarios connected to replication failures, delays in replication and having to test all the kinds of scenarios. Contrary to what all the softening in the AIP explains - this is impacting our development velocity and maintenance overhead significantly - and there is absolutely no way we can agree to it without knowing if it makes any sense at all.

            I would say - and unless there are some good arguments to convince me, the only approach I see here is to have company like yours that have such a need to do - again, fork of Airflow 3 for testing it in real scenarios. Keep the fork rebased and apply those changes in your staging environment, and make comparision tests on real-life Dag scenarios  - and implement the change, identify all the places that can get improved by applying read only operator and make a comparision of "with replicas" and "without" and see how they compare (including resource usage and such).

            Also - I would be very, very strongly against having sometimes like that without a complete test suite that allows to test ife everything works with the replica scenarios.

            Google is about to contribute a performance test framework (first results were shared at the last dev calls) - that framework could be extended to get some numbers out  also including some syntetic dags.

            I'd say those are the two things that are absolutely necessary to even discuss if the increase maintenance complexity is worth it:

            • seing comparision resuts on real-life Dags and infrastructure from your company POC
            • backed also with performance results from the performance framework showing the impact


            1. Well, if we are so worried about testing and maintaining this feature, then we could say that this is an experimental feature, and users who are using this feature need to be cognizant of the fact. But anyway, I'm out of words now & won't be pursuing this AIP anymore. In my current org, we've been using this feature for years, and will continue to do so with Airflow 3 as well, cause I don't think that just by removing one read call from workers to scheduler would make a drastic impact on the overall DB QPS for us. Maybe down the line, once we've migrated our workloads completely to Airflow 3, I or someone else from the team can revive this AIP again. 

              1. This is exactly what I suggested. Check it in your org. See what impact it has. Make a comparision and real measurement showing the results and achieved savings. Then we can actually discuss it - knowing the impact it will have and we can balance it with increase maintenance cost and testing. Bringing thing as experimental does not do anything - we stil have to test it and different scenarios, if feature is experimental, it means that we might **change** how it works without warning and deprecation,  but it still means that if someone has a problem and raises an issue, we have to be able to reproduce it, test, fix, and have a way to prevent regressions.

                What you proposed now is exactly how it should be:

                • implement it with your copy of Airlfow 3 
                • measure impact comparing to "no read only"
                • bring it here with numbers

                Yes, I also do not know if the architecture changes, changed the characteristics. Same as you. I suspect it does. But I do not know.

                And all what I am asking for is to change "I do not know" into "I know" - before we impact development effort, test infrastructure for "airflow" as a product.


                The "I do not know" and "I follow it anyway" is the "reckless" part of the proposal. Not the proposal itself.

                1. Also just a bit more context Sumit Maheshwari . Please don't take it personally.

                  What I say now is really based on the experience of AIP-44. That was a huge mistake (of mine) I got it approved and started implementing it. It was kinda similar story (though it did not have the goal of improving scalability and performance, just security. I vastly underestimated complexity it brought to the daily work of everyone else. It started with similar "let's slap decorator on top of the methods and all things will be solved". That turned out to be just a tip of an iceberg. It ended up with us having to artifficially split certain DB methods to "internal" and "external" parts because the code was simply not prepared and not properly modularized to accomodate to AIP-44 expectations. What we ended up with is "almost working soluation" - where it turned out that few of the places that we could not really implemented were disabled. when AIP-44 was enabled.

                  That all was ONLY possible because from the very beginning we had AIP-44 dedicated test jobs and our CI was more complex because of that. Luckily it was only a "little" more complex - but complexity of replicated database tests is about order of magnitude bigger. 

                  Even if it was relatively small scope, this made a huge (negative) impact on the development. Ask all the maintainers who were developing Airlfow 2. This was a major pain - mostly because most of the people developed against somethign they did not test locally for and that was basically kinda-duplicating of the test efforts.

                  It was not all bad, it allowed Jens Scheffler to implement Edge Executor and make it works for Bosh a year before Airflow 3 was out. And we learned a ton and identified a lot of places that were problematic (and still are in Airflow 3 - like callbacks) that we knew we have to find better solutions. That was a great learning experience. But the cost was huge - both time-wise and mental-wise. 

                  And yeah. It was personally my own mistake we went that route. I admit it. I was reckless. But - I've learned.

                  So my ask is very simple, let's not repeat the same mistake again. If your company is willing to go that route - I would really love they do. Fix all problems, realise what is the "real" scope of changes and what impact it has. Finally assess if it was worth it. 

                  The best way to learn is to learn on other's mistakes. And in this case my preference for the Airflow maintainers is that we learn on the mistakes (and learnings) of Uber in this case without bearing the pain of making those mistakes. Plain and simple and very straightforward and blunt.


                  1. I do understand Jarek Potiuk and values your inputs. Infact it takes huge efforts to review in details and saying NO to proposals, so thanks for doing that. Regarding to this AIP, as I said we'll revive (hopefully) in the future, but you know that nothing is certain, right.  

                    1. I am not so sure whether the proposal has a high risk of problems. If my question (2) ^^^ is possible to be made then I'd assume besides docs and responsibility of a deployment manager + docs the change is rather leight-weight and non-intrusive. If a RO connection is really RO then testing is minimal because you see it immediately breaking.

                      Whereas I agree we might need to check what the level of RO "savings" for a master DB could be actually.