Background and motivation

The logical key of a table tends to tell you what it means.  I.e. what does a record represent.

Up to this point, the logical key of the dag run table was logical_date + dag_id.  

(i will call it logical date not execution date just to keep it simple).

In other words, the combination of logical_date + dag_id is what defined a dag run, and what uniquely identified it.

What does the logical date represent?  I think there's really no better way to describe it than the word "partition".  So that is what I'll use here.

From the beginning, essentially the default assumption of Airflow was that the user was supporting a date-partition-driven workflow.  The evidence for this is quite widespread.

  • the dag is not scheduled until the end of the partition
  • there can only be one running dag run per logical date
  • there can only be one dag run record per logical date
  • every dag run must have a logical date. it's assumed that this date is meaningful for all dags / dag runs
  • backfill is conceptually about "filling in the gaps" -- it's in the name. gaps of what? partitions.
  • catchup: ensure that each partition is run -- each logical date has meaning for the dag.
  • there are places in the code base where we get "the prior run" -- and we sort not by "when it ran" but by logical date.  e.g. depends on past and xcom pull.

In this way, the dag run table functioned as the store of data completeness.  It told you in effect which partitions were fulfilled and which were not.  Airflow could answer the question "which is the dag run for date X", "what's the dag run state for date X", "give me the run prior to run X".  Indeed, Airflow needed to function in this way in order for backfill to function, and to prevent concurrent runs of the same partition.  The grid view was also a data completeness view.  Every vertical bar represented a single logical date.

I think that we can all recognize that, this canonical logical-date-driven dag pattern, not everyone uses it.  I would guess that it's actually a minority of dags. But there isn't really great data on this.

Personally, in my former life as a data engineer, I almost never used this design pattern.  I never worked at a hive shop. I mostly did things either incremental or full refresh.  I had virtually no use for execution_date.  Instead I wrote my own interface for storing task state and a watermark operator to make use of it for incremental loads.

That having been said, I know that other engineers and teams do use logical date, and do write their dags in that classic Airflow partition-driven style.

Recently, in AIP-83, we removed the uniqueness constraint on (dag_id, logical_date).  This means that now there can now be many dag runs with the same logical date for the same dag.  It also means that there's no longer a well-defined answer to "is this date fulfilled or not?"  Or "give me the prior dag run".  It introduces questions like, "if there are three runs for date X, and user backfills this range, do we create one new run, or clear and rerun all 3?"  Or, how do we ascertain what is the state of date X (e.g. when user backfills with "rerun failed" or "rerun completed") when there are multiple runs for that date?

In other words, this blows up the traditional airflow dag run semantics.

Even though I personally almost never used this design pattern, I know it's a valid pattern. And fundamentally, since the beginning of time, Airflow has assumed all dags follow this pattern. And while most of us probably just ignore it or look the other way, I would guess some folks probably like the pattern or in any case are used to it.

Fundamentally my contention is that we should continue to support this pattern. I also think it should no longer be the default.  I think the partition-driven style should be one available mode.  But probably a more sensible default would be that the dag just runs when it is scheduled, and it's not connected to any partition or data interval.

Problems with current state

The blanket removal of uniqueness constitutes a breaking change.  The behavior is simply different.   

Where previously, we could point to the uniqueness constraint of dag_run to understand what a dag run means, now we cannot

Where previously there could only ever be one dag run per logical date, now there could be many, and that introduces some ambiguities with regard to inferring the state of the partition: 

  • Users are not protected against concurrent runs of the same logical date
  • Suppose you backfill a range containing 10 runs with date X
    • What if you ask for "rerun failed"? How do you determine the state of that logical date?
    • Suppose they all failed.  Should we rerun all of them?  Only the one that started most recently?
      • What if there is more than one "latest" one? I.e. with same start date?
      • Suppose that their start dates differ by a few milliseconds, so there is no such ambiguity – do you really want your workflow to be driven by that randomness / race condition?
    • In the old Airflow semantics, backfilling a range would never result in more than one run per logical date.
  • If a task is marked "depends on past", and there are 3 runs for the prior logical date, which run should we look at?
  • If you try to retrieve xcom of the task from the prior run, which run should it choose?
  • We've lost our view of data completeness.  Previously grid view functioned as data completeness view, because every partition's state was visible there.  So if you saw a red bar you knew that partition did not load properly.  But now if it was later re-run successfully, that red bar would not be telling the truth anymore.

Side problem: data intervals

Data intervals are not a specific target of this proposal.  But, in any of these solutions we have to think about what are the implications for data intervals and what to do about them.  And, in some of the solutions considered, the proposal is to not have data intervals defined in some scenarios.  Generally speaking, I think they are confusing and misleading and we would probably do well, as is planned for assets, to move away from them.

With that let's briefly discuss some of the  issues.

One problem is that, they don't tend to tell the truth.  If a dag's behavior is not driven by logical_date, then the data interval object has no meaning.  In other words it is actively misleading.  It tells you the data interval for this dag is A→B, but in reality it has nothing to do with what the dag did.

Data intervals were layered onto the existing (albeit implicit) partitioning concept represented by execution_date / logical_date.  They assume more about what the user is doing than they really need to.  So e.g. rather than saying this execution_date is simply the partition key for this run (whatever that may mean) they say this logical date represents an interval.  But what if a M-F dag really does process only the M-F data, i.e. really only deals with one date at a time, and doesn't care about weekends?

It also creates trouble when trying to define what the data interval should be for a dataset-triggered run, or a manually-triggered run. With dataset-triggered, we take the min and max of the the dag runs that created the dataset events that triggered the run.  In practice, I suspect this generally is not in agreement with the actual work that is being done.

What to do / options considered

We have a number of options.

  1. restore uniqueness and keep old dag run semantics
  2. restore uniqueness but make logical_date nullable
  3. add a partition-esque concept
  4. leave it sorta fuzzy and ambiguous
  5. distribute the responsibility for what backfill means etc

Let's discuss each one

1. Restore uniqueness and keep old dag run semantics

We could say, for now let's keep logical date unique and defer the removal of uniqueness constraint until 3.1+.  With this approach, it's possible we could design a solution without the time pressure of airflow 3.  The cost here is that, we would likely spend more time on the effort overall, and we would not have the option to break backcompat that we do with 3.0. Though, I don't think we really need to breakcompat – indeed that's sort of the point of this amendment proposal.

2. Restore uniqueness but make logical_date nullable and / or optional

To me this is a very interesting approach that on one hand, let's us keep the dag run semantics unchanged.

But it also makes it possible to just fully dispense with logical dates and data intervals for those dags which don't use them.

With this approach we sort of have our cake and eat it too.

2.a. timetables that don't use logical date

If you don't care about logical date or data intervals, then maybe indicate that with the timetable you use:

DAG(dag_id="blah", schedule=MyNoLogicalDateNonDataIntervalTimetable(...))

Or, if we want to let users continue to use a cron expression, we could add a DAG level param.

DAG(dag_id="blah", schedule="0 0 * * *", logical_dates=False)

2.b. don't add timetables with no logical date, but let user manually trigger with no such date

A more limited version of this one, nullable logical date, is that we don't actually add timetables that don't use logical date, but we allow users to trigger runs with no logical date.  These would be unaffected by the uniqueness constraint on logical date.

Technical note

Null values are allowed with uniqueness constraint.  So the unique constraint on logical date in dag run is not a problem here.

Other note

I think with option 2, it might make sense to introduce another date field, run_date  which would be populated for all runs.  It would be the date after which the dag run should be free to be scheduled.  This word is generic enough to work for both schedule-driven dags and event-driven dags.  It would always mean the earliest time that the dag should run.  For manually triggered e.g. it would be "right now".  When the scheduler is not overloaded, it should always be close to start_time


3. Add a partition concept, or a partition-esque concept

We could allow users to declare whether the dag should use old semantics or new semantics.

This old vs new distinction is just shorthand; in reality we should not describe it as old vs new, since this implies a judgment about it, whereas I think we should recognize the old way as a valid way of designing pipelines, even as we make it just one option instead of the only option.

With this approach, we propose to add a partition param to the dag object.  At this time, we would propose three available behaviors.

Full backcompat – legacy partition-driven semantics

E.g. we could declare partition="implicit" to say that the dag should have partitions defined by the schedule

DAG(dag_id="blah", partition="implicit")

With implicit partitioning, we keep the old Airflow semantics.  When the scheduler schedules a dag run, it creates a record in a dag_partition table.  This table would ensure uniqueness, within the scope of the dag, of partition keys.  Each partition key would be associated with only one dag run.  So this would make well defined the questions "what is the state for this partition?", and "what is the run prior to this run?".  Partition info would be made available in the execution context.  Partition date would equal logical date always.  Data interval would behave the same.

No partitioning

To disable partitioning for a dag, user can set partition=None.

DAG(dag_id="blah", partition=None)

With partition=None, no partition records would be created in the dag_partition table.  No uniqueness would be enforced w.r.t. logical date.  

Other modes

We have explored adding other configurability to the partition paramater.  E.g. something to allow us to not have data intervals, or to change when the partition is run – e.g. on the partition date and not on the start of next partition.

Aside: I think making dag partitioning an explicit thing in airflow would be an overall good thing, beyond the whole uniqueness issue.  It clarifies what exactly is going on, i.e. why execution X does not run until the next partition – cus it would not be complete otherwise.  It also makes it possible for the user to say, every day rerun the last N partitions.  This is essentially a partition-driven workflow with a configurable lookback.

Notes on conversion to and from partition-driven

One thing to consider is, what happens when user changes the partition scheme or schedule.  People will obviously do this, either intentionally or on accident. 

One option is to only track partition fulfillment when the dag is in partition mode.  Then, if dag is later changed to non-partition mode, then we would just leave the old partition records there, and ignore them.  Any runs created while dag not in partition mode would not update the partition records.  If the dag is switched back to partition-driven, then the partition store would not have knowledge of runs that happened when the dag ran in non-partition mode.  If user wants to migrate the history they would have to do so manually or via the api.

Another option would be that when we convert modes, we could try to infer the current partition state.

Another option would be to disallow conversion.

4. Do nothing

Here's where things stand right now

  • there's no logical date uniqueness constraint on dag_run
  • users could end up with multiple runs for a given logical date if user triggers manually, via api, or through trigger dag run operator
  • in part because of this, backfill currently will always add new dag runs and never reuse existing dag runs
  • as a result, use of backfill will always result in multiple runs per logical date
  • when there are multiple runs for a given logical date, they will all be shown on the UI.  e.g. a red run might falsely indicate partition non-fulfillment
  • there is nothing to prevent concurrent runs of a given logical date.  e.g. if there's a backfill running, and then a user clears an old failed dag run in the same range, or if to users or API calls trigger a dag run at around the same time, nothing will prevent that.
  • what backfill does now is look at the state of the latest dag run (by start date) to ascertain state of the partition.  and for any date for which there's currently a running dag run, backfill will not create a dag run for that date.

5. Provide configurability at backfill run time or more "depends on past" options

Here's one phrasing of the question:

> Why is this not something that you choose when doing a backfill or when setting depends on past?

Here are my concerns with it:

  • these options would not maintain these aspects of old behavior
    • concurrent runs of a one logical date a.k.a. partition are disallowed
    • grid view is a partition-fulfillment view
  • "choose when doing a backfill"
    • what you're talking about i think is, let user choose, at the time of backfill, whether to rerun all runs or, say, only the latest.  but "the latest" is a fraught concept as a partition inference strategy.  because (1) there could be a tie and (2) even if there isn't a tie, if say user created 3 runs at same time, it would be subject to randomness which one is the lucky one that we use. 

Proposal

I would be in favor of 2 or 3.

Additionally, if we do something like 3, we should make non-partition-driven dags the default, with no data interval.





  • No labels

32 Comments

  1. Thanks for writing this up Daniel Standish

    I think I have a relatively strong preference for (3) above, with a stronger bias towards backwards compatibility. 

    I agree with the motivation section you described above and with the concept of backfills there as well.
    However, in my prior life of working in data and I originally came into it from a data replication and data warehouse creation angle, I did not really think of it as "partitions", but in terms of "time intervals". Again, that was purely a terminology preference, because of the way were were processing data. 

    However, I do believe that the word "partition" is causing a lot of confusion here, because of the Asset Partitions AIP. 

    I try to usually stay away from naming debates, but using something like "data-interval" vs. "non-data-interval" Dags may be easier to understand this. 

    1. Yeah the reason it makes sense as partitions really is because, fundamentally that's all that is being controlled and constrained, e.g. in the db.  All you have a unique constraint on is the partition key – historically that was execution_date, then logical date.

      Imagine if we tried use "intervals" as the uniqueness key....

      Then interval  (Jan 1, Jan 5) is different from (Jan 1, Jan 3).   From a database perspective, it's hard to tell that these intervals overlap; if we simply add a uniqueness constraint, these would be distinct and allowed to coexist! But it's counter to the definition of partition to have the same data in multiple partitions; a partition is always a split.

      This is why in hive or any other DB system you define a partition key not an interval.  And since Airflow was designed to support precisely those data jobs that dealt with time partitions, that's why it's the same way in airflow.  We've since layered on to this the data interval concept but data intervals don't drive the behavior like logical date does.  The data intervals are allowed to be anything; the logical dates are constrained, and are what drive behavior.

      But yeah, maybe the community does not like this way of thinking about it.  Just calling it like I see it.

  2. This is a very interesting write-up, and kudos to you for writing this on a Friday!

    I do think backfills are important, and I know a lot of users treat logical_date as the way to split up their runs.
    Even then, personally, I am against option 3. If the DAG acts different depending on the partition setting, that's just going to make troubleshooting a pain. DEs will have to remember to check that partition mode for every DAG, which is extra work. Plus, it makes things harder for users having to learn all these different modes. We should definitely keep it simpler and more predictable.

    With that in mind, I see that the main drawback of removing the current system where each logical_date is assumed to have only one run (which complicates backfills). Is that correct? If so, then I think moving forward with the AIP-83 idea – letting logical_date not be unique so DAG runs are more flexible, which opens up possibilities for dynamically triggered DAGs and other use cases – and then figuring out a new way to handle re-running past Dag Runs makes sense. Like, instead of backfilling using a logical_date range, we could just target the specific DAG runs you want to re-run. Maybe, along with targeting specific run_ids, we could also allow users to re-run DAGs based on a logical_date range still. But instead of assuming each date has only one run, we'd have to figure out what to do if there are multiple runs for that date – like, re-run all of them, or maybe just the failed ones, or the latest one. Users can pick the backfill behavior with an API parameter.

    1. If the DAG acts different depending on the partition setting, that's just going to make troubleshooting a pain. DEs will have to remember to check that partition mode for every DAG, which is extra work. Plus, it makes things harder for users having to learn all these different modes

      Respectfully I think this is a bit of a straw man.  Any feature that controls dag behavior, you could say the same thing.  According to this logic we should not have features (wink).  But srsly, every feature adds complexity; question is sorta, is it worth it. Mind you, what I'm trying to do here is reduce ambiguity and support existing paradigm.  And, is it really the case that, "preserving the semantics that have always existed" is what will cause troubleshooting pain? Is it not rather "blowing up the semantics that always existed", that is the more proximate cause? (smile) 

      I don't want to be too forceful about it because, I could of course be wrong in the way I think about it, or could just have tunnel vision or something.  We must get to the right answer; not to the answer that has the loudest supporter.  Nonetheless I'll share my thoughts.  I think by making it explicit it actually could allow us to make Airflow behavior more intelligible.  To my thinking, to recognize partition-driven dags, Airflow would, essentially, finally tell it like it is.  In my mind, we would disambiguate this longstanding source of confusion.

      Airflow has forever been wishy washy about this and has tried to have it both ways, dancing around what it's really doing, fundamentally assuming a partition-driven workflow, but not coming out and saying it.  Take for example the catchup param.  This is a workaround that essentially exists to hack non-partition-driven workflows into the Airflow paradigm.  It tells airflow to skip the partitions that it did not run in time.  This param is really a partition concept, and is a bit nonsensical when you just want something to run on a schedule.  And CronTriggerTimetable, which says, run the logical date on the logical date, is another sort of hack to get away from the partition-driven nature of Airflow.  This should be the default DAG behavior, and adding the partition concept actually could be of help with this because we could allow users to choose non-partition-driven as default mode.  All the sturm und drang over the years about what execution_date means and should we allow users to run it on the left hand side of the interval etc; basically it's about partition-driven workloads vs simple cron and that's it.

      But anyway, I do think that there is def room for improvement on the interface.  I think if we agree to maintain support for the partition-driven style, I'm sure we can somehow collectively figure out a good interface.  And of course if people have better naming, delightful.

      I see that the main drawback of removing the current system where each logical_date is assumed to have only one run (which complicates backfills). Is that correct?

      This statement is a bit incomplete.  I think you meant to say, the main drawback of removing uniqueness is its impact on backfills.  That's sorta true.  Backfill is a part of it.  But it's not all of it.  There is also depends on past, xcom pull, and prevention of concurrent execution of same logical date within a dag, which is actually a feature if you're using the canonical Airflow dag pattern.  Also, grid as data completeness view.  Before AIP-83, grid view was both execution history view and data completeness view.  If you saw a red bar, you know that that partition was unfulfilled.  But now, you might see a red bar, but if you subsequently reran that date manually, that red bar would give you the wrong impression!  So we've also essentially lost a well-defined view of data completeness.

      Like, instead of backfilling using a logical_date range, we could just target the specific DAG runs you want to re-run. Maybe, along with targeting specific run_ids, we could also allow users to re-run DAGs based on a logical_date range still. But instead of assuming each date has only one run, we'd have to figure out what to do if there are multiple runs for that date – like, re-run all of them, or maybe just the failed ones, or the latest one. Users can pick the backfill behavior with an API parameter.

      Yeah this falls into option 5: distribute the logic elsewhere.  So we could force the user to deal with this when backfilling, when setting depends on past, when doing xcom pull, perhaps also when triggering runs (if we want to continue to support prevention of concurrent runs), annnd we could add options in the grid view to apply some fuzzy logic to give the "current state" of each partition.  But to me this looks like scattering a bunch of noise and complexity around, and it seems simpler to just make it a dag attr (or timetable attr) to have partitions or something, but that's me 🤷.

      Thanks for the engagement.

      1. Before AIP-83, grid view was both execution history view and data completeness view.  If you saw a red bar, you know that that partition was unfulfilled.

        Ah, if catchup=false and no backfill run, then the gaps are also not visible in grid view. So this is also a missing feature if users expect that gid should give a impression of completeness.

      2. Thanks for the detailed response and for walking through your perspective. There’s a lot to consider here, and I appreciate the thoughtfulness you’ve put into disambiguating the historical ambiguity around partition-driven workflows in Airflow.

        Any feature that controls DAG behavior, you could say the same thing. According to this logic we should not have features (wink). But srsly, every feature adds complexity; question is sorta, is it worth it.

        I didn’t initially think of this as introducing a new feature but rather as a way to preserve the old paradigm for users while enabling clarity for those transitioning to event-driven pipelines. That said, I recognize the value in your approach to continue supporting the old way while making this distinction explicit, especially for users who rely on partitioned DAGs. 

        You’re absolutely right that the drawbacks aren’t limited to backfill. I did notice others, such as the impact on depends_on_past and grid view semantics. For me, backfill stands out as particularly significant because it’s widely used, and any inconsistencies could break the mental model many users have about how Airflow operates. Not having backfills could also complicate the management of data pipelines.

        While I don’t have a clean solution, I want to highlight my main concern: the bifurcation this might create across the broader feature set. For example, grid view bars might represent different semantics depending on whether a DAG is partitioned. Similarly, the applicability of catchup or how backfills behave could differ based on the DAG configuration. While I agree with you that making this explicit would improve clarity, I wonder if there’s an opportunity to go further. Would it make sense to introduce a distinct workflow class—e.g., users could import a PartitionedDAG instead of the default DAG? (just an example of making it more explicit) By completely isolating this behavior, it becomes easier for users to understand and reason about differences.

        Additionally, I think this distinction should be reflected in the UI. If a DAG is partitioned, the grid view could clearly indicate its partition-driven nature with distinct visual cues or labels. This is something we could explore further based on user feedback for v3.0.

        Regarding the naming, I’m slightly more in favor of your suggestion to use "partition" instead of "data-interval." To me, "partition" makes the concept easier to grasp. I think "data-interval" as a term is applicable for non-partitioned DAGs as well, which could make it less intuitive for the old "partition-driven" workflows.

        1. All good points.  It's interesting, Jarek also floated the distinct dag class idea.  It's an interesting idea. I think that the way in which that could be useful would be through it's ability to change defaults, perhaps.  But I wonder whether there's enough difference here to merit fragmenting of the DAG concept in this way.  I agree that the visual component is important.

        2. P.S. I expanded a bit on option 2.  Feels a bit the dark horse.


  3. Thanks for writing this up Daniel Standish. Sorry for dropping so many comments, please rate this positive/constructive.

    Anyway I'd favor 3. Because:

    I see mainly 3 use cases how Airflow in general can be used:

    1. Schedule driven - that is the main purpose of the discussion and the history where Airflow comes from. I see this as valid use case and a user base for use is using Airflow exactly for this. I would not drop this use case. The schedule drinven can be divided (in my view) into some "relaxed" cases where you just want to have a cron but uniqueness is not a problem. And the cases which this paper is about which need some logic about the partitioning and have a strong tie for a time range.
    2. External/Event triggered (e.g. via REST API) - that is why we have AIP-82 - and we use this in our environment also very often. Also the data driven (==Asset) use cases I'd put in here. They also might generate cases with non-unique date.
    3. Manual triggered - That is why I contributed AIP-50. We have this very often and I assume this is an important use case.

    ...and as use case (1) is just a subset I think it demands a dedicated feature but should not tie all other use cases into the uniqueness problem.

    And I am with Vikram that instead of "partitions" the term "intervals" might be a better wording. Also "partitions" are differently interpreted by DBA's as a kind of database internal storage optimization which we are not talking about.

    1. Sorry for dropping so many comments

      I appreciate the comments!  It's great to have the dialogue.  It helps to clarify things.  This is not a simple issue, and it's hard to get across all the nuance and complexity in a doc, and the question / answer / pushback / response etc helps clarify things I think.  And I really appreciate the serious engagement from everyone no matter what conclusions they come to.

      And I am with Vikram that instead of "partitions" the term "intervals" might be a better wording. Also "partitions" are differently interpreted by DBA's as a kind of database internal storage optimization which we are not talking about.

      Not sure if you were able to see my response to Vikram's comment.  But I try to explain there why I think partition is more accurate and why interval is sorta problematic.  Partitions are defined by scalar values that split a range.  Intervals have the problem that they can be distinct and overlap at the same time.


  4. I reviewed it in detail and spent about 2 hours last week talking to Daniel about it (smile) ...  I think it's a very important decision to make. I looked at the possible options, weighted all of them - I understand some of the constraints and where different people come from.

    And similarly to Daniel Standish (looking at the last comment) 2a. grew on me as a simple and very straightforward way to do it, and for now I see no drawbacks.

    • logical date can be made nullable and part of unique constraint - this is already verified (and we are lucky we dropped MSSQL - because we would have to implement some filtered indexes there to handle it and there were many issues I saw with filtered indexes - including performance
    • this is a very natural and simple way to indicate that a DAG is not "interval" bound - we do not need a new field or property (and we do not have to discuss "partitiion" vs "interval" name for it) - simply presence of a schedule which is non-interval bound (or simply set to None) is an indication that we can have logical_date null - and it's behaviour is non-interval driven. Sounds like a very straightforward way of "marking" DAG as non-interval bound.
    • we keep backwards compatibility and familiar behaviour, including a way to see completness of the data in the UI. As Daniel indicated in his description, not having uniqueness does not show the UI users completeness easily - we probably **could** add some extra UI layer on top to make it up, but that would be a band-aid IMHO and would have many edge cases.


  5. Update. I looked a bit more closely at 2a. and I **think** I have a slight modification proposal (possibly 2c.) - rather than having "logical_date" being NULL based on the schedule, decide at the DagRun creation time, depending on how the DAG RUN got triggered. If by schedule → set logical_date. if manual or by asset → do not set logical_date (optionally allow to specify logical_date to be set in the manual API trigger if user wants to do it). 

    I think that actually solved all the problems, and nicely allows to have "mixed" schedules - both time and asset.

    1. Upvote for 2c.

      I was thinking of something hybrid of 2 and 3. With partition being one of the options auto|strict|none. And we can use auto for warning, strict for traditional and none for new way.
      But I like the idea of not fragmenting DAG behavior with too many options with Jarek Potiuk 's 2c.

  6. So 2.b. Is just optionally allow users to trigger with no logical date. It sounds like 2.c. expands on that by making manual runs by default have no logical date and no data interval, and same for asset triggered. Is that right?
    Note, there is nothing in 2.b. or 2.c. which would prevent us from adding timetables that eschew logical date and data jntervals — it still opens the door to it. Just doesnt also immediately add them. Is that all consistent with what you are saying?

    1. It's a bit more. It's also about "different" kind of scheduling (asset scheduling particularly).

      • 2a) did not explain what happens when you have "combined" schedule (Time and Asset), actually it implies that this combination should not be used.
      • 2b) did not clarify what happens when you trigger via asset and the sentence was not completed (smile) "A more limited version of this one, nullable logical date, is that we don't actually add timetables that don't use logical date, but we"

      So maybe my 2c) is the same as 2b) - and I am perfectly fine if that is incorporated as clarification (also including the "combined schedules" case).

  7. After some emails and comments have been piled-up after my initial review and I (attempted to) digest all of them... I also would upvote 2c now. Maybe 2b for Airflow 3.0 and if we see we need more features then develop it into a direction of 2c. Especially UI guidance would be great. This would also include for me a kind of "completeness view" which might be something like Grid View and Calendar View in the legacy UI. Maybe even the option to restrict the user to run something manual/irregular is the DAG is "strictly" tied to a interval (start/end)

  8. Daniel and I talked a bit earlier and we agree 2c should work. Jarek Potiukyou mentioned in a comment that a combination of 2b and 2c may be possible… what exactly would be combination look like?

    I’m summarise briefly my understanding to the options so everyone’s on the same page:

    2c is to make logical date null if the DAG run is created manually (API, trigger from UI, or TriggerDagOperator), or triggered by an asset event. A run scheduled by time (either “normally” or via backfill/catchup) still has a logical date like it does now.

    2b is to allow the user to add a flag (on the DAG?) to indicate that logical date should always be null for all runs of that DAG.

    From what I can tell, the two do not logically conflict and can be trivially combined. However, we must do 2c in 3.0 since it is a breaking change—manual runs currently do have a logical date (although quite unuseful), but they won’t after the change. 2b is the one we can do in a minor release since it uses a flag and does not change existing behaviour if you do not use the flag (which existing DAGs wouldn’t).

    One thing to note is that making the logical date null means we’ll need to make some changes to the UI; this value is used in many places to represent a DAG run, such as the last run and next run of a DAG. If we make the value null, we’ll need another way to represent that run. I don’t think we can use the run ID since that value does not exist (yet) for the next run use case. The run_date value proposed above is a reasonable alternative, but

    • Using it unconditionally for all DAGs is a breaking change (it is a different date from logical date)
    • Using it only if logical date is null may be quite confusing since you’d need to consider how the run was run to make sense of the value

    Since we are building a new UI now, maybe one possibility would be to unconditionally use run_date in the new UI? The old UI would keep using logical date if available, and use run_date otherwise—slightly confusing, but maybe acceptable since it’s going away anyway.


    Another thing I mentioned elsewhere in a comment, but want to repeat here for visibility, is that I think eventually we should totally remove logical date from the DagRun table. The value is currently only used to logically represent a DAG run, and would be entirely useless if we can move to use run_date completely, and tell users to always use the data interval fields instead (or partitions—when we have them).

    1. That's pretty much exactly what I had in mind. 2c is as much breaking as I would like to see in 3.0 - because the breakage is "logical" and it's mostly about operational side of things. No need to modify your DAGs, they will behave (and their representation in the DB will be different in 2.0) - but it does not break the "regular" behaviour and the uniqueness guarantees. It will also naturally make things happen. And yes - marking a dag as "not having logical date at all" could be indeed added later as a feature.

      1. If we make null the default for manual runs, but the user references logical date (or its derivatives such as data interval start etc) in the task (e.g. a template) then users dags will break.  So users in some cases would have to change either their dags or their behavior (e.g. specify a logical date when triggering a manual run).

        Similar story for asset-triggered.  

        So I guess we can understand 2.b. and 2.c. to be very similar only, 2.c. is more aggressive in that it changes the default.  2.b. would be default utcnow()  and 2.c. would be default null. 

        That sound right to folks? 

        What is the right thing to do?

        It is possible though i hesitate to even suggest it, that we could, when logical date null on the python object, default to run_after – just for backcompat. But... that doesn't seem great.

        1. Yeah if we do it for 3.0 I’d much prefer we just make logical date None, or even not exist at all (raising KeyError) when it’s accessed as context variable. Maybe we have a way to set logical date to now() when you manually trigger a run from the web form, but I don’t feel even that is particularly necessary—you can just pick a value explicitly in the datepicker.

  9. Another issue similar to the UI representation one is how we generate run_id. Currently it uses logical_date, but we need another value if it’s null. Do we switch to always use run_date instead? Or do we still use logical date if it’s available, and only use run_date if needed? Either way can be potentially confusing. Personally I’m leaning toward always using run_date since it’s cleaner, but this is another technically breaking change. (There’s also a compatibility issue to consider since run ID generation logic can be overriden in a timetable.)

    1. One problem with using a timestamp in run_id is that the run_id still must be unique!  So, if we use another date, then a user who creates too many dag runs at same time could still bump against a constraint issue.

      1. Couple of ideas:

        For the run_id → How about LOGICAL_DATE_<UUID>
          - That will keep most compatibility (even if someone parses it now, it will mostly **work** I guess)
        For the logical date - yes, IMHO we should only keep interval. Maybe even replacing the logical_date with string fileld where it is `<START_INTERVAL>-<END_INTERVAL> `- we could name it interval  and alias with logical_date
        I think also what is important is how we sort stuff in grid - because when Dag is interval-bound, what you should see is sorting via intervals not something else - and I think we should agree how we sort stuff for non-interval dags and how we show manual runs. If we keep Logical date in <START_DATE>-<END_DATE> format - that woudl sort pretty much naturally by run_id and will be stable (which will be <START_INTERVAL>_<END_INTERVAL>_UUID)
         


      2. Yeah that’s a problem too, although I’m less worried about that since we can always just append a hash to it.

      3. Actually… do we really need run_id to be human-readable? Or do we need it even at all?

        1. YES. we NEED run_id as primary key to access any run. Must be unique.

          I favor having a date in it and having it "some kind of human understandable" because it is added to all URLs and might appear in logs as the identifier of a run. If it is just a hash, nobody can remember it. Today you can manually set it.

          Semantics is no critical, though I assume many people in the wild made qualifiers around this becasue it starts with `scheduled__` or `manual__` if not set.

          Github just uses an incremented integer, even this would be sufficient to make it "manual__123355345".

          1. DagRun already has id. This is the actual primary key in the database, and can be used to uniquely select a run. For URL, we can do something like /dag_runs/123/manual_xxxxx, where 123 is the actual unique key, and the part after is simply a human-readable representation and not actually a selector. This is the approach used by many modern platforms, such as StackOverflow and Discourse. So the question here really is if we need a human readable value to be unique—technically the answer is NO if you design the system carefully.

            Anyway, this is probably too much a side topic to be discussed here. The best approach in the scope of this AIP (amendment) is likely to just add a suffix (hash or whatever) to our existing scheme when needed. Whether we can do away with run_id altogether should be a discussion for a much later time, after 3.0.

            1. Yes - I also prefer to turn run_id into human-readable representation only. For APIs generally it's good if we use "real" unique ids. And possibly we can then make them overrideable by the user - effectively becoming "name" of the run_id - which is a nice feature. 

              1. Shouldn't an identifier, i.e. an "id", be a unique identifier?  If we want to add a name feature, shouldn't we call it something else?


                1. Yes. We should alias it likely if we do not want to break too many existing DAGs 

  10. OK Tzu-ping Chung Jarek Potiuk Jens Scheffler we need to start finalizing the proposal 

    I will make a new document that attempts to clarify exactly the proposal we will attempt to move forward with.  I will note the points of agreement and the points where there are open questions.  Hopefully each of you can review and weigh in to ensure that we get something we can all agree on.

    1. OK Jens Scheffler Jarek Potiuk Tzu-ping Chung I have attempted to formalize the "option 2" proposal here: Option 2 clarification doc WIP

      I identified 6 questions of controversy on that document.   Please review and comment.  If I missed any questions that need addressing, feel free to add or comment.