Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Page properties


StateDraft
Discussion Thread

Vote Thread
Vote Result Thread
Progress Tacking (PR/GitHub Project/Issue Label)
Date Created

Created

Version Released
Authors



Motivation

This is a proposal to add rule-based calendaring capabilities to provide complex job scheduling.

Examples:

  • First three week days in each month, respecting US holidays;
  • Exclude Saturday in each week, except if it's on the first day of the month;
  • Last weekday of the month.

Considerations

What change do you propose to make?

Add a new timetable which works like CronDataIntervalTimetable except the cron-expression and timezone arguments are replaced with an compositional interface that builds on three functions:

  • cron – e.g. cron("0 3 * * *", "Europe/Berlin"), nothing surprising here;
  • days – e.g. days("D1", "D2", "THU-SAT", "4>", "L1"); first two weekdays, Thursday thru Saturday, day 4 (in the month) or the next calendar day, and finally the last calendar day of the month;
  • holidays – e.g. ~holidays("US"); Every day except US holidays (based on the holidays Python package).

The "D" modifier means calendar day (see below), while "L" is a calendar day running backwards from the end of the month.

For example, days("MON-FRI") & ~(holidays("US") | days("L1")) – this runs every weekday at midnight except for US holidays and the last calendar day in the month.

Importantly, a calendar day is a weekday by default; but every "&" narrows this definition. That is, ~holidays("US") & days("D1", "D2") means the first two weekdays in the month, with respect to US holidays.

There is some overlapping functionality between cron and days. Also, if multiple cron-expressions are provided, in terms of the implied data interval, they combine like vertical lines on a timeline.

Omitting a cron-expression means "Midnight at UTC", with "every day" implied as the calendar.

What problem does it solve?

The timetables currently shipped with Airflow probably work fine for most data-related workloads, but more business-related scheduling often requires calendaring capabilities.

Why is it needed?

Cron-expressions can only take us so far – being able to use composition to form a timetable is arguably much more flexible and built-in support for holidays can help meet a lot of real-world scheduling needs, perhaps mostly relevant to business-related workflows (e.g. don't send an low-importance email out on Christmas day.)

To some extent, some of this scheduling can be implemented using Python logic, skipping a task if it falls on a holiday. But we don't want to run DAGs on days where they're not supposed to run – it clutters the UI and ruins the statistics.

Are there any downsides to this change?


Which users are affected by the change?


How are users affected by the change? (e.g. DB upgrade required?)


Other considerations?


What defines this AIP as "done"?

The proposal as outlined through the examples above is implemented.