Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Airflow has a lot of great features and is a fast moving project. As such, there are some common pitfalls that are worth noting.

...

Details

  • We do support more than one DAG definition per python file, but it is not recommended as we would like better isolation between DAGs from a fault and deployment perspective and multiple DAGs in the same file goes against that. For now, make sure that the dag object is in the global namespace : you can use the globals dict as in globals()[dag_id] = DAG(...)

  • Configuring parallelism in airflow.cfg

    • parallelism = number of physical python processes the scheduler can run
    • dag_concurrency = the number of TIs to be allowed to run PER-dag at once
    • max_active_runs_per_dag = number of dag runs (per-DAG) to allow running at once
  • Understanding the execution date

    • Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
    • This date is available to you in both Jinja and a Python callable's context in many forms as documented here. As a note ds refers to date_string, not date start as may be confusing to some.
  • Run your entire Airflow infrastructure in UTC. Airflow was developed at Airbnb, where every system runs on UTC (GMT). As a result, various parts of Airflow assume that the system (and database) timezone is UTC (GMT). This includes:

    • Webserver
    • Metadata DB
    • Scheduler
    • Workers (possibly)
  • When setting a schedule, align the start date with the schedule. If a schedule is to run at 2am UTC, the start-date should also be at 2am UTC

  • Bash Operator - Jinja templating and the bash commands

    • Described here : see below. You need to add a space after the script name in cases where you are directly calling a bash scripts in the bash_command attribute ofBashOperator - this is because the Airflow tries to apply a Jinja template to it, which will fail.

...