Overview
Airflow has a lot of great features and is a fast moving project. As such, there are some common pitfalls that are worth noting.
...
Details
We do support more than one DAG definition per python file, but it is not recommended as we would like better isolation between DAGs from a fault and deployment perspective and multiple DAGs in the same file goes against that. For now, make sure that the dag object is in the global namespace : you can use the globals dict as in
globals()[dag_id] = DAG(...)
Configuring parallelism in airflow.cfg
- parallelism = number of physical python processes the scheduler can run
- dag_concurrency = the number of TIs to be allowed to run PER-dag at once
- max_active_runs_per_dag = number of dag runs (per-DAG) to allow running at once
Understanding the
execution date
- Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for
2016-02-19
, I would do it at2016-02-20 midnight GMT
, which would be right after all data for2016-02-19
becomes available. - This date is available to you in both Jinja and a Python callable's context in many forms as documented here. As a note
ds
refers todate_string
, notdate start
as may be confusing to some.
- Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for
Run your entire Airflow infrastructure in UTC. Airflow was developed at Airbnb, where every system runs on UTC (GMT). As a result, various parts of Airflow assume that the system (and database) timezone is UTC (GMT). This includes:
- Webserver
- Metadata DB
- Scheduler
- Workers (possibly)
When setting a schedule, align the start date with the schedule. If a schedule is to run at 2am UTC, the start-date should also be at 2am UTC
Bash Operator - Jinja templating and the bash commands
- Described here : see below. You need to add a space after the script name in cases where you are directly calling a bash scripts in the
bash_command
attribute ofBashOperator
- this is because the Airflow tries to apply a Jinja template to it, which will fail.
- Described here : see below. You need to add a space after the script name in cases where you are directly calling a bash scripts in the
...