AIP-37 Virtualenv management inside Airflow

Status

State	Draft
Discussion Thread
Github	https://github.com/apache/airflow/issues/13364
Created	$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())

Motivation

Once getting to a point when you want to create cluster for different types of python tasks and you've multiple teams working on the same cluster, you need to start splitting into different python packages the business login code to allow better versioning control and unit testing outside of Airflow scope.

The issue with having a single virtual env (or globally installed) is that you have a lot of version conflicts between packages and every upgrade requires Airflow restart.

Considerations

What change do you propose to make?

In order to solve this issue, I suggest to introduce venv management as part of the celery and local executors lifecycle.

Option A - make the venv config part of the DAGs code:

Each task can include venv configuration inside executor_config, which will include the venv name and list of packages.

Once the executor gets the task, it will check if the venv exists in the machine, use a lock to make sure it's the only process currently updating the venv, installs the packages and then run the airflow run command from inside the venv.

Option B - manage venv as an Airflow DB model

Using the UI, CLI and API, the user will be able to create a venv by providing name and list of requirements, then each task will be able to point to a venv id and then the executor will be able to create the venv if needed and run the task in it.

What problem does it solve?

Allowing users to create a complex Airflow cluster with multiple packages easily.
Creating a standard best practice of how util functions should be written and how to write the logic outside of the Airflow dag.

Why is it needed?

It's currently almost impossible to achieve that without having a complex deployments scripts that updates all the workers and reload the airflow process (and in the LocalExecutor it means running tasks need to be killed).

While it is possible to achieve it when using CeleryExecotur, execute_tasks_new_python_interpreter=True and creating celery worker per venv, in a big cluster this can led to creation of a lot of celery workers (which consumes system resources) and it forces the user to still use their own deployment methods to update all venvs in the cluster.

Are there any downsides to this change?

For both options:

This solution can make the tasks creation a bit slower.
The venv must contain the Airflow packages (which can be solved using system-site-packages and copy venv).

For option A:

There is not easy way to fix conflicts, multiple tasks can override each other versions and it can create confusion and make debugging harder.

Which users are affected by the change?

No one, as long as they don't configure tasks with venvs.

How are users affected by the change? (e.g. DB upgrade required?)

There will be a need for DB upgrade to create the venv table.

Other considerations?

N/A

What defines this AIP as "done"?

Users can create venvs and run tasks inside them using Airflow.

Space shortcuts

Page tree

Status

Motivation

Considerations

What change do you propose to make?

What problem does it solve?

Why is it needed?

Are there any downsides to this change?

Which users are affected by the change?

How are users affected by the change? (e.g. DB upgrade required?)

Other considerations?

What defines this AIP as "done"?

4 Comments

Ricardo Gaspar

Jarek Potiuk

Jarek Potiuk

Jarek Potiuk