Optimizing Docker Image Workflow

Created by Daniel Imberman, last modified by Jarek Potiuk on Mar 19, 2019

This proposal was deprecated by AIP-10 Multi-layered and multi-stage official Airflow CI image where CI image is being embedded into the main airflow repository

Right now, every time we pull our CI images from dockerhub, we then need to pip install all dependencies to our local/travis machine. This can take anywhere from 3-5 minutes. This is also extremely wasteful as these packages have to be redownloaded every time the CI script is run.

Proposal: Pre-bake all airflow dependencies into the airflow-ci image - the CI image is part of the main repo
Step 1: Every time dependencies change, the PR owner creates a feature branch in the airflow-ci repo:

We can set up our system where if a user wants to create a change to the airflow CI image, they can create a feature branch in https://github.com/apache/incubator-airflow-ci. This will generate a docker image with that feature branch as the tag for the docker image. We specify a dockerhub rule that only the master branch can create numbered tags. When that PR is accepted a new tag is created. This ensures that old builds are not broken since they all contain the old tag.

Step 2: Update airflow-ci tag in the airflow repo

With the airflow-ci tag changed, users will be able to test their code with the newest dependencies. While the PR is pending they can test with their feature branch image, but PRs will not be accepted until the new official tag has been created.

Benefits:

Faster Iterative testing:

Once the image has been downloaded once, users will be able to re-run it as many times as they like without needing to redownload all airflow dependencies from scratch. Already provided by the CI image from AIP-10

Faster CI:

~~If we can split the dependencies into multiple image layers and keep individual layers small, these layers can be downloaded in parallel, making for fast loading into travis.~~ Already provided by the CI image from AIP-10

No labels

7 Comments

Daniel Imberman
Gerardo Curiel Fokko Driesprong Kaxil Naik PTAL .
- Permalink
- Jan 02, 2019
- Delete comments
Jarek Potiuk
As explained in AIP-4 Automation of System Tests [Deps: AIP-47] and in the Devlist thread - there is an already existing environment of https://github.com/PolideaInternal/airflow-breeze . It is an alternative Development/CI environment which already implements a lot of what is described here.
However It is closely tied with Google Cloud Platform as it uses Google Cloud Build for running System Tests and storing images in the registry, so we might consider it as an inspiration or base for further discussions.
It already implements Airflow image that can be used both locally and for CI testing and has full development cycle covered as explained in the Design Doc.
- Permalink
- Jan 03, 2019
- Delete comments
Gerardo Curiel
Daniel Imberman Looks good to me. I'm only worried that, even though it removes friction regarding the time it takes for a developer to test their changes, it adds friction in the number of repositories that need to be modified (1 PR becomes at least 2 PRs). But I think it's worth it.
- Permalink
- Jan 04, 2019
- Delete comments
Jarek Potiuk
How about simplifying it even more and embedding the Docker file (and travis build scripts to build it) into main incubator-airflow repo?
You can do `docker pull` followed by `docker build --cache-from` and then `docker push`, then you effectively get very good caching support. This is something I do in https://github.com/PolideaInternal/airflow-breeze/blob/master/cloudbuild.yaml to build my "airflow-breeze" image. I could easily imagine the workflow where the first thing travis CI yaml does is pull + build with --cache-from + push the image. The image can then be used to perform further steps and paralel test execution with the new image. You can even specify mutliple --cache-from images which will be useful in our case for branch support.
If the docker file does not change (nor it's ADDed files/dirs) such flow is very fast (build and push do pretty much nothing) and it's equivalent of just doing docker pull.
What I imagine here is that travis build could do first:
- docker pull <airflow-ci-image>:master
- docker pull <airflow-ci-image>:BRANCH,
- docker build --cache-from <airflow-ci-image>:master --cache-from <airflow-ci-image>:BRANCH
- docker push <airflow-ci-image>:BRANCH.
Then Travis CI uses <airflow-ci-image>:BRANCH to run the tests.
This way any change to the branch Dockerfile will be cached between commits in that BRANCH.
This way we avoid complex workflow - merging the change that requires some new requirement will be simply part of the same PR and it will be automatically build and pushed when it's merged to master and pushed to :master image.
- Permalink
- Jan 04, 2019
- Delete comments
Jarek Potiuk
While implementing AIP-10 Multi-layered and multi-stage official Airflow CI image I have a working proposal (that I am going to share soon with request for comments) of such Dockerfile in the main repo that can be used also for CI integration - it implements most of the stuff mentioned here:
- The Dockerfile is part of the main Airflow repo
- It is layered (and multi-staging) and optimised for fast building in many cases (including building it during CI build very quickly using Docker registry as cache)
- You can build the whole pipeline to modify and test everything using your own DockerHub account and use it for your development (I am actually using my own now for testing).
- Your changes in Dockerfile can be bundled in the same commit with your changes to Airflow
- No need to have separate airflow-ci repository
- Last but not least - it also has rather user friendly local development environment that automatically takes care about local image management (i.e. it will detect when important files change and will trigger incremental rebuild of the images when needed).
- The environment is mirroring the Travis CI setup using docker compose
- Tox is removed
https://github.com/apache/airflow/pull/4543/files
- Permalink
- Mar 15, 2019
- Delete comments
Jarek Potiuk
Daniel Imberman → Hey Daniel. What do you think about the proposal I have in AIP-10 Multi-layered and multi-stage official Airflow CI image approach about using the Dockerfile which is part of the main "airflow" project? I have been trying and testing it for quite some time and it seems that it might work pretty well (at the same time addressing some other issues mentioned above). I think it supports very well the possibility of having your own Dockerfile modification (locally or in your own DockerHub account) and working on it together with your feature implementation. Seems easy and natural. Let me know what you think. I will soon start advocating the work I've done for AIP-10 to get more community feedback/support, but I would love to hear if it addresses your needs.
- Permalink
- Mar 17, 2019
- Delete comments
Jarek Potiuk
Gerardo Curiel → also it addresses your concerns about 2PR instead of 1PR. The way I propose it, there will be 1 PR addressing both changes together - in the dockerfile and in the code.
- Permalink
- Mar 17, 2019
- Delete comments

Space shortcuts

Page tree

Proposal: Pre-bake all airflow dependencies into the airflow-ci image - the CI image is part of the main repo
Step 1: Every time dependencies change, the PR owner creates a feature branch in the airflow-ci repo:

Step 2: Update airflow-ci tag in the airflow repo

Benefits:

Faster Iterative testing:

Faster CI:

7 Comments

Daniel Imberman

Jarek Potiuk

Gerardo Curiel

Jarek Potiuk

Jarek Potiuk

Jarek Potiuk

Jarek Potiuk

Space shortcuts

Page tree

Proposal: Pre-bake all airflow dependencies into the airflow-ci image - the CI image is part of the main repoStep 1: Every time dependencies change, the PR owner creates a feature branch in the airflow-ci repo:

Step 2: Update airflow-ci tag in the airflow repo

Benefits:

Faster Iterative testing:

Faster CI:

7 Comments

Proposal: Pre-bake all airflow dependencies into the airflow-ci image - the CI image is part of the main repo
Step 1: Every time dependencies change, the PR owner creates a feature branch in the airflow-ci repo: