Background

In August 2019, the Flink community started discussing different approaches for reducing Flink's build time. Some of the problems we faced were:

The lack of resources / build time has forced developers to make sacrifices in terms of test coverage to keep the build time within the limitations of Travis: Instead of testing Flink against different Scala or Hadoop versions, or running the end to end tests for each new push, we had to defer these tests to nightly build jobs.

Why Azure Pipelines?

From these considerations, we started evaluating Azure Pipelines (AZP) as a potential replacement for Travis. We were seeing the following compelling reasons for Azure:

We have considered other CI options / providers as well. GitHub Actions (which is probably based on Azure Pipelines) is probably a great alternative, but it was still in closed beta at the time of the PoC. It offers basically the same features as Azure Pipelines, but it is more tightly integrated with GitHub.

Current State

As of March 2020, Azure Pipelines for Flink is set up as follows:

There are two pipeline definition files in the Flink repository: 

To avoid configuration duplication, these two files are parameterizing a template, defining the actual compile and test jobs: /tools/azure-pipelines/jobs-template.yml.

From a high level point of view, the template accepts the following parameters:


Based on these parameters, the template defines the following jobs:


Putting all this together, using the /azure-pipelines.yml definition in a free Azure Pipelines account will run the end to end tests as well as a compile and all “test_$name” stages on virtual machines provided by Azure.

On the apache/flink repository, the builds are a bit more sophisticated: We include /tools/azure-pipelines/jobs-template.yml multiple times in the definition file (/tools/azure-pipelines/build-apache-repo.yml) for the different testing profiles in the nightly tests.

Also, we execute the “compile” and “test” stages on community provided build machines. The “e2e” stage is executed on virtual machines provided by Azure. We follow this approach because our end to end tests are potentially modifying system resources, and they sometimes fail to properly clean up after a failure (such as keeping Docker containers or JVMs running).

Lessons Learned: From Travis to Azure

What is different?

Besides the benefits mentioned already, what are some positive lessons learned?

Conclusion

With Azure Pipelines, the Flink project has migrated their CI infrastructure to a mature platform, backed by a major vendor.
During the testing period, we had very little outages caused by Azure itself.
We currently have enough available resources on Azure to even execute the end to end tests with each pull request and push to master, further improving the testing coverage of Flink. This leads to a higher developer productivity, as issues are detected earlier.

Not only Flink developers benefit from this effort: Additionally, we’ll stop using the Travis resources shared with all the other Apache Software Foundation projects. Apache Flink used to be the heaviest user of Travis CI resources at the foundation.

The outcome of this effort is not a reduction of the time for one CI run (yet), however the overall queueing time has been massively reduced. This effort also laid the foundation for scaling the project further, by being able to add powerful build machines as the testing load increases.

In the future, we are planning to work on simplifying Flink’s build infrastructure, in particular the test scripts, which have collected a lot of technical debt over the years.