Background

In August 2019, the Flink community started discussing different approaches for reducing Flink's build time. Some of the problems we faced were:

Lack of build machine resources at Travis, leading to long wait times for pull requests to be validated
Build-times exceeding the limits of the free Travis plan for open source projects (mostly relevant for contributors wanting to test their branches in their own Travis accounts)
Frequent instabilities in Travis’ caching mechanism, which we have abused for splitting and parallelizing the build into a compile, and multiple subsequent test stages.

The lack of resources / build time has forced developers to make sacrifices in terms of test coverage to keep the build time within the limitations of Travis: Instead of testing Flink against different Scala or Hadoop versions, or running the end to end tests for each new push, we had to defer these tests to nightly build jobs.

Why Azure Pipelines?

From these considerations, we started evaluating Azure Pipelines (AZP) as a potential replacement for Travis. We were seeing the following compelling reasons for Azure:

AZP offers 10 parallel test slots with a runtime of 6 hours in their free accounts. This allows Flink contributors to use the official Flink CI setup on their own, free Azure Pipelines account. (Travis has 5 parallel builds with a 50 minutes timeout)
Azure Pipelines is a very mature software (based on Team Foundation Server, a software first released in 2005), offering additional features over Travis.
- Most importantly, passing build artifacts between stages in a build pipeline. This allows us to stick to the “split & parallelize” approach of the build.
- AZP also parses the JUnit output, showing developers immediately the failed test (without browsing the logs).
- Templates allow for better structuring of build definitions.
AZP allows users to add their own machines as builders into the hosted service. This means that Microsoft is operating the service itself, the artifact storage, the build logs, the build job triggering and scheduling etc., while users can provide additional hardware resources to speed up testing.
In Flink’s case, Alibaba Cloud is generously offering the community powerful testing machines for building all pull requests and pushes to master.
Using a CI service that allows custom machines has several benefits:

CI resources scalability: The Flink project can ask the community for hardware / server donations, instead of monetary donations (and we have received offers from two additional companies already).
This allows the Flink community to think more about “what needs to be tested” instead of “what can we still test”.
CI resources variety: There have been some discussions around supporting Flink on the ARM platform: With custom machines, we can also add ARM-based builders to Azure. Through a donation from Huawei, we actually have ARM-based builders (which are not yet properly added to the testing)

The company behind Travis has been acquired early 2019, leading to questions about the future of their product:

We have considered other CI options / providers as well. GitHub Actions (which is probably based on Azure Pipelines) is probably a great alternative, but it was still in closed beta at the time of the PoC. It offers basically the same features as Azure Pipelines, but it is more tightly integrated with GitHub.

Current State

As of March 2020, Azure Pipelines for Flink is set up as follows:

There are two pipeline definition files in the Flink repository:

/azure-pipelines.yml
This file is meant to be used by people with their free AZP accounts. It defines a simple build pipeline, that is basically the same as we use on Flink for validating pull requests and pushes.
/tools/azure-pipelines/build-apache-repo.yml
This file is meant to be used only by the “apache/flink” repository. In addition to the simple push and pull request validation, it also defines the nightly builds, testing Flink on JDK11, with Hadoop 2.4.1 and without Hadoop (the default profile tests Flink with Hadoop 2.8.3)

To avoid configuration duplication, these two files are parameterizing a template, defining the actual compile and test jobs: /tools/azure-pipelines/jobs-template.yml.

From a high level point of view, the template accepts the following parameters:

pool definitions: These allow us configuring Azure provided, or self-hosted testing machines
environment variables: The Flink compilation is controlled through environment variables, most notably the PROFILE variable, with defines things like the Hadoop or Scala version
jdk: accepts “jdk8” and “jdk11”

Based on these parameters, the template defines the following jobs:

“compile”: builds Flink according to the provided parameters and publishes the result as a “pipeline artifact”
“test_$name” stages (currently core, python, libraries, blink_planner, connectors, kafka/gelly, tests, legacy_scheduler_core, legacy_scheduler_tests, misc). These stages download the pipeline artifact from the previous stage, so that they don’t have to compile Flink and run their tests right away.
JUnit Test metadata is parsed by Azure Pipelines for analysis in the Web UI.
The full logs of all the Flink tests are published as pipeline artifacts. These and all the other artifacts are available for download through the AZP UI.

Note: Both the compile and test_ stages are executed in a docker container, providing a consistent testing environment with all the required tools installed (such as Maven 3.2.5, jdk8 and jdk11).
This also simplifies reproducing environment-related build issues locally.
The “e2e” stage compiles Flink and runs all bash and Java end to end tests defined in /flink-end-to-end-tests/run-nightly-tests.sh.
The e2e stage does not run in a docker container, but directly on the build worker. We followed this approach because some end to end tests use Docker containers themselves, and executing them within containers would require a lot of changes for making the docker networking and file-mounting flexible enough for “docker-in-docker” and “docker-free” scenarios.

Putting all this together, using the /azure-pipelines.yml definition in a free Azure Pipelines account will run the end to end tests as well as a compile and all “test_$name” stages on virtual machines provided by Azure.

On the apache/flink repository, the builds are a bit more sophisticated: We include /tools/azure-pipelines/jobs-template.yml multiple times in the definition file (/tools/azure-pipelines/build-apache-repo.yml) for the different testing profiles in the nightly tests.

Also, we execute the “compile” and “test” stages on community provided build machines. The “e2e” stage is executed on virtual machines provided by Azure. We follow this approach because our end to end tests are potentially modifying system resources, and they sometimes fail to properly clean up after a failure (such as keeping Docker containers or JVMs running).

Lessons Learned: From Travis to Azure

What is different?

For connecting AZP with a GitHub repository, it requires write access to the repo. (They have a feature for editing the build definition in their UI). Directly connecting AZP with the Flink repository would show the build status in the GitHub UI.
However, the Apache Software Foundation does not allow third parties to get write access to their GitHub repositories. We are using flinkbot and a mirror script to trigger pull request and push builds.
Poor support: For Travis, you could just send them an email and get a competent response quickly.
For Azure, the main channels seem to be the public feedback forum and StackOverflow. In cases relating to service stability or bugs, the support was not satisfactory.
For the open source parts of AZP, opening a GitHub ticket there was more successful (but that’s not possible for the service itself).
During the PoC, we needed a lot of time fine-tuning the custom build machines: What turned out to be important was that the build service does not run with root permissions. This repository contains all the instructions needed for setting up a new worker.

Besides the benefits mentioned already, what are some positive lessons learned?

The development speed of the product seems to be high, many parts (build agents, VM definitions, YAML parsing, build tasks) of the software are open source and the product management seems to be good: They look for user feedback and regularly communicate new and upcoming features.
The close relationship between AZP and GitHub Actions would allow us to migrate to GH Actions quite easily if needed.

Conclusion

With Azure Pipelines, the Flink project has migrated their CI infrastructure to a mature platform, backed by a major vendor.
During the testing period, we had very little outages caused by Azure itself.
We currently have enough available resources on Azure to even execute the end to end tests with each pull request and push to master, further improving the testing coverage of Flink. This leads to a higher developer productivity, as issues are detected earlier.

Not only Flink developers benefit from this effort: Additionally, we’ll stop using the Travis resources shared with all the other Apache Software Foundation projects. Apache Flink used to be the heaviest user of Travis CI resources at the foundation.

The outcome of this effort is not a reduction of the time for one CI run (yet), however the overall queueing time has been massively reduced. This effort also laid the foundation for scaling the project further, by being able to add powerful build machines as the testing load increases.

In the future, we are planning to work on simplifying Flink’s build infrastructure, in particular the test scripts, which have collected a lot of technical debt over the years.