Problem: Flink's build is too slow
We want to reduce the local and CI build times of Flink. This page is looking at options.
§1 Optimize current setup
We currently use Maven + Travis CI + custom scripts. This proposals keep this setup but refine it
Enable JVM reuse for IT cases in more modules (Solution 1)
Benefits:
- Speedups in blink planner (7 minutes saved)
- Considered easy to implement
Problems:
- Not all tests are doing a proper cleanup
Custom differential build scripts (Solution 2)
Benefits:
- Only build & test affected modules
Problems:
- Needs a defensive/pessimistic design to catch all potential issues
- development and maintenance of "homegrown" scripts working around Maven limitations
- Reinventing the wheel to compensate for the limitations of a bad build tool (Maven)
- Complex, non-standard build system
Only run smoke tests when PR is opened, run heavy tests on demand (Solution 3)
Benefits:
- Execute fewer tests, heavy tests on demand
Problems:
- Custom implementation with "ci-bot" likely
- Committers need to know which test runs to request / run
Move more tests into cron builds (Solution 4)
Benefits:
- almost no custom implementation needed (cheap version of 'Solution 3')
Problems:
- Poor developer experience: People expect to get fast feedback on their changes
- Failures in cron builds potentially go unnoticed for quite some time (months)
- Potential of lower long-term build quality
Work towards parallelizing the build better
Benefits:
- Moving to a build infrastructure with more CPU cores will allow us to run more build / test workloads concurrently
Targets:
- Maven checkstyle plugin
- Kafka tests (30 minutes of sequential execution)
Use Gradle Enterprise Global Build Cache
Gradle Enterprise provides a maven plugin for global build caches.
Benefits:
- Incremental build benefits on a module basis (I guess)
- Low effort, because it is used in the existing environment
- Save money on Travis plan (faster builds)
- Improves local and CI builds
Problems:
- Relies on a proprietary product
- Unclear if it works for anonymous Flink contributors
Remove as much shading from Flink as possible (into flink-shaded)
§2 Switch Build System
We currently use Maven + custom scripts
Use Gradle (Solution 5)
Benefits:
- Supports incremental builds and tests
- Supports remote build cache to do an incremental build w/o having earlier increments (through "Gradle Enterprise")
- All build tasks can be solved in code, instead of Maven+scripts
Problems:
- MAJOR effort to change entire build system
- All Flink developers need to learn a new build system
Notes:
- Apache Kafka is using gradle
- Apache Beam migrated from Maven to grade by having both build systems side-by-side during the transition
- gradle supports Kotlin (as an alternative to Groovy) for the build scripts, but Kotlin support is new and has potential limitations
- Arvid Heise is willing to support a POC
- ~1 week for PoC (some modules only, not all problems solved)
- POC must cover CI as well
- Problems to solve
- Shading & layered shading
- Inclusion of NOTICE files into the final build (producing valid Apache releases in general)
- Support for mixed scala / java projects
- Javadocs for mixed scala / java projects
- Java 9+ support
- API compatibility checks
- checkstyle
- ensuring dependency convergence
- unclear whether we can use Gradle Enterprise build cache for free as open source, and how it works over the public internet (in a secure way)
Use Bazel
Benefits:
- Supports incremental builds
Problems:
- MAJOR effort to change entire build system
- Not widely adopted in Javaland
Notes:
- A quick search for shading with bazel didn't reveal promising results
§3 Switch Build Infrastructure
We currently use Travis CI
Benefits of moving away from Travis:
- Travis future is uncertain due to company ownership changes
- Travis build caches are unreliable / used in a hacky way
- Travis only provides a build environment with 2cpu, 7.5g (where a build currently needs 3.5hrs). Other vendors provide bigger build instances, where the build can finish in ~1.3hrs
- Travis provides bigger build environments in paid plans.
Move to another hosted CI service (Solution 6)
Benefits:
- Low maintenance overhead of a hosted service
- similar experience to current setup
Problem:
- Hosted CI services often have resource limited build environments
Free for open source options:
- Azure Pipelines (recommended by community)
- 10 instances (6 hours each on a 2 cores, 7gb machine)
- Open source projects can add an unlimited number of self-hosted "worker" machines
- Artefact caching is in preview only: https://docs.microsoft.com/en-us/azure/devops/pipelines/caching/?view=azure-devops
- Requires write access to the apache/flink GH repo, which Apache does not allow: - INFRA-17030Getting issue details... STATUS
- GitHub CI
- Closed Beta
- Seems to be based on AZ Pipelines
- Circle CI
Paid options:
- Google Cloud Build
- 32 core builders (at a high price tag (almost 4x over the compute instances' price))
Move to a self-hosted CI service
Benefits:
- Lower costs compared to hosted CI service
Problems:
- How do we support building private branches from outside contributors?
- Somebody needs to maintain the infrastructure to provide a similar experience
Options for software:
- drone.io
- Jenkins
- operated by Apache Infra
- Jetbrains TeamCity
- offered for free to the ASF
- Apache Ignite is using TeamCity: https://ci.ignite.apache.org/
- Potentially Azure DevOps Server
- Offered for free to Apache Committers
Options for machines:
- Cloud providers
- Google: $1500/mo for 2x 32core machines
- Dedicated Servers
§4 Split Repository (TODO)
See separate page (wip)