Problem: Flink's build is too slow

We want to reduce the local and CI build times of Flink. This page is looking at options.

§1 Optimize current setup

We currently use Maven + Travis CI + custom scripts. This proposals keep this setup but refine it

Enable JVM reuse for IT cases in more modules (Solution 1)

Benefits:

  • Speedups in blink planner (7 minutes saved)
  • Considered easy to implement

Problems:

  • Not all tests are doing a proper cleanup

Custom differential build scripts (Solution 2)

Benefits:

  • Only build & test affected modules

Problems:

  • Needs a defensive/pessimistic design to catch all potential issues
  • development and maintenance of "homegrown" scripts working around Maven limitations
  • Reinventing the wheel to compensate for the limitations of a bad build tool (Maven)
  • Complex, non-standard build system

Only run smoke tests when PR is opened, run heavy tests on demand (Solution 3)

Benefits:

  • Execute fewer tests, heavy tests on demand

Problems:

  • Custom implementation with "ci-bot" likely
  • Committers need to know which test runs to request / run

Move more tests into cron builds (Solution 4)

Benefits:

  • almost no custom implementation needed (cheap version of 'Solution 3')

Problems:

  • Poor developer experience: People expect to get fast feedback on their changes
  • Failures in cron builds potentially go unnoticed for quite some time (months)
  • Potential of lower long-term build quality

Work towards parallelizing the build better

Benefits:

  • Moving to a build infrastructure with more CPU cores will allow us to run more build / test workloads concurrently

Targets:

  • Maven checkstyle plugin
  • Kafka tests (30 minutes of sequential execution)

Use Gradle Enterprise Global Build Cache

Gradle Enterprise provides a maven plugin for global build caches.

Benefits:

  • Incremental build benefits on a module basis (I guess)
  • Low effort, because it is used in the existing environment
  • Save money on Travis plan (faster builds)
  • Improves local and CI builds

Problems:

  • Relies on a proprietary product
  • Unclear if it works for anonymous Flink contributors


§2 Switch Build System

We currently use Maven + custom scripts

Use Gradle (Solution 5)

Benefits:

  • Supports incremental builds and tests
  • Supports remote build cache to do an incremental build w/o having earlier increments (through "Gradle Enterprise")
  • All build tasks can be solved in code, instead of Maven+scripts

Problems:

  • MAJOR effort to change entire build system
  • All Flink developers need to learn a new build system

Notes:

  • Apache Kafka is using gradle
  • Apache Beam migrated from Maven to grade by having both build systems side-by-side during the transition
  • gradle supports Kotlin (as an alternative to Groovy) for the build scripts, but Kotlin support is new and has potential limitations
  • Arvid Heise is willing to support a POC
    • ~1 week for PoC (some modules only, not all problems solved)
    • POC must cover CI as well
  • Problems to solve
    • Shading & layered shading
    • Inclusion of NOTICE files into the final build (producing valid Apache releases in general)
    • Support for mixed scala / java projects
    • Javadocs for mixed scala / java projects
    • Java 9+ support
    • API compatibility checks
    • checkstyle
    • ensuring dependency convergence
  •  unclear whether we can use Gradle Enterprise build cache for free as open source, and how it works over the public internet (in a secure way)

Use Bazel

Benefits:

  • Supports incremental builds

Problems:

  • MAJOR effort to change entire build system
  • Not widely adopted in Javaland

Notes:

  • A quick search for shading with bazel didn't reveal promising results

§3 Switch Build Infrastructure

We currently use Travis CI

Benefits of moving away from Travis:

  • Travis future is uncertain due to company ownership changes
  • Travis build caches are unreliable / used in a hacky way
  • Travis only provides a build environment with 2cpu, 7.5g (where a build currently needs 3.5hrs). Other vendors provide bigger build instances, where the build can finish in ~1.3hrs
    • Travis provides bigger build environments in paid plans.

Move to another hosted CI service (Solution 6)

Benefits:

  • Low maintenance overhead of a hosted service
  • similar experience to current setup

Problem:

  • Hosted CI services often have resource limited build environments

Free for open source options:

  • Azure Pipelines (recommended by community)
  • GitHub CI
    • Closed Beta
    • Seems to be based on AZ Pipelines
  • Circle CI

Paid options:

  • Google Cloud Build
    • 32 core builders (at a high price tag (almost 4x over the compute instances' price)) 

Move to a self-hosted CI service

Benefits:

  • Lower costs compared to hosted CI service

Problems:

  • How do we support building private branches from outside contributors?
  • Somebody needs to maintain the infrastructure to provide a similar experience

Options for software:

Options for machines:

  • Cloud providers
    • Google: $1500/mo for 2x 32core machines
  • Dedicated Servers


§4 Split Repository (TODO)

See separate page (wip)


  • No labels