This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

Problem: Flink's build is too slow

We want to reduce the local and CI build times of Flink. This page is looking at options.

§1 Optimize current setup

We currently use Maven + Travis CI + custom scripts. This proposals keep this setup but refine it

Enable JVM reuse for IT cases in more modules (Solution 1)

Benefits:

  • Speedups in blink planner (7 minutes saved)
  • Considered easy to implement

Problems:

  • Not all tests are doing a proper cleanup

Custom differential build scripts (Solution 2)

Benefits:

  • Only build & test affected modules

Problems:

  • Needs a defensive/pessimistic design to catch all potential issues
  • development and maintenance of "homegrown" scripts working around Maven limitations
  • Reinventing the wheel to compensate for the limitations of a bad build tool (Maven)
  • Complex, non-standard build system

Only run smoke tests when PR is opened, run heavy tests on demand (Solution 3)

Benefits:

  • Execute fewer tests, heavy tests on demand

Problems:

  • Custom implementation with "ci-bot" likely
  • Committers need to know which test runs to request / run

Move more tests into cron builds (Solution 4)

Benefits:

  • almost no custom implementation needed (cheap version of 'Solution 3')

Problems:

  • Poor developer experience: People expect to get fast feedback on their changes
  • Failures in cron builds potentially go unnoticed for quite some time (months)
  • Potential of lower long-term build quality

Work towards parallelizing the build better

Benefits:

  • Moving to a build infrastructure with more CPU cores will allow us to run more build / test workloads concurrently

Targets:

  • Maven checkstyle plugin
  • Kafka tests (30 minutes of sequential execution)

Use Gradle Enterprise Global Build Cache

Gradle Enterprise provides a maven plugin for global build caches.

Benefits:

  • Incremental build benefits on a module basis (I guess)
  • Low effort, because it is used in the existing environment
  • Save money on Travis plan (faster builds)
  • Improves local and CI builds

Problems:

  • Relies on a proprietary product
  • Unclear if it works for anonymous Flink contributors


§2 Switch Build System

We currently use Maven + custom scripts

Use Gradle (Solution 5)

Benefits:

  • Supports incremental builds and tests
  • Supports remote build cache to do an incremental build w/o having earlier increments (through "Gradle Enterprise")
  • All build tasks can be solved in code, instead of Maven+scripts

Problems:

  • MAJOR effort to change entire build system
  • All Flink developers need to learn a new build system

Notes:

  • Apache Kafka is using gradle
  • Apache Beam migrated from Maven to grade by having both build systems side-by-side during the transition
  • gradle supports Kotlin (as an alternative to Groovy) for the build scripts, but Kotlin support is new and has potential limitations
  • Arvid Heise is willing to support a POC
    • ~1 week for PoC (some modules only, not all problems solved)
    • POC must cover CI as well
  • Problems to solve
    • Shading & layered shading
    • Inclusion of NOTICE files into the final build (producing valid Apache releases in general)
    • Support for mixed scala / java projects
    • Javadocs for mixed scala / java projects
    • Java 9+ support
    • API compatibility checks
    • checkstyle
    • ensuring dependency convergence
  •  unclear whether we can use Gradle Enterprise build cache for free as open source, and how it works over the public internet (in a secure way)

Use Bazel

Benefits:

  • Supports incremental builds

Problems:

  • MAJOR effort to change entire build system
  • Not widely adopted in Javaland

Notes:

  • A quick search for shading with bazel didn't reveal promising results

§3 Switch Build Infrastructure

We currently use Travis CI

Benefits of moving away from Travis:

  • Travis future is uncertain due to company ownership changes
  • Travis build caches are unreliable
  • Travis only provides a build environment with 2cpu, 7.5g (where a build currently needs 3.5hrs). Other vendors provide bigger build instances, where the build can finish in ~1.3hrs

Move to another hosted CI service (Solution 6)

Benefits:

  • Low maintenance overhead of a hosted service
  • similar experience to current setup

Problem:

  • Hosted CI services often have resource limited build environments

Free for open source options:

Paid options:

  • Google Cloud Build
    • 32 core builders (at a high price tag (almost 4x over the compute instances' price)) 

Move to a self-hosted CI service

Benefits:

  • Lower costs compared to hosted CI service

Problems:

  • How do we support building private branches from outside contributors?
  • Somebody needs to maintain the infrastructure to provide a similar experience

Options for software:

Options for machines:

  • Cloud providers
    • Google: $1500/mo for 2x 32core machines
  • Dedicated Servers


§4 Split Repository (TODO)

See separate page (wip)


  • No labels