Status
Current state: Under Discussion
Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]
JIRA: here [Change the link from KAFKA-1 to your own ticket]
Authors: Vince Rose vrose@confluent.io Farid Zakaria fzakaria@confluent.io
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Apache Kafka has historically had long build times. In September 2024, Jenkins pull requests took 3h48m on average and 5h35m in the 95th percentile. The migration to Github Actions [ref] has resulted in a sizable drop in those times - 1h50m on average. While this is an improvement, build times continue to remain a bottleneck for developer productivity.
At Confluent, we operate a monorepo including Kafka and other services. Our average build times were even higher, reaching nearly 3 hours on average. Internally, we’ve successfully migrated the build system to Bazel and are seeing an average build time (including tests) of 30 minutes for pull requests while maintaining the same build output structure and targets as Gradle. By leveraging Bazel’s aggressive caching and utilizing finer-grained build targets, we have identified opportunities to radically further reduce these build times.
Background
Kafka’s current build system via Gradle results in a non-deterministic dependency graph due to its lack of a dependency lockfile and lack of toolchain definition. Additionally, Gradle resolves a separate dependency tree per module, meaning that final tar.gz and docker images could end up with multiple versions of the same jar, which can have unexpected results.
Apache Kafka builds against multiple JVMs that must be hand-managed by the developer and whose settings may differ across development platforms. With Bazel, the JDK and Scala versions are integrated into the build as toolchains, guaranteeing that developer machines and CI are using the same JDK down to the same patch version.
While Gradle seemingly offers many of the same advantages we seek to highlight through an adoption to Bazel, in practice we’ve found that the broad imperative style of Gradle files written in Kotlin/Groovy is error-prone and have made it surprisingly easy to introduce bugs into the build system that either introduce non-reproducibility or mitigate the ability to cache subsequent actions.
Bazel’s view of a build graph goes beyond the language-level and provides a unified framework that supports builds across languages and domains. In our experience, migrating OCI (Docker) image builds to Bazel has yielded significant improvements by leveraging Bazel's caching and granular build strategies. Using Bazel rules for Docker builds, we observed Docker build times decrease from 25 minutes to 5 minutes. We expect continued improvements as we continue to refine the graph.
Demo
We’ve developed a separate repository for those interested in test-driving what a Bazel-like experience may be. We’ve built kafka-bazel, a Github repository which overlays a Bazel build system atop the Apache Kafka source.
This repository was modeled after a similar setup done for LLVM (llvm-bazel) prior to Bazel’s inclusion into their main repository. The compilation takes on average 3 minutes. We are running the CI with roughly the same size VMs as the Apache Kafka Github runners. The overlay build has 4 cores with 8GB of RAM while the GitHub runners have 4 cores with 16GB of RAM.
For now, the repository includes a submodule for Apache Kafka at a relatively recent commit with no modifications to demonstrate the build system.
Getting started with the repository is incredibly straightforward, as the repository has encoded all the necessary toolchain information. One must simply clone the repository & execute bazel build @kafka//… to build the complete codebase.
Much of the Bazel BUILD files are analogous to what we’ve written internally at Confluent with proprietary information removed.
Benchmarks
The following were taken on an Apple M3 Pro and constitute ~4300 Bazel targets.
The benchmark only considers build times (bazel build @kafka//… & ./gradlew build -x test) and not test executions. Bazel sets itself apart from Gradle by defaulting to caching test outputs based on their dependencies, significantly reducing CI times at Confluent. This benchmark does not highlight those gains.
Code Structure
Bazel promotes fine-grained targets, allowing each Java package, or even individual files within a package, to have distinct BUILD targets. This approach enables more effective early cutoff optimization and enhances code organization by enforcing explicit dependency modeling. Unlike Gradle and Maven, which favor modular structures where multiple Java packages are bundled within a single JAR, Bazel encourages a more granular approach. With large modules, as in Gradle/Maven, developers often adopt extensive dependencies wholesale, increasing the likelihood of circular dependencies across package scopes. Circular dependencies not only introduce complexity but also obscure the flow of dependencies, making code harder to maintain.
Bazel’s emphasis on smaller build targets aligns with ongoing efforts within Apache Kafka to break down large modules, such as its core module, into smaller, manageable components [ref]. This decomposition is not just a technical choice but also an alignment with Bazel’s philosophy of optimizing build performance and ensuring clear, manageable dependency structures.
Public Interfaces
Developers will need to adapt to using Bazel commands for building Kafka, replacing existing Gradle and Docker commands. This might also involve adjusting their Integrated Development Environment (IDE) configurations.
To consumers of kafka artifacts, there should be no change. The Bazel build will continue to publish jars and pom files to the maven repository.
Proposed Changes
Confluent has already done the heavy lifting to convert the Kafka repository to use Bazel. We have pushed an overlay repository that allows a developer to test out Building/Testing with Bazel using a symlinked Apache Kafka directory.
Compatibility, Deprecation, and Migration Plan
Phase 1
Get full build/test green on the overlay repository while gathering early usability feedback from developers.
Phase 2
Merge the BUILD.bazel files directly into the Apache Kafka repository and run the Bazel build in parallel with the Gradle build in a non-blocking manner. During this time, close any remaining gaps between the two build systems. We can add “comparison tests” to check the artifacts produced by the Gradle build against the Bazel build.
Phase 3
Require the Bazel build to be green to merge and allow developers some time to ramp up with the new system.
Phase 4
Begin publishing artifacts with Bazel.
Phase 5
Turn off the Gradle build.
Rejected Alternatives
The current solution for breaking down Kafka into smaller cacheable pieces involves separating the code out into additional submodules [ref], which then also get published as additional maven modules. This is a heavyweight solution that adds additional network and storage costs on the package registries, and requires moving code around to many “*-common” modules. With Bazel, we can break down code to much smaller cacheable pieces, with low overhead, without adding additional published maven modules, and keeping a more manageable file structure. Bazel’s use of package level build artifacts enforces a non-circular build graph, making it easier to continue to evolve/break-down the build targets over time.
Bazel’s caching of tests are all the way down to the test class level, while Gradle’s is at the module level.
Bazel’s incremental builds are ingrained in the tool itself. All tasks inputs and outputs are tracked explicitly and build outputs are put in a sandbox. With Gradle, inputs and outputs are less strict, leaving it open to developers to test whether their tasks are actually deterministic. It also allows tasks to write directly to source directories, while Bazel enforces a build sandbox.
If any parts of the Kafka repository need to branch out to other languages, Bazel is well equipped to handle it. Gradle has some support for building other languages via plugins, but we found that they were mostly abandoned or unsupported.