Flaky tests are the ever-looming enemy of any software project. A flaky test is any test which does not consistently pass or fail. In Apache Kafka, we have adopted a proactive strategy for managing and eliminating flaky tests. This document describes that process.
Build Scan Tags
Our CI workflow is split into three parallel jobs: one for "new" tests, one for "flaky" tests, and one for the remaining tests (i.e., mostly everything). The build scans include tags to let us easily filter for different results on Develocity.
- github: the CI was run on GitHub Actions
- trunk: the build was run on trunk
- new: the test has been added to trunk within the last 7 days
- flaky: the test is explicitly marked as flaky
Using combinations of these tags and their negations (e.g., "not:flaky"), we can learn a lot about the state of our tests and builds.
Identify Flaky Tests
There are a few ways to identify flaky tests. The most systematic and direct approach is to look at Develocity directly.
Using combinations of the tags in the Develocity search, you can see many useful things. Here are a few links to Develocity's reports
- Most flaky tests on "trunk" (excluding "flaky" and "new" tests)
- Most flaky new tests on "trunk" (excluding "flaky" and including "new" tests)
- Problematic flaky tests ("flaky" tests which have a lot of failures)
We also have a nightly report that is built using Develocity data. This is a aggregation of build data and might offer some insights that are difficult to see directly in Develocity.
Finally, you can also simply look at recent trunk builds in GitHub Actions and see which tests have been failing. This is not a very systematic approach, but it is pretty fast to do.
Triaging Flaky Tests
Once we have identified some flaky tests, we need to take action. Here is an example of flaky tests on trunk from Mar 19, 2025.
Typically, there will be a long tail of flaky tests. We expect some natural amount of "background flakiness" due to the CI environment, esoteric VM behavior, cosmic rays, etc. We want to focus on the tests that are contributing the most to flakiness as these are the most likely to cause a build failure. From the above screenshot, we can see that CoordinatorRequestManagerTest is flaky 34% of the time. Click on the class to drill down to the actual test cases.
Here we can see that a single test case is the cause of the flakiness. Now that we have identified a flaky tests, we must take some action on it.
Create Jira ticket
Once a flaky test case has been found, search the Apache Kafka JIRA and create a ticket if necessary. The ticket should be for the type Test and the severity should be set according to the flakiness over the last 7 days.
- CRITICAL 50% or more (these tests may also be disabled, if necessary)
- MAJOR 10-49%
- MINOR less than 10%
Also include the "flaky-test" label on the issue
Include links to Develocity report in the test ticket. If you plan on working on the fix, assign the ticket to yourself. If an existing issue is found and the flakiness has changed, adjust the severity accordingly. Also feel free to re-open any closed flaky tests issues if the problem has re-occurred.
Mark the test as @Flaky
and open a Pull Request
Once the ticket has been created, we must explicitly mark the test as flaky in the code. This should be done as quickly as possible to reduce the impact on trunk.
The Pull Request can use "MINOR" in the title or the KAFKA ticket number.
If it is obvious, you should tag people who can work on the test. Pull Requests which are marking tests as flaky should be merged quickly to avoid duplicated effort and minimize the time that trunk is vulnerable.
Reproducing Flaky Tests
The simplest way to reproduce a flaky test is to run it several times locally. Usually, you will want to to increase the log4j level to INFO or DEBUG. This is done by modifying the corresponding log4j.properties in a modules "test" source directory. E.g., for ":core" tests, you would modify "core/src/test/resources/log4j.properties"
To run a single test class from the command line, use the "--tests" flag for gradlew.
./gradlew :clients:test --tests "*KafkaConsumerTest*"
To run this repeatedly, you can use a Bash loop (as mentioned in the project README)
I=0; while ./gradlew clients:test --tests "*KafkaConsumerTest*" --rerun --fail-fast; do (( I=$I+1 )); echo "Completed run: $I"; sleep 1; done
Additionally, you may pass in Gradle properties to utilize the Develocity test retry behavior
./gradlew -PmaxTestRetries=10 :clients:test --tests "*KafkaConsumerTest*"
With KAFKA-17433, a new custom Gradle property was added that lets us repeat integration tests. This is primarily intended for use in the "deflake" action (documented below), but can also be used locally.
./gradlew -Pkafka.cluster.test.repeat=2 :core:test --tests *ZkMigrationIntegrationTest*
IntelliJ IDEA
Some times, running the tests repeatedly with Gradle does not produce enough load on the system to expose the flakiness. If you are using IntelliJ IDEA, there is an option to run a single test N times or until failure. To enable this, first modify IntelliJ to run the tests directly rather than calling out to Gradle
Then, create a Run Configuration for the flaky test. Under Modify Options, find the Repeat section. On older version of IntelliJ, this dialog may look different, but the option is there somewhere.
GitHub "deflake" Action
There is a special "deflake" action on GitHub that allows us to run a single integration test repeatedly. Currently, it only supports tests written using the ClusterTestExtensions (i.e., @ClusterTest
, @ClusterTests
, and @ClusterTemplate
). It combines Gradle test selectors "--tests" with the new property added in KAFKA-17433. See the Gradle docs on test filtering for valid patterns https://docs.gradle.org/current/userguide/java_testing.html#simple_name_pattern.
The deflake action is located https://github.com/apache/kafka/actions/workflows/deflake.yml
The Action is limited to 60 minutes of execution time, so take care not to run too many tests or use too many repetitions. Committers using this action can push their branch to apache/kafka in order to see a Gradle Build Scan of the job.
Fixing the flakes
As time permits, contributors will work on fixes for flaky tests. We should generally prioritize fixing flaky tests over new feature development, but we must also be practical. If a subject matter expert is needed to fix a given test, it may take a bit of time. If subject matter experts are available, we should expect them to address flaky tests in their before adding new features or new tests.
Some common causes of flaky tests include:
- Timeouts
- Operations not being retried
- Race conditions (in test code, or in production code)
When it comes to merging test fixes, we should not remove the @Flaky
annotation in the same Pull Request that aims to fix the flakiness. The annotation should remain until we have sufficient data to prove that the test has actually stabilized.
In other words: a test is flaky until the data says otherwise!