DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Flaky tests are the ever-present looming enemy of a build systemany software project. A flaky test is any test which does not consistently pass or fail. These tests reduce the confidence in our build. The presence of flaky failures also means that we rarely see "green builds" on Pull Requests or trunk builds. A small negative signal (one flaky failure) is amplified into a large negative signal (failing build). This means that committers become accustomed to ignoring build failures. On several occasions, this has allowed other negative signals such as checkstyle or compile errors to be ignored.
We should strive to reduce, if not eliminate, flaky tests from our build.
Triaging Flaky Tests
The Develocity build scans offer very useful insight into the Apache Kafka builds. Under the Tests view, we can look at the most flaky tests classes across a given period time.
Here is an example of the top 6 flaky tests from Aug 29, 2024.
In Apache Kafka, we have adopted a proactive strategy for managing and eliminating flaky tests. This document describes that process.
Build Scan Tags
Our CI workflow is split into three parallel jobs: one for "new" tests, one for "flaky" tests, and one for the remaining tests (i.e., mostly everything). The build scans include tags to let us easily filter for different results on Develocity.
- github: the CI was run on GitHub Actions
- trunk: the build was run on trunk
- new: the test has been added to trunk within the last 7 days
- flaky: the test is explicitly marked as flaky
Using combinations of these tags and their negations (e.g., "not:flaky"), we can learn a lot about the state of our tests and builds.
Identify Flaky Tests
There are a few ways to identify flaky tests. The most systematic and direct approach is to look at Develocity directly.
Using combinations of the tags in the Develocity search, you can see many useful things. Here are a few links to Develocity's reports
- Most flaky tests on "trunk" (excluding "flaky" and "new" tests)
- Most flaky new tests on "trunk" (excluding "flaky" and including "new" tests)
- Problematic flaky tests ("flaky" tests which have a lot of failures)
We also have a nightly report that is built using Develocity data. This is a aggregation of build data and might offer some insights that are difficult to see directly in Develocity.
Finally, you can also simply look at recent trunk builds in GitHub Actions and see which tests have been failing. This is not a very systematic approach, but it is pretty fast to do.
Triaging Flaky Tests
Once we have identified some flaky tests, we need to take action. Here is an example of flaky tests on trunk from Mar 19, 2025.
Typically, there will If we browse through the results, there is likely to be a long tail of flaky tests. That is to say, a small number of tests are responsible for a majority of the flaky failures. These are the ones to prioritize.Drilling down into a particular test class, we can see which test cases are the most flakyWe expect some natural amount of "background flakiness" due to the CI environment, esoteric VM behavior, cosmic rays, etc. We want to focus on the tests that are contributing the most to flakiness as these are the most likely to cause a build failure. From the above screenshot, we can see that CoordinatorRequestManagerTest is flaky 34% of the time. Click on the class to drill down to the actual test cases.
https://gedevelocity.apache.org/scans/tests?search.relativeStartTimerootProjectNames=P28Dkafka&search.rootProjectNames=kafkatags=github%2Ctrunk%2Cnot:flaky%2Cnot:new&search.tagstasks=trunktest&search.timeZoneId=America%2FNew_YorkUTC&tests.container=org.apache.kafka.clients.consumer.internals.KafkaConsumerTestCoordinatorRequestManagerTest&tests.sortField=FLAKY
Taking KafkaConsumerTest as an example (also from Aug 29, 2024)
Here we can see that only a few test cases are causing flaky failures for this test class.single test case is the cause of the flakiness. Now that we have identified a flaky tests, we must take some action on it.
Create Jira ticket
Once a flaky test case has been found, search the Apache Kafka JIRA and create a ticket if necessary. The ticket should be for the type Test and the severity should be set according to the flakiness over the last 7 days.
- CRITICAL 50% or more (these tests may also be disabled, if necessary)
- MAJOR 10-49%
- MINOR less than 10%
...
Include links to Develocity report in the test ticket. If you plan on working on the fix, assign the ticket to yourself. If an existing issue is found and the flakiness has changed, adjust the severity accordingly. Also feel free to re-open any closed flaky tests issues if the problem has re-occurred.
Mark the test as @Flaky and open a Pull Request
| Info |
|---|
| Do not skip this step! Even if you think you have solved the flakiness for a given test, we should still mark it as flaky. We cannot know for sure if a test has stabilized until we have collected data over several days. |
Once the ticket has been created, we must explicitly mark the test as flaky in the code. This should be done as quickly as possible to reduce the impact on trunk.
The Pull Request can use "MINOR" in the title or the KAFKA ticket number.
If it is obvious, you should tag people who can work on the test. Pull Requests which are marking tests as flaky should be merged quickly to avoid duplicated effort and minimize the time that trunk is vulnerable.
Reproducing Flaky Tests
The simplest way to reproduce a flaky test is to run it several times locally. Usually, you will want to to increase the log4j level to INFO or DEBUG. This is done by modifying the corresponding log4j.properties in a modules "test" source directory. E.g., for ":core" tests, you would modify "core/src/test/resources/log4j.properties"
...
./gradlew -Pkafka.cluster.test.repeat=2 :core:test --tests *ZkMigrationIntegrationTest*
IntelliJ IDEA
Some times, running the tests repeatedly with Gradle does not produce enough load on the system to expose the flakiness. If you are using IntelliJ IDEA, there is an option to run a single test N times or until failure. To enable this, first modify IntelliJ to run the tests directly rather than calling out to Gradle
...
Then, create a Run Configuration for the flaky test. Under Modify Options, find the Repeat section. On older version of IntelliJ, this dialog may look different, but the option is there somewhere.
GitHub "deflake" Action
There is a special "deflake" action on GitHub that allows us to run a single integration test repeatedly. Currently, it only supports tests written using the ClusterTestExtensions (i.e., @ClusterTest , @ClusterTests , and @ClusterTemplate ). It combines Gradle test selectors "--tests" with the new property added in KAFKA-17433. See the Gradle docs on test filtering for valid patterns https://docs.gradle.org/current/userguide/java_testing.html#simple_name_pattern.
...
The Action is limited to 60 minutes of execution time, so take care not to run too many tests or use too many repetitions. Committers using this action can push their branch to apache/kafka in order to see a Gradle Build Scan of the job.
Root Cause Analysis
Fixing the flakes
As time permits, contributors will work on fixes for flaky tests. We should generally prioritize fixing flaky tests over new feature development, but we must also be practical. If a subject matter expert is needed to fix a given test, it may take a bit of time. If subject matter experts are available, we should expect them to address flaky tests in their before adding new features or new tests.Find the root cause of a flaky test is usually the hardest part. After all, it the problem was obvious, someone would have probably fixed it already .
Some common causes of flaky tests include:
- Timeouts
- Operations not being retried
- Race conditions (in test code, or in production code)
When it comes to merging test fixes, we should not remove the @Flaky annotation in the same Pull Request that aims to fix the flakiness. The annotation should remain until we have sufficient data to prove that the test has actually stabilized.
In other words: a test is flaky until the data says otherwise!






