Current state: Accepted

Discussion thread: https://lists.apache.org/thread/d4r4g9hqfmwpckdk9j01n7dh5y25j11n

JIRA: KAFKA-17629 - Getting issue details... STATUS

Motivation

Flaky tests are an ever-present problem in the Apache Kafka build. The presence of flaky tests erodes confidence in our test results which leads to many problems, both social and technical. When developers are used to seeing "red" results from the CI system, they are more likely to miss or ignore a significant test failure. The constant need for re-running the test suite in order to gain reasonable confidence puts a lot of burden on our test infrastructure – leading to very real costs. 

The aim of this KIP is to address some of the technical problems of flaky tests, which will hopefully alleviate some the social problems.

Some of our technical problems include:

  • Small amount of flakiness compounds and results in frequent build failures
  • Verifying fixes for flaky tests is tedious
  • Re-running all tests up to 3 times can mask problems
  • New tests are unknowns and frequently the source of flaky failures

Public Interfaces

No public interfaces are affected by this KIP. This is a development process change.

Proposed Changes

The main idea presented here is the introduction of a test "quarantine". This is a means of isolating flaky tests so they do not affect our build outcomes. Tests that have been placed into quarantine will be run as part of the CI builds, but their results will be reported separately. Once a test is placed into quarantine, it will be evaluated for a predetermined amount of time. A test will be removed from the quarantine once it passes the exit criteria.

This idea may also be thought of as test isolation or test cordoning. However, the name quarantine seems appropriate because, like a medical quarantine, this test quarantine is designed to be temporary. 

We can break this down into a few key components:

  • Isolation mechanism
  • Historical data
  • Automated reporting


Here is a diagram with the proposed workflow

Test Isolation

There are three types of tests which we will exclude from the main test suite. They include:

  • Tests explicitly marked as flaky
  • Tests recently, but no longer, marked as flaky
  • Newly added integration tests

For explicitly marked flaky tests, we will use the JUnit tagging system. A new @Flaky tag will include a required String field which will refer to a Jira ticket. Explicit tagging of flaky tests makes it more obvious to developers which tests need attention, since the tags live in the source code alongside the test.

Once tests have been fixed, the tag will be removed by a developer (most likely, in the Pull Request that fixes the test). However, we don't want to include these tests in the main suite just yet. They will remain in the quarantine for 7 days to determine if they are indeed no longer flaky.

Quarantining New Tests

According to Google2, flaky tests are added approximately at the same rate as they are fixed. Anecdotally, this seems to be the case in Apache Kafka as well. This phenomenon results in a non-zero steady-state failure rate in the test suite, which is not desirable.

One solution to this problem is to require a certain number of successful test runs on a Pull Request prior to merging. This is basically how we verify new tests today. The problem with this approach is that it is very ad-hoc and difficult to enforce. It also relies on a committer being heavily involved in the PR because CI runs can only be re-triggered by a committer (this is true for Jenkins and GitHub Actions). Another side effect of this approach is that it reduces our development velocity. Pull Requests stay open longer in order to produce sufficient test data. This leads to divergence from trunk which brings its own set of problems.

Automatically placing a new test into the quarantine will allow us to observe its behavior without the risk of failing builds. After a few days of successful runs, these tests will graduate from the quarantine and be run as part of the main suite. 

Historical Data

One challenge with fixing flaky tests is to know when it has actually been fixed. In some cases, there are multiple sources of flakiness within a test and it may not be obvious. There have been several occasions where Apache Kafka developers have committed a fix for a flaky test only to find that they have simply reduced the flakiness – not eliminated it. By collecting historical test data, we can increase our confidence that a fix was successful. 

Here is sample timeline for a quarantined test.

  • Day 0: Flaky test identified, test is marked as quarantined. Builds no longer impacted by this test
  • Day 1: Developer commits fix for flaky test
  • Days 1-N: CI builds continue to run with the flaky test in quarantine, collecting data
  • Day N: No failures after several days, the test is removed from quarantine.

For some months now, Apache Kafka has been producing Gradle Build Scans which are stored in an instance of Develocity host by the ASF. This application can serve as a historical record of our test execution. Develocity also has an API which can be used to export this data and query a broad range of information about the test suite.

In addition to the historical test execution data, we also need to know the history of which tests have been quarantined. This information can be obtained from the Develocity API and the Git history.

Reporting

Using the Git history and the Develocity API, we can create automated reports to help us identify tests that should be added or removed from the quarantine. 

New Tests

This report will query Develocity for tests which only have recent history (i.e., newly committed tests). This will let us see the flakiness of newly added tests. New tests with flakiness will need to promptly fixed or manually marked as flaky.

Flaky Test Regressions

This report will query Develocity for flaky tests which have an established history and have recently started failing. This will indicate that some code (or test) change has impacted the stability of the test, or that a bug was introduced that only causes occasional test failures.

Cleared Tests

This report will query Develocity for passing tests that are on the quarantined list. If a test has had passing builds above some threshold, it will be reported as "cleared" from the quarantine.

Defective Tests

This report will list tests that have been quarantined for an excessive number of days. Tests in this report should be considered for removal or rewriting. This report will help surface tests that have been in the quarantine for too long and perhaps been ignored or forgotten about.

Exit Criteria

If a test has no flaky failures on trunk for 7 days with a minimum of 10 builds, we can consider it "cleared" and remove it from quarantine.

Long Tail Failures

As with most statistical phenomena, there is a long tail of test failures. That is to say, a relatively high number of tests which will have a very occasional failure. These failures are extremely hard to diagnose due to their rare nature. They can come from infrastructure problems, rare timing race conditions, cosmic rays, etc. Quarantining these tests would likely be ineffective because they would end up staying in the quarantine for too long. Instead, we can use the Develocity test retry feature to help with these. Rather than our existing permissive and broad application of the retries, we can use a stricter and more targeted approach.


References

  1. https://junit.org/junit5/docs/current/user-guide/#extensions-conditions
  2. https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
  3. https://medium.com/@tom.tikkle/how-to-manage-flaky-tests-quarantine-a34c86993354
  4. https://medium.com/scopedev/how-can-we-peacefully-co-exist-with-flaky-tests-3c8f94fba166

Compatibility, Deprecation, and Migration Plan

N/A

Test Plan

N/A

Rejected Alternatives

Automatic Quarantining

Using the proposed test retry policy and data from Develocity, we could fully automate quarantine. The problem with this is that it makes it far too easy to ignore flaky tests. By only allowing new tests and recently fixed tests to automatically enter the quarantine, we are reducing developer toil but not making it too easy to ignore the problem.


Quarantined Test File

Instead of tags, we could place a text file of quarantined tests in the source tree. This has the benefit of showing all the quarantined tests in one place and an easy history to look at for when tests were added and removed from the quarantine. However, the downside of this approach is that the notation of flakiness is not immediately visible to developers while looking at the test source code. 

  • No labels