Unstable tests fail non-deterministically and thus can sneak into the main codebase if they pass during the initial PR. Failing tests make it difficult to understand if new contributions introduce issues or if the failures are entirely unrelated, increasing the burden of the review process. To maintain a healthy test suite, the Flink community tracks tests that fail in CI via JIRA Bug tickets with the test-stability label, "Critical" severity and the affected versions.

Finding Test Instability Tickets

  1. Go to the Apache JIRA site and click "View all issues and filters"
  2. In the "Advanced" search mode, enter a query like:

    project = FLINK AND resolution = unresolved AND labels in (test-stability) ORDER BY createdDate DESC

    This will filter down and sort all the test instability issues. It can also be helpful to further refine the results to a specific component of the project, with an additional clause:

    project = FLINK AND component in ("Runtime / Coordination", "Runtime / REST", "Runtime / Queryable State", "Runtime / Metrics", "Deployment / Mesos", "Deployment / Kubernetes", "Deployment / YARN", "Build System", "Release System", "BuildSystem / Shaded", flink-docker) AND resolution = unresolved AND labels in (test-stability) ORDER BY createdDate DESC

Triaging Techniques

Finding and Downloading Logs in CI

CI tests run on a logging configuration with no test-output to standard out (visible in the web interface). The full log statements are instead written to disk. Here, we describe the steps to retrieve the logs from the testing machines:

  1. Navigate to the Azure Pipelines build of the failure, which should be linked in the ticket
    1. If it is not, please comment on the ticket asking the reporter to provide those details
  2. Find the failing Job and see the x artifact produced link
    1. NOTE: remember the name of the failing job before clicking the link
  3. In the link, there should be an artifact similar to logs-ci-JOB_NAME_COMPONENT-12345678
    1. ex: with Job name test_ci kafka_gelly, there should be a log artifact named logs-ci-kafkagelly-12345678
  4. Download the log artifact via the inline three-dot menu (on the right)
  5. Unarchive the log artifact
    1. The artifact should be a zip containing a tarball
  6. In the archive, look for the mvn-x.log files
    1. There may be more than one, depending on how many threads Maven is running tests in
  7. Open the log files in your editor, IntelliJ is fairly good at displaying them
  8. In each log file, search for the test name and hope there are some details pointing to the failure cause (if the log line doesn't appear, make sure that your test extends TestLogger, a utility class that prints log statement and the beginning and end of each test). We are happy to accept "hotfix" pull requests to address this.

Reproducing Locally

The easiest way to see how a test is failing is by reproducing it in your IDE. This is not always possible but is a nice first step if the issue is not painfully obvious by the initial failure logs.

  1. Setup up your Flink development environment in IntelliJ, if you have not already by following the Setting up a Flink development environment guide
  2. Navigate to the failing test/ test suite and run it via the IntelliJ Junit integration
    1. The integration offers ways to configure the test runner to run the test repeatedly, either a set number of times or until failure, which is helpful for flaky tests
    2. https://www.jetbrains.com/help/idea/run-debug-configuration-junit.html#configTab
  3. If this does not yield a failure in a reasonable amount of time, the next step is to repeat this tactic using CI with debug logs
  4. (optional) If you can reproduce the issue locally, remember to set the logging level in "test/resources/log4j-test.properties" from "OFF" to "INFO" or "DEBUG" to get more information.

For flink-yarn-tests there is something specific to consider: These tests spin up a MiniYarnCluster which lives outside of the JVM. You have to specify JAVA_HOME in your IDE's run configuration to make some of these tests succeed. See flink-yarn-tests/README.md for further details.

Enabling CI Debug Logging and Test Repetition for bash-based e2e tests

There is currently no way to configure this from the Azure Pipelines UI, so you'll have to patch and commit the test running script configured just for the failing test.

  1. Patch the test configuration
    1. Set the tools/ci/log4j.properties rootLogger level to DEBUG
    2. Edit the tools/ci/test_controller.sh script to run a test repeatedly
    3. An example commit can be seen here: https://github.com/XComp/flink/commit/4360ed402859fd5d3359b323d5d92e5ed5b1ea31
  2. Ensure you have an Azure DevOps account, which you can sign up for using GitHub
    1. You can then set up your Flink fork to run in your personal Azure pipelines
  3. Once a pipeline has run and reproduced the error, find the logs in the same way as above, and cross your fingers that something indicative of the failure is in them this time

  • No labels