The flaky test bot is a system for checking PRs for flakiness by detecting new or modified tests and running these tests a large number of times. Further details of the design can be found here, and the implementation is in this PR. In order to deploy the flaky test bot, we plan to create a separate Jenkins job that can run the flaky test check on PRs.

This will be run as a separate job on Jenkins which will be triggered on PRs, like our current PR checks. During this first stage of the deployment, we will evaluate the performance of the bot and fix any issues. The plan is to then deploy the bot as part of normal PR checks.

I, Carl Tsai, will be in charge of overseeing the deployment-- monitoring performance and ensuring the bot is ready to move into production. Once deployed, normal monitoring and maintenance will be owned by contributors who are working on the CI system. Any bug fixes, improvements, any design changes will be the responsibility of the authors of the bot, Carl Tsai and Hao Jin.

  • Why can't this be part of the existing PR check pipeline?
    • The current plan is to use this bot to check PRs for flakiness. However, in order to achieve a high confidence level of a test's robustness, the bot needs to run for a significant amount of time. Doing this as part of the current PR check pipeline could as much as double the amount of time required. Doing this in a separate pipeline, as proposed, would allow the bot to run for a reasonably long time.
  • Why should we run this check on every PR?
    • The main reason for running this job on PRs rather than on nightly builds or other less frequent builds is to enforce high quality tests. One of the main reasons that flaky tests are a problem is that they are causing builds to fail on otherwise healthy PRs; by running this as a PR check, we ensure that most flaky tests never enter the code base and cause other builds to fail. Another benefit of running this as a PR check is that if a PR is marked as flaky, the person who made the change takes ownership of fixing the flakiness, which is more efficient than having another developer look into the issue.
    • It is worth noting that if we later decide that it is not worth running this on PR verification builds (if for example, the cost is too high or we are not achieving high enough detection rate), we will consider deploying this as part of nightly builds as a backup option. Naturally, the tool can also be added to nightly tests or other triggered builds with low effort.



  • No labels