See dev mailing list thread here.

The Build Lead role is inspired by the "Build Baron" role used in mongodb (see whitepaper section 3.2 here). While their role began as a performance regression and change point analysis triage role, ours comes from a perspective of triaging test failure and database correctness and may evolve into a performance regression and change point triage role in the future.

Rotation

The Build Lead role is a volunteer role with weekly rotations.

Date Range (reversed)WeekNameEmail#cassandra-dev slack

  -  

21


  -  

20


  -  

19


  -  

18


  -  

17Mick Semb Wevermck@apache.orgmck

  -  

16Dan Jatnieksdjatnieks@gmail.comDan Jatnieks

  -  

15


  -  

14Josh McKenziejmckenzie@apache.orgjmckenzie

  -  

13Josh McKenziejmckenzie@apache.orgjmckenzie

  -  

12Mick Semb Wevermck@apache.orgmck

  -  

11maxwellguocclive1601@gmail.commaxwellguo

  -  

10Derek Chen-Beckerapache@chen-becker.orgdchenbecker

  -  

9Derek Chen-Beckerapache@chen-becker.orgdchenbecker

  -  

8Mick Semb Wevermck@apache.orgmck

  -  

7maxwellguocclive1601@gmail.commaxwellguo

-  

6German Eichbergergeeichbe@microsoft.comxgerman

-  

5Dan Jatnieksdjatnieks@gmail.comDan Jatnieks

-  

4Claude Warrenclaude.warren@aiven.ioClaude Warren

-  

3Caleb Rackliffecalebrackliffe@gmail.comCaleb Rackliffe

-  

2Mick Semb Wevermck@apache.orgmck

See child pages for past years.

Tools

Butler: dashboard of historical test failures and per-test build history failure details w/JIRA links (see trunk here)

OpenTestFailures kanban board: board showing all labeled test failure JIRA tickets

ASF Jenkins C* CI: source data pulled by Butler

CircleCI: optionally paid for testing infrastructure (pay for parallel. See .circleci/generate.sh for details on profiles and usage)

ASF Infra:

Workflow

Weekly:

  • Enter: handoff w/previous build lead
  • Exit: handoff to next build lead
  • Coordinate with release manager if any releases are happening that week

Daily:

  • Check if there are new test failures in Butler that don't yet exist in JIRA (i.e. butler test failures w/out a JIRA link)
  • Create JIRA tickets for new failures and link them to the failure entries in Butler
  • Keep an eye out for major build infra issues, if some show up hit #cassandra-dev about it
  • Optional:
    • Assign test failure JIRA to whomever introduced a new failing test or, if clear, broke an existing stable test
    • (Optional): run a hires config against trunk / other desired branches on circleci, confirm tickets created for failures, create tickets if none

Details

Creating JIRAs

  • Create a JIRA ticket with summary: "Test Failures: <suite> <class_name>"
  • Set component to the matching "Test/<suite>" component
  • Fill out description w/mention of class name and number of failures at time of ticket creation

  • In comments, add details of failure w/link to failing run + formatted \{code\} (without \) blocked JIRA capturing output of the test as CI results aren't preserved forever

  • After creation, update the ticket to Bug Category "Correctness", "Test Failure"

When we close out all failures for a test class across all branches, we close out the JIRA. If another failure comes up on that class, we can re-open.

Using butler:

Currently butler functionality is limited to viewing the current test results and linking failures to existing JIRA tickets; the "Report selected failures" functionality does not currently work with the Apache JIRA project (as of  ). The recommended workflow as Build Lead is as follows:

  1. Check for new failures on the details page for each branch in the bottom right where it says detailed history:
  2. Look for failing tests without a JIRA link; in the following example see the top test "TestCQLNodes2RF1_Upgrade_current_4_0_x_To_indev_trunk:
  3. For failing tests without a linked item we have a couple workflows depending on where the commit occurred as well as what type of failure it is:
    1. Single commit on trunk:
      1. If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
      2. If consistent, git revert the SHA that introduced the failure, re-open the original JIRA ticket, and leave a note for the original assignee about the breakage they introduced.
    2. Commit on older LTS branch w/merge commits:
      1. If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
      2. If consistent, create a new JIRA ticket for the failure, link it in Butler, and set assignee to the individual that introduced the failure and notify them in the comments in the JIRA ticket

Build infra:

  • If there are any build failures due to infra issues (say running out of disk space on Jenkins) either from the weekly cci run or when checking Butler problems, file or find existing JIRA

Notes:

  • Link failures to JIRA via the "Link selected failures" button:
  • Create new failure tickets in the ASF C* JIRA.
  • CI on Jenkins is run on every commit so for consistently failing tests (> 1 run failed on butler) it should be immediately clear which commit introduced the failure.
  • For failures with "Timeout occurred. Please note the time in the report does not reflect the time until the timeout", we can ignore them , as it's considered test-infrastructure failures. And CASSANDRA-18137 is working on this kind of failures already. 

  • [Optional] Loop failing tests locally using tools/dev/ci-test-loop (PENDING CONTRIBUTION), which relies on tools/dev/ci-test (PENDING CONTRIBUTION) for a number of iterations to determine if it's consistent or intermittent. If intermittent, reflect in subject of the created JIRA ticket for the failure.
  • No labels