The Build Lead role is inspired by the "Build Baron" role used in mongodb (see whitepaper section 3.2 here). While their role began as a performance regression and change point analysis triage role, ours comes from a perspective of triaging test failure and database correctness and may evolve into a performance regression and change point triage role in the future.
Rotation
The Build Lead role is a volunteer role with weekly rotations.
Note: you will need a C* committer in order to open tickets in ASF Infra. Ping in #cassandra-dev on https://the-asf.slack.com for one if needed.
Workflow
Weekly:
Enter: handoff w/previous build lead
Exit: handoff to next build lead
Coordinate with release manager if any releases are happening that week
Daily:
Check if there are new test failures in Butler that don't yet exist in JIRA (i.e. butler test failures w/out a JIRA link)
Create JIRA tickets for new failures and link them to the failure entries in Butler
Keep an eye out for major build infra issues, if some show up hit #cassandra-dev about it
Optional:
Assign test failure JIRA to whomever introduced a new failing test or, if clear, broke an existing stable test
(Optional): run a hires config against trunk / other desired branches on circleci, confirm tickets created for failures, create tickets if none
Details
Creating JIRAs
Create a JIRA ticket with summary: "Test Failures: <suite> <class_name>"
Set component to the matching "Test/<suite>" component
Fill out description w/mention of class name and number of failures at time of ticket creation
In comments, add details of failure w/link to failing run + formatted \{code\} (without \) blocked JIRA capturing output of the test as CI results aren't preserved forever
After creation, update the ticket to Bug Category "Correctness", "Test Failure"
When we close out all failures for a test class across all branches, we close out the JIRA. If another failure comes up on that class, we can re-open.
Using butler:
Currently butler functionality is limited to viewing the current test results and linking failures to existing JIRA tickets; the "Report selected failures" functionality does not currently work with the Apache JIRA project (as of ). The recommended workflow as Build Lead is as follows:
Check for new failures on the details page for each branch in the bottom right where it says detailed history:
Look for failing tests without a JIRA link; in the following example see the top test "TestCQLNodes2RF1_Upgrade_current_4_0_x_To_indev_trunk:
For failing tests without a linked item we have a couple workflows depending on where the commit occurred as well as what type of failure it is:
Single commit on trunk:
If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
If consistent, git revert the SHA that introduced the failure, re-open the original JIRA ticket, and leave a note for the original assignee about the breakage they introduced.
Commit on older LTS branch w/merge commits:
If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
If consistent, create a new JIRA ticket for the failure, link it in Butler, and set assignee to the individual that introduced the failure and notify them in the comments in the JIRA ticket
Build infra:
If there are any build failures due to infra issues (say running out of disk space on Jenkins) either from the weekly cci run or when checking Butler problems, file or find existing JIRA
Notes:
Link failures to JIRA via the "Link selected failures" button:
CI on Jenkins is run on every commit so for consistently failing tests (> 1 run failed on butler) it should be immediately clear which commit introduced the failure.
For failures with "Timeout occurred. Please note the time in the report does not reflect the time until the timeout", we can ignore them , as it's considered test-infrastructure failures. And CASSANDRA-18137 is working on this kind of failures already.
[Optional] Loop failing tests locally using tools/dev/ci-test-loop (PENDING CONTRIBUTION), which relies on tools/dev/ci-test (PENDING CONTRIBUTION) for a number of iterations to determine if it's consistent or intermittent. If intermittent, reflect in subject of the created JIRA ticket for the failure.