Build Lead

See dev mailing list thread here.

The Build Lead role is inspired by the "Build Baron" role used in mongodb (see whitepaper section 3.2 here). While their role began as a performance regression and change point analysis triage role, ours comes from a perspective of triaging test failure and database correctness and may evolve into a performance regression and change point triage role in the future.

Rotation

The Build Lead role is a volunteer role with weekly rotations.

Date Range (reversed)	Week	Name	Email	#cassandra-dev slack
22 May 2023 - 26 May 2023	21
15 May 2023 - 19 May 2023	20
08 May 2023 - 12 May 2023	19
01 May 2023 - 05 May 2023	18
24 Apr 2023 - 28 Apr 2023	17	Mick Semb Wever	mck@apache.org	mck
17 Apr 2023 - 21 Apr 2023	16	Dan Jatnieks	djatnieks@gmail.com	Dan Jatnieks
10 Apr 2023 - 14 Apr 2023	15
04 Apr 2023 - 07 Apr 2023	14	Josh McKenzie	jmckenzie@apache.org	jmckenzie
27 Mar 2023 - 31 Mar 2023	13	Josh McKenzie	jmckenzie@apache.org	jmckenzie
20 Mar 2023 - 24 Mar 2023	12	Mick Semb Wever	mck@apache.org	mck
13 Feb 2023 - 17 Mar 2023	11	maxwellguo	cclive1601@gmail.com	maxwellguo
06 Mar 2023 - 10 Mar 2023	10	Derek Chen-Becker	apache@chen-becker.org	dchenbecker
27 Feb 2023 - 03 Mar 2023	9	Derek Chen-Becker	apache@chen-becker.org	dchenbecker
20 Feb 2023 - 24 Feb 2023	8	Mick Semb Wever	mck@apache.org	mck
13 Feb 2023 - 17 Feb 2023	7	maxwellguo	cclive1601@gmail.com	maxwellguo
06 Feb 2023 - 10 Feb 2023	6	German Eichberger	geeichbe@microsoft.com	xgerman
30 Jan 2023 - 03 Feb 2023	5	Dan Jatnieks	djatnieks@gmail.com	Dan Jatnieks
23 Jan 2023 - 27 Jan 2023	4	Claude Warren	claude.warren@aiven.io	Claude Warren
16 Jan 2023 - 20 Jan 2023	3	Caleb Rackliffe	calebrackliffe@gmail.com	Caleb Rackliffe
09 Jan 2023 - 14 Jan 2023	2	Mick Semb Wever	mck@apache.org	mck

See child pages for past years.

Tools

Butler: dashboard of historical test failures and per-test build history failure details w/JIRA links (see trunk here)

OpenTestFailures kanban board: board showing all labeled test failure JIRA tickets

ASF Jenkins C* CI: source data pulled by Butler

CircleCI: optionally paid for testing infrastructure (pay for parallel. See .circleci/generate.sh for details on profiles and usage)

ASF Infra:

Status page
JIRA
Note: you will need a C* committer in order to open tickets in ASF Infra. Ping in #cassandra-dev on https://the-asf.slack.com for one if needed.

Workflow

Weekly:

Enter: handoff w/previous build lead
Exit: handoff to next build lead
Coordinate with release manager if any releases are happening that week

Daily:

Check if there are new test failures in Butler that don't yet exist in JIRA (i.e. butler test failures w/out a JIRA link)
Create JIRA tickets for new failures and link them to the failure entries in Butler
Keep an eye out for major build infra issues, if some show up hit #cassandra-dev about it
Optional:
- Assign test failure JIRA to whomever introduced a new failing test or, if clear, broke an existing stable test
- (Optional): run a hires config against trunk / other desired branches on circleci, confirm tickets created for failures, create tickets if none

Details

Creating JIRAs

Create a JIRA ticket with summary: "Test Failures: <suite> <class_name>"
Set component to the matching "Test/<suite>" component
Fill out description w/mention of class name and number of failures at time of ticket creation
In comments, add details of failure w/link to failing run + formatted \{code\} (without \) blocked JIRA capturing output of the test as CI results aren't preserved forever
After creation, update the ticket to Bug Category "Correctness", "Test Failure"

When we close out all failures for a test class across all branches, we close out the JIRA. If another failure comes up on that class, we can re-open.

Using butler:

Currently butler functionality is limited to viewing the current test results and linking failures to existing JIRA tickets; the "Report selected failures" functionality does not currently work with the Apache JIRA project (as of 15 Dec 2021 ). The recommended workflow as Build Lead is as follows:

Check for new failures on the details page for each branch in the bottom right where it says detailed history:
Look for failing tests without a JIRA link; in the following example see the top test "TestCQLNodes2RF1_Upgrade_current_4_0_x_To_indev_trunk:
For failing tests without a linked item we have a couple workflows depending on where the commit occurred as well as what type of failure it is:
1. Single commit on trunk:
  1. If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
  2. If consistent, git revert the SHA that introduced the failure, re-open the original JIRA ticket, and leave a note for the original assignee about the breakage they introduced.
2. Commit on older LTS branch w/merge commits:
  1. If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
  2. If consistent, create a new JIRA ticket for the failure, link it in Butler, and set assignee to the individual that introduced the failure and notify them in the comments in the JIRA ticket

Build infra:

If there are any build failures due to infra issues (say running out of disk space on Jenkins) either from the weekly cci run or when checking Butler problems, file or find existing JIRA

Notes:

Link failures to JIRA via the "Link selected failures" button:
Create new failure tickets in the ASF C* JIRA.
CI on Jenkins is run on every commit so for consistently failing tests (> 1 run failed on butler) it should be immediately clear which commit introduced the failure.
For failures with "Timeout occurred. Please note the time in the report does not reflect the time until the timeout", we can ignore them , as it's considered test-infrastructure failures. And CASSANDRA-18137 is working on this kind of failures already.
[Optional] Loop failing tests locally using tools/dev/ci-test-loop (PENDING CONTRIBUTION), which relies on tools/dev/ci-test (PENDING CONTRIBUTION) for a number of iterations to determine if it's consistent or intermittent. If intermittent, reflect in subject of the created JIRA ticket for the failure.

Space shortcuts

Page tree