Status

Current state: Accepted

Discussion thread: https://lists.apache.org/thread/c446mf4bg0ocjgs6do9lltoc2f13r71m

JIRA: KAFKA-18789

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As the Kafka community grows, so does the number of committers and the overall rate of change to our source code. In order to facilitate a high velocity of change while maintaining safety and correctness, this KIP proposes to serialize all changes made to Kafka into a single build queue. This will ensure that no commit is able to break our mainline branches and that concurrent changes do not have bad interactions with each other. The mechanism for implementing this serial build queue is GitHub Merge Queues.

Public Interfaces

No public interfaces are affected by this KIP. This is a development process change.

Proposed Changes

Current Workflow

Our current change management workflow is that contributors open change requests known as Pull Requests (PRs). These changes are then tested, approved by a committer, and finally squash merged into the base branch. When we test our PRs, we use a special "merge" reference that is provided by GitHub. This reference is equivalent to the base branch with the PR merged into it. This means the test results of our PRs should reflect what the results would be if the PR was merged.

After a PR has been tested and approved, it can be merged by a committer. One major problem with our current approach is that we do not prevent out-of-date PRs from being merged. As long as the PR is approved and "mergeable" (i.e., no conflicts), it can be merged. This indeterminate delay between testing, approval, and merging can result in problems on our mainline branchs.

In fact, problems may occur even with no delay between testing and merging. Consider two PRs which have different and non-conflicting changes that would interact badly with each other. It is possible that these PRs have a successful test run, but when merged concurrently would break the base branch.

A separate class of problem exists where an author makes a small innocuous change after a PR has been tested and approved. Technically, any change should be fully retested, but there have been circumstances when this has been skipped (intentionally or otherwise).

Merge Queue

Rather than chasing down problems in our mainline branches after the fact, this KIP proposes to leverage the Merge Queue feature of GitHub.

The merge queue effectively acts as a gatekeeper for trunk. It does so by placing all PRs into a queue rather than merging them directly. Once in the queue, a Continuous Integration (CI) job will run that applies the PR on top of the base branch and then runs a custom job. If successful, the PR is merged to the base branch and the next PR in the queue is built. If our custom job is not successful, the PR remains open and the author is notified of the failure. This process continues as long as there are pending PRs to be merged.

This controlled and sequential workflow will allow us to ensure that no change made to Kafka will break our mainline branches.

GitHub Details

When enabled, the merge queue replaces the "Squash and Merge" button that we use today. We can still do a squash merge, but the merge button becomes "Merge when ready". Clicking this button is tantamount to merging the PR to trunk. Similar to today, only committers are authorized to click the button, and only PRs which have been approved should be merged.


Figure 1: The GitHub UI for merging a Pull Request with the Merge Queue



There are a few important configurations of GitHub's Merge Queue that we must understand.

Build concurrency: The maximum number of merge_group webhooks to dispatch (between 1 and 100), throttling the total amount of concurrent CI builds. This affects the velocity of merges that a merge queue can complete.

Only merge non-failing pull requests: This setting determines how a merge queue forms groups of pull requests to be merged.

Merge limits: Select the minimum and maximum number of pull requests to merge into the base branch at the same time (between 1 and 100), and a timeout after which the queue should stop waiting for more entries and merge with fewer than the minimum number.


As Pull Requests (PRs) are enqueued, they can run the CI check concurrently to avoid blocking. These concurrent builds are still serialized in the sense that a later PR in the queue will include the earlier PRs. Setting "Build concurrency" higher than 1 will allow multiple CI runs to happen concurrently.

The Merge Limits determine how many PRs can be batched together in a single CI run. This can be useful if a lot of small changes are landing at the same time and we want to amortize the CI run time. However, increasing this value also increases the risk of a failed CI run (more things being changed at once) and also increases the impact of a failed CI run (more PRs need to be re-submitted to the queue).

The "Only merge non-failing pull requests" option can help mitigate the risks of batching PRs into one merge group assuming a build concurrency of greater than 1.

Rollout Plan

The extent of validation we can perform in the merge queue job is governed by the rate of PRs we expect to merge. Taking data from https://github.com/apache/kafka/graphs/commit-activity, we can see that over the last year we have merged around 15 commits per day. Our "Compile and Check Java" CI step is very consistently taking 11-12 minutes on trunk (with no caching). Our full test suite runs in around 2 hours on trunk (again, no caching). With the current rate of change in Kafka, we cannot reasonably run the full test suite for each PR in the merge queue. However, it would quite reasonable to perform compilation and static checks on each PR.

Since there are many unknowns with the merge queue, this KIP proposes that we enable it in a simple configuration with our "Compile and Check Java" step. This will give us the most immediate benefit of the merge queue without overly complicating things. By disabling batching and concurrency initially, we can simplify the mental model of this new change management. This will give the community a chance to adapt to this new paradigm and really see how it works. 

  • Build Concurrency: 1
  • Merge Limits: 1
  • Workflow: Compile and Check Java

Following this rollout, we can experiment with the merge group batch size and concurrency. Eventually, we may opt to run some or all tests as part of the merge queue. That remains out of scope of this KIP.

Rejected Alternatives

The only other solution to the problems listed above is to require PRs to be up-to-date before merging (a configuration option in GitHub). However, since our test suite runs for around 2 hours, this would lead to a never-ending cycle of testing and updating from the base branch. Enabling the up-to-date requirement for PRs would be a burden on contributors and our CI infrastructure.

  • No labels