Status

Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: KAFKA-20427 [Change the link from KAFKA-1 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When using Apache Kafka's KRaft mode with dynamic quorum (kraft.version = 1, introduced in KIP-853), the cluster stores the controller voter configuration in the metadata log as VotersRecord entries. 

These records contain DNS hostnames and ports for each controller voter, together with node and directory IDs. 

Unlike static quorum mode (which uses the controller.quorum.voters configuration), dynamic quorum persists voter endpoints in the metadata topic, making them the source of truth for controllers to form the quorum. 

This design creates a critical problem when DNS infrastructure changes occur: 

  1. Controllers load the voter set from persisted metadata (snapshot + log) during the startup. 
  2. The voter set contains old DNS hostnames (e.g., controller-0.old-domain.com) 
  3. The old DNS names no longer resolve in the updated DNS infrastructure. 
  4. Controllers cannot communicate with each other to form quorum. 
  5. The UpdateVoterRequest API (used to update voter endpoints) requires an active quorum to function. 
  6. Cannot establish quorum to update endpoints, cannot update endpoints without quorum. 

Even with the controller.quorum.bootstrap.servers  parameter being updated with the new DNS hostnames, on restart, the controllers use the voter set from the persisted metadata and there is no way to rewrite it with thew new endpoints. 

In such scenario, it’s not possible to recover the KRaft quorum. 

The shift to dynamic quorum with KIP-853 makes split brain structurally impossible during membership changes, but on the other side, it creates recovery problems, especially in dynamic environments like Kubernetes. 

When running an Apache Kafka cluster within a cloud-native environment based on Kubernetes and deployed by an operator like Strimziit’s not uncommon to rename DNS domains. In this case, when the Kubernetes cluster's DNS configuration changes (e.g., from cluster.local to cluster-new.local, or changing the default DNS suffix), all pod DNS names are affected. For example, controller FQDNs change from my-cluster-kafka-0.my-cluster-kafka-brokers.kafka-prod.svc.cluster.local to my-cluster-kafka-0.my-cluster-kafka-brokers.kafka-prod.svc.cluster-new.local. 

Furthermore, there is no safe recovery from majority loss. For example, if 2 of 3 controllers are permanently gone, the metadata stored on those controllers (including leader epochs, ISR/ELR state) is lost. Kafka cannot recover from such metadata loss and any re-bootstrap attempt will require additional data loss beyond the initial metadata loss, as the system lacks the information needed to determine which data is authoritative and which replicas are safely in-sync. 

Finally, recovery from volume snapshot (i.e PVC in Kubernetes) is also fragile. It requires preserving the cluster name and the DNS domain at least, but if one of them changes, pods get new identities (common in Kubernetes), the VotersRecord in the metadata log no longer matches the reality, and there is no config file to simply edit to fix the mismatch. 

The same issue doesn’t arise when using the static quorum mode, because as soon as the controller.quorum.voters parameter is updated by using the new DNS hostnames for the controllers and they restart, the KRaft quorum is formed as the controller.quorum.voters list is used as the source of truth. By using dynamic quorum, it's still possible to recover if not all controllers changed the DNS hostnames, but there are still controllers using the old ones and able to connect to form the quorum. This happens when number of controllers with DNS changes is less than the majority of controllers.

Public Interfaces

Command Line Interface Changes 

The kafka-storage.sh format command has a new option --override-voters.  

kafka-storage.sh format --config <config-file> --cluster-id <cluster-id> --initial-controllers <controller-list> --override-voters 

With the --override-voters option, when the storage is already formatted, the tool creates a new snapshot with a VotersRecord, from the required --initial-controllers option. It only allows endpoint (DNS hostnames and/or ports) changesIt’s not possible to update the node and directory IDs. 

For example, the following command is going to override the DNS hostname for all the three controllers within a cluster (it runs against the controller 0):  

bin/kafka-storage.sh format --cluster-id ${CLUSTER_ID} --initial-controllers "0@controller-0.new-domain:9093:${CONTROLLER_0_UUID},1@controller-1.new-domain:9094:${CONTROLLER_1_UUID},2@controller-1.new-domain:9095:${CONTROLLER_2_UUID}" --config /controller-0.properties --override-voters

Assuming their old hostnames are like controller-0.old-domain.com (and so on), the output will be: 



Storage directory /var/kafka/kraft-logs is already formatted. 

Override voters mode enabled, checking if VoterSet needs updating... 

Persisted VoterSet: 

VoterSet(voters=[VoterNode(voterKey=ReplicaKey(id=0, directoryId=AELswfi2T3Cml5KU3vmujQ), listeners=Endpoints(CONTROLLER://controller-0.old-domain.com:9093), supportedKRaftVersion=SupportedVersionRange(...)), VoterNode(voterKey=ReplicaKey(id=1, directoryId=7jelSzBGQyStp734JHJBgw), listeners=Endpoints(CONTROLLER://controller-1.old-domain.com:9094), supportedKRaftVersion=SupportedVersionRange(...)), VoterNode(voterKey=ReplicaKey(id=2, directoryId=PMev_mDOTDy6icOE3qrPUQ), listeners=Endpoints(CONTROLLER://controller-2.old-domain.com:9095), supportedKRaftVersion=SupportedVersionRange(...))]) 

Provided VoterSet (from --initial-controllers): 

VoterSet(voters=[VoterNode(voterKey=ReplicaKey(id=0, directoryId=AELswfi2T3Cml5KU3vmujQ), listeners=Endpoints(CONTROLLER://controller-0.new-domain.com:9093), supportedKRaftVersion=SupportedVersionRange(...)), VoterNode(voterKey=ReplicaKey(id=1, directoryId=7jelSzBGQyStp734JHJBgw), listeners=Endpoints(CONTROLLER://controller-1.new-domain.com:9094), supportedKRaftVersion=SupportedVersionRange(...)), VoterNode(voterKey=ReplicaKey(id=2, directoryId=PMev_mDOTDy6icOE3qrPUQ), listeners=Endpoints(CONTROLLER://controller-2.new-domain.com:9095), supportedKRaftVersion=SupportedVersionRange(...))]) 

Changes detected: 

VoterSetDiff{hasVoterIdChanges=false, hasDirectoryIdChanges=false, endpointChanges={0=controller-0.new-domain.com:9093, 1=controller-1.new-domain.com:9094, 2=controller-2.new-domain.com:9095}} 

Validation: PASSED (only endpoints changed, safe operation) 

Creating complete snapshot at offset 4601, epoch 9 with updated VotersRecord... 

Snapshot created: 00000000000000004601-0000000009.checkpoint 

Override complete. Kafka will load updated VoterSet on startup. 

In case of the voter set provided is the same as the persisted one, the output will be: 


Storage directory /var/kafka/kraft-logs is already formatted. 

Override voters mode enabled, checking if VoterSet needs updating... 

... 

No changes detected (VoterSets are equivalent). Override operation skipped, already up to date.

Finally, in case of rejected changes because the user is trying to change the voter IDs, the output will be:

Storage directory /var/kafka/kraft-logs is already formatted. 

Override voters mode enabled, checking if VoterSet needs updating... 

... 

Changes detected: 

VoterSetDiff{hasVoterIdChanges=true, hasDirectoryIdChanges=false, endpointChanges={}} 

Error: --override-voters cannot be used for changing node IDs or directory IDs. 

Changes detected: 

VoterSetDiff{hasVoterIdChanges=true, hasDirectoryIdChanges=false, endpointChanges={}} 

When the storage is not formatted yet, the --override-voters doesn’t have any additional impact, but the formatting runs as usual. It also doesn’t have any effect if the persisted voter set is the same as the one provided via the --initial-controllers option.

Proposed Changes

This KIP proposes adding a new --override-voters flag to the kafka-storage.sh format command.  

When the storage is already formatted and this flag is specified with --initial-controllers: 

  1. Read the complete cluster metadata state (from existing snapshots and log segments). 
  2. Compare the persisted voter set with the provided voter set from --initial-controllers. If they are the same, no action is taken. 
  3. If the voter sets are different, validate that only controllers’ endpoints changes (DNS hostnames and/or ports) are being made. It's not allowed changing node and directory IDs. 
  4. Create a new snapshot at the current log end offset containing: 
    1. All existing cluster metadata (features, topics, configs, ACLs, etc.). 
    2. Updated VotersRecord with new endpoint information. 
    3. All other control records (SnapshotHeaderKRaftVersionSnapshotFooter) 

On next startup, controllers load the new snapshot and use updated endpoints to reach each other and form the quorum. 

This approach: 

  • Preserves all cluster metadata. 
  • Uses standard KRaft snapshot mechanisms. 
  • Provides safety validation (rejects topology changes). 
  • Requires no changes to Kafka server code (offline tool only). 
  • Is idempotent and safe to run multiple times; it skips the formatting if the voter sets already match. 

Validation rules 

When using the --override-voters flag, the following validations are enforced: 

  • Requires --initial-controllers: If --override-voters is specified without --initial-controllers, the command fails. 
  • Only endpoint changes allowed: If node or directory IDs differ between persisted and provided voter sets, the command fails. 
  • Requires dynamic quorum: If controller.quorum.voters is configured (static quorum), the command fails. 

Broker considerations 

In KRaft mode, brokers act as observers and they maintain their own metadata, fetched from controllers, with snapshots and log segments containing VotersRecord entries.  

When the broker starts: 

  1. It loads the voter set from their local metadata log (snapshot + replayed segments). 
  2. This voter set contains controller endpoints for KRaft quorum discovery. 
  3. It uses these endpoints to connect to the KRaft controller quorum. 
  4. The controller.quorum.bootstrap.servers configuration is not used to reach out the controllers. 

After DNS changes, the broker's metadata logs contain old controller endpoints. Even though controller.quorum.bootstrap.servers is updated in the configuration file, the broker ignores it and uses the stale endpoints from its metadata log, causing connection failures. 

For this reason, when performing DNS changes, even brokers need a voter set override by using the same approach described for the controllers. It means they need to run the formatting tool again and providing the --initial-controllers option, including the new voter set, together with the --override-voters flag. 

It doesn’t match the current official Kafka documentation where it states to use --no-initial-controllers when formatting a broker. Of course, the current state doesn’t consider the problem that this KIP is trying to address.

Implementation overview 

VoterSet comparison 

The voter set comparison between what’s persisted in the metadata and what’s provided through the --override-voters option is computed via a new utility class VoterSetDiff. 

public class VoterSetDiff {
    private final boolean hasVoterIdChanges;
    private final boolean hasDirectoryIdChanges;
    private final Map<Integer, InetSocketAddress> endpointChanges;

    /**
     * Compare two VoterSets and return detailed diff.
     *
     * Only compares the specified listener (controller listener) since --initial-controllers
     * only creates one listener per voter.
     *
     * @param persisted The VoterSet currently persisted in the metadata log
     * @param provided The VoterSet provided via --initial-controllers
     * @param controllerListenerName The listener name to use for endpoint comparison
     * @return VoterSetDiff containing detected changes
     */
    public static VoterSetDiff compare(
        VoterSet persisted,
        VoterSet provided,
        String controllerListenerName
    ) {
		// returns the differences between the two VoteSet instances
	}
}

The comparison can detect voter ID changes, directory ID changes and endpoint changes (by comparing hostname and port).

Metadata State Reading 
 

The override logic within the Formatter class must build a complete MetadataImage representing the full cluster state. It goes through the following steps: 

  1. Scan the metadata directory for snapshot files (*.checkpoint). 
  2. Load latest snapshot into a MetadataDelta by using the RecordsSnapshotReader (getting both control and metadata records). 
  3. Replay log segments (*.log) with all metadata records into the MetadataDelta by using the BatchFileReader. 
  4. Together with the MetadataDelta, it tracks the last offset to be used for appending the new VotersRecord later. 
  5. Build the MetadataImage from the MetadataDeltacontaining: 
    1. Feature levels 
    2. Topic configurations and partitions 
    3. Broker and controller registrations 
    4. ACLs and configs 
    5. Voter set history 
    6. All other metadata records 

The logic that reads the log segments takes transaction records into account as well. It tracks when a transaction begins, because of a BEGIN_TRANSACTION_RECORD, and saves in memory all records from now ondon’t put them immediately within the MetadataDelta, until it reads the end or abort of the transaction: 

  • If END_TRANSACTION_RECORD, the logic put all the saved records into the MetadataDelta, because they are confirmed and can be part of the snapshot. 
  • If ABORT_TRANSACTION_RECORD, the saved records are just discarded. They won’t be part of the snapshot. 

The above is a simplified version of MetadataBatchLoader's transaction handling.

Snapshot Creation with Updated VoterSet 

After building the MetadataImage, the Formatter class runs the logic to create and write the new snapshot with the updated VotersRecordIt goes through the following steps: 

  1. Determine log end offset and epoch from latest log segment. 
  2. Create the RecordsSnapshotWriter with the control records and adds the new VoterSet. 
  3. Write all the metadata records from the image. 
  4. Flush the image to the disk by creating the snapshot as {offset}-{epoch}.checkpoint. 

Safety and Locking 

Before running any of the above operations, the override logic within the Formatter acquires exclusive lock on the .lock file in metadata directory to prevent concurrent access from running Kafka processes. 

It also runs safety validations: 

  • Reject if voter IDs differ (topology change) 
  • Reject if directory IDs differ (data loss risk) 
  • Reject if static quorum is configured 
  • Require controllers to be stopped (lock acquisition fails if running)

Compatibility, Deprecation, and Migration Plan

The proposed solution is fully backward compatible: 

  • No changes to Kafka server runtime code. 
  • No changes to network protocols or APIs. 
  • No changes to on-disk data formats (uses standard KRaft snapshots). 
  • No changes to existing kafka-storage.sh format behaviour when --override-voters is not used. 

There is no deprecation, because the proposal is about adding a new option to the formatting tool without removing any existing one. The static quorum remains supported as well.

The migration plan includes:  

  • Upgrade to the Kafka version including the new --override-voters option within the formatting tool. 
  • In case of DNS infrastructure changes: 
    • Stop each controller one by one. 
    • Format the storage by using the override of the voter set 
    • Restart controllers one by one 
  • Repeating the same process for the brokers. 

Test Plan

New test cases will be added within the FormetterTest class: 

  • Read metadata from snapshot and log segments. 
  • Handling bootstrap transactions correctly. 
  • Handling multiple voter sets updates. 
  • Verify snapshot contains all metadata. 
  • Validating only changes on voter endpoints and no change on node and directory IDs. 
  • Verify all the conditions are met when using the --override-flag (usage of --initial-controllers and dynamic quorum) 
  • Validate idempotency by overriding with the same voter set which implies a no-op. 
  • Verifying the lock mechanism to prevent voter set override while the metadata folder is locked (i.e. controller is running). 

New test cases will be added within StorageToolTest: 

  • Verify all the conditions are met when using the --override-flag (usage of --initial-controllers)

Rejected Alternatives

The following alternatives were considered but rejected: 

  • Append a new VotersRecord directly to the metadata log instead of creating a snapshot. 
    • Architecturally inconsistent: Modifying append-only log violates KRaft design principles. 
    • High risk of braking metadata: Appending to the metadata log should be done only by the controller logic and not something external. 
  • Create a separate tool specifically for updating voter endpoints: 
    • Redundant tooling: kafka-storage.sh format already handles metadata initialization. 
    • Increased complexity: Operators need to learn/maintain another tool. 
    • Conceptual fit: Updating voter endpoints is essentially "reformatting" with new config. 
  • Auto re-bootstrap with an external discovery mechanism by decoupling the endpoints resolution from the Raft log entirely, so that controllers can discover each other's current addresses without needing a functioning quorum first. This could be based on using the controller.quorum.bootstrap.servers as a fallback when the VoterSet endpoints are unreachable. 
    • Lose availability: During the rolling updates, to use the new controller.quorum.bootstrap.servers, some nodes (controller and brokers) will be using the old set of endpoints to communicate while some nodes will be using the new set of endpoints. 
  • Still using the proposed approach with --override-voters option for overriding controllers’ DNS hostname and/or ports but deleting the metadata folder within the brokers to allow them to fetch the new one from the controllers, instead of applying the override to them as well: 
    • In an environment where the Kafka cluster is not operated by a human but via a software operator instead (i.e. on Kubernetes by using the Strimzi operator), the best approach would be the format override one because of consistency with controllers, instead of making a distinction and look for metadata deletion.
  • No labels