Status

Current stateDiscussion

Discussion thread: here 

JIRA: here 

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Apache Kafka supports rack-aware replica assignment (via broker.rack, introduced in KIP-36) and rack-aware consumer fetching (KIP-392). These features allow operators to spread replicas across fault domains such as availability zones or data centers and to reduce cross-zone consumer traffic. However, there is currently no mechanism to enforce rack diversity at produce time. The existing min.insync.replicas configuration guarantees that a minimum number of replicas have acknowledged a write before the leader confirms it to the producer, but it is entirely rack-unaware.

This creates a significant operational problem for clusters that span multiple racks or availability zones. Consider a cluster with 3 racks and a replication factor of 5. An operator who wants to guarantee that every acknowledged message exists on at least 2 different racks must set min.insync.replicas high enough to probabilistically ensure cross-rack coverage. Depending on the replica distribution, this may require min.insync.replicas=3 or even higher. Each additional required in-sync replica increases produce latency, because the leader must wait for more followers to acknowledge the write before responding to the producer.

The core issue is that min.insync.replicas conflates two distinct concerns:

  1. Durability: how many copies of the data exist.

  2. Fault domain coverage: how many distinct failure domains (racks, zones, data centers) those copies span.

Confluent Platform addresses this with observer replicas, which provide asynchronous cross-datacenter replication with a separate replication tier. However, this is a proprietary feature that is unavailable in Apache Kafka and introduces additional complexity through new replica types and replication semantics.

This KIP proposes a simpler, more targeted solution: a new configuration, min.insync.racks, that allows operators to specify the minimum number of distinct racks that must contain in-sync replicas for a produce request to be acknowledged when acks=all. This decouples the rack diversity guarantee from the replica count guarantee, allowing operators to set a lower min.insync.replicas while still ensuring cross-rack durability.

Example

A cluster spans 3 availability zones (configured as racks A, B, and C) with a replication factor of 5. The operator wants to guarantee that every acknowledged message is durable across at least 2 availability zones.

Without this KIP, the operator must set min.insync.replicas=3 (or higher) to probabilistically guarantee that at least one replica on a second rack has acknowledged. This means the leader waits for 3 followers, increasing produce latency.

With this KIP, the operator sets min.insync.replicas=2 and min.insync.racks=2. The leader only needs 2 replicas to acknowledge, but it additionally verifies that those replicas span at least 2 distinct racks. Produce latency is reduced while the cross-rack durability guarantee is explicit and deterministic rather than probabilistic.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

A public interface is any change to the following:

  • Binary log format

  • The network protocol and api behavior

  • Any class in the public packages under clientsConfiguration, especially client configuration

    • org/apache/kafka/common/serialization

    • org/apache/kafka/common

    • org/apache/kafka/common/errors

    • org/apache/kafka/clients/producer

    • org/apache/kafka/clients/consumer (eventually, once stable)

  • Monitoring

  • Command line tools and arguments

  • Anything else that will likely break existing users in some way when they upgrade

New Broker Configuration

Configuration

Description

Default

Type

min.insync.racks

When a producer sets acks to "all" (or "-1"), this configuration specifies the minimum number of distinct racks that must have in-sync replicas to acknowledge a produce request. If this minimum cannot be met, the broker will respond with NotEnoughRacksException. This works in conjunction with min.insync.replicas — both thresholds must be satisfied for a produce request to succeed. Insufficient replicas will raise NotEnoughReplicasException; insufficient rack diversity will raise NotEnoughRacksException. This ensures that acknowledged messages are durably replicated across multiple fault domains (e.g. availability zones or data centers), reducing the risk of data loss during a full rack failure. Requires broker.rack to be configured on all brokers. A value of 1 (default) disables rack-aware acknowledgement checking.

1

int

New Error Message

Error MessageDescription
NOT_ENOUGH_RACKSThe number of distinct racks with in-sync replicas for the partition is less than the required minimum (min.insync.racks). The produce request cannot be satisfied until replicas on additional racks rejoin the ISR

New Monitoring

MetricScopeDescription
UnderMinRackIsr Partition1 when this partition's ISR spans fewer racks than min.insync.racks
AtMinRackIsrPartition1 when this partition's ISR rack count equals exactly min.insync.racks
UnderMinRackIsrPartitionCountBroker count of leader partitions where isUnderMinRackIsr is true
AtMinRackIsrPartitionCountBrokercount of leader partitions where isAtMinRackIsr is true

Proposed Changes

Overview

The proposed change modifies the produce acknowledgement path in the broker to add a rack diversity check alongside the existing min.insync.replicas check. The change is minimal and confined to the ISR validation logic in the ReplicaManager.

ISR Rack Tracking

The broker already knows the rack assignment of every broker via the broker.rack configuration, and it already maintains the ISR set for every partition. The proposed change adds a lightweight computation on the produce ack path: when evaluating whether a partition’s ISR satisfies the produce requirements, the broker groups the current ISR members by their broker.rack value and counts the number of distinct racks.

This computation is O(n) where n is the size of the ISR (typically 3–5 members) and involves no network calls, no new metadata, and no changes to the ISR protocol. It is a local, in-memory check that is only performed when min.insync.racks is configured above 1, ensuring zero additional overhead on the produce path for clusters that do not use this feature.

Produce Acknowledgement Logic

The current produce acknowledgement logic (simplified) is:

if (acks == ALL) {
if (isr.size() < min.insync.replicas) {
throw NotEnoughReplicasException
}
// wait for ISR to ack
}



The proposed change extends this to:

if (acks == ALL) {

   if (isr.size() < min.insync.replicas) {

      throw NotEnoughReplicasException

   }

   if (min.insync.racks > 1) {

      int isrRackCount = countDistinctRacks(isr)

      if (isrRackCount < min.insync.racks) {

         throw NotEnoughRacksException

      }

   }

   // wait for ISR to ack

}



The countDistinctRacks function maps each ISR member to its broker.rack and returns the number of unique values. This computation is only performed when min.insync.racks is greater than 1, ensuring minimal overhead on the produce path for clusters that do not enable rack-aware acknowledgement. If any broker in the ISR does not have broker.rack configured, it is treated as belonging to an unnamed rack (empty string). Brokers without broker.rack configured are all considered to be on the same rack for the purposes of this check.

Validation

On broker startup and on dynamic configuration changes, the broker will validate:

  • If min.insync.racks > 1, then broker.rack must be configured on the local broker. If not, the broker will log an error and refuse to start (or reject the dynamic config change).

  • min.insync.racks must be <= the number of distinct racks in the cluster as reported by the cluster metadata. If the value exceeds the number of known racks, the broker will log a warning (but not fail to start, as racks may be added later).

  • min.insync.racks must be >= 1.

Rack Outage Handling

During a rack outage, the ISR for affected partitions will shrink and may no longer span the required number of racks. In this scenario, produce requests with acks=all will be rejected with NotEnoughRacksException. This is distinct from NotEnoughReplicasException, which is returned when the ISR drops below min.insync.replicas. The separate error codes allow operators and monitoring systems to immediately distinguish between these two failure modes and respond accordingly.

This is a deliberate design choice that maintains consistency with the existing min.insync.replicas behaviour.

The recommended operational procedure during a rack outage is:

  1. Monitoring: Use the proposed IsrRackCount and RackCheckFailureRate metrics to detect when rack diversity drops below the configured threshold. Alert on these metrics.

  2. Assessment: The operator evaluates the outage duration and risk. For transient failures (minutes), no action may be needed. For extended outages, the operator proceeds to step 3.

  3. Dynamic reconfiguration: The operator explicitly lowers min.insync.racks to 1 (or a lower value) using the Kafka Admin API. This is a conscious decision to trade rack-level durability for availability. For example: kafka-configs.sh --bootstrap-server localhost:9092 --alter --topic my-topic --add-config min.insync.racks=1

  4. Recovery: When the rack is restored and ISR membership recovers, the operator restores the original min.insync.racks value.

This approach mirrors the existing operational model for min.insync.replicas (which also does not auto-degrade) and can be automated with external tooling if desired. An automatic timeout-based degradation mechanism was considered and is documented in Rejected Alternatives.

Compatibility, Deprecation, and Migration Plan

This change is fully backward compatible. The new min.insync.racks configuration defaults to 1, which disables rack-aware acknowledgement checking. Existing clusters will see no change in behaviour until an operator explicitly sets min.insync.racks > 1.

A new error code, NotEnoughRacksException, is introduced. This exception extends InvalidMetadataException, the same parent class as NotEnoughReplicasException. Existing clients that handle InvalidMetadataException or unknown error codes will continue to function correctly. No changes to the produce request format are required. The new error code is returned in the standard produce response error field. Clients that wish to distinguish between insufficient replicas and insufficient rack diversity can add handling for the new error code, but this is optional.

No deprecations are introduced. The existing min.insync.replicas configuration continues to function exactly as it does today. The two configurations are complementary and both must be satisfied for a produce request to succeed.

Test Plan

System Tests

tests will deploy a multi-broker cluster across 3 simulated racks and verify:

  • Produces with acks=all succeed when ISR spans the required number of racks.

  • Produces with acks=all are rejected with NotEnoughRacksException when an entire rack is shut down and the ISR no longer meets min.insync.racks.

  • Dynamically lowering min.insync.racks to 1 during a rack outage allows produces to resume without a broker restart.

  • Both min.insync.replicas and min.insync.racks must be satisfied independently; failing either threshold rejects the produce with the correct error type.

  • Clusters with min.insync.racks=1 (default) exhibit no change in behaviour compared to a cluster without this feature.

  • Restoring the downed rack allows ISR to recover, and produces succeed again after restoring the original min.insync.racks value.

Rejected Alternatives

Implement Observer Nodes

Confluent Platform’s observer replicas provide a more comprehensive solution for cross-datacenter replication, including asynchronous replication tiers and separate replica types. While powerful, this approach introduces significant complexity (new replica types, new replication semantics, changes to the ISR protocol) and is not available in open-source Apache Kafka. This KIP provides a simpler, more targeted solution that stays within the existing ISR model and addresses the most common use case: ensuring cross-rack durability without inflating min.insync.replicas.

Change Replica Placement Behaviour

Consideration was given to changing the behaviour of replica placement to enforce distribution across configured racks.  This was rejected as it is a significant change to existing behaviour and doesn't address the risk that replicas could potentially be manually moved after creation into a non resilient configuration 


  • No labels