Status

Current state: Under Discussion

Co Author: Satish Duggana 

Discussion thread: here

JIRA: here [Change the link from KAFKA-1 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

With KIP-405: Kafka Tiered Storage, inactive segments are offloaded to remote object storage.
Currently, consumers still read all data from leader replicas, which causes:

  • High I/O on leader brokers
  • Increased local storage requirements
  • Slower recovery times for rehydrated segments
  • Contention between real-time and historical consumers
  • Broker performance degradation due to extended cold reads KAFKA-7504 - Getting issue details... STATUS

Goal: Introduce a lightweight broker role, Remote Read Replica (RRR), dedicated to serving historical reads directly from remote storage.
This separates hot-path brokers (leaders/ISR) from cold-path brokers (RRRs), improving scalability, cost efficiency, and performance.

Additionally, RRRs can be deployed across multiple Availability Zones (AZs), allowing consumers to connect to the nearest AZ’s RRR. This reduces network latency and cross-AZ data transfer costs.

Proposal Summary

Remote Read Replica (RRR) Properties

Property

Value

Replication

Not in ISR or leader election

Local log segments

Minimal, metadata only

Data source

Remote tiered storage only

Use cases

Historical reads, analytics, observability queries

Scaling

Horizontal, stateless, autoscalable

Durability

Read-only, no replication responsibility

RRRs reduce load on main brokers and provide an elastic read tier for older data.Public Interfaces

Proposed Changes

Architecture Overview 

  • Consumers connect to the nearest AZ’s RRR to minimize latency and cross-AZ cost.
  • Main brokers handle writes and hot reads; RRRs serve historical reads exclusively from remote storage.

Workflow:

  1. Consumer requests recent offsets → served by main brokers.
  2. Consumer requests historical offsets → routed to nearest RRR.
  3. RRR fetches segments from remote storage, optionally caches them, and streams to consumers.

How Remote Read Replicas Work

Broker Responsibilities

  • Maintain partition and segment metadata only
  • Fetch segment indexes from remote storage
  • Stream records to consumers
  • Optionally cache frequently accessed segments locally (pre fetch segments)
  • Do not participate in ISR or become leader

Consumer Routing

Option A: Client-side routing

  • Consumer chooses RRR vs. main broker based on offset age or timestamp

Option B: Server-side routing (Preferred)

  • Main broker redirects historical read requests to RRRs

Pros and Cons

Pros

  • Removes historical read load from main brokers
  • Scalable independently of ISR brokers
  • Lower local storage and instance costs
  • Consumers connect to nearest AZ RRR → reduces latency and cross-AZ cost
  • No risk to replication or leader election

Cons

  • Cold reads may have higher latency due to remote storage
  • Consumer routing adds complexity
  • Increased bandwidth usage for remote storage reads
  • Requires operating a separate fleet of brokers

Failure Scenarios

  • RRR node failure: stateless; clients retry to other RRRs or main broker, no ISR impact
  • Remote storage latency spike: slower reads; main brokers unaffectedCompatibility, Deprecation, and Migration Plan

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

  • No labels