Authors: Greg Harris, Ivan Yurchenko, Jorge Quilcate, Giuseppe Lillo, Anatolii Popov, Juha Mynttinen, Josep Prat, Filip Yonov
Status
Current state: "Under Discussion"
Discussion thread: here
JIRA: KAFKA-19161
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Describe the problems you are trying to solve.
Background
The Apache Kafka protocol has become a successful base for building streaming applications, and has attracted workloads that push the Apache Kafka implementation to new limits. The Apache Kafka implementation is designed around low-durability block storage and direct replication, and provides strong consistency and high durability backed by commodity hardware.
Currently, Apache Kafka is often operated in cloud hyperscaler environments where high-reliability object storage is available and more cost-effective than block storage for equivalent workloads. The existing Tiered Storage feature (KIP-405) provides the capability to use object storage for inactive segments, and has seen widespread adoption. However, Tiered Storage does not remove the need for replication of active segments, which is the most substantial infrastructure cost for Apache Kafka operators on hyperscalers today.
Multiple protocol-compatible alternatives to Apache Kafka now use object storage to fully replace direct replication and substantially lower the cost to operate a cluster on a hyperscaler cloud. These alternatives are finding market success and their adoption is rising, showing a general market interest in this optimization.
Motivating Question
Should the Apache Kafka implementation pursue an object storage optimization as is present in all alternatives?
Yes, The Apache Kafka reference implementation should incorporate this innovation, and provide the capability to replace block storage with object storage.
New Capabilities
Diskless Topics allow Apache Kafka operators on hyperscalers to:
- Eliminate inter-zone data transfer costs from replication
- Eliminate inter-zone ingress and egress costs for data from producers and to consumers
- Permit multi-region active-active topics with automatic failover
Diskless topics will allow all Apache Kafka operators to:
- Write through to object storage, avoiding local disk usage
- Pick pluggable commodity storage backends based on their environment
- Balance traffic among brokers and eliminate broker hotspots with per-client granularity
- Upgrade and scale clusters without moving active segments or electing leaders
- Tradeoff cost optimization and latency on a per-topic basis
By incorporating this feature within Apache Kafka specifically:
- More operators will have access to this feature under the Apache 2.0 license
- The community can maintain the feature, reducing dependence on vendors
- Protocol changes for further optimizations, such as producer rack-awareness, become possible
- The Apache Kafka implementation can maintain or grow its market share, and avoid obsolescence
With Diskless Topics, Apache Kafka will become a streaming engine that supports a wide spectrum of latencies, balancing cost and performance for an extremely diverse set of workloads.
Disk usage and lack thereof
It's important to clarify what exactly "diskless" means. "Diskless" primarily refers to not using broker disk for storing user data. There are no index files stored on broker disk for diskless topics as well. However, diskless topics still require some broker disk usage, particularly:
- normal topic KRaft metadata;
- batch metadata may be stored on broker disk depending on the batch coordinator implementation (e.g. in a Kafka topic);
- brokers may require some limited amount of disk space to perform certain operations like object compaction;
- caching in the read path may optionally use broker disk instead of memory.
Proposed Changes
This KIP will not require any changes to the codebase or documentation upon acceptance. By accepting this KIP, we will come to a consensus on the need for this feature, and its end-user requirements, but not any specific implementation details.
For details on the planned implementation, please see the integral follow-up KIPs:
- KIP-1163: Diskless Core
- KIP-1165: Object Compaction for Diskless
- KIP-1164: Topic Based Batch Coordinator
- KIP-1181: Metadata Rack Awareness for Diskless Topics
- KIP-F: Cache Strategy
- KIP-O: Garbage collection for Diskless objects
KIPs without a number will be published in the following weeks.
Each of these KIPs will have its own discussion and voting. Effort should be focused on this KIP first, and only after the community has generally agreed this KIP is something we want, should the particular implementation be designed. These KIPs will influence one another, and together they constitute the minimum viable form of this feature.
Further Work
In addition to the minimum viable implementation described in the integral KIPs above, below are some optional follow-ups. These are features which are not critical to the core functionality, but are natural extensions, further optimizations, and new innovations which are unlocked once the core functionality is in place.
- Broker Roles: Specializing brokers between produce/consume/coordination/compaction operations and permitting heterogeneous Kafka clusters
- Parallel Produce Handling: Processing multiple Produce requests concurrently, increasing potential throughput in high latency environments.
- Transactions on Diskless Topics: Including Diskless Topics in Exactly-Once Semantics Workloads
- Iceberg Format: Allowing massively parallel processing of at-rest topic data. This work enables a pluggable storage interface where one can innovate in the log format layer independently
- Dynamically Enabled Diskless: Allowing extremely easy migrations to try out & revert Diskless
- Unification/Relationship with Tiered Storage: Identifying a long-term vision for Diskless and Tiered Storage plugins
These components are less defined, and currently don’t have KIPs attached. Contributions are welcome to either suggest other extensions, or design one of the above extensions. Design, discussion, and voting on these is expected to begin after the integral KIPs are complete.
Public Interfaces
Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.
A public interface is any change to the following:
- Binary log format
- The network protocol and api behavior
- Any class in the public packages under clientsConfiguration, especially client configuration
- org/apache/kafka/common/serialization
- org/apache/kafka/common
- org/apache/kafka/common/errors
- org/apache/kafka/clients/producer
- org/apache/kafka/clients/consumer (eventually, once stable)
- Monitoring
- Command line tools and arguments
- Anything else that will likely break existing users in some way when they upgrade
This KIP does not propose any new public interfaces.
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
- If we are changing behavior how will we phase out the older behavior?
- If we need special migration tools, describe them here.
- When will we remove the existing behavior?
This will be a backwards-compatible upgrade for existing Kafka Clusters. Specific compatibility, deprecation, and migration details will be covered in follow-up KIPs.
Test Plan
Describe in a few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?
There will be no modifications to existing tests for this KIP. Follow-up KIPs will have specific test plans.
Documentation Plan
What will the impact be on the documentation? List any affected areas/files.
There will be no modifications to the documentation for this KIP. Follow-up KIPs will have specific plans for changes to documentation.
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.
Drop support for non-Diskless topics
The current topic implementation is still appropriate for low latency use-cases, and Diskless topics are not always a suitable replacement. It would also be a backwards-incompatible change, and one which would limit the ability of this new feature to be rolled out.
Additionally, all existing functionality in Kafka depends on the existing topic implementation, including the KRaft Metadata, Consumer Offsets & Group Coordination, Transactions, and Share Groups. Dropping support for non-Diskless topics would require additional scope to integrate this existing functionality.
Enable Diskless on a per-cluster basis instead of per-topic
For users, a Cluster is an administrative boundary, one with a unified resource namespace, permissions system, and physical deployment. Users within one administrative boundary may have distinct performance requirements, and wish to choose different underlying storage parameters. This mirrors the existing topic configurability (retention, segment rolling, etc.)
Do Nothing
As time progresses, this will become the single most substantial missing feature from the upstream implementation. This will drive high-scale and cloud users to Apache Kafka alternatives, which will grow in total market share. This will further fragment the control that Apache Kafka has over the Kafka Protocol, and Kafka may lose its mandate over the Kafka protocol entirely. This may lead to needing to coordinate with forks for new functionality, proliferation of hard protocol forks, or re-centralization of the protocol under a standards organization. We should take steps now to avoid or delay this outcome.