Page History

This wiki space hosts

Table of Contents

If you are looking for documentation on using Apache Hudi, please visit the project site or engage with our community

Technical documentation

How-to blogs

Design documents/RFCs

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion. Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.

Below is a list of RFCs

Children Display

page	RFC Process

Community Management

Roadmap

Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)

Integrations

Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)
Trino integration with support for querying/writing Hudi table using SQL statements
Kinesis/Pulsar integrations with DeltaStreamer
Kafka Connect Sink for Hudi
Dremio integration
Interops with other table formats
ORC Support

Writing

Indexing
- MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.
- Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)
- Range index to maintain column/field value ranges, to help file skipping for query performance
- Addition of more auxiliary indexing structures - bitmaps, ..
- global/hash based index to faster point-in-time lookup
Concurrency Control
- Addition of optimistic concurrency control, with pluggable locking services.
- Non-blocking clustering implementation w.r.t updates
- Multi-writer support with fully non-blocking log based concurrency control.
- Multi table transactions
Performance
- Integrate row writer with all Hudi writer operations
Self Managing
- Clustering based on historical workload trend
- On-fly data locality during write time (HUDI-1628)
- Auto Determination of compression ratio

Querying

Performance
- Complete integration with metadata table.
- Realtime view performance/memory footprint reduction.
PrestoDB
- Incremental Query support on Presto
Hive
- Storage handler to leverage metadata table for partition pruning
Spark SQL
- Hardening incremental pull via Realtime view
- Spark Datasource redesign around metadata table
- Streaming ETL via Structured Streaming
Flink
- Support for end-end streaming ETL pipelines
- Materialized view support via Flink/Calcite SQL
Mutable, Columnar Cache Service
- File group level caching to enable real-time analytics (backed by Arrow/AresDB)

Metadata Management

Standalone timeline server
- Serves interactive query planning performance: schema, DFS listings, statistics, timeline requests
- High availability/sharding
- Pluggable backing stores including rocksDB, Dynamo, Spanner

Space shortcuts

Page tree

Versions Compared

Old Version 49

New Version 50

Key