This wiki space hosts
If you are looking for documentation on using Apache Hudi, please visit the project site or engage with our community
- How to manually register Hudi tables into Hive via Beeline?
- Ingesting Database changes via Sqoop/Hudi
- De-Duping Kafka Events With Hudi DeltaStreamer
RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion. Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.
Below is a list of RFCs
- Apache Hudi - Release Guide (Pre Graduation)
- Apache Hudi Community Bi-Weekly Sync
- Committer On-boarding Guide
- Community Support
Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)
Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)
Trino integration with support for querying/writing Hudi table using SQL statements
Kinesis/Pulsar integrations with DeltaStreamer
Kafka Connect Sink for Hudi
- Dremio integration
Interops with other table formats
- ORC Support
MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.
Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)
Range index to maintain column/field value ranges, to help file skipping for query performance
Addition of more auxiliary indexing structures - bitmaps, ..
global/hash based index to faster point-in-time lookup
- Addition of optimistic concurrency control, with pluggable locking services.
Non-blocking clustering implementation w.r.t updates
- Multi-writer support with fully non-blocking log based concurrency control.
- Multi table transactions
- Integrate row writer with all Hudi writer operations
Clustering based on historical workload trend
- On-fly data locality during write time (HUDI-1628)
Auto Determination of compression ratio
- Complete integration with metadata table.
- Realtime view performance/memory footprint reduction.
Incremental Query support on Presto
- Storage handler to leverage metadata table for partition pruning
- Spark SQL
Hardening incremental pull via Realtime view
- Spark Datasource redesign around metadata table
- Streaming ETL via Structured Streaming
Support for end-end streaming ETL pipelines
- Materialized view support via Flink/Calcite SQL
Mutable, Columnar Cache Service
- File group level caching to enable real-time analytics (backed by Arrow/AresDB)
- Standalone timeline server
- Serves interactive query planning performance: schema, DFS listings, statistics, timeline requests
- High availability/sharding
- Pluggable backing stores including rocksDB, Dynamo, Spanner