Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This wiki space hosts 

Table of Contents

If you are looking for documentation on using Apache Hudi, please visit the project site or engage with our community

Technical documentation

How-to blogs

  1. How to manually register Hudi tables into Hive via Beeline? 
  2. Ingesting Database changes via Sqoop/Hudi
  3. De-Duping Kafka Events With Hudi DeltaStreamer

Design documents/RFCs

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion.  Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.

Below is a list of RFCs 

Children Display
pageRFC Process

Community Management

Roadmap

Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)

Integrations 

  1. Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)

  2. Trino integration with support for querying/writing Hudi table using SQL statements

  3. Kinesis/Pulsar integrations with DeltaStreamer

  4. Kafka Connect Sink for Hudi

  5. Dremio integration 
  6. Interops with other table formats

  7. ORC Support

Writing 

  • Indexing 

    • MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.

    • Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)

    • Range index to maintain column/field value ranges, to help file skipping for query performance

    • Addition of more auxiliary indexing structures - bitmaps, .. 

    • global/hash based index to faster point-in-time lookup

  • Concurrency Control

    • Addition of optimistic concurrency control, with pluggable locking services.
    • Non-blocking clustering implementation w.r.t updates

    • Multi-writer support with fully non-blocking log based concurrency control.
    • Multi table transactions
  • Performance
    • Integrate row writer with all Hudi writer operations
  • Self Managing 

    • Clustering based on historical workload trend 

    • On-fly data locality during write time (HUDI-1628)
    • Auto Determination of compression ratio

Querying

  • Performance

    • Complete integration with metadata table.
    • Realtime view performance/memory footprint reduction.
  • PrestoDB
    • Incremental Query support on Presto

  • Hive
    • Storage handler to leverage metadata table for partition pruning
  • Spark SQL
    • Hardening incremental pull via Realtime view

    • Spark Datasource redesign around metadata table
    • Streaming ETL via Structured Streaming
  • Flink
    • Support for end-end streaming ETL pipelines

    • Materialized view support via Flink/Calcite SQL
  • Mutable, Columnar Cache Service

    • File group level caching to enable real-time analytics (backed by Arrow/AresDB)

Metadata Management

  • Standalone timeline server
    • Serves interactive query planning performance: schema, DFS listings, statistics, timeline requests
    • High availability/sharding
    • Pluggable backing stores including rocksDB, Dynamo, Spanner