Skip to end of metadata
Go to start of metadata

 RFC - 04 : Faster Hive incremental pull queries

Proposers

Approvers

Status

Current state: COMPLETED

Discussion thread: here

JIRA: here

Released: <Hudi Version>

Abstract

The incremental pull tool for Hive (HiveIncrementalPuller & its uber internal equivalent) provides the ability to obtain change streams on ingested tables in Hive. Currently this is achieved by listing all partitions and then evaluating them for incremental changes. For larger Hive tables, this approach can be time consuming with high number of partitions. This proposal aims at speeding up incremental queries by leveraging the commit metadata that Hudi timeline already has even before listing the partitions. This will reduce the planning overhead to just the partitions affected by the time range the data is incrementally pulled for.

Background

Here are some related concepts that might be useful to know.

  • HoodieTimeline - This abstraction provides a timeline of Hudi operations and Hudi metadata that happened on a dataset which is either complete or in inflight.
  • FileSystem View - An abstraction which hierarchically tracks the partitions and files present in each partitions. This abstraction allows for many versions of a file to be present in each partition and identifies the file-versions valid at any snapshot by using a careful naming scheme
  • HoodieCommitMetadata - This has all metadata that gets stored along with a commit.

Implementation

Reimplement the HoodieInputFormat.listStatus() this way -

  1. Parse the InputPaths from JobConf and classify into three categories - paths related to incremental mode queries, paths related to non incremental mode queries, non Hudi Paths.
  2. While Parsing the InputPaths, also create HoodieTableMetadataClient for Hudi tables.
  3. Process each category sequentially
    1. Incremental mode - HoodieTimeline provides commits to check. Read each of those commits (HoodieCommitMetadata) and gather partitions to list. Mutate JobConf to set input paths to this new partitions list. From here on the implementation is same as current - Create ReadOptimizedView of the table and fetch all latest data files which are in the matching commit range.
    2. Non Hudi mode - Mutate jobConf to set these as InputPaths and return FileSatus[] as is.
    3. Non Incremental mode - Mutate Jobconf to set non incremental input paths. List files in these paths, group them based on tables (reuse table metatada created in step 2 above), construct ReadOptimizedView of the table and fetch all latest data files.

Rollout/Adoption Plan

  • Roll out involves replacing the current HoodieInputFormat registered in Hive with this new version.
  • There wont be any user impact since, the behavior of listStatus will remain same. 

Test Plan

Unit tests to cover all operations

Testing using production workloads.

1 Comment

  1. looks good to me, this is a good precursor to hoodie metadata filtering on the queries (smile)