Proposers
Approvers
- Vinoth Chandar : APPROVED
- Balaji Varadarajan : APPROVED
- Nishith Agarwal : APPROVED
Status
Current state: COMPLETED
Discussion thread: here
JIRA: - HUDI-344Getting issue details... STATUS
Released: 0.6.0
Abstract
A feature to snapshot a Hudi dataset and export the latest records to a set of external files (e.g., plain parquet files).
Background
The existing org.apache.hudi.utilities.HoodieSnapshotCopier
performs a Hudi-to-Hudi copy that serves for backup purpose. To broaden the usability, the Copier could be potentially extended to perform exporting features to data formats, like plain parquet files, other than Hudi dataset.
Implementation
The proposed class is org.apache.hudi.utilities.HoodieSnapshotExporter
, which serves as the main entry for snapshotting related work.
Definition of "Snapshot"
To snapshot is to get the records from a Hudi dataset at a particular point in time. Note that the data exported from MOR tables may not be the most up-to-date as RO query is used for retrieval, which omits the latest data in the log files.
Arguments
Description | Remark | |
---|---|---|
--source-base-path | Base path for the source Hudi dataset to be snapshotted | required |
--target-output-path | Output path for storing a particular snapshot | required |
--output-format | Output format for the exported dataset; accept these values: json|parquet|hudi | required; When "hudi", behaves the same as HoodieSnapshotCopier ; may support more data formats in the future |
--output-partition-field | A field to be used by Spark repartitioning | optional; Ignored when "HUDI" or when The output dataset's default partition field will inherent from the source Hudi dataset. When this argument is specified, the provided value will be used for both in-memory Spark repartitioning and output file partition. String partitionField = // from the argument df.repartition(df.col(partitionField)) .write() .partitionBy(partitionField) .parquet(outputPath); In case of more flexibility needed for repartitioning, use |
--output-partitioner | A class to facilitate custom repartitioning | optional; Ignored when "hudi" |
Steps
- Read
- Regardless of output format, always leverage on
org.apache.hudi.common.table.view.HoodieTableFileSystemView
to perform RO query for read - Specifically, data to be read is from the latest version of columnar files in the source dataset, up to the latest commit time, like what the existing
HoodieSnapshotCopier
does
- Regardless of output format, always leverage on
- Transform
- Output format "parquet"
- Stripe Hudi metadata
- Allow user to provide a field to do simple Spark repartitioning
- Allow user to provide a class to do custom repartitioning
- No transformation is needed for output format "hudi"; just copy the original files, like what the existing
HoodieSnapshotCopier
does
- Output format "parquet"
- Write
- Just need to provide the output directory and Spark shall handle the rest.
Rollout/Adoption Plan
- No impact to existing users as this is a new independent utility tool.
- Once this feature is GA'ed, we can mark
HoodieSnapshotCopier
as deprecated and suggest user to switch to this tool, which provides equivalent copying features.
Test Plan
- Write similar tests like
HoodieSnapshotCopier
- When testing end-to-end, we are to verify
- number of records are matched
- later snapshot reflect the latest info from the original dataset
2 Comments
Vinoth Chandar
LGTM overall. Just few high level questions.
Nishith Agarwal
LGTM as well, few questions/comments.