RFC - 34: Hudi BigQuery Integration

Proposers

Approvers

Status

Current state


Current State

UNDER DISCUSSION

(tick)

IN PROGRESS


ABANDONED


COMPLETED


INACTIVE


JIRA: here

Released: Next version

Abstract

BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently doesn’t support Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query.

Background

Hudi table types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written). In turn, query types define how the underlying data is exposed to the queries (i.e how data is read).

Hudi supports the following table types:

  • Copy On Write: Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing synchronous merge during write.
  • Merge On Read: Stores data using a combination of columnar (e.g parquet) + row-based (e.g Avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

Hudi maintains multiple versions of the Parquet files and tracks the latest version using Hudi metadata (Cow), since BigQuery doesn’t support Hudi yet, when you sync the Hudi’s parquet files to BigQuery and query it without Hudi’s metadata layer, it will query all the versions of the parquet files which might cause duplicate rows.

To avoid the above scenario, this proposal is to implement a BigQuery sync tool that will use the Hudi metadata to know which files are latest and sync only the latest version of parquet files to BigQuery external table so that users can query the Hudi tables without any duplicate records.

Implementation

This new feature will implement the AbstractSyncTool similar to the HiveSyncTool named BigQuerySyncTool with sync methods for CoW tables. The sync implementation will identify the latest parquet files for each .commit file and keep these manifests synced with the BigQuery external table. Spark Datasource & DeltaStreamer can already take a list of such classes to keep these manifests synced.

BigQueryConfigs Class Design (TBD)

These are the new BigQuery configs used to sync the table:

    String datasetName = "MY_DATASET_NAME";

    String tableName = "MY_TABLE_NAME";

    String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv";

BigQuerySyncTool Class Design(TBD)

Rollout/Adoption Plan

There are no impacts to existing users since this is entirely a new feature to support a new use case hence there are no migrations/behavior changes required.

After the BigQuerySyncTool has been implemented, I will reach out to Uber's Hudi/BigQuery team to roll out this feature for their BigQuery ingestion service.

Test Plan

This RFC aims to implement a new SyncTool to sync the Hudi table to BigQuery, to test this feature, there will be some test tables created and updated on to the BigQuery along with unit tests for the code. Since this is an entirely new feature, I am confident that this will not cause any regressions during and after rollout.

Future Plan

After this feature has been rolled out, the same model can be applied to sync the Hudi tables to other external data warehouses like Snowflake.