Summary

The earlier versions of Hudi had maven coordinates com.uber.hoodie and package names starting with com.uber.hoodie. The first Apache version of Hudi (0.5.0) has both bundle and package names following Apache conventions (org.apache.hudi). This document is intended for engineers who build and operate Hudi datasets for migrating from pre 0.5.0 Hudi version to 0.5.0 (first Apache release).

Implications of package renames:

There are some implications in changing the package names in Hudi. They are:

Custom Hudi hooks written by users (like custom payload class, partition-extractor) needs to be modified to use new base class/interfaces (See below)
Hudi’s integration with query engines like hive and presto relies on Hudi input format class (For e:g com.uber.hoodie.HoodieInputFormat) which is registered in Hive metastore. This input format is part of Hive table definition. As the namespace for these input-format classes have changed, upgrade to 0.5.0 has to be carefully done in specific order to avoid compatibility issues between query engines and Hudi writers.
Some of Hudi’s metadata (of compaction plan, actions like clean/rollback) are in avro format with “com.uber.hoodie” namespace. Also, the record payload class name (which could be com.uber.hoodie.xxx) is tracked in hoodie.properties file. Again, the upgrade needs to be carefully planned to avoid any interoperability issues related to this.

Upgrading your Hudi extensions:

In some cases, you may have written custom hooks for merging records, transforming upstream data sources and for other purposes. You would need to be aware of the following changes to base interfaces for these hooks.

Hooks	Older base-class/interface	New base-class/interface
Own Record Payload to perform custom merging semantics	com.uber.hoodie.common.model.HoodieRecordPayload	org.apache.hudi.common.model.HoodieRecordPayload
Custom Partition Value extractor for sycning to hive metastore	com.uber.hoodie.hive.PartitionValueExtractor	org.apache.hudi.hive.PartitionValueExtractor
Custom Hoodie Key generator from record	com.uber.hoodie.KeyGenerator	org.apache.hud.KeyGenerator
Custom Upstream Source for HoodieDeltaStreamer	com.uber.hoodie.utilities.sources.Source	org.apache.hudi.utilities.sources.Source
Custom Datasource transformer for HoodieDeltaStreamer	com.uber.hoodie.utilities.transform.Transformer	org.apache.hudi.utilities.transform.Transformer

New Hudi Bundle packages :

The below table maps the old bundle coordinates with new bundle coordinates

S.NO	Old Bundle Name	New Bundle Name
1	com.uber.hoodie:hoodie-hadoop-mr-bundle	org.apache.hudi:hudi-hadoop-mr-bundle
2	com.uber.hoodie:hoodie-hive-bundle	org.apache.hudi:hudi-hive-bundle
3	com.uber.hoodie:hoodie-spark-bundle	org.apache.hudi:hudi-spark-bundle
4	com.uber.hoodie:hoodie-presto-bundle	org.apache.hudi:hudi-presto-bundle
5	com.uber.hoodie:hoodie-utilities-bundle	org.apache.hudi:hudi-utilities-bundle

Changes in Input Format classes for Hive Tables:

Hudi has custom input format implementation to work with Hive tables. These classes are also affected by the change in the package namespace. In addition, these input format names are renamed to note that they work primarily on Parquet dataset.

Please find the name changes below

View Type	Pre v0.5.0 Input Format Class	v0.5.0 Input Format Class
Read Optimized View	com.uber.hoodie.hadoop.HoodieInputFormat	org.apache.hudi.hadoop.HoodieParquetInputFormat
Realtime View	com.uber.hoodie.hadoop.HoodieRealtimeInputFormat	org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat

Changes in Spark DataSource Format Name:

With the package renaming, Hudi’s Spark Data Source will now be accessed for reading and writing using the format name “org.apache.hudi”

Data Source Type	Pre v0.5.0 Format (e.g in scala)	v0.5.0 Format (e.g in scala)
Read	spark.read.format(“com.uber.hoodie”).xxxx	spark.read.format(“org.apache.hudi”).xxxx
Write	spark.write.format(“com.uber.hoodie”).xxxx	spark.write.format(“org.apache.hudi”).xxxx

Migrating Existing Hudi Datasets:

This section describes the steps needed to seamlessly move Hudi writers and readers from using com.uber.hoodie:hoodie-* packages to org.apache.hudi:hudi-* packages

Best Practices:

These are some general guidelines on change management for any big data systems but especially important in the context of Hudi migration.

As you are responsible for your organization's data in Hudi, we recommend you have

Staging Setup: Have non-prod testing environment (staging) with non-trivial datasets. Test any version upgrade in Hudi in this environment before rolling out to production.
Continuous Testing in Staging: With big-data systems, sufficient baking time and traffic has to pass to verify and harden new versions before they can be rolled out to production. The same is true for upgrades pertaining to Hudi. Make sure due diligence is given in testing in Staging.
Backwards Compatibility Testing: Apache Hudi sits in a unique place in the data lake ecosystem. On one hand, it integrates with the data ingestion side (involving processing engines like Spark, upstream like Kafka, storage like HDFS/S3/GCS) and on the other hand, it has to seamlessly work with query engines (Spark/Hive/Presto). For large deployments, it may not be possible to stop the world to upgrade all services with new version of Hudi. In those cases, make sure you perform backwards compatibility testing by upgrading reader first.
Gradual Rollout : Once properly vetted in Staging, Have deployment strategies to production in place such that you can deploy any version upgrade to Hudi to one or a smaller subset of data-sets first in one datacenter (if multi-colo) and validate for some amount of time before rolling out to the entire service.

Recommended Migration Steps:

Upgrade Hudi to 0.4.7 first (recommended):

Using the local dockerized environment, we have manually tested the upgrade from com.uber.hoodie:hoodie-xxx-0.4.7 to org.apache.hudi:hudi-xxx-0.5.0. While the upgrade from pre 0.4.7 release to hudi-0.5.0 should theoretically work, we have not personally tested the migration steps.

Upgrade Readers First :

Hudi 0.5.0 (org.apache.hudi:hudi-xxx) packages have special classes and implementation to allow for reading datasets that are written by 0.4.7 and pre-0.4.7 versions. Upgrading Writers first could result in queries from old readers failing

Upgrade Hudi Writer Next :

This should start writing metadata with new namespace “org.apache.hudi” and the query engines (which have already been upgraded) will be able to handle this change.

Register New HoodieInputFormat for Hive Tables: For existing hive tables, change table definition to use new hudi input format.

For Read Optimized Tables: ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieParquetInputFormat;
For Realtime Tables : ALTER TABLE table_name SET FILEFORMAT org.apache.hudi.hadoop.HoodieParquetRealtimeInputFormat;

For MOR tables, update hoodie.properties file to rename the configuration value for hoodie.compaction.payload.class from “com.uber.hoodie” to “org.apache.hudi”, We have a utility script that takes in a list of base-paths to be upgraded and does the rename. In theory, there is a small time window where queries and ingestion could see partial hoodie.properties file when the utility script is overwriting the file. To be really safe, this operation has to be performed with downtime but in practice you will most likely be fine. See below for an example invocation:
```
java -cp $HUDI_UTILITIES_BUNDLE:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.8.4.jar org.apache.hudi.utilities.adhoc.UpgradePayloadFromUberToApache --help
Usage: <main class> [options]
  Options:
  * --datasets_list_path, -sp
       Local File containing list of base-paths for which migration needs to be performed
    --help, -h
```

Space shortcuts

Page tree

Summary

Implications of package renames:

Upgrading your Hudi extensions:

New Hudi Bundle packages :

Changes in Input Format classes for Hive Tables:

Changes in Spark DataSource Format Name:

Migrating Existing Hudi Datasets:

Best Practices:

Recommended Migration Steps:

2 Comments

Vinoth Chandar

Balaji Varadarajan

Space shortcuts

Page tree

Migration Guide From com.uber.hoodie to org.apache.hudi

Summary

Implications of package renames:

Upgrading your Hudi extensions:

New Hudi Bundle packages :

Changes in Input Format classes for Hive Tables:

Changes in Spark DataSource Format Name:

Migrating Existing Hudi Datasets:

Best Practices:

Recommended Migration Steps:

2 Comments

Vinoth Chandar

Balaji Varadarajan