Proposers
Approvers
- Vinoth Chandar APPROVED
- lamber-ken APPROVED
- Bhavani Sudha APPROVED
- ...
Status
Current state:
Current State | |
---|---|
UNDER DISCUSSION | |
IN PROGRESS | |
ABANDONED | |
COMPLETED | |
INACTIVE |
Discussion thread: here
Released: <Hudi Version>
Abstract
Currently, reading Hudi Merge on Read table is depending on MapredParquetInputFormat from Hive and RecordReader<NullWritable, ArrayWritable> from package org.apache.hadoop.mapred, which is the first generation MapReduce API.
This dependency makes it very difficult for other Query engines (e.g. Spark sql) to use the existing record merging code due to the incapability of the first generation MapReduce API.
Hive doesn't support for the second-generation MapReduce API at this point. cwiki-proposal. Hive-3752
For Spark datasource, the FileInputFormat and RecordReader are based on org.apache.hadoop.mapreduce, which is the second generation API. Based on the discussion online, mix usage of first and second generations API will lead to unexpected behavior.
So, I am proposing to support for both generations' API and abstract the Hudi record merging logic. Then we will have flexibility when adding support for other query engines.
Background
Problems to solve:
- Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
- Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it.
- Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.
Implementation
https://github.com/apache/incubator-hudi/pull/1592
Rollout/Adoption Plan
- No impact on the existing users because the existing Hive related InputFormat won't be changed, except some methods was relocated to HoodieInputFormatUtils class. Will test this won't impact the Hive query.
- New Spark Datasource support for Merge on Read table will be added
Test Plan
- Unit tests
- Integration tests
- Test on the cluster for a larger dataset.
4 Comments
Vinoth Chandar
Bhavani Sudha Can you review this RFC, since you are looking at all the query related stuff anyway..
Vinoth Chandar
Gary Li this RFC seems like refactoring/restructuring code to support both old and new IPF classes and uses the new one to do Spark Datasource integration.. IIUC, then I am fine with this.. we can do a more detailed review of the pull request when ready
Gary Li
Vinoth Chandar Let's do the review when the PR is ready, this high-level design might not fit with Spark Datasource 100%. So let me have a working version first, then iterate through the refactoring/restructuring.
Vinoth Chandar
Sg. Just ping us all here., when the PR is ready..