RFC-02: ORC Storage in Hudi

Created by Vinoth Chandar, last modified on Sep 09, 2021

Proposers

lamber-ken

Approvers

Status

Current state:

	Current State
UNDER DISCUSSION
IN PROGRESS
ABANDONED
COMPLETED
INACTIVE

Discussion thread: here

JIRA: here

Released: N/A

Abstract

All Hudi datasets are preserved as Parquet on DFS. Since ORC is widely accepted and supported too. Goal is to provide ORC as a serving layer to back Hudi datasets so that users can have more control over the columnar format they wish to use.

Hoodie uses parquet as its default storage format for Copy on Write and Merge On Read operations where users are forced to store and query data in parquet. Introduce Orc as an underlying storage format for Hoodie to expose ORC Read Optimized views.

Implementation

Schema

Hudi use avro schema for input record, and parquet writes record by using avro schema. But orc cann't support avro natively, need to do some transform job.

Implementation Steps

Add storage type option(e.g "hoodie.table.storage.type=ORC") to hoodie.properties.
Impement a HoodieOrcWriter just like HoodieParquetWriter to write data in ORC.
Impement a OrcReaderIterator just like ParquetReaderIterator to read data in ORC.
Change or create new implementation of HoodieInputFormat to support ORC writing (As of now it strictly writes Parquet)
Rework HoodieBloomIndex to work with ORC.

Rollout/Adoption Plan

What impact (if any) will there be on existing users?

None. Only one columnar storage format can be used for each hudi data set, controlled by "hoodie.table.storage.type" option.

If we are changing behavior how will we phase out the older behavior?

Parquet is the default,

If we need special migration tools, describe them here.
- NA
When will we remove the existing behavior?
- Is not required.

Test Plan

<Describe in few sentences how the HIP will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>

May be rework in the future (Dataframe)

Hudi use avro schema for input record and store the schema infomation to commit meta, and parquet writer will use the avro schema.

But the schema of orc incompatible with avro schema, so we should store StructType( spark provides ) infomation to commit meta.

content-formatting-template

11 Comments

lamber-ken
Hi Vinoth Chandar, I think I can keep iteraring on this step by step.
Implementation(key steps)
1. Storage type option, add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file, used to distinguish different storage types.
2. Schema Compatibility, avro to orc and orc to avro types.
3. Indexs(vary important part), we know orc provides built-in indexs(min/max and bloomfilter),
add _hoodie_record_key column to bloomfilter, so that hudi can look up the index and tags each incoming record.
- Rework HoodieRangeInfoHandle
- Rework HoodieBloomIndex
4. Implement a HoodieOrcWriter like HoodieParquetWriter to write data in ORC:
5. Rework HoodieMergeHandle, HoodieCreateHandle, HoodieReadHandle etc.
6. Rework FileSystemViewManager HoodieTimeline
- Permalink
- Feb 20, 2020
- Delete comments
1. Vinoth Chandar
  yes.. this is a great project! .. I suggest we do some cleanups to properly introduce abstractions.. before we begin the work..
  
  Specifically, some parts of the code may be assuming base file is parquet today.. We need to introduce a `HoodieBaseFileFormat` and make all access go through that.. I plan to work on that piece... Once we do that, with some config, you just need to implement a few classes...
  
  ofc we can work closely on this.. or in parallel, you can get a PoC implementation going which irons out integrations with hive/spark/presto etc (hackathon style).. and we can later combine efforts and make the implementation cleaner?
  Permalink
  
  Feb 20, 2020
  
  Delete comments
  1. lamber-ken
    Agree, thanks
    
    Permalink
    
    Mar 12, 2020
    
    Delete comments
Vinoth Chandar
lamber-ken assigned you as proposer and adjusted status.. Please redo the RFC as you see fit and raise a discuss thread when ready for initial review from everyeone
- Permalink
- Mar 12, 2020
- Delete comments
1. lamber-ken
  Ok, will do that, thanks : )
  Permalink
  
  Mar 13, 2020
  
  Delete comments
lamber-ken
Start working on this RFC, but there is an jira[1] seems against this. What I thinking is we will rework this RFC again after HUDI-685.
So, I'm not sure whether we need explore support for Spark's data frame APIs replacing the current RDD[HoodieRecord] abstraction as well first or not?

[1] https://issues.apache.org/jira/browse/HUDI-685
- Permalink
- Mar 23, 2020
- Delete comments
Vinoth Chandar
lamber-ken Please grab the existing ticket .. Let's decouple this effort from spark dataframe api etc.. it should not really be related..
If you could follow a similar approach to the original PR opened for this and get ORC support working in our docker demo..

We would then know about all the unknown things upfront and have something working end-end.. Later we can rework the code nicely
- Permalink
- Mar 25, 2020
- Delete comments
Vinoth Chandar
I am saying, lets do ORC hackathon and open a WIP PR, where we can query ORC MOR/COW tables in docker setup We will go from there
- Permalink
- Mar 25, 2020
- Delete comments
lamber-ken
hi Vinoth Chandar sorry for delay, I met some difficulty in writing the RFC.
- schema compatibility
- support write record by row format style natively
- spark dataframe
1. Schema compatibility

Orc cann't support avro format directly, learned about the schema of avro and orc.
also, found some implementations we can refer to
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/ConvertAvroToORC.java
2. Support write record by row format style natively

From orc's api and the write demo, orc can only support write in batch.
But as I got deeper into the orc, I found an new writer which can support write row one by one.
https://github.com/apache/orc/blob/master/java/examples/src/java/org/apache/orc/examples/CoreWriter.java
https://github.com/apache/orc/blob/master/java/mapreduce/src/java/org/apache/orc/mapreduce/OrcMapreduceRecordReader.java
3. Spark dataframe

if we use dataframe, we will don't need to care about the differences between orc and parquet.
as you said, let's decouple this effort from spark dataframe api etc.., just foucs on this RFC
write.format("orc").save() / write.format("parquet").save()
At last, following your suggesion, will implement it step by step.
- Permalink
- Apr 02, 2020
- Delete comments
Vinoth Chandar
Apologies for the large delay ..

>Orc cann't support avro format directly, learned about the schema of avro and orc.
yes.. This sadly is still true.. We need to write and own a convertor ourselves.. seems like..

>if we use dataframe, we will don't need to care about the differences between orc and parquet.
Correct.. but that takes all the control away from us in terms of naming the files, small file handling and also we need to send data to log files and so on... These spark datasources are fairly primitive in these respects.. I think given data types are finite, we can take that overhead instead of giving up these benefits..

Yes.. if you can produce an end-end example, that works on the docker demo setup.. We can iterate from there..
- Permalink
- Apr 08, 2020
- Delete comments
lamber-ken
Vinoth Chandar no need say that, always welcome, replied to you there https://issues.apache.org/jira/browse/HUDI-57
- Permalink
- Apr 08, 2020
- Delete comments

Space shortcuts

Page tree

Proposers

Approvers

Status

Abstract

Implementation

Schema

Implementation Steps

Rollout/Adoption Plan

Test Plan

May be rework in the future (Dataframe)

11 Comments

lamber-ken

Implementation(key steps)

Vinoth Chandar

lamber-ken

Vinoth Chandar

lamber-ken

lamber-ken

Vinoth Chandar

Vinoth Chandar

lamber-ken

1. Schema compatibility

2. Support write record by row format style natively

3. Spark dataframe

Vinoth Chandar

lamber-ken