Discussion thread: here
All Hudi datasets are preserved as Parquet on DFS. Since ORC is widely accepted and supported too. Goal is to provide ORC as a serving layer to back Hudi datasets so that users can have more control over the columnar format they wish to use.
Hoodie uses parquet as its default storage format for Copy on Write and Merge On Read operations where users are forced to store and query data in parquet. Introduce Orc as an underlying storage format for Hoodie to expose ORC Read Optimized views.
Hudi use avro schema for input record, and parquet writes record by using avro schema. But orc cann't support avro natively, need to do some transform job.
Add storage type option(e.g "hoodie.table.storage.type=ORC") to hoodie.properties.
- Impement a HoodieOrcWriter just like HoodieParquetWriter to write data in ORC.
- Impement a OrcReaderIterator just like ParquetReaderIterator to read data in ORC.
- Change or create new implementation of HoodieInputFormat to support ORC writing (As of now it strictly writes Parquet)
- Rework HoodieBloomIndex to work with ORC.
- What impact (if any) will there be on existing users?
None. Only one columnar storage format can be used for each hudi data set, controlled by "hoodie.table.storage.type" option.
- If we are changing behavior how will we phase out the older behavior?
Parquet is the default,
- If we need special migration tools, describe them here.
- When will we remove the existing behavior?
- Is not required.
<Describe in few sentences how the HIP will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>
May be rework in the future (Dataframe)
Hudi use avro schema for input record and store the schema infomation to commit meta, and parquet writer will use the avro schema.
But the schema of orc incompatible with avro schema, so we should store StructType( spark provides ) infomation to commit meta.
Hi Vinoth Chandar, I think I can keep iteraring on this step by step.
1. Storage type option, add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file, used to distinguish different storage types.
2. Schema Compatibility, avro to orc and orc to avro types.
3. Indexs(vary important part), we know orc provides built-in indexs(min/max and bloomfilter),
add _hoodie_record_key column to bloomfilter, so that hudi can look up the index and tags each incoming record.
4. Implement a HoodieOrcWriter like HoodieParquetWriter to write data in ORC:
5. Rework HoodieMergeHandle, HoodieCreateHandle, HoodieReadHandle etc.
6. Rework FileSystemViewManager HoodieTimeline
yes.. this is a great project! .. I suggest we do some cleanups to properly introduce abstractions.. before we begin the work..
Specifically, some parts of the code may be assuming base file is parquet today.. We need to introduce a `HoodieBaseFileFormat` and make all access go through that.. I plan to work on that piece... Once we do that, with some config, you just need to implement a few classes...
ofc we can work closely on this.. or in parallel, you can get a PoC implementation going which irons out integrations with hive/spark/presto etc (hackathon style).. and we can later combine efforts and make the implementation cleaner?
lamber-ken assigned you as proposer and adjusted status.. Please redo the RFC as you see fit and raise a discuss thread when ready for initial review from everyeone
Ok, will do that, thanks : )
Start working on this RFC, but there is an jira seems against this. What I thinking is we will rework this RFC again after HUDI-685.
So, I'm not sure whether we need explore support for Spark's data frame APIs replacing the current RDD[HoodieRecord] abstraction as well first or not?
lamber-ken Please grab the existing ticket .. Let's decouple this effort from spark dataframe api etc.. it should not really be related..
If you could follow a similar approach to the original PR opened for this and get ORC support working in our docker demo..
We would then know about all the unknown things upfront and have something working end-end.. Later we can rework the code nicely
I am saying, lets do ORC hackathon and open a WIP PR, where we can query ORC MOR/COW tables in docker setup We will go from there
hi Vinoth Chandar sorry for delay, I met some difficulty in writing the RFC.
1. Schema compatibility
Orc cann't support avro format directly, learned about the schema of avro and orc.
also, found some implementations we can refer to
2. Support write record by row format style natively
From orc's api and the write demo, orc can only support write in batch.
But as I got deeper into the orc, I found an new writer which can support write row one by one.
3. Spark dataframe
if we use dataframe, we will don't need to care about the differences between orc and parquet.
as you said, let's decouple this effort from spark dataframe api etc.., just foucs on this RFC
At last, following your suggesion, will implement it step by step.
Apologies for the large delay ..
>Orc cann't support avro format directly, learned about the schema of avro and orc.
yes.. This sadly is still true.. We need to write and own a convertor ourselves.. seems like..
>if we use dataframe, we will don't need to care about the differences between orc and parquet.
Correct.. but that takes all the control away from us in terms of naming the files, small file handling and also we need to send data to log files and so on... These spark datasources are fairly primitive in these respects.. I think given data types are finite, we can take that overhead instead of giving up these benefits..
Yes.. if you can produce an end-end example, that works on the docker demo setup.. We can iterate from there..
Vinoth Chandar no need say that, always welcome, replied to you there https://issues.apache.org/jira/browse/HUDI-57