The Spark DAG for this storage, is relatively simpler. The key goal here is to group the tagged Hudi record RDD, into a series of updates and inserts, by using a partitioner. To achieve the goals of maintaining file sizes, we first sample the input to obtain a workload profile that understands the spread of inserts vs updates, their distribution among the partitions etc. With this information, we bin-pack the records such that Any remaining records after that, are again packed into new file id groups, again meeting the size requirements.
A def~table-type where a def~table's def~commits are fully merged into def~table during a def~write-operation. This can be seen as "imperative ingestion", "compaction" of the happens right away. No def~log-files are written and def~file-slices contain only def~base-file. (e.g a single parquet file constitutes one file slice)
2 Comments
SemanticBeeng
https://hudi.apache.org/docs/docker_demo.html
https://hudi.apache.org/docs/writing_data.html
it seems that there is no such thing as "table type" but more like "merge mode": this mode would have to stay constant until an ingestion ... session ends (a number of updates) but then can be flipped between COW and def~merge-on-read (MOR).
Most of the current documentation depicts Hudi from ingestion point of view and this makes it look like "table type": but it is a mode, not an intrinsic attribute of the table itself.
Delta streamer may see Hudi this way because it is an ingestion tool: but from Hudi's "programming model" point of view there is much more than "ingestion".
Seems to me.but Vinoth Chandar please advise.
What is supposed to happen by design
Is there a design decision in wiki or tests to specify the desired behavior
Vinoth Chandar
>>session ends (a number of updates) but then can be flipped between COW and MOR
We can do an one time conversion from COW to MOR, but reverse won't work as expected..
>> Most of the current documentation depicts Hudi from ingestion point of view and this makes it look like "table type": but it is a mode, not an intrinsic attribute of the table itself.
It does not just affect ingestion. it for example affects how we register tables in Hive and so forth. So, I feel the two are fairly different , not just a mode.