DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Definition
An item in the `Hudi` ingestion processing timelineA log of def~instant-actions that are performed on a def~table, ordered by def~instant-time.
Design details
| Excerpt |
|---|
At its core, Hudi maintains a |
...
def~instant-action performed on the |
...
def~table at different |
...
def~table, while also efficiently supporting retrieval of data in the order |
...
- Action type : Type of action performed on the dataset
- Instant time : Instant time is typically a timestamp (e.g: 20190117010349), which monotonically increases in the order of action’s begin time.
- Instant state : current state of the instant
...
in which it was written. The timeline is akin to a redo/transaction log, found in databases, and consists of a set of def~timeline-instants. Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time. |
...
Key action types performed include
COMMITS- A commit denotes an atomic write of a batch of records into a dataset.CLEANS- Background activity that gets rid of older versions of files in the dataset, that are no longer needed.DELTA_COMMIT- A delta commit refers to an atomic write of a batch of records into a Merge On Read (MOR) storage type of dataset, where some/all of the data could be just written to delta logs.COMPACTION- Background activity to reconcile differential data structures within Hudi e.g: moving updates from row based delta log files to columnar file formats. Internally, compaction manifests as a special commit on the timelineROLLBACK- Indicates that a commit/delta commit was unsuccessful & rolled back, removing any partial files produced during such a writeSAVEPOINT- Marks certain file groups as “saved”, such that cleaner will not delete them. It helps restore the dataset to a point on the timeline, in case of disaster/data recovery scenarios.
Any given instant can be in one of the following instant states
Timeline is implemented as a set of files under the `.hoodie` def~metadata-folder directly under the def~table-basepath. Specifically, while the most recent instants are maintained as individual files, the older instants are archived to the def~timeline-archival folder, to bound the number of files, listed by writers and queries. |
Design decisions
| Excerpt Include | ||||||
|---|---|---|---|---|---|---|
|
| Excerpt Include | ||||||
|---|---|---|---|---|---|---|
|
...
Design decisions
- #todo