A log of def~instant-actions that are performed on a def~table, ordered by def~instant-time.

Design details

At its core, Hudi maintains a timeline of all def~instant-action performed on the def~table at different instants of time that helps provide instantaneous views of the def~table, while also efficiently supporting retrieval of data in the order in which it was written. The timeline is akin to a redo/transaction log, found in databases, and consists of a set of def~timeline-instants. Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time. Timeline is implemented as a set of files under the `.hoodie` def~metadata-folder directly under the def~table-basepath. Specifically, while the most recent instants are maintained as individual files, the older instants are archived to the def~timeline-archival folder, to bound the number of files, listed by writers and queries. 

Design decisions

Key Instant action types performed include:

  • COMMITS - `action type` which denotes an atomic write of a batch of records into a def~table (see def~commit).
  • CLEANS - `action type` which denotes a background activity that gets rid of older versions of files in the def~table, that are no longer needed.
  • DELTA_COMMIT - `action type` which denotes an atomic write of a batch of records into a def~merge-on-read (MOR) def~table-type of def~table, where some/all of the data could be just written to delta logs (see def~commit).
  • COMPACTION - `action type` which denotes a background activity to reconcile differential data structures within Hudi e.g: merging updates from delta log files onto def~base-files columnar file formats. Internally, compaction manifests as a special def~commit on the timeline (see def~timeline)
  • ROLLBACK - `action type` denotes that a def~timeline of `instant action type` commit/delta commit was unsuccessful & rolled back, removing any partial files produced during such a write
  • SAVEPOINT - `action type` marks certain file groups as “saved”, such that cleaner will not delete them. It helps restore the def~table to a point on the timeline, in case of disaster/data recovery scenarios.

Any given instant can be in one of the following states:

  • REQUESTED - Denotes an action has been scheduled, but has not initiated
  • INFLIGHT - Denotes that the action is currently being performed
  • COMPLETED - Denotes completion of an action on the timeline

Design decisions

  1. #todo

Related concepts

  1. def~table
  2. instant state
  3. def~instant-action
  4. def~instant-time
  5. def~commit
  6. file format

Status (draft)

  • No labels

1 Comment

  1. Vinoth Chandar , Balaji Varadarajan : ported to wiki and further structured the knowledge from

    Let us keep the wiki to be the master version and serve the online documentation from it.

    See this tool that can make the wiki to works as a CMS:

    Is this agreeable with you (question)