Page History

Problem statement

When a gobblin job gets killed or canceled, it should shutdown gracefully: the job must end up with a recoverable state (fault tolerant):

Discoverability of a valid last known good state
Kafka audit count exactly once
Lock revocation

Last known good state

A last know good state is the last state where both job state and data were successfully committed. Currently, Gobblin assumes the last known good job state is `current.jst`. While persisting the job state of the running job, a copy will be created to replace `current.jst`. The phases are:

Image Removed Image Added

Phase 1: current last know good state is `20170702.jst`

Phase 2: the running job persisted state `20170703.jst` and made a copy `_tmp_/current.jst`

(copy as a way to achieve file completeness)

Phase 3: delete `current.jst`

Phase 4: rename `_tmp_/current.jst` to `current.jst`

It has 2 problems:

1) If the job exits between phase 3 and 4, it ends up with a non-recoverable state

2) If the job is killed between phase 2 and phase 3, the `current.jst` shadows the real last known good state

We're looking for a straightforward, robust, and efficient way to discover the last known good state in the state store. Here are 4 options:

Option 1: Symlink

Hadoop 2 supports symlink. Instead of copying the actual data, the state store changes the reference of `current.jst`. This approach doesn't help with problem 1. Not all file systems support symlink

...

Page tree

Versions Compared

Old Version 1

New Version 2

Key