Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem statement 

When a gobblin job gets killed or canceled, it should shutdown gracefully: the job must end up with a recoverable state (fault tolerant): 

  • Discoverability of a valid last known good state 

  • Kafka audit count exactly once 

  • Lock revocation 

     

Last known good state 

A last know good state is the last state where both job state and data were successfully committed. Currently, Gobblin assumes the last known good job state is `current.jst`. While persisting the job state of the running job, a copy will be created to replace `current.jst`. The phases are: 

  Image Removed  Image Added

Phase 1: current last know good state is `20170702.jst` 

Phase 2: the running job persisted state `20170703.jst` and made a copy `_tmp_/current.jst` 

(copy as a way to achieve file completeness) 

Phase 3: delete `current.jst` 

Phase 4: rename `_tmp_/current.jst` to `current.jst` 

  

It has 2 problems: 

1) If the job exits between phase 3 and 4, it ends up with a non-recoverable state 

2) If the job is killed between phase 2 and phase 3, the `current.jst` shadows the real last known good state 

We're looking for a straightforward, robust, and efficient way to discover the last known good state in the state store. Here are 4 options: 

Option 1: Symlink 

Hadoop 2 supports symlink. Instead of copying the actual data, the state store changes the reference of `current.jst`. This approach doesn't help with problem 1. Not all file systems support symlink 

...