The problem: The JobManager (JM) is a single point of failure. When it crashes, TaskManagers (TM) fail all running jobs and try to reconnect to the same JM. A failed JM looses all state and can not resume the running jobs; even if it recovers and the TMs reconnect.

Solution: implement JM fault tolerance/high availability by having multiple JM instances running with one as leader and the other(s) in standby. The exact coordination and state update protocol between JM, TM, and clients is the topic of this document.

JIRA: FLINK-2287

Distributed Coordination with ZooKeeper (FLINK-2288)

Having standby JM instances requires distributed coordination between JM, TM, and clients. For this, we will use ZooKeeper (ZK).

Pros:

ZK dependency

JM/TM/Client configuration

ZK client configuration in Flink

ZK server configuration

  1. ZK managed by Flink: Flink provides script to start ZK servers
    1. Configuration via zoo.cfg
  2. Dedicated ZK cluster (user sets up ZK)

Startup scripts

Leader Election