Refactoring of client classes for cluster management

When code is first introduced it may stay in a code base for a long time. This is certainly true for Flink's `Client` class which has been around since 2011. So far, it has been the central point for all job-related actions like job submission, job cancellation, and retrieving of accumulator results. The Client class acts primarily as a proxy between the command-line frontend and the JobManager to perform all job related actions like job submission, cancellation, retrieving results, and creating a savepoint.

When Yarn support was added to Flink, the Client remained agnostic of the cluster setup. The command-line interface (CliFrontend) would manage the cluster lifecycle by setting up the Yarn cluster, extracting the cluster's jobmanager address, and setting up a Client to submit jobs. On the one hand it meant that Standalone, Yarn, and soon Mesos cluster setup and management had different code paths which had to be maintained in the CliFrontend. On the other hand it also inherently tied cluster management dependencies (Yarn) to the "flink-clients" module.

FLINK-3667 coupled the Client (now ClusterClient) with cluster related lifecycle methods. The original Client class became the base class of the cluster modes Flink supports (Standalone, Yarn, tbd Mesos). This enabled us to delegate all cluster-specific code to the implementation of the cluster. In the course, we also created an interface for custom command-line code (CustomCommandLine) which enabled us to clearly separated the general command-line client code from the specific command-lines for other cluster managers. The result is a clear seperation between the "flink-clients" and the "flink-yarn" modules.

In addition to unifying the lifecycle management, we saw another shortcoming in the process of resuming a cluster. It wasn't possible to programmatically resume running clusters. A resumed cluster would simply be treated as a standalone Flink cluster. In the case of Yarn, it was not possible to shutdown a resumed cluster or to resume the cluster with a Yarn-specific attribute, e.g. the Yarn application id.

FLINK-3937 addressed these issues by interfacing with the refactored code to resume Yarn cluster using the Yarn properties file or the Yarn application id.

Next, we want to move the job-related actions (submit, cancel,...) from the ClusterClient to a dedicated class which can be used to monitor and control Flink jobs. The ClusterClient class would then only have a job submission method which returns a "JobClient" with all the job related methods.

Page tree

Refactoring of client classes for cluster management