Document the state by adding a label to the FLIP page with one of "discussion", "accepted", "released", "rejected".

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As part of our work with Flink, we identified the need for a solution to have minimal “downtime” when re-deploying pipelines. This is especially the case when the startup times are high and lead to downtime that’s unacceptable for certain workloads.

image.png

As stateful applications, Flink pipelines require the data stream flow to be halted during deployment operations. For some cases stopping this data flow could have a negative impact on downstream consumers. The following figure helps illustrate this:

A first simple phase for this Blue/Green deployment architecture, can help overcome this issue by starting a second identical pipeline along side the main one. The Blue/Green promotion (switch) can be done as soon as the second deployment is successfully running.


Pros:

  • Existing use cases can immediately benefit from this simplistic approach
  • Can work as-is with existing Flink jobs

Cons:

  • At-least once semantics
  • Records/data will be duplicated for as long as both pipelines are running concurrently

This following effort builds upon this one to add coordination between the jobs and aims for Exactly-Once semantics

Public Interfaces

New Custom Resource Descriptor (CRD) representing a Blue/Green deployment for the Flink Kubernetes Operator

This effort only targets Application Jobs (no Session Jobs)

CRD

The FlinkBlueGreenDeploymentSpecFlinkDeploymentSpec is a “has-a” relationship, analogous to the ReplicaSetSpecPodSpec relationship

public class FlinkBlueGreenDeploymentSpec {

   private FlinkDeploymentTemplateSpec template;

}

public class FlinkDeploymentTemplateSpec {

   private ObjectMeta metadata;

   private FlinkDeploymentSpec spec;

   private Map<String, Object> additionalProperties = ...

}

Proposed Changes

This functionality will be added as an extension to the existing Flink K8s Operator via a new controller (e.g. FlinkBlueGreenDeploymentController), with the capability of managing the lifecycle of these deployments.

  • A new CRD (e.g. FlinkBlueGreenDeployment) will be introduced
  • The new FlinkBlueGreenDeploymentController will manage this CRD and hide from the user the details of the actual Blue/Green (Active/StandBy) jobs.
  • Delegate the lifecycle of the actual Jobs to the existing FlinkDeployment controller.

Controller Reconciliation Logic

Event Sequence for a Blue/Green

(If a deployment is new it’ll be treated as a regular FlinkDeployment)

Simple sequence of events when transitioning from A → B:

  • NOTES:
    • TODO: Starting B from A’s latest checkpoint is optional
    • Checkpoints need to be externally accessible and they need to exist for the duration of the transition.

Open Questions

  • How to handle metrics integration in the common components?
  • What to do if both deployments have issues during the transition?

Compatibility, Deprecation, and Migration Plan

This is a new feature, no migration necessary

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.