DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
As part of our work with Flink, we identified the need for a solution to have minimal “downtime” when re-deploying pipelines. This is especially the case when the startup times are high and lead to downtime that’s unacceptable for certain workloads.
As stateful applications, Flink pipelines require the data stream flow to be halted during deployment operations. For some cases stopping this data flow could have a negative impact on downstream consumers. The following figure helps illustrate this:
A first simple phase for this Blue/Green deployment architecture, can help overcome this issue by starting a second identical pipeline along side the main one. The Blue/Green promotion (switch) can be done as soon as the second deployment is successfully running.
Pros:
- Existing use cases can immediately benefit from this simplistic approach
- Can work as-is with existing Flink jobs
Cons:
- At-least once semantics
- Records/data will be duplicated for as long as both pipelines are running concurrently
This following effort builds upon this one to add coordination between the jobs and aims for Exactly-Once semantics
Public Interfaces
New Custom Resource Descriptor (CRD) representing a Blue/Green deployment for the Flink Kubernetes Operator
This effort only targets Application Jobs (no Session Jobs)
CRD
The FlinkBlueGreenDeploymentSpec → FlinkDeploymentSpec is a “has-a” relationship, analogous to the ReplicaSetSpec → PodSpec relationship
public class FlinkBlueGreenDeploymentSpec {
private FlinkDeploymentTemplateSpec template;
}
public class FlinkDeploymentTemplateSpec {
private ObjectMeta metadata;
private FlinkDeploymentSpec spec;
private Map<String, Object> additionalProperties = ...
}Proposed Changes
This functionality will be added as an extension to the existing Flink K8s Operator via a new controller (e.g. FlinkBlueGreenDeploymentController), with the capability of managing the lifecycle of these deployments.
- A new CRD (e.g. FlinkBlueGreenDeployment) will be introduced
- The new FlinkBlueGreenDeploymentController will manage this CRD and hide from the user the details of the actual Blue/Green (Active/StandBy) jobs.
- Delegate the lifecycle of the actual Jobs to the existing FlinkDeployment controller.
Controller Reconciliation Logic
Event Sequence for a Blue/Green
(If a deployment is new it’ll be treated as a regular FlinkDeployment)
Simple sequence of events when transitioning from A → B:
- NOTES:
- TODO: Starting B from A’s latest checkpoint is optional
- Checkpoints need to be externally accessible and they need to exist for the duration of the transition.
Open Questions
- How to handle metrics integration in the common components?
- What to do if both deployments have issues during the transition?
Compatibility, Deprecation, and Migration Plan
This is a new feature, no migration necessary
Test Plan
Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.


