Status

Current state: Under Discussion

Discussion thread: https://lists.apache.org/thread/5vw6b6w6lctd2p7vfo9w9fcpzs7834fj

JIRAhere (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)

Released: <Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Current Flink autoscaling approach FLIP-271: Autoscaling follows a reactive model. As a result, scaling adjustments may lag behind sudden workload changes, leading to temporary inefficiencies. This challenge initially led to exploring ways to incorporate predicted input rate into autoscaling decision to improve responsiveness. However, a more flexible and extensible solution is to provide an interface for plugging in custom evaluation logic for scaling metrics. This allows users to tailor autoscaling decisions to their specific workloads, enabling business-aware and proactive scaling strategies.

Proposed Changes

Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/953

Goals

  1. Allow plugging in custom evaluation logic in-to Autoscaler.
  2. Allow job-level configuration for using a specific Custom Evaluator and provide evaluator specific parameters.
  3. Integrate custom evaluation plugin into scaling metrics evaluator.

Non-Goals (for now)

  1. Provide plugins for some commonly used predictive algorithms, such as scheduled, forecasted, using REST call etc. → to be done as part of different PR.
  2. Allow overriding of internally collected scaling metrics.
  3. Allow plugging custom scaling summary computation algorithm.

 

Public Interfaces

Scaling Metric Custom Evaluator plugin

A Scaling Metric Custom Evaluator is a plugin that allows integrating custom logic to Autoscaler for evaluating scaling metrics. This plugin provides flexibility in how scaling decisions are made, enabling workload-specific optimisation's beyond the default model.

  • Custom Evaluator loaded as a plugin.
  • Custom Evaluator is invoked at end of evaluateMetrics for vertices.
  • The Custom Evaluator to return Map<ScalingMetric, EvaluatedScalingMetric> which would be merged with the current evaluated metrics.
  • Pass evaluation context when Custom Evaluator is called.
  • We maintain the topological order of evaluation for the vertices.

Job Level Config For Custom Evaluator

As part of this FLIP, we propose the following job-level Autoscaler options to control which Custom Evaluator is used, if any, along with evaluator-specific configurations. All named Custom Evaluator configurations will be available to the evaluator.

KeyDefaultDescription
job.autoscaler.metrics.custom-evaluator.name
Specifies a name of the custom evaluator to be used.
job.autoscaler.metrics.custom-evaluator.<evaluator name>.class
Specifies the class name of the custom evaluator to be used. Required to be passed if job.autoscaler.metrics.custom-evaluator.name is set.
job.autoscaler.
metrics.custom-evaluator.<evaluator name>.<other evaluator specific config>

Additional configuration parameters specific to the selected custom evaluator. These settings allow fine-tuning of the custom evaluator behavior based on the model used.

Information To Be Provided To Custom Evaluator

Custom Evaluator Invocation Signature

Map<ScalingMetric, EvaluatedScalingMetric> evaluateVertexMetrics(
JobVertexID vertex,
Map<ScalingMetric, EvaluatedScalingMetric> evaluatedMetrics,
Context evaluationContext)

The evaluationContext encapsulates the necessary details (as below) required for the evaluation process.

Context {
Configuration jobConf;
SortedMap<Instant, CollectedMetrics> metricsHistory;
Map<JobVertexID, Map<ScalingMetric, EvaluatedScalingMetric>> evaluatedVertexMetrics;
JobTopology topology;
boolean processingBacklog;
Duration restartTime;
Configuration customEvaluatorConf;
}

Note:

We propose making evaluatedMetrics, jobConf, metricsHistory, and evaluatedVertexMetrics unmodifiable views when passed to the evaluator. They must not be modified by the Custom Evaluator. Any attempt to modify them will result in an UnsupportedOperationException.
However, this design choice is optional and can be reconsidered if it is unnecessary.

Sample Implementation

https://github.com/pchoudhury22/flink-kubernetes-operator/blob/custom-evaluator-plugin/flink-autoscaler/src/test/java/org/apache/flink/autoscaler/metrics/SimpleTrendAdjustor.java

Compatibility, Deprecation, and Migration Plan

New feature, no compatibility issues

Test Plan

We will provide unit and integration tests to validate and verify the proposed changes.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

  • No labels