...
- Extend
jobmanager.scheduler
to accept new valuedeclarative
in order to activate the declarative scheduler - Introduce
declarative-scheduler.resource-timeout
to configure the resource timeout for the "Waiting for resources" state
Compatibility, Deprecation, and Migration Plan
...
Rescaling happens through restarting the job, thus jobs with large state might need a lot of resources and time to rescale. Rescaling a job causes downtime of your job, but no data loss.
Per-job configuration
It might be useful to select the used scheduler on a per-job basis. Within the scope of this FLIP, the scheduler will only configurable for the whole cluster. Hence, introducing a job configuration for selecting which scheduler to use could be a good follow up.
Slow performance when recovering from a fault
Since creating an ExecutionGraph
is a costly operation (see FLINK-21110) which can also involve IO operation if certain sources/sinks are used, the failover might be not very fast. If this becomes a problem, then we have to think about pulling one time initialisation tasks out of the ExecutionGraph
and to speed up the creation of the ExecutionGraph
in order to speed up the failover.
Test Plan
The new scheduler needs extensive unit, IT and end-to-end testing because it is a crucial component which is at the heart of Flink.
...