This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.

Document the state by adding a label to the FLIP page with one of "discussion", "accepted", "released", "rejected".

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently, the Adaptive Scheduler already supports the REST API  in FLIP-291: Externalized Declarative Resource Management to manually adjust the parallelism of jobs, which enhances the functionality of the Adaptive Scheduler. Adaptive Scheduler will support record and query the rescale history in FLIP-495: Support AdaptiveScheduler record and query the rescale history. This makes it inconvenient for users/devs to quickly view some internal information about the rescale history of the Adaptive Scheduler.

So showing the history of rescale events of AdaptiveScheduler in the web UI is very useful for users to make the next step for jobs.

  • Facilitate users to trace the history of rescale and make rescale information more transparent
    • By REST APIs
    • By Flink WebUI pages.
  • Provide users with information on optimizing Adaptive Scheduler parameters

Proposed Changes & Public Interfaces

Based on the feature provided by FLIP-495: Support AdaptiveScheduler record and query the rescale history, We can design the following pages to display some rescale information.

The Web UI entry-point page location and showing logic

  • Add a new tab page like exceptions to show it. Only for AdaptiveScheduler(& streaming job.).
    • When should the subpage be shown ?
      • Ans: The status of the job is ‘RUNNING’ and the SchedulerType of the job is AdaptiveScheduler and the job type of the job is STREAMING.
    • How to obtain the info to judge?
      • When users view Running Jobs or Overview or Completed pages, the interface /jobs/overview will be visited and response with the information(scheduler_type and job_type) that is used by the condition judgement.

How to judge the case to show the sub-pages

Change /jobs/overview REST interface response body to supply the job scheduler_type, job_type to give the information for front end judging if there is the ‘Rescale‘ subpage

  • URL: /jobs/overview
  • Response body schema:
extended schema of the response for /jobs/overview
{
 "jobs": [

   {// This is a sub-json that is mapping to  org.apache.flink.runtime.messages.webmonitor.JobDetails
     // new fields
     "schedulerType": "AdaptiveScheduler",
     "jobType": "STREAMING",
     // old placeholders….
   },
   ...
 ]
}
  • The  corresponding internal change about java class
org.apache.flink.runtime.messages.webmonitor.JobDetails
public class JobDetails implements Serializable {     
// The old lines placeholder     
...     
// The new introduced attributes related.      
private static final String FIELD_NAME_JOB_TYPE = "jobType";
private static final String FIELD_NAME_JOB_SCHEDULER = "schedulerType";      

@Nullable private final JobType jobType;
@Nullable private final String schedulerType;      

// The other lines placeholder...    
 ...
}

The urn description will be updated when developing.

The Web UI and REST interfaces

  • The all pages are the sketch draft, the final style must follows Flink UI style as the standard.
  • The urn that is related to the introduced or changed schemas will be updated when developing.
  • The design of the rescale history UI will follow the style of the checkpoints-related pages.
  • But the design of the rescale history REST API will not follow fully the style of the checkpoints-related interfaces.
    • The main difference is that the current section provides a clearer and more explicit breakdown of the REST interfaces. 
    • Compared to the solution outlined in the "Rejected Alternatives" section, the current design has the following pros and cons:
      • During implementation, the number of XXXHandler classes will increase nearly linearly with the number of pages.
      • However, the responsibilities of each interface are clearly defined and straightforward.

Rescale Overview 


Introduce the rescales overview REST API

  • URL:  "/jobs/:jobid/rescales/overview"
  • METHOD: GET
  • Parameter: N.A
  • The response schema:
Schema of response for /jobs/:jobid/rescales/overview
{
  "rescalesCounts": {
        "ignored": 1,
        "inProgress": 0,
        "completed": 4,
        "failed": 1
  },
  "latest": {
    "completed": {       
      "rescaleUuid": ${hexString},
      "resourceRequirementsUuid":"${UUID}",
      "rescaleAttemptId": 1,
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "vertices": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "slots": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in "scheduler_states" field in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "schedulerStates": {},
      "terminalState": "COMPLETED",
      "triggerCause": "xxxxx",        
      "terminatedReason": "xxxxx",
      "startTimestampInMillis": 1733279950222,//milliseconds.
      "endTimestampInMillis":1733279950222,//milliseconds.
      // "durationInMillis": 11111, The attribute is deprecated and removed. When showing the field in the UI, we can calculate it by startTimestampInMillis and endTimestampInMillis.
      // If the endTimestampInMillis is null, we could use the current time milliseconds timestamp to minus the startTimestampInMillis for getting durationInMillis.
    },
    "failed":...,
    "ignored":...
  }
}

Rescale Overview UI

The goal of the page is to have the rescale overview aligned with the checkpoint overview at UI side.

  • When the Rescales subpage is accessed, it defaults to displaying information from the Overview section.

  • If rescale events corresponding to latest completed, latest ignored, or latest failed exist, the interface will automatically request the details API /jobs/:jobid/rescales/details/:rescaleuuid and display the detailed information.

  • When displaying string values of the UUID type, for ease of presentation and layout convenience, we can show only the first eight characters instead of the complete string. This is similar to the abbreviated display of Git commit IDs.

  • When displaying vertex name in vertices table, for ease of presentation and layout convenience, we can show only the name at most 32 chars instead of the complete name. In my limited read, there may be cases where task names are relatively long in sql jobs.
  • In the front-end implementation, tooltip explanations need to be added to the header fields of the table below. A prompt message will pop up when the mouse hovers over a corresponding header field, and the tooltip will be dismissed when the mouse moves away.

    • The header attributes of Rescale information:
      • Rescale UUID: The unique ID in Rescale consists of 32 hexadecimal characters
      • Attempt ID: The number ID of Rescale attempts that occurred under the same resource requirements
      • Requirements ID: The unique ID of resource requirements consists of 32 hexadecimal characters
      • Trigger Cause: The reason that triggers the target Rescale
      • Terminal State: The end state of the target Rescale
      • Terminated Reason: The reason for the completion or termination of the target Rescale
      • Start Time: The start time of the target Rescale.
      • Duration: Duration from the start of the rescale to its completion or until now
      • End Time: The end time of the target Rescale.
    • The header attributes of Vertices
      • ID: The unique ID of target JobVertex consists of 32 hexadecimal characters
      • Name: The short name of target vertex
      • Slot Sharing Group ID:The unique ID of the slot sharing group consists of 32 hexadecimal characters
      • Previous Parallelism: The previous parallelism of target vertex before the current rescale
      • Acquired Parallelism: The acquired parallelism of target vertex after the current rescale
      • Sufficient Parallelism: The minimal parallelism of target vertex to run
      • Desired Parallelism: The desired parallelism of the target vertex.
    • The header attributes of Slots
      • Slot Sharing Group ID: The ID of the slot sharing group to which the slot belongs consists of 32 hexadecimal characters
      • Slot Sharing Group Name:The name of the slot sharing group to which the slot belongs
      • Previous Slot: The previous number of slots before the rescale
      • Acquired Slot: The acquired number of slots after the rescale
      • Desired Slot: The desired number of slots of the rescale
      • Sufficient Slot: The minimal number of slots to deploy tasks in the rescale
      • Required Profile: The required resource profile of the slot sharing group in the rescale
      • Acquired Profile: The acquired resource profile of the slot sharing group in the rescale
    • The header attributes of Scheduler State History
      • State: The scheduler state name 
      • Enter Time: The time to enter the state
      • Leave Time: The time to leave the state
      • Duration: The duration time from enter time to leave time of the state
      • Exception: The exception information about current rescale during the state

Rescale History 


Introduce the rescales history REST API

  • URL:  "/jobs/:jobid/rescales/history"
  • METHOD: GET
  • Parameter: N.A
  • The response schema:
Schema of response for /jobs/:jobid/rescales/history
[
    {                
      "rescaleUuid": ${hexString},
      "resourceRequirementsUuid":"${UUID}",
      "rescaleAttemptId": 1,        
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "vertices": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "slots": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in "scheduler_states" field in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "schedulerStates": [],
      "terminalState": "COMPLETED",
      "triggerCause": "xxxxx",
      "terminatedReason": "xxxxx",
      "startTimestampInMillis": 1733279950222,//milliseconds.
      "endTimestampInMillis":1733279950222,//milliseconds.
      // "durationInMillis": 11111, The attribute is deprecated and removed. When showing the field in the UI, we can calculate it by startTimestampInMillis and endTimestampInMillis.
      // If the endTimestampInMillis is null, we could use the current time milliseconds timestamp to minus the startTimestampInMillis for getting durationInMillis. 
    },
    ...
 ]


Rescale History UI

When accessing the History subpage, the interface will call the API /jobs/:jobid/rescales/history and display a summary of historical rescale events.

  • When displaying string values of the UUID type, for ease of presentation and layout convenience, we can show only the first eight characters instead of the complete string. This is similar to the abbreviated display of Git commit IDs.

  • In the front-end implementation, tooltip explanations need to be added to the header fields of the table below. A prompt message will pop up when the mouse hovers over a corresponding header field, and the tooltip will be dismissed when the mouse moves away.

    • The header attributes of Rescale information:
      • Rescale UUID: The unique ID in Rescale consists of 32 hexadecimal characters
      • Attempt ID: The number ID of Rescale attempts that occurred under the same resource requirements
      • Requirements ID: The unique ID of resource requirements consists of 32 hexadecimal characters
      • Trigger Cause: The reason that triggers the target Rescale
      • Terminal State: The end state of the target Rescale
      • Terminated Reason: The reason for the completion or termination of the target Rescale
      • Start Time: The start time of the target Rescale.
      • Duration: Duration from the start of the rescale to its completion or until now
      • End Time: The end time of the target Rescale

Introduce the rescale details REST API

  • URL:  "/jobs/:jobid/rescales/details/:rescaleuuid"
  • METHOD: GET
  • The response schema:
Schema of response for /jobs/:jobid/rescales/details/:rescaleuuid
{   
  "rescaleUuid": ${hexString},
  "resourceRequirementsUuid":"${UUID}",
  "rescaleAttemptId": 1,
  "vertices": {
    "jobVertexId": {
      "jobVertexName": "Map-1", // Perhaps need limited length here.
      "slotSharingGroupId": "",
      "requiredResourceProfile": { // The key in the sub-json uses the naming style for reusing org.apache.flink.runtime.rest.messages.ResourceProfileInfo
        "cpuCores": 200000.0,
        "taskHeapMemory": 209715,
        "taskOffHeapMemory": 209715,
        "managedMemory": 25,
        "networkMemory": 12,
        "extendedResources": {}
      },
      "acquiredResourceProfile": {
        "cpuCores": 200000.0,
        "taskHeapMemory": 209715,
        "taskOffHeapMemory": 209715,
        "managedMemory": 25,
        "networkMemory": 12,
        "extendedResources": {}
      },
      "preRescaleParallelism": 14,
      "postRescaleParallelism": 19,
      "desiredParallelism": 128,
      "sufficientParallelism": 19
    },
    ...
  },
  "slots": {
    "xxx(hexString, slot sharing group id)": {
      "slotSharingGroupName": "xxx",
      "preRescaleSlots": 10,
      "postRescaleSlots": 10,
      "desiredSlots": 10,
      "minimalRequiredSlots": 10,
      "requestResourceProfile": {
        "cpuCores": 200000.0,
        "taskHeapMemory": 209715,
        "taskOffHeapMemory": 209715,
        "managedMemory": 25,
        "networkMemory": 12,
        "extendedResources": {}
      },
      "acquiredResourceProfile": {
        "cpuCores": 200000.0,
        "taskHeapMemory": 209715,
        "taskOffHeapMemory": 209715,
        "managedMemory": 25,
        "networkMemory": 12,
        "extendedResources": {}
      }
    },
    ...
  }
  "schedulerStates":
    [
      {
        "state":"xxx", // xxx: one state of all states of adaptiveScheduler
        "enterTimestampInMillis": 1111111,//milliseconds.
        "leaveTimestampInMillis": 2222222,//milliseconds.
        "durationInMillis": 11111
        "stringifiedException": "",
      },
      ...
    ],
  "terminalState": "COMPLETED",
  "triggerCause": "xxxxx",
  "terminatedReason":"xxx",
  // milliseconds.
  "startTimestampInMillis": 1733279950222,
  "endTimestampInMillis":1733279950222,
  // "durationInMillis": 11111, The attribute is deprecated and removed. When showing the field in the UI, we can calculate it by startTimestampInMillis and endTimestampInMillis.
  // If the endTimestampInMillis is null, we could use the current time milliseconds timestamp to minus the startTimestampInMillis for getting durationInMillis. 
}

Rescale Details UI

When a user clicks on a specific Rescale to view its details, the interface will call the corresponding API /jobs/:jobid/rescales/details/:rescaleuuid and display the details of the selected rescale event.


  • When displaying string values of the UUID type, for ease of presentation and layout convenience, we can show only the first eight characters instead of the complete string. This is similar to the abbreviated display of Git commit IDs.

  • When displaying vertex name in vertices table, for ease of presentation and layout convenience, we can show only the name at most 32 chars instead of the complete name. In my limited read, there may be cases where task names are relatively long in sql jobs.
  • In the front-end implementation, tooltip explanations need to be added to the header fields of the table below. A prompt message will pop up when the mouse hovers over a corresponding header field, and the tooltip will be dismissed when the mouse moves away.

    • The header attributes of Rescale information:
      • Rescale UUID: The unique ID in Rescale consists of 32 hexadecimal characters
      • Attempt ID: The number ID of Rescale attempts that occurred under the same resource requirements
      • Requirements ID: The unique ID of resource requirements consists of 32 hexadecimal characters
      • Trigger Cause: The reason that triggers the target Rescale
      • Terminal State: The end state of the target Rescale
      • Terminated Reason: The reason for the completion or termination of the target Rescale
      • Start Time: The start time of the target Rescale.
      • Duration: Duration from the start of the rescale to its completion or until now
      • End Time: The end time of the target Rescale.
    • The header attributes of Vertices
      • ID: The unique ID of target JobVertex consists of 32 hexadecimal characters
      • Name: The short name of target vertex
      • Slot Sharing Group ID:The unique ID of the slot sharing group consists of 32 hexadecimal characters
      • Previous Parallelism: The previous parallelism of target vertex before the current rescale
      • Acquired Parallelism: The acquired parallelism of target vertex after the current rescale
      • Sufficient Parallelism: The minimal parallelism of target vertex to run
      • Desired Parallelism: The desired parallelism of the target vertex.
    • The header attributes of Slots
      • Slot Sharing Group ID: The ID of the slot sharing group to which the slot belongs consists of 32 hexadecimal characters
      • Slot Sharing Group Name:The name of the slot sharing group to which the slot belongs
      • Previous Slot: The previous number of slots before the rescale
      • Acquired Slot: The acquired number of slots after the rescale
      • Desired Slot: The desired number of slots of the rescale
      • Sufficient Slot: The minimal number of slots to deploy tasks in the rescale
      • Required Profile: The required resource profile of the slot sharing group in the rescale
      • Acquired Profile: The acquired resource profile of the slot sharing group in the rescale
    • The header attributes of Scheduler State History
      • State: The scheduler state name 
      • Enter Time: The time to enter the state
      • Leave Time: The time to leave the state
      • Duration: The duration time from enter time to leave time of the state
      • Exception: The exception information about current rescale during the state

Rescale Summary


Introduce the rescales summary REST API

  • URL:  "/jobs/:jobid/rescales/summary"
  • METHOD: GET
  • Parameter: N.A
  • The response schema:
Schema of response for /jobs/:jobid/rescales/summary
{
    "rescalesCounts": {
        "ignored": 1,
        "inProgress": 0,
        "completed": 4,
        "failed": 1
    },
    "rescalesDurationStatsInMillis": {
       "min" 100,
       "max" 100,
       "avg" 100
    },
    "completedRescalesDurationStatsInMillis":{
        "min": 9620,
        "max": 335627,
        "avg": 176649,
        "p50": 178502.0,
        "p90": 313040.0,
        "p95": 329417.0,
        "p99": 335627.0,
        "p999": 335627.0
      },
    "ignoredRescalesDurationStatsInMillis":{
        "min": 9620,
        "max": 335627,
        "avg": 176649,
        "p50": 178502.0,
        "p90": 313040.0,
        "p95": 329417.0,
        "p99": 335627.0,
        "p999": 335627.0
      },
    "failedRescalesDurationStatsInMillis":{
        "min": 9620,
        "max": 335627,
        "avg": 176649,
        "p50": 178502.0,
        "p90": 313040.0,
        "p95": 329417.0,
        "p99": 335627.0,
        "p999": 335627.0
    }
}


Rescale Summary UI

When accessing the summary subpage, the interface /jobs/:jobid/rescales/summary will be called, and the corresponding statistics list will be displayed.



When the user clicks the Rescale Duration Percentile dropdown button, the page will display additional statistical information


Rescale configuration 

Rescale configuration REST API

Introduce the adaptive scheduler config related REST API

  • URL:  "/jobs/:jobid/rescales/config"
  • METHOD: GET
  • Parameter: N.A
  • The response body schema:
Schema of response for /jobs/:jobid/rescales/config
{
 "rescaleHistoryMax": 16,
 "executionMode": "", // string
 "rescaleOnFailedCheckpointCount": 1, // integer
 // The unit(s) of the following items are milliseconds.
 "submissionResourceWaitTimeoutInMillis": 1000,// long
 "submissionResourceStabilizationTimeoutInMillis":232,// long
 "slotIdleTimeoutInMillis":132,// long
 "executingCooldownTimeoutInMillis": 111,// long
 "executingResourceStabilizationTimeoutMillis": 1000,// long
 "maximumDelayForTriggeringRescaleInMillis": 1999 // long
}


Rescale configuration UI


When accessing the configuration subpage, the interface /jobs/:jobid/rescales/config will be called, and the corresponding configuration information will be displayed.


Compatibility, Deprecation, and Migration Plan

This is a new feature, so there is no need to consider previous developments.

Test Plan

The REST endpoints part:

Regarding this part, we plan to test the REST endpoints through the RestHandler framework, similar to the workflow implemented in classes like org.apache.flink.runtime.rest.handler.job.checkpoints.AbstractCheckpointStatsHandlerTest.

The UI part:


The UI will be tested visually through manual testing.

Rejected Alternatives


The following original Rescale Overview,Rescale History, Rescale Summary parts will share a single REST interface '/jobs/:jobid/rescales' to fetch data.

The goal of the design about the REST point is to have the rescale overview aligned with the checkpoint overview at the REST interfaces side.

The candidate solution for this section is beneficial for reducing the number of handlers during implementation.
However, the drawback is that using only a single REST API interface to fulfill these responsibilities would make the interface’s role bloated and less clear.

 

Rescale Overview 


Introduce the rescales REST API

  • URL:  "/jobs/:jobid/rescales"
  • METHOD: GET
  • Parameter: N.A
  • The response schema:
Schema of response for /jobs/:jobid/rescales
{
  "summary": {
    "rescalesCounts": {
        "ignored": 1,
        "inProgress": 0,
        "completed": 4,
        "failed": 1
    },
    "rescalesDurationStatsInMillis": {
       "min" 100,
       "max" 100,
       "avg" 100
    },
    "completedRescalesDurationStatsInMillis":{
        "min": 9620,
        "max": 335627,
        "avg": 176649,
        "p50": 178502.0,
        "p90": 313040.0,
        "p95": 329417.0,
        "p99": 335627.0,
        "p999": 335627.0
      },
    "ignoredRescalesDurationStatsInMillis":{
        "min": 9620,
        "max": 335627,
        "avg": 176649,
        "p50": 178502.0,
        "p90": 313040.0,
        "p95": 329417.0,
        "p99": 335627.0,
        "p999": 335627.0
      },
    "failedRescalesDurationStatsInMillis":{
        "min": 9620,
        "max": 335627,
        "avg": 176649,
        "p50": 178502.0,
        "p90": 313040.0,
        "p95": 329417.0,
        "p99": 335627.0,
        "p999": 335627.0
    }
  },
  "latest": {
    "completed": {       
      "rescaleUuid": ${hexString},
      "resourceRequirementsUuid":"${UUID}",
      "rescaleAttemptId": 1,
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "vertices": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "slots": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in "scheduler_states" field in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "schedulerStates": {},
      "terminalState": "COMPLETED",
      "triggerCause": "xxxxx",        
      "terminatedReason": "xxxxx",
      "startTimestampInMillis": 1733279950222,//milliseconds.
      "endTimestampInMillis":1733279950222,//milliseconds.
      // "durationInMillis": 11111, The attribute is deprecated and removed. When showing the field in the UI, we can calculate it by startTimestampInMillis and endTimestampInMillis.
      // If the endTimestampInMillis is null, we could use the current time milliseconds timestamp to minus the startTimestampInMillis for getting durationInMillis.
    },
    "failed":...,
    "ignored":...
  },
  "history": [
    {                
      "rescaleUuid": ${hexString},
      "resourceRequirementsUuid":"${UUID}",
      "rescaleAttemptId": 1,        // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "vertices": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "slots": {},
      // ignored detail here to reduce the response size. the fine-grained detail could be view in "scheduler_states" field in ‘/jobs/:jobid/rescales/details/:rescaleuuid’
      "schedulerStates": [],
      "terminalState": "COMPLETED",
      "triggerCause": "xxxxx",
      "terminatedReason": "xxxxx",
      "startTimestampInMillis": 1733279950222,//milliseconds.
      "endTimestampInMillis":1733279950222,//milliseconds.
      // "durationInMillis": 11111, The attribute is deprecated and removed. When showing the field in the UI, we can calculate it by startTimestampInMillis and endTimestampInMillis.
      // If the endTimestampInMillis is null, we could use the current time milliseconds timestamp to minus the startTimestampInMillis for getting durationInMillis. 
    },
    ...
  ]
}

Rescale Overview UI

The page will have the rescale overview aligned with the checkpoint overview as mentioned in the main design.
The page will only use the sub-response result shown in the schema 'summary' & 'latest' parts.

Rescale History 


Rescale History UI

The design details is same as mentioned in the main design part.

The page will only use the sub-response result shown in the schema 'history' part.

Introduce the rescale details REST API

The design details is same as mentioned in the main design part.

Rescale Summary

Rescale Summary UI

When accessing the summary subpage, the interface /jobs/:jobid/rescales will be called, and the corresponding statistics list will be displayed.

The UI design details is same as mentioned in the main design part.



  • No labels