Status
...
Page properties | |
---|---|
|
...
...
|
...
|
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
PlantUML |
---|
@startuml
hide empty description
[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting
Waiting --> Waiting : Resources are not stable yet
Waiting --> Executing : Resources are stable
Waiting --> Finished : Cancel, suspend or not enough \nresources for executing
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend or job reached terminal state
Executing --> Restarting : Recoverable fault
Restarting --> Finished : Suspend
Restarting --> Canceling : Cancel
Restarting --> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]
@enduml
|
The states have the following semantics:
...
The scheduler consists of the following services to accomplish its job. These services are used by the different states to decide on state transitions and to perform certain operations
PlantUML |
---|
@startuml package "Adaptive Scheduler" { [SlotAllocator] [FailureHandler] [ScaleUpController] } @enduml |
...
Supporting local failovers is another feature which we want to add as a follow up. Adding support for it allows to not having to restart the whole job. One idea could be to extend the existing state machine by a new state "Restarting locally":
PlantUML |
---|
@startuml
hide empty description
[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting
state "Restarting globally" as RestartingG
state "Restarting locally" as RestartingL
Waiting --> Waiting : Resources are not stable yet
Waiting --> Executing : Resources are stable
Waiting --> Finished : Cancel, suspend or \nnot enough resources for executing
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend or job reached terminal state
Executing --> RestartingG : Recoverable global fault
Executing --> RestartingL : Recoverable local fault
RestartingL --> Executing : Recovered locally
RestartingL --> RestartingL : Recoverable local fault
RestartingL --> RestartingG : Local recovery timeout
RestartingL --> Canceling : Cancel
RestartingL --> Finished : Suspend
RestartingL --> Failing : Unrecoverable fault
RestartingG --> Finished : Suspend
RestartingG --> Canceling : Cancel
RestartingG --> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]
@enduml |
...