Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

StatusSystem is fully functionalMeaningActions possible for technical clientsActions possible for human clients
OK (could be also GREEN)yesEverything is ok.
  • If system is not actively used yet, a load balancer might decide to take the system to production after receiving this status for the first time
  • Otherwise no action needed
  • Response logs might still provide information to a human on why the system currently is healthy
    • e.g. it might show 30% disk used which indicates that no action will be required for a long time
WARN (could be also YELLOW or AMBER)yes

Tendency to CRITICAL

System is fully functional but actions are needed to avoid a CRITICAL status in the future

  • Certain actions can be configured for known, actionable warnings
    • e.g. if disk space is low, it could be dynamically extended using infrastructure APIs if on virtual infrastructure)
  • Pass on information to monitoring system to be available to humans (in other aggregator UIs)

  • Any manual steps that a human can perform based on their knowledge to avoid the system to get to CRITICAL state
CRITICAL (could be also RED)no

System is not functional and must not be used

  • Take out system from load balancing
  • Decommission system entirely and re-provision from scratch
  • Possibly in future: AI could be used to perform correcting steps (self-healing)
  • Any manual steps that a human can perform based on their knowledge to bring the system back to state OK
TEMP_CRITICALUNAVAILABLE*) (could be also YELLOW or AMBER, but then ambiguous with WARN)no

Tendency to OK

System is not functional at the moment but is expected to become OK (or at least WARN) without action.

An health check using this status is expected to turn CRITICAL after a certain period returning TEMP_CRITICAL

  • Wait until TEMP_CRITICAL UNAVAILABLE status turns into either
    • OK
    • CRITICAL
  • Wait and monitor result logs of health check returning TEMP_CRITICALUNAVAILABLE
HEALTH_CHECK_ERRORno

Actual status unknown: There was an error in correctly calculating one of the status values above.

Like CRITICAL but with the hint that the health check probe itself might be the problem (and the system could well be in state OK)

  • Treat exactly the same as CRITICAL
  • Fix health check implementation or configuration to ensure a correct status can be calculated

...