Failure Policies are a necessary, but rarely codified component, of deployed software systems. Command and control necessitates a strict codified scheme for failure policies within agents.
The nomenclature will represent capabilities within Apache NiFi MiNiFi C++ agents.
Policy Name | Policy Description |
---|---|
FAIL | Fails the corresponding component. May yield failure and shutdown of the Agent |
RECOVER | Attempts to recover the affected component. If recovery isn't successful, the policy may further define routes that allow the component to transfer ownership to a new object/component. |
RETRY | Attempts to retry from the failure. May leave the agent's component in a loop |
RETRY_RECOVER | Attempts to retry the failing operation, recovery in the event that retry isn't successful. |
Configuration
Agents can be configured through the flow. This allows the flow to define a set of update policies that control what can be changed during runtime. FailurePolicies are controller services that are inherently
linked with command and control ( but do not require it to be running ), allowing real-time change ( if update policies allow ) how we recover from failure. Failure recovery can be defined by specific controls
within failure policies that allow us to recover by resetting state or instantiating new components to take over the operation
To simply the configuration, the initial set of FailurePolicies built will allow recovery of hardware failure for repositories and network connectivity. Network connectivity is typically controlled via backpressure; however,
if network failure occurs we may want some operations to continue, thus a failure policy can recover by clearing queues or prioritizing connections.
Repositories can recover through flushing WALs, replacing WAL data, and/or converting to fully volatile repositories.
Work for this effort will be completed in - MINIFICPP-545Getting issue details... STATUS