-
Notifications
You must be signed in to change notification settings - Fork 13
Description
PR oxidecomputer/omicron#6503 implemented automatic restarts of instances in the Failed
state. This change introduced some additional instance state that should be exposed to users. In particular:
- When a
Failed
instance is automatically restarted, a cooldown timer is started for that instance. If that instance fails again while the cooldown period is still active, it will not be automatically restarted again until the cooldown period has elapsed. - Some instances may be configured with auto-restart policies that do not permit them to be restarted when they are
Failed
.
New fields were added to the external-API instance message to report state related to automatic restarts. Instances now have an auto_restart_enabled: boolean
field that indicates if their auto-restart policy permits restarting the instance, and an auto_restart_cooldown_expiration: string
representing the date and time at which the cooldown period will have completed (allowing the instance to be restarted again). See: https://github.com/oxidecomputer/omicron/blob/45813be40b62167eff75333c410515e8bee24211/openapi/nexus.json#L15094-L15104
This data should probably be exposed to users: if an instance is in the Failed
state, the user will want to know why it has not yet been automatically restarted, whether it will ever be automatically restarted, and if it will, when that will happen. We probably only need to display this information for instances which are Failed
. If a Failed
instance has auto_restart_enabled
set to false
, we should tell the user that auto-restart is disabled for that instance. Otherwise, if there is an auto_restart_cooldown_expiration
timestamp, we should tell the user that the instance will be restarted only after that time. If auto_restart_enabled
is not false and there is no auto_restart_cooldown_expiration
timestamp, then the instance will be automatically restarted --- we might want to indicate that as well.