-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Introduce a repair scan to fix failing clusters #304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A repair is a sync scan that acts only on those clusters that indicate that the last add, update or sync operation on them has failed. It is supposed to kick in more frequently than the repair scan. The repair scan still remains to be useful to fix the consequences of external actions (i.e. someone deletes a postgres-related service by mistake) unbeknownst to the operator. The repair scan is controlled by the new repair_period parameter in the operator configuration. It has to be at least 2 times more frequent than a sync scan to have any effect (a normal sync scan will update both last synced and last repaired attributes of the controller, since repair is just a sync underneath). A repair scan could be queued for a cluster that is already being synced if the sync period exceeds the interval between repairs. In that case a repair event will be discarded once the corresponding worker finds out that the cluster is not failing anymore.
|
👍 |
Show status of the latest operation on the cluster in the logs.
# Conflicts: # pkg/controller/postgresql.go
Document this option and the concept of the repair scan.
|
👍 |
docs/index.md
Outdated
| This is triggered by either the `sync scan`, running every `resync_period` | ||
| seconds for every cluster, or by the `repair scan`, coming every | ||
| `repair_period` only for those clusters that didn't report success as a | ||
| result of the last operation running on them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is better fitted for the administrator docs; I wrote the Intro with the intention to provide a very high-level overview of operator's capabilities w/o any overly technical details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, moved to the end of the admin guide
| period between consecutive sync requests. The default is `30m`. | ||
|
|
||
| * **repair_period** | ||
| period between consecutive repair requests. The default is `5m`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is probably worth to mention here a shortened version of this pr descriotion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter reference doesn't serve the goal of explaining underline concepts behind those parameters.
pkg/controller/postgresql.go
Outdated
| // TODO: make a separate function to be called from InitSharedInformers | ||
| // clusterListFunc obtains a list of all PostgreSQL clusters and runs sync when necessary | ||
| // NB: as this function is called directly by the informer, it needs to avoid acquiring locks | ||
| // on individual cluster structures. Therefore, it acts on the maifests obtained from Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maNifests :)
pkg/controller/postgresql.go
Outdated
| if event != "" { | ||
| c.queueEvents(&list, event) | ||
| } else { | ||
| c.logger.Infof("not enough passed since the last sync (%s seconds) or repair (%s seconds)", timeFromPreviousSync, timeFromPreviousRepair) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not enough time ?
| return | ||
| } | ||
| lg.Debugf("Observed cluster status %s, running sync scan to repair the cluster", lastOperationStatus) | ||
| event.EventType = spec.EventSync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is the "under-the-hood" point where the repairs scan turn into a sync scan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, once the operator verifies the actual cluster status in the operator memory indicates the need for the repair actions it continues with the sync scan.
Move the repair and sync description into the admin guide. Address typos in the comments and omissions in the error messages.
|
👍 |
pkg/controller/postgresql.go
Outdated
| return &list, err | ||
| } | ||
|
|
||
| // queueSyncEvents adds a sync event for every cluster with the valid manifest to the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover comment ? afaik there is no queueSyncEvent in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
|
👍 |
|
👍 |
1 similar comment
|
👍 |
A repair is a sync scan that acts only on those clusters that indicate
that the last add, update or sync operation on them has failed. It is
supposed to kick in more frequently than the repair scan. The repair
scan still remains to be useful to fix the consequences of external
actions (i.e. someone deletes a postgres-related service by mistake)
unbeknownst to the operator.
The repair scan is controlled by the new repair_period parameter in the
operator configuration. It has to be at least 2 times more frequent than
a sync scan to have any effect (a normal sync scan will update both last
synced and last repaired attributes of the controller, since repair is
just a sync underneath).
A repair scan could be queued for a cluster that is already being synced
if the sync period exceeds the interval between repairs. In that case a
repair event will be discarded once the corresponding worker finds out
that the cluster is not failing anymore.