Introduce a repair scan to fix failing clusters #304

alexeyklyukin · 2018-05-25T10:45:23Z

A repair is a sync scan that acts only on those clusters that indicate
that the last add, update or sync operation on them has failed. It is
supposed to kick in more frequently than the repair scan. The repair
scan still remains to be useful to fix the consequences of external
actions (i.e. someone deletes a postgres-related service by mistake)
unbeknownst to the operator.

The repair scan is controlled by the new repair_period parameter in the
operator configuration. It has to be at least 2 times more frequent than
a sync scan to have any effect (a normal sync scan will update both last
synced and last repaired attributes of the controller, since repair is
just a sync underneath).

A repair scan could be queued for a cluster that is already being synced
if the sync period exceeds the interval between repairs. In that case a
repair event will be discarded once the corresponding worker finds out
that the cluster is not failing anymore.

A repair is a sync scan that acts only on those clusters that indicate that the last add, update or sync operation on them has failed. It is supposed to kick in more frequently than the repair scan. The repair scan still remains to be useful to fix the consequences of external actions (i.e. someone deletes a postgres-related service by mistake) unbeknownst to the operator. The repair scan is controlled by the new repair_period parameter in the operator configuration. It has to be at least 2 times more frequent than a sync scan to have any effect (a normal sync scan will update both last synced and last repaired attributes of the controller, since repair is just a sync underneath). A repair scan could be queued for a cluster that is already being synced if the sync period exceeds the interval between repairs. In that case a repair event will be discarded once the corresponding worker finds out that the cluster is not failing anymore.

sdudoladov · 2018-05-25T13:52:59Z

👍

coveralls · 2018-05-25T23:53:29Z

Coverage increased (+0.007%) to 4.619% when pulling 1684835 on repairs into 3a9378d on master.

Show status of the latest operation on the cluster in the logs.

# Conflicts: # pkg/controller/postgresql.go

Document this option and the concept of the repair scan.

alexeyklyukin · 2018-07-16T16:18:04Z

👍

sdudoladov · 2018-07-20T10:28:51Z

docs/index.md

+   This is triggered by either the `sync scan`, running every `resync_period`
+   seconds for every cluster, or by the `repair scan`, coming every
+   `repair_period` only for those clusters that didn't report success as a
+   result of the last operation running on them.


This paragraph is better fitted for the administrator docs; I wrote the Intro with the intention to provide a very high-level overview of operator's capabilities w/o any overly technical details.

Good point, moved to the end of the admin guide

sdudoladov · 2018-07-20T10:31:50Z

docs/reference/operator_parameters.md

+  period between consecutive sync requests. The default is `30m`.
+
+* **repair_period**
+  period between consecutive repair requests. The default is `5m`.


it is probably worth to mention here a shortened version of this pr descriotion

The parameter reference doesn't serve the goal of explaining underline concepts behind those parameters.

sdudoladov · 2018-07-20T10:34:11Z

pkg/controller/postgresql.go

 // TODO: make a separate function to be called from InitSharedInformers
 // clusterListFunc obtains a list of all PostgreSQL clusters and runs sync when necessary
+// NB: as this function is called directly by the informer, it needs to avoid acquiring locks
+// on individual cluster structures. Therefore, it acts on the maifests obtained from Kubernetes


maNifests :)

sdudoladov · 2018-07-20T10:37:34Z

pkg/controller/postgresql.go

+	if event != "" {
+		c.queueEvents(&list, event)
+	} else {
+		c.logger.Infof("not enough passed since the last sync (%s seconds) or repair (%s seconds)", timeFromPreviousSync, timeFromPreviousRepair)


not enough time ?

sdudoladov · 2018-07-20T10:43:39Z

pkg/controller/postgresql.go

+			return
+		}
+		lg.Debugf("Observed cluster status %s, running sync scan to repair the cluster", lastOperationStatus)
+		event.EventType = spec.EventSync


so this is the "under-the-hood" point where the repairs scan turn into a sync scan ?

Right, once the operator verifies the actual cluster status in the operator memory indicates the need for the repair actions it continues with the sync scan.

Move the repair and sync description into the admin guide. Address typos in the comments and omissions in the error messages.

alexeyklyukin · 2018-07-23T16:01:14Z

👍

sdudoladov · 2018-07-24T08:45:49Z

pkg/controller/postgresql.go

+	return &list, err
+}

+// queueSyncEvents adds a sync event for every cluster with the valid manifest to the queue.


leftover comment ? afaik there is no queueSyncEvent in this file

thanks, fixed

sdudoladov · 2018-07-24T08:47:53Z

👍

alexeyklyukin · 2018-07-24T09:03:39Z

👍

sdudoladov · 2018-07-24T09:21:07Z

👍

alexeyklyukin and others added 4 commits May 24, 2018 18:09

Initial commit

c55ec71

Merge branch 'master' into repairs

c8125fd

Typo

b174565

alexeyklyukin mentioned this pull request Jun 20, 2018

RFC: trigger cluster rolling update from the REST API #328

Closed

alexeyklyukin added 5 commits July 16, 2018 15:38

More verbose debug messages when deciding whether to run repair.

4287504

Show status of the latest operation on the cluster in the logs.

Merge remote-tracking branch 'origin/repairs' into repairs

647de3e

Merge branch 'master' into repairs

802b44a

# Conflicts: # pkg/controller/postgresql.go

Merge branch 'master' into repairs

3dea1a0

Support 'repair_period' option inside the CRD configuration.

9993558

Document this option and the concept of the repair scan.

alexeyklyukin added the needs review label Jul 17, 2018

sdudoladov reviewed Jul 20, 2018

View reviewed changes

alexeyklyukin added waiting on author and removed needs review labels Jul 23, 2018

Address the code review by @zerg-junior

f4646da

Move the repair and sync description into the admin guide. Address typos in the comments and omissions in the error messages.

alexeyklyukin added needs review and removed waiting on author labels Jul 23, 2018

sdudoladov reviewed Jul 24, 2018

View reviewed changes

Fix a comment

1684835

alexeyklyukin merged commit 0181a1b into master Jul 24, 2018

alexeyklyukin deleted the repairs branch July 24, 2018 09:21

alexeyklyukin mentioned this pull request Aug 2, 2018

Run repair process for the failed clusters more often than the normal sync #230

Closed

Introduce a repair scan to fix failing clusters #304

Introduce a repair scan to fix failing clusters #304

Uh oh!

Conversation

alexeyklyukin commented May 25, 2018

Uh oh!

sdudoladov commented May 25, 2018

Uh oh!

coveralls commented May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexeyklyukin commented Jul 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeyklyukin commented Jul 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdudoladov commented Jul 24, 2018

Uh oh!

alexeyklyukin commented Jul 24, 2018

Uh oh!

sdudoladov commented Jul 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coveralls commented May 25, 2018 •

edited

Loading