Skip to content

Conversation

@alexeyklyukin
Copy link
Contributor

A repair is a sync scan that acts only on those clusters that indicate
that the last add, update or sync operation on them has failed. It is
supposed to kick in more frequently than the repair scan. The repair
scan still remains to be useful to fix the consequences of external
actions (i.e. someone deletes a postgres-related service by mistake)
unbeknownst to the operator.

The repair scan is controlled by the new repair_period parameter in the
operator configuration. It has to be at least 2 times more frequent than
a sync scan to have any effect (a normal sync scan will update both last
synced and last repaired attributes of the controller, since repair is
just a sync underneath).

A repair scan could be queued for a cluster that is already being synced
if the sync period exceeds the interval between repairs. In that case a
repair event will be discarded once the corresponding worker finds out
that the cluster is not failing anymore.

alexeyklyukin and others added 4 commits May 24, 2018 18:09
A repair is a sync scan that acts only on those clusters that indicate
that the last add, update or sync operation on them has failed. It is
supposed to kick in more frequently than the repair scan. The repair
scan still remains to be useful to fix the consequences of external
actions (i.e. someone deletes a postgres-related service by mistake)
unbeknownst to the operator.

The repair scan is controlled by the new repair_period parameter in the
operator configuration. It has to be at least 2 times more frequent than
a sync scan to have any effect (a normal sync scan will update both last
synced and last repaired attributes of the controller, since repair is
just a sync underneath).

A repair scan could be queued for a cluster that is already being synced
if the sync period exceeds the interval between repairs. In that case a
repair event will be discarded once the corresponding worker finds out
that the cluster is not failing anymore.
@sdudoladov
Copy link
Member

👍

@coveralls
Copy link

coveralls commented May 25, 2018

Coverage Status

Coverage increased (+0.007%) to 4.619% when pulling 1684835 on repairs into 3a9378d on master.

@alexeyklyukin
Copy link
Contributor Author

👍

docs/index.md Outdated
This is triggered by either the `sync scan`, running every `resync_period`
seconds for every cluster, or by the `repair scan`, coming every
`repair_period` only for those clusters that didn't report success as a
result of the last operation running on them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is better fitted for the administrator docs; I wrote the Intro with the intention to provide a very high-level overview of operator's capabilities w/o any overly technical details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, moved to the end of the admin guide

period between consecutive sync requests. The default is `30m`.

* **repair_period**
period between consecutive repair requests. The default is `5m`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably worth to mention here a shortened version of this pr descriotion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter reference doesn't serve the goal of explaining underline concepts behind those parameters.

// TODO: make a separate function to be called from InitSharedInformers
// clusterListFunc obtains a list of all PostgreSQL clusters and runs sync when necessary
// NB: as this function is called directly by the informer, it needs to avoid acquiring locks
// on individual cluster structures. Therefore, it acts on the maifests obtained from Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maNifests :)

if event != "" {
c.queueEvents(&list, event)
} else {
c.logger.Infof("not enough passed since the last sync (%s seconds) or repair (%s seconds)", timeFromPreviousSync, timeFromPreviousRepair)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not enough time ?

return
}
lg.Debugf("Observed cluster status %s, running sync scan to repair the cluster", lastOperationStatus)
event.EventType = spec.EventSync
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is the "under-the-hood" point where the repairs scan turn into a sync scan ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, once the operator verifies the actual cluster status in the operator memory indicates the need for the repair actions it continues with the sync scan.

Move the repair and sync description into the admin guide.
Address typos in the comments and omissions in the error messages.
@alexeyklyukin
Copy link
Contributor Author

👍

return &list, err
}

// queueSyncEvents adds a sync event for every cluster with the valid manifest to the queue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover comment ? afaik there is no queueSyncEvent in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@sdudoladov
Copy link
Member

👍

@alexeyklyukin
Copy link
Contributor Author

👍

1 similar comment
@sdudoladov
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants