Re: Hot Standby Design

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Hot Standby Design
Date: 2008-09-23 18:15:34
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hot Standby design has been growing on the PostgreSQL Wiki for some
weeks now.

I've updated the design to reflect all feedback received so far on
various topics.

http://wiki.postgresql.org/wiki/Hot_Standby

It's not hugely detailed in every area, but it gives a flavour of the
topics and issues related to them.

Comments or questions welcome here, or I will discuss specific areas in
more detail as I tackle those topics.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Hot Standby Design
Date: 2008-09-24 04:30:22
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 23 September 2008 14:15:34 Simon Riggs wrote:
> Hot Standby design has been growing on the PostgreSQL Wiki for some
> weeks now.
>
> I've updated the design to reflect all feedback received so far on
> various topics.
>
> http://wiki.postgresql.org/wiki/Hot_Standby
>
> It's not hugely detailed in every area, but it gives a flavour of the
> topics and issues related to them.
>
> Comments or questions welcome here, or I will discuss specific areas in
> more detail as I tackle those topics.
>

very nice work. very in depth. unfortunatly, this means it is long and the
hour is late... so here are some scattered thoughts i had while reading it :

* "However, some WAL redo actions will be for DDL actions. These DDL actions
are repeating actions that have already committed on the primary node, so
they must not fail on the standby node. These DDL locks take priority and
will automatically cancel any read-only transactions that get in their way."

Some people will want the option to stack-up ddl transactions behind
long-running queries (one of the main use cases of a hot slave is reporting
stye and other long running queries)

* Actions not allowed on Standby are:
DML - Insert, Update, Delete, COPY FROM, Truncate

copy from suprised me a bit, since it is something i could see people wanting
to do... did you mean COPY TO in this case?

* Statspack functions should work OK, so tools such as pgAdminIII should work.
pgAgent will not.

I presume this means the backend kill function would work? Also, can you go
into why pgAgent would not work? (I presume it's because you can't update
information that needs to be changed when jobs run, if thats the case it
might be worth thinking about making pgAgent work across different clusters)

* Looking for suggestions about what monitoring capability will be required.

one place to start might be to think about which checks in check_nagios might
still apply. Possibly also looking to systems like slony for some
guidence... for example everyone will want some way to check how far the lag
is on a stnadby machine.

* The following commands will not be accepted during recovery mode
GRANT, REVOKE, REASSIGN

How is user management done on a standby? can you have users that dont exist
on the primary (it would seem not).

... more to come i'm sure, but fading out... thanks again for the work so far
Simon.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hot Standby Design
Date: 2008-09-24 07:27:44
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote:

> here are some scattered thoughts i had while reading it :

Thanks for your comments.

It is very detailed, so further feedback is going to be very beneficial
in making this all work in the way we hope and expect.

> * "However, some WAL redo actions will be for DDL actions. These DDL actions
> are repeating actions that have already committed on the primary node, so
> they must not fail on the standby node. These DDL locks take priority and
> will automatically cancel any read-only transactions that get in their way."
>
> Some people will want the option to stack-up ddl transactions behind
> long-running queries (one of the main use cases of a hot slave is reporting
> stye and other long running queries)

Scheduling tools can help here. Run set of queries on Standby, then when
complete run DDL on Primary and have its changes filter through.

> * Actions not allowed on Standby are:
> DML - Insert, Update, Delete, COPY FROM, Truncate
>
> copy from suprised me a bit, since it is something i could see people wanting
> to do... did you mean COPY TO in this case?

I checked...

COPY TO is allowed, since it extracts data. So pg_dump will run.
COPY FROM will not be allowed since it loads data. So pg_restore will
not run.

> * Statspack functions should work OK, so tools such as pgAdminIII should work.
> pgAgent will not.
>
> I presume this means the backend kill function would work?

Yes it will.

> Also, can you go
> into why pgAgent would not work? (I presume it's because you can't update
> information that needs to be changed when jobs run, if thats the case it
> might be worth thinking about making pgAgent work across different clusters)

Yes

> * Looking for suggestions about what monitoring capability will be required.
>
> one place to start might be to think about which checks in check_nagios might
> still apply. Possibly also looking to systems like slony for some
> guidence... for example

check_pgsql will work, but its very simple.

check_postgres will also work, though many some actions could give
different or confusing results. e.g. last vacuum time will not be
maintained for example, since no vacuum occurs on the standby.

slony won't run on the standby either, so those checks won't work
either.

> everyone will want some way to check how far the lag
> is on a stnadby machine.

Agreed

> * The following commands will not be accepted during recovery mode
> GRANT, REVOKE, REASSIGN
>
> How is user management done on a standby? can you have users that dont
> exist
> on the primary (it would seem not).

No you can't have different users. (In time, I see this as a good thing
because it will allow us to move queries around to different nodes for
execution so that a single database acts like a "hive mind".)

Nearly everything must be set via the Primary. Users, passwords.
Changes to .conf files will be possible. There is no mechanism to auto
synchronise the .conf files between nodes.

The Standby is a Clone and not a Slave. A Slave is a separate database
that is forced to duplicate the actions of the Master. The Standby is an
exact copy, in every detail that matters.

I can see it might be desirable to have it work that way, but that's not
going to happen in the first release.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: "Dave Page" <dpage(at)pgadmin(dot)org>
To: "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Subject: Re: Hot Standby Design
Date: 2008-09-24 07:28:21
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 24, 2008 at 5:30 AM, Robert Treat
<xzilla(at)users(dot)sourceforge(dot)net> wrote:

> I presume this means the backend kill function would work? Also, can you go
> into why pgAgent would not work? (I presume it's because you can't update
> information that needs to be changed when jobs run, if thats the case it
> might be worth thinking about making pgAgent work across different clusters)

Probably - it needs to be able to update its catalogs to record job
results and ensure that only a single node runs any given job
instance.

We could probably update the code to optionally accept a connection
string instead of a database name, which would allow us to operate
against servers other than the one with the catalogs installed.

--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Dave Page <dpage(at)pgadmin(dot)org>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hot Standby Design
Date: 2008-09-24 07:37:36
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2008-09-24 at 08:28 +0100, Dave Page wrote:
> On Wed, Sep 24, 2008 at 5:30 AM, Robert Treat
> <xzilla(at)users(dot)sourceforge(dot)net> wrote:
>
> > I presume this means the backend kill function would work? Also, can you go
> > into why pgAgent would not work? (I presume it's because you can't update
> > information that needs to be changed when jobs run, if thats the case it
> > might be worth thinking about making pgAgent work across different clusters)
>
> Probably - it needs to be able to update its catalogs to record job
> results and ensure that only a single node runs any given job
> instance.
>
> We could probably update the code to optionally accept a connection
> string instead of a database name, which would allow us to operate
> against servers other than the one with the catalogs installed.

It would be useful to be able to schedule jobs against multiple nodes.

Not much need for maintenance actions on a Standby node, but I think if
I say "there aren't any needed at all" that would likely be wrong.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Hot Standby Design
Date: 2008-09-24 16:35:28
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 24 September 2008 03:27:44 Simon Riggs wrote:
> On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote:
> > here are some scattered thoughts i had while reading it :
>
> Thanks for your comments.
>
> It is very detailed, so further feedback is going to be very beneficial
> in making this all work in the way we hope and expect.
>
> > * "However, some WAL redo actions will be for DDL actions. These DDL
> > actions are repeating actions that have already committed on the primary
> > node, so they must not fail on the standby node. These DDL locks take
> > priority and will automatically cancel any read-only transactions that
> > get in their way."
> >
> > Some people will want the option to stack-up ddl transactions behind
> > long-running queries (one of the main use cases of a hot slave is
> > reporting stye and other long running queries)
>
> Scheduling tools can help here. Run set of queries on Standby, then when
> complete run DDL on Primary and have its changes filter through.
>

true... i am just reminded of Oracle 8's log replay, where you had to stop
replay to run any queries... given that was a usefull feature, I suspect
we'll hear complaints about it not going that way. I think one could argue
that we might be able to provide such an option in the future, if not in the
first release.

> > * Actions not allowed on Standby are:
> > DML - Insert, Update, Delete, COPY FROM, Truncate
> >
> > copy from suprised me a bit, since it is something i could see people
> > wanting to do... did you mean COPY TO in this case?
>
> I checked...
>
> COPY TO is allowed, since it extracts data. So pg_dump will run.
> COPY FROM will not be allowed since it loads data. So pg_restore will
> not run.
>

ah, of course... the late hour had me thinking backwards. pg_dump on the
clone will be a big plus.

<snip>
> > * Looking for suggestions about what monitoring capability will be
> > required.
> >
> > one place to start might be to think about which checks in check_nagios
> > might still apply. Possibly also looking to systems like slony for some
> > guidence... for example
>
> check_pgsql will work, but its very simple.
>
> check_postgres will also work, though many some actions could give
> different or confusing results. e.g. last vacuum time will not be
> maintained for example, since no vacuum occurs on the standby.
>

yep.

> slony won't run on the standby either, so those checks won't work
> either.
>
> > everyone will want some way to check how far the lag
> > is on a stnadby machine.
>
> Agreed
>

right... my thought with slony was, what do people monitor on thier slony
slaves? (there is actually a nagios script for that iirc)

<snip>
> The Standby is a Clone and not a Slave. A Slave is a separate database
> that is forced to duplicate the actions of the Master. The Standby is an
> exact copy, in every detail that matters.
>

This is an interesting clarification.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hot Standby Design
Date: 2008-09-24 16:51:40
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2008-09-24 at 12:35 -0400, Robert Treat wrote:
> On Wednesday 24 September 2008 03:27:44 Simon Riggs wrote:
> > On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote:
> > > * "However, some WAL redo actions will be for DDL actions. These DDL
> > > actions are repeating actions that have already committed on the primary
> > > node, so they must not fail on the standby node. These DDL locks take
> > > priority and will automatically cancel any read-only transactions that
> > > get in their way."
> > >
> > > Some people will want the option to stack-up ddl transactions behind
> > > long-running queries (one of the main use cases of a hot slave is
> > > reporting stye and other long running queries)
> >
> > Scheduling tools can help here. Run set of queries on Standby, then when
> > complete run DDL on Primary and have its changes filter through.
> >
>
> true... i am just reminded of Oracle 8's log replay, where you had to stop
> replay to run any queries... given that was a usefull feature, I suspect
> we'll hear complaints about it not going that way. I think one could argue
> that we might be able to provide such an option in the future, if not in the
> first release.

It probably sounds worse than it is. If you drop a table on the Primary,
then somebody running a query against it on the Standby is in for a
shock. But on the other hand, why are you dropping a table that people
still consider worth reading? DROP TABLE should be carefully researched
before use, so I don't think its a big problem.

We could make it wait for a while before cancelling, as an option, if
you think its important?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support