From 62f54780fdcfafc1ccf694a32bc6ab24f2508222 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 May 2014 15:48:44 +0300
Subject: [PATCH] Update readme. Fix locking in GetOldestSnapshotLSN

---
 src/backend/access/transam/README   | 143 +++++++++++-----------------
 src/backend/storage/ipc/procarray.c |   6 +-
 2 files changed, 62 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 3a32471e95..77cab9fcc2 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -244,41 +244,37 @@ transaction Y as committed, then snapshot A must consider transaction Y as
 committed".
 
 What we actually enforce is strict serialization of commits and rollbacks
-with snapshot-taking: we do not allow any transaction to exit the set of
-running transactions while a snapshot is being taken.  (This rule is
-stronger than necessary for consistency, but is relatively simple to
-enforce, and it assists with some other issues as explained below.)  The
-implementation of this is that GetSnapshotData takes the ProcArrayLock in
-shared mode (so that multiple backends can take snapshots in parallel),
-but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
-while clearing MyPgXact->xid at transaction end (either commit or abort).
-
-ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
-transaction >= this xid value that the snapshot needs to consider as
-completed.
-
-In short, then, the rule is that no transaction may exit the set of
-currently-running transactions between the time we fetch latestCompletedXid
-and the time we finish building our snapshot.  However, this restriction
-only applies to transactions that have an XID --- read-only transactions
-can end without acquiring ProcArrayLock, since they don't affect anyone
-else's snapshot nor latestCompletedXid.
-
-Transaction start, per se, doesn't have any interlocking with these
-considerations, since we no longer assign an XID immediately at transaction
-start.  But when we do decide to allocate an XID, GetNewTransactionId must
-store the new XID into the shared ProcArray before releasing XidGenLock.
-This ensures that all top-level XIDs <= latestCompletedXid are either
-present in the ProcArray, or not running anymore.  (This guarantee doesn't
-apply to subtransaction XIDs, because of the possibility that there's not
-room for them in the subxid array; instead we guarantee that they are
-present or the overflow flag is set.)  If a backend released XidGenLock
-before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedXid to
-pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break GetOldestXmin, as discussed below.
+with snapshot-taking. We use the LSNs generated by Write-Ahead-Logging as
+a convenient monotonically-increasing counter, to serialize commits with
+snapshots. Each commit is naturally assigned an LSN; it's the LSN of the
+commit WAL record. Snapshots are also represented by an LSN; all commits
+with a commit record's LSN <= the snapshot's LSN are considered as visible
+to the snapshot. Therefore acquiring a snapshot is a matter of reading the
+current WAL insert location.
+
+That means that we need to be able to look up the commit LSN of each
+transaction, by XID. For that purpose, we store the commit LSN of each
+transaction in the commit log (clog). However, storing the LSN in the
+clog is not atomic with writing the WAL record, hence it's possible that
+another backend takes a snapshot right after the commit, but sees the
+transaction as in-progress in the clog, even though it wrote the commit
+record before the snapshot was taken. To close that race condition, just
+before writing the commit WAL record, the committing backend sets the
+clog entry to a special value, COMMITLSN_COMMITTING. It is replaced with
+the commit record's LSN after the WAL record has been written. When a
+backend looks up a transaction's commit LSN in the clog and sees
+COMMITLSN_COMMITTING, it must wait for the commit to finish, by calling
+XactLockTableWait(). That's quite heavy-weight, but the race should
+happen rarely.
+
+So, a snapshot is simply an LSN, such that all transactions that committed
+before that LSN are visible, and everything later is still considered
+as in-progress. However, to avoid consulting the clog every time the
+visibility of a tuple is checked, we also record a lower and upper bound of
+the XIDs considered visible by the snapshot, in SnapshotData. When a snapshot
+is taken, xmin is set to the current nextXid value; any transaction that
+begins after the snapshot is surely still running. The xmin is tracked
+lazily in shared memory, by AdvanceGlobalXmin().
 
 We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
@@ -290,43 +286,29 @@ once, rather than assume they can read it multiple times and get the same
 answer each time.  (Use volatile-qualified pointers when doing this, to
 ensure that the C compiler does exactly what you tell it to.)
 
-Another important activity that uses the shared ProcArray is GetOldestXmin,
-which must determine a lower bound for the oldest xmin of any active MVCC
-snapshot, system-wide.  Each individual backend advertises the smallest
-xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
+Another important activity that uses the shared ProcArray is GetOldestSnapshot
+which must determine a lower bound for the oldest of any active MVCC
+snapshots, system-wide.  Each individual backend advertises the earliest
+of its own snapshots in MyPgXact->snapshotlsn, or zero if it currently has no
 live snapshots (eg, if it's between transactions or hasn't yet set a
-snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
-valid xmin fields.  It does this with only shared lock on ProcArrayLock,
-which means there is a potential race condition against other backends
-doing GetSnapshotData concurrently: we must be certain that a concurrent
-backend that is about to set its xmin does not compute an xmin less than
-what GetOldestXmin returns.  We ensure that by including all the active
-XIDs into the MIN() calculation, along with the valid xmins.  The rule that
-transactions can't exit without taking exclusive ProcArrayLock ensures that
-concurrent holders of shared ProcArrayLock will compute the same minimum of
-currently-active XIDs: no xact, in particular not the oldest, can exit
-while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
-active XID will be the same as that of any concurrent GetSnapshotData, and
-so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
-for the xmin that might be computed by concurrent or later GetSnapshotData
-calls.  (We know that no XID less than this could be about to appear in
-the ProcArray, because of the XidGenLock interlock discussed above.)
-
-GetSnapshotData also performs an oldest-xmin calculation (which had better
-match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
-for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
-too expensive.  Note that while it is certain that two concurrent
-executions of GetSnapshotData will compute the same xmin for their own
-snapshots, as argued above, it is not certain that they will arrive at the
-same estimate of RecentGlobalXmin.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without taking
-ProcArrayLock), so one execution might see what had been the oldest xmin,
-and another not.  This is OK since RecentGlobalXmin need only be a valid
-lower bound.  As noted above, we are already assuming that fetch/store
-of the xid fields is atomic, so assuming it for xmin as well is no extra
-risk.
-
+snapshot for a new transaction).  GetOldestSnapshot takes the MIN() of the
+snapshots.
+
+For freezing tuples, vacuum needs to know the oldest XID that is still
+considered running by any active transaction. That is, the oldest XID still
+considered running by the oldest active snapshot, as returned by
+GetOldestSnapshotLSN(). This value is somewhat expensive to calculate, so
+the most recently calculated value is kept in shared memory
+(SharedVariableCache->recentXmin), and is recalculated lazily by
+AdvanceRecentGlobalXmin() function. AdvanceRecentGlobalXmin() first scans
+the proc array, and makes note of the oldest active XID. That XID - 1 will
+become the new xmin. It then waits until all currently active snapshots have
+finished. Any snapshot that begins later will see the xmin as finished, so
+after all the active snapshots have finished, xmin will be visible to
+everyone. However, AdvanceRecentGlobalXmin() does not actually block waiting
+for anything; instead it contains a state machine that advances if possible,
+when AdvanceRecentGlobalXmin() is called. AdvanceRecentGlobalXmin() is
+called periodically by the WAL writer, so that it doesn't get very stale.
 
 pg_clog and pg_subtrans
 -----------------------
@@ -340,21 +322,10 @@ from disk.  They also allow information to be permanent across server restarts.
 
 pg_clog records the commit status for each transaction that has been assigned
 an XID.  A transaction can be in progress, committed, aborted, or
-"sub-committed".  This last state means that it's a subtransaction that's no
-longer running, but its parent has not updated its state yet.  It is not
-necessary to update a subtransaction's transaction status to subcommit, so we
-can just defer it until main transaction commit.  The main role of marking
-transactions as sub-committed is to provide an atomic commit protocol when
-transaction status is spread across multiple clog pages. As a result, whenever
-transaction status spreads across multiple pages we must use a two-phase commit
-protocol: the first phase is to mark the subtransactions as sub-committed, then
-we mark the top level transaction and all its subtransactions committed (in
-that order).  Thus, subtransactions that have not aborted appear as in-progress
-even when they have already finished, and the subcommit status appears as a
-very short transitory state during main transaction commit.  Subtransaction
-abort is always marked in clog as soon as it occurs.  When the transaction
-status all fit in a single CLOG page, we atomically mark them all as committed
-without bothering with the intermediate sub-commit state.
+"committing". For committed transactions, the clog stores the commit WAL
+record's LSN. This last state means that the transaction is just about to
+write its commit WAL record, or just did so, but it hasn't yet updated the
+clog with the record's LSN.
 
 Savepoints are implemented using subtransactions.  A subtransaction is a
 transaction inside a transaction; its commit or abort status is not only
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index f254095f21..15a433be0d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -662,7 +662,11 @@ GetOldestSnapshotLSN(Relation rel, bool ignoreVacuum)
 
 	result = GetXLogInsertRecPtr();
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	/*
+	 * Take an exclusive lock to ensure that no-one is in the process of
+	 * taking a snapshot while we scan the array.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-- 
2.39.5