git.postgresql.org Git - postgresql.git/log

Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient.  All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan.  Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).

Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem).  They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Author: Peter Geoghegan <[email protected]>
Author: Matthias van de Meent <[email protected]>
Reviewed-By: Heikki Linnakangas <[email protected]>
Reviewed-By: Matthias van de Meent <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com

Remove obsolete comment in CopyReadLineText().

When this bit of commentary was written, it was alluding to the
fact that we looked for newlines and EOD markers in the raw
(not yet encoding-converted) input data. We don't do that anymore,
preferring to batch the conversion of larger chunks of input and
split it into lines later. Hence there's no longer any need for
assumptions about the relevant characters being encoding-invariant,
and we should remove this comment saying we assume that.

Discussion: https://postgr.es/m/1461688.1712347668@sss.pgh.pa.us

Speed up tail processing when hashing aligned C strings, take two

After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated
the terminator's position and re-loaded the remaining bytes from
the input string. This was slower than the unaligned case for very
short strings. We already have all the data we need in a register,
so let's just mask off the bytes we need and hash them immediately.

In addition to endianness issues, the previous attempt upset valgrind
in the way it computed the mask. Whether by accident or by wisdom,
the author's proposed method passes locally with valgrind 3.22.

Ants Aasma, with cosmetic adjustments by me

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com

Teach fasthash_accum to use platform endianness for bytewise loads

This function previously used a mix of word-wise loads and bytewise
loads. The bytewise loads happened to be little-endian regardless of
platform. This in itself is not a problem. However, a future commit
will require the same result whether A) the input is loaded as a
word with the relevent bytes masked-off, or B) the input is loaded
one byte at a time.

While at it, improve debuggability of the internal hash state.

Discussion: https://postgr.es/m/CANWCAZZpuV1mES1mtSpAq8tWJewbrv4gEz6R_k4gzNG8GZ5gag%40mail.gmail.com

Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f introduced
the mechanism 17 years ago.  Commit 1cbbee03 recently made it
configurable but retained the traditional default.  The correct default
size has been debated for years, but 256kB is certainly very small.
VACUUM soon needs to write back data it dirtied only 32 blocks ago,
which usually requires flushing the WAL.  New experiments in prefetching
pages for VACUUM exacerbated the problem by crashing into dirty data
even sooner.  Let's make the default 2MB.  That's 1.6% of the default
toy buffer pool size, and 0.2% of 1GB, which would be a considered a
small shared_buffers setting for a real system these days.  Users are
still free to set the GUC to a different value.

Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgr.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com

Allow BufferAccessStrategy to limit pin count.

While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers. For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing. Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.

Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com

Convert uses of hash_string_pointer to fasthash equivalent

Remove duplicate hash_string_pointer() function definitions by creating
a new inline function hash_string() for this purpose.

This has the added advantage of avoiding strlen() calls when doing hash
lookup. It's not clear how many of these are perfomance-sensitive
enough to benefit from that, but the simplification is worth it on
its own.

Reviewed by Jeff Davis

Discussion: https://postgr.es/m/CANWCAZbg_XeSeY0a_PqWmWqeRATvzTzUNYRLeT%2Bbzs%2BYQdC92g%40mail.gmail.com

Add macro to disable address safety instrumentation

fasthash_accum_cstring_aligned() uses a technique, found in various
strlen() implementations, to detect a string's NUL terminator by
reading a word at at time. That triggers failures when testing with
"-fsanitize=address", at least with frontend code. To enable using
this function anywhere, add a function attribute macro to disable
such testing.

Reviewed by Jeff Davis

Discussion: https://postgr.es/m/CANWCAZbwvp7oUEkbw-xP4L0_S_WNKq-J-ucP4RCNDPJnrakUPw%40mail.gmail.com

Fix incorrect return type

fasthash32() calculates a 32-bit hashcode, but the return
type was uint64. Change to uint32.

Noted by Jeff Davis

Discussion: https://postgr.es/m/b16c93e6c736a422d4de668343515375664eb05d.camel%40j-davis.com

Improve read_stream.c's fast path.

The "fast path" for well cached scans that don't do any I/O was
accidentally coded in a way that could only be triggered by pg_prewarm's
usage pattern, which starts out with a higher distance because of the
flags it passes in. We want it to work for streaming sequential scans
too, once that patch is committed. Adjust.

Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com

Fix headerscheck violation introduced in f8ce4ed78ca

Per ci.

Silence some compiler warnings in commit 3311ea86ed

Per report from Nathan Bossart

Fix incorrect calculation in BlockRefTableEntryGetBlocks.

The previous formula was incorrect in the case where the function's
nblocks argument was a multiple of BLOCKS_PER_CHUNK, which happens
whenever a relation segment file is exactly 512MB or exactly 1GB in
length. In such cases, the formula would calculate a stop_offset of
0 rather than 65536, resulting in modified blocks in the second half
of a 1GB file, or all the modified blocks in a 512MB file, being
omitted from the incremental backup.

Reported off-list by Tomas Vondra and Jakub Wartak.

Discussion: http://postgr.es/m/CA+TgmoYwy_KHp1-5GYNmVa=zdeJWhNH1T0SBmEuvqQNJEHj1Lw@mail.gmail.com

Check HAVE_COPY_FILE_RANGE before calling copy_file_range

Fix a mistake in ac8110155132 - write_reconstructed_file() called
copy_file_range() without properly checking HAVE_COPY_FILE_RANGE.

Reported by several macOS machines. Also reported by cfbot, but I missed
that issue before commit.

Allow using copy_file_range in write_reconstructed_file

This commit allows using copy_file_range() for efficient combining of
data from multiple files, instead of simply reading/writing the blocks.
Depending on the filesystem and other factors (size of the increment,
distribution of modified blocks etc.) this may be faster than the
block-by-block copy, but more importantly it enables various features
provided by CoW filesystems.

If a checksum needs to be calculated for the file, the same strategy as
when copying whole files is used - copy_file_range is used to copy the
blocks, but the file is also read for the checksum calculation.

While the checksum calculation is rarely needed when cloning whole
files, when reconstructing the files from multiple backups it needs to
happen almost always (the only exception is when the user specified
--no-manifest).

Author: Tomas Vondra
Reviewed-by: Thomas Munro, Jakub Wartak, Robert Haas
Discussion: https://postgr.es/m/3024283a-7491-4240-80d0-421575f6bb23%40enterprisedb.com

Make libpqsrv_cancel's return const char *, not char *

Per headerscheck's C++ check.

Discussion: https://postgr.es/m/372769.1712179784@sss.pgh.pa.us

Remove unused variable in checksum_file()

The 'offset' variable was set but otherwise unused.

Per buildfarm animals with clang, e.g. sifaka and longlin.

Allow copying files using clone/copy_file_range

Adds --clone/--copy-file-range options to pg_combinebackup, to allow
copying files using file cloning or copy_file_range(). These methods may
be faster than the standard block-by-block copy, but the main advantage
is that they enable various features provided by CoW filesystems.

This commit only uses these copy methods for files that did not change
and can be copied as a whole from a single backup.

These new copy methods may not be available on all platforms, in which
case the command throws an error (immediately, even if no files would be
copied as a whole). This early failure seems better than failing later
when trying to copy the first file, after performing a lot of work on
earlier files.

If the requested copy method is available, but a checksum needs to be
recalculated (e.g. because of a different checksum type), the file is
still copied using the requested method, but it is also read for the
checksum calculation. Depending on the filesystem this may be more
expensive than just performing the simple copy, but it does enable the
CoW benefits.

Initial patch by Jakub Wartak, various reworks and improvements by me.

Author: Tomas Vondra, Jakub Wartak
Reviewed-by: Thomas Munro, Jakub Wartak, Robert Haas
Discussion: https://postgr.es/m/3024283a-7491-4240-80d0-421575f6bb23%40enterprisedb.com

Suppress "variable may be used uninitialized" warning.

Buildfarm member caiman is showing this, which surprises me because
it's very late-model gcc (14.0.1) and ought to be smart enough to
know that elog(ERROR) doesn't return. But we're likely to see the
same from stupider compilers too, so add a dummy initialization in
our usual style.

docs: Merge separate chapters on built-in index AMs into one.

The documentation index is getting very long, which makes it hard
to find things. Since these chapters are all very similar in structure
and content, merging them is a natural way of reducing the size of
the toplevel index.

Rather than actually combining all of the SGML into a single file,
keep one file per <sect1>, and add a glue file that includes all
of them.

Discussion: http://postgr.es/m/CA+Tgmob7_uoYuS2=rVwpVXaRwP-UXz+++saYTC-BCZ42QzSNKQ@mail.gmail.com

Align blocks in incremental backups to BLCKSZ

Align blocks stored in incremental files to BLCKSZ, so that the
incremental backups work well with CoW filesystems.

The header of the incremental file is padded with \0 to a multiple of
BLCKSZ, so that the block data (also BLCKSZ) is aligned to BLCKSZ. The
padding is added only to files containing block data, so files with just
the header remain small. This adds a bit of extra space, but as the
number of blocks increases the overhead gets negligible very quickly.
And as the padding is \0 bytes, it does compress extremely well.

The alignment is important for CoW filesystems that usually require the
blocks to be aligned to filesystem page size for features like block
sharing, deduplication etc. to work well. With the variable sized header
the blocks in the increments were not aligned at all, negating the
benefits of the CoW filesystems.

This matters even for non-CoW filesystems, for example when placed on a
RAID array. If the block is not aligned, it may easily span multiple
devices, causing read and write amplification.

It might be better to align the blocks to the filesystem page, not
BLCKSZ, but we have no good way to determine that. Even if we determine
the page size at the time of taking the backup, the backup may move. For
now the BLCKSZ seems sufficient - the filesystem page is usually 4K, so
the default BLCKSZ (8K by default) is aligned to that.

Author: Tomas Vondra
Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/3024283a-7491-4240-80d0-421575f6bb23%40enterprisedb.com

Operate XLogCtl->log{Write,Flush}Result with atomics

This removes the need to hold both the info_lck spinlock and
WALWriteLock to update them.  We use stock atomic write instead, with
WALWriteLock held.  Readers can use atomic read, without any locking.

This allows for some code to be reordered: some places were a bit
contorted to avoid repeated spinlock acquisition, but that's no longer a
concern, so we can turn them to more natural coding.  Some further
changes are possible (maybe to performance wins), but in this commit I
did rather minimal ones only, to avoid increasing the blast radius.

Reviewed-by: Bharath Rupireddy <[email protected]>
Reviewed-by: Jeff Davis <[email protected]>
Reviewed-by: Andres Freund <[email protected]> (earlier versions)
Discussion: https://postgr.es/m/20200831182156 [email protected]

Allow synced slots to have their inactive_since.

This commit does two things:
1) Maintains inactive_since for sync slots whenever the slot is released
just like any other regular slot.

2) Ensures the value is set to the current timestamp during the promotion
of standby to help correctly interpret the time after promotion. We don't
want the slots to appear inactive for a long time after promotion if they
haven't been synchronized recently. This would also avoid the invalidation
of such slots immediately after promotion if tomorrow we have a feature
that invalidates slots based on their inactivity time. Whoever acquires
the slot i.e. makes the slot active will reset it to NULL.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik, Masahiko Sawada
Discussion: https://postgr.es/m/CAA4eK1KrPGwfZV9LYGidjxHeW+rxJ=E2ThjXvwRGLO=iLNuo=Q@mail.gmail.com
Discussion: https://postgr.es/m/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
Discussion: https://postgr.es/m/CA+Tgmob_Ta-t2ty8QrKHBGnNLrf4ZYcwhGHGFsuUoFrAEDw4sA@mail.gmail.com

Add "ABI_compatibility" regions to wait_event_names.txt

The current design behind the automatic generation of the C code and
documentation related to wait events introduced in fa88928470b5 does not
offer a way to attach new wait events without breaking ABI
compatibility, as all the events are forcibly reordered for each section
in the input file wait_event_names.txt.  Adding new wait events to
stable branches is something that has happened in the past, 0b6517a3b79a
being a recent example of that with VERSION_FILE_SYNC, so we need a way
to generate any C code for wait events while maintaining compatibility
on stable branches already released.

This commit solves this issue by adding a new region called
"ABI_compatibility" (keyword could be updated to something else if
someone had a better idea) to each section of wait_event_names.txt, so
as one can add new wait events to stable branches in
wait_event_names.txt while keeping the code ABI-compatible.
"ABI_compatibility" has no impact on the documentation generated: all
the wait events of one section are still alphabetically ordered.  LWLock
and Lock sections generate their C code elsewhere, so they do not need
an "ABI_compatibility" region.

For example, let's imagine a wait_event_names.txt like that:
Section: ClassName - Foo
FOO_1 "Waiting in Foo 1"
FOO_2 "Waiting in Foo 2"
ABI_compatibility:
NEW_FOO_1 "Waiting in New Foo 1"
NEW_BAR_1 "Waiting in New Bar 1"

This results in the following enum, where the events in the ABI region
are listed last with the same ordering as in wait_event_names.txt:
typedef enum
{
    WAIT_EVENT_FOO_1,
    WAIT_EVENT_FOO_2,
    WAIT_EVENT_NEW_FOO_1,
    WAIT_EVENT_NEW_BAR_1
} WaitEventFoo;

New wait events added in stable branches should be added at the end of
each ABI_compatibility region, and ABI_compatibility should remain empty
on HEAD and unreleased stable branches.

This design has been suggested by Noah Misch and me.

Reported-by: Noah Misch
Author: Bertrand Drouvot
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20240317183114 [email protected]

Fix test failures when language environment is not UTF-8.

For tests that depend on UTF-8 encoding, force LC_COLLATE=C and
LC_CTYPE=C to avoid an encoding mismatch.

Reported-by: Thomas Munro
Discussion: https://postgr.es/m/CA+hUKGK-ZqV1njkG_=xcCqXh2fcMkz85FTMnhS2opm4ZerH=xw@mail.gmail.com

Fix old, misleading comment for PGRES_POLLING_ACTIVE.

The comment implies that we can eventually remove this, but per
discussion, we actually don't want to do that ever, in order to
maintain compatibility.

Jelte Fennema-Nio, reviewed by Tristan Partin

Discussion: http://postgr.es/m/CAGECzQTO72jKed5461W8cytV2Msh_e+WUZjOyX_RUQCbjk4LRA@mail.gmail.com

Remove reachable call to pg_unreachable().

The loop just before this uses break, not return, so this line
is reachable. Commit cafe1056558fe07cdc52b95205588fcd80870362
introduced this issue.

Jelte Fennema-Nio, reviewed by Tristan Partin

Discussion: http://postgr.es/m/CAGECzQTO72jKed5461W8cytV2Msh_e+WUZjOyX_RUQCbjk4LRA@mail.gmail.com

Fix ecpg's mechanism for detecting unsupported cases in the grammar.

ecpg wants to emit a warning if it parses a SQL construct that the
backend can parse but will immediately throw a FEATURE_NOT_SUPPORTED
error for.  The way it was testing for this was to see if the string
ERRCODE_FEATURE_NOT_SUPPORTED appeared anywhere in the gram.y code.
This is, of course, not nearly good enough, as there are plenty of
rules in gram.y that throw that error only conditionally.  There was
a hack dating to 2008 to suppress the warning in one rule that
doesn't even exist anymore, but nothing for other cases we've created
since then.  End result was that you could get "unsupported feature
will be passed to server" warnings while compiling perfectly good SQL
code in ecpg.  Somehow we'd not heard complaints about this, but
it was exposed by the recent addition of an ecpg test for a SQL/JSON
construct.

To fix, suppress the warning if the rule contains any "if" statement.
Manual comparison of gram.y with the generated preproc.y file shows
that the warning is now emitted only in rules where it's sensible.

This problem has existed for a long time, so back-patch to all
supported branches.

Discussion: https://postgr.es/m/603615.1712245382@sss.pgh.pa.us

Further cleanup for recent JSON-related commits.

The link commands in test_json_parser/Makefile were a long way
shy of a load, as evidenced by buildfarm failures.  Model them
on pgxs.mk's PROGRAM rule.  (Probably we should have put these
two test programs in different subdirectories so we could
actually use the PROGRAM rule.  But I won't question that
decision today.)

Further cleanup for recent JSON-related commits.

Add overlooked .gitignore entries.

Fix test_json_parser/Makefile to use the pgxs.mk clean rule
instead of fighting it. Suppresses a warning from make,
at least for me.

Tidy up after incremental JSON parser patch

Remove junk left over from non-vpath builds.

Try to remedy gettext error on some platforms.

Fix warnings re typedef redefinition in ea7b4e9a2a and 3311ea86ed

Per gripe from Tom Lane and the buildfarm

Add missing initialization in transformJsonFuncExpr()

de3600452b added some code for the new JSON_TABLE_OP to that function
but missed to initialize the default_format variable.

Reported-by: Erik Rijkers <[email protected]>
Discussion: https://postgr.es/m/254b2fa2-2f6b-a30a-20ee-21f8a2c12a50@xs4all.nl

Fix typo introduced in 6185c9737

Reported-by: Jian He <[email protected]>
Discussion: https://postgr.es/m/CACJufxGHiU0p0usjh5hnR0_ByZn4tq1FC3eKAtrQgJeKU6W9kw@mail.gmail.com

Add basic JSON_TABLE() functionality

JSON_TABLE() allows JSON data to be converted into a relational view
and thus used, for example, in a FROM clause, like other tabular
data. Data to show in the view is selected from a source JSON object
using a JSON path expression to get a sequence of JSON objects that's
called a "row pattern", which becomes the source to compute the
SQL/JSON values that populate the view's output columns. Column
values themselves are computed using JSON path expressions applied to
each of the JSON objects comprising the "row pattern", for which the
SQL/JSON query functions added in 6185c9737cf4 are used.

To implement JSON_TABLE() as a table function, this augments the
TableFunc and TableFuncScanState nodes that are currently used to
support XMLTABLE() with some JSON_TABLE()-specific fields.

Note that the JSON_TABLE() spec includes NESTED COLUMNS and PLAN
clauses, which are required to provide more flexibility to extract
data out of nested JSON objects, but they are not implemented here
to keep this commit of manageable size.

Author: Nikita Glukhov <[email protected]>
Author: Teodor Sigaev <[email protected]>
Author: Oleg Bartunov <[email protected]>
Author: Alexander Korotkov <[email protected]>
Author: Andrew Dunstan <[email protected]>
Author: Amit Langote <[email protected]>
Author: Jian He <[email protected]>

Reviewers have included (in no particular order):

Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130 [email protected]
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com

pg_upgrade: Fix typo in message

Use incremental parsing of backup manifests.

This changes the three callers to json_parse_manifest() to use
json_parse_manifest_incremental_chunk() if appropriate. In the case of
the backend caller, since we don't know the size of the manifest in
advance we always call the incremental parser.

Author: Andrew Dunstan
Reviewed-By: Jacob Champion
Discussion: https://postgr.es/m/7b0a51d6-0d9d-7366-3a1a-f74397a02f55@dunslane.net

Add support for incrementally parsing backup manifests

This adds the infrastructure for using the new non-recursive JSON parser
in processing manifests. It's important that callers make sure that the
last piece of json handed to the incremental manifest parser contains
the entire last few lines of the manifest, including the checksum.

Author: Andrew Dunstan
Reviewed-By: Jacob Champion
Discussion: https://postgr.es/m/7b0a51d6-0d9d-7366-3a1a-f74397a02f55@dunslane.net

Introduce a non-recursive JSON parser

This parser uses an explicit prediction stack, unlike the present
recursive descent parser where the parser state is represented on the
call stack. This difference makes the new parser suitable for use in
incremental parsing of huge JSON documents that cannot be conveniently
handled piece-wise by the recursive descent parser. One potential use
for this will be in parsing large backup manifests associated with
incremental backups.

Because this parser is somewhat slower than the recursive descent
parser, it is not replacing that parser, but is an additional parser
available to callers.

For testing purposes, if the build is done with -DFORCE_JSON_PSTACK, all
JSON parsing is done with the non-recursive parser, in which case only
trivial regression differences in error messages should be observed.

Author: Andrew Dunstan
Reviewed-By: Jacob Champion
Discussion: https://postgr.es/m/7b0a51d6-0d9d-7366-3a1a-f74397a02f55@dunslane.net

Silence meson warning

Commit 619bc23a1a introduced

WARNING: Project targets '>=0.54' but uses feature introduced in '0.55.0': Passing executable/found program object to script parameter of add_dist_script

Work around that by wrapping the offending line in a meson version check.

Author: Tristan Partin <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/D096Q3NFFVH1.1T5RE4MOO9ZFH%40neon.tech

postgres_fdw: Remove useless ternary expression.

There is no case where we would call pgfdw_exec_cleanup_query or
pgfdw_exec_cleanup_query_{begin,end} with a NULL query string, so this
expression is pointless; remove it and instead add to the latter
functions an assertion ensuring the given query string is not NULL.

Thinko in commit 815d61fcd.

Discussion: https://postgr.es/m/CAPmGK14mm%2B%3DUjyjoWj_Hu7c%2BQqX-058RFfF%2BqOkcMZ_Nj52v-A%40mail.gmail.com

Secondary refactor of heap scanning functions

Similar to 44086b097, refactor heap scanning functions to be more
suitable for the read stream API.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com

Coordinate emit_log_hook and all log destinations to share the same timeval

This would cause the timestamp values used by emit_log_hook and all the
other log destinations to differ, because the timestamps are reset
before sending the logs to the server and after calling the hook.

This change matters for emit_log_hook when generating log information
with 'n' or 'm' in log_line_prefix through log_status_format(), or when
doing direct calls to get_formatted_log_time() like in the JSON or CSV
log formats.

While on it, this commit fixes a couple of comments related to the
formatted timestamps where the JSON was not mentioned. Oversight in
dc686681e079, that I have noticed while reviewing this patch.

Author: Kambam Vinay, Michael Paquier
Discussion: https://postgr.es/m/CANiRfmsK36A0i8mnQtzaxhSm3CUCimPwJPp4WQNq53OdSNkgWg@mail.gmail.com

Preliminary refactor of heap scanning functions

To allow the use of the read stream API added in b5a9b18cd for
sequential scans on heap tables, here we make some adjustments to make
that change less invasive and perhaps make the code easier to follow in
the process.

Here heapgetpage() gets broken into two functions:

1) The part which reads the block has now been moved into a function
   named heapfetchbuf().
2) The part which performed pruning and populated the scan's
   rs_vistuples[] array is now moved into a new function named
   heap_prepare_pagescan().

The functionality provided by heap_prepare_pagescan() was only ever
required by SO_ALLOW_PAGEMODE scans, so the branching that was
previously done in heapgetpage() is no longer needed as we simply just
don't call heap_prepare_pagescan() from heapgettup() in the refactored
code.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com

pg_regress: Save errno in emit_tap_output_v() and switch to %m

emit_tap_output_v() includes some fprintf() calls for some output
related to the TAP protocol, that may clobber errno and break %m. This
commit makes the logging of pg_regress smarter by saving errno before
restoring it in vfprintf() where the input strings are used, removing
the need for strerror(). All logs are switched to %m rather than
strerror(), shaving some code.

This was not a problem until now as pg_regress.c did not use %m, but the
change is simple enough that we have no reason to not support this
placeholder, and that will avoid future mistakes if new logs that
include %m are added.

Author: Dagfinn Ilmari Mannsåker
Reviewed-by: Peter Eisentraunt, Michael Paquier
Discussion: https://postgr.es/m/[email protected]

CREATE INDEX: do not update stats during binary upgrade.

During binary upgrade, indexes are created before the data is moved
into place, so it will always be zero.

This is not currently a major problem, but will be when we try to
preserve statistics during upgrade.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=daPdFB8V0tgFxK-dLowFsAEzWRWJHyxij7BG3kBjcouA@mail.gmail.com

Invent SERIALIZE option for EXPLAIN.

EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about
the volume of data emitted by a query, as well as the time taken
to convert the data to the on-the-wire format. Previously there
was no way to investigate this without actually sending the data
to the client, in which case network transmission costs might
swamp what you wanted to see. In particular this feature allows
investigating the costs of de-TOASTing compressed or out-of-line
data during formatting.

Stepan Rutz and Matthias van de Meent,
reviewed by Tomas Vondra and myself

Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de

Fix the parameters order for TableAmRoutine.relation_copy_for_cluster()

Specify OldTable first, NewTable second as used by
table_relation_copy_for_cluster() and as implemented in
heapam_relation_copy_for_cluster().

Backpatch to PostgreSQL 12, where TableAmRoutine was introduced.

Discussion: https://postgr.es/m/ME3P282MB3166860D4911AE82F92DF7C5B63F2%40ME3P282MB3166.AUSP282.PROD.OUTLOOK.COM
Author: Japin Li
Reviewed-by: Pavel Borisov
Backpatch-through: 12

docs: Demote "Monitoring Disk Usage" from chapter to section.

This chapter is very short, and the immediately preceding chapter is
called "Monitoring Database Activity". So, instead of having a
separate chapter for this, make it the last section of the preceding
chapter instead.

Discussion: http://postgr.es/m/CA+Tgmob7_uoYuS2=rVwpVXaRwP-UXz+++saYTC-BCZ42QzSNKQ@mail.gmail.com

Split XLogCtl->LogwrtResult into separate struct members

After this change we have XLogCtl->logWriteResult and ->logFlushResult.
There's no functional change, other than the fact that the assignment
from shared memory to local is no longer done via struct assignment, but
instead using a macro that copies each member separately.

The current representation is inconvenient going forward; notably, we
would like to add a new member "Copy" (to keep track of the last
position copied into WAL buffers), so the symmetry between the values in
shared memory vs. those in local would be lost.

This also gives us freedom to later change the concurrency model for the
values in shared memory: we can make them use atomics instead of relying
on the info_lck spinlock.

Reviewed-by: Bharath Rupireddy <[email protected]>
Discussion: https://postgr.es/m/202404031119 [email protected]

Inline pg_popcount() for small buffers.

If there aren't many bytes to process, the function call overhead
of the optimized implementation isn't worth taking, so instead we
inline a loop that consults pg_number_of_ones in that case. If
there are many bytes to process, we accept the function call
overhead because the optimized versions are likely to be faster.
The threshold at which we use the optimized implementation is set
to the smallest amount of data required to use special popcount
instructions.

Reviewed-by: Alvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/20240402155301.GA2750455%40nathanxps13

Combine freezing and pruning steps in VACUUM

Execute both freezing and pruning of tuples in the same
heap_page_prune() function, now called heap_page_prune_and_freeze(),
and emit a single WAL record containing all changes. That reduces the
overall amount of WAL generated.

This moves the freezing logic from vacuumlazy.c to the
heap_page_prune_and_freeze() function. The main difference in the
coding is that in vacuumlazy.c, we looked at the tuples after the
pruning had already happened, but in heap_page_prune_and_freeze() we
operate on the tuples before pruning. The heap_prepare_freeze_tuple()
function is now invoked after we have determined that a tuple is not
going to be pruned away.

VACUUM no longer needs to loop through the items on the page after
pruning. heap_page_prune_and_freeze() does all the work. It now
returns the list of dead offsets, including existing LP_DEAD items, to
the caller. Similarly it's now responsible for tracking 'all_visible',
'all_frozen', and 'hastup' on the caller's behalf.

Author: Melanie Plageman <[email protected]>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov

Refactor how heap_prune_chain() updates prunable_xid

In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman <[email protected]>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov

Fix zeroing of pg_serial page without SLRU bank lock

Bug in commit 53c2a97a9266: we failed to acquire the correct SLRU bank
lock when iterating to zero-out intermediate pages in predicate.c.
Rewrite the code block so that we follow the locking protocol correctly.

Also update an outdated comment in the same file -- SerialSLRULock
exists no more.

Reported-by: Alexander Lakhin <[email protected]>
Reviewed-by: Dilip Kumar <[email protected]>
Discussion: https://postgr.es/m/2a25eaf4-a3a4-5fd1-6241-9d7c73142085@gmail.com

Use the pairing heap instead of a flat array for LSN replay waiters

06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for
the particular LSN to be replayed on standby.  The waiters were stored in
the flat array.  Even though scanning small arrays is fast, that might be a
problem at scale (a lot of waiting processes).

This commit replaces the flat shared memory array with the pairing heap,
which holds the waiter with the least LSN at the top.  This gives us O(log N)
complexity for both inserting and removing waiters.

Reported-by: Alvaro Herrera
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql

Drop global objects after completed test

Project policy is to not leave global objects behind after a regress
test run. This was found as a result of the development of a patch
to make pg_regress detect such leftovers automatically, which in the
end was withdrawn due to issues with parallel runs.

Discussion: https://postgr.es/m/[email protected]

Ensure that the sync slots reach a consistent state after promotion without losing data.

We were directly copying the LSN locations while syncing the slots on the
standby. Now, it is possible that at some particular restart_lsn there are
some running xacts, which means if we start reading the WAL from that
location after promotion, we won't reach a consistent snapshot state at
that point. However, on the primary, we would have already been in a
consistent snapshot state at that restart_lsn so we would have just
serialized the existing snapshot.

To avoid this problem we will use the advance_slot functionality unless
the snapshot already exists at the synced restart_lsn location. This will
help us to ensure that snapbuilder/slot statuses are updated properly
without generating any changes. Note that the synced slot will remain as
RS_TEMPORARY till the decoding from corresponding restart_lsn can reach a
consistent snapshot state after which they will be marked as
RS_PERSISTENT.

Per buildfarm

Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Shveta Malik, Bharath Rupireddy, Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com

Minor improvements for waitlsn.c

* Remove extra includes
* Fill 'cur' in addLSNWaiter() before taking the spinlock
* Initialize 'endtime' with zero in WaitForLSN() to avoid compiler warning

Reported-by: Alvaro Herrera, Masahiko Sawada, Daniel Gustafsson
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Discussion: https://postgr.es/m/CAD21AoAx7irptnPH1OkkkNh9E0M6X-phfX7sYZfwoMsc1qV1sQ%40mail.gmail.com

Fix indentation from cafe1056558f

Per buildfarm animal koel

Add error codes to some PANIC/FATAL errors reports

This adds errcodes to a set of PANIC and FATAL errors in xlog.c
and relcache.c, which previously had no errcode at all set, in
order to make fleetwide analysis of errorlogs easier. There are
many more ereport/elogs left which could benefit from having an
errcode but this at least makes a dent in the issue.

Author: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com

Add built-in ERROR handling for archive callbacks.

Presently, the archiver process restarts when an archive callback
ERRORs.  To avoid this, archive module authors can use sigsetjmp(),
manage a memory context, etc., but that requires a lot of extra
code that will likely look roughly the same between modules.  This
commit adds basic archive callback ERROR handling to pgarch.c so
that module authors won't ordinarily need to worry about this.
While this built-in handler attempts to clean up anything that an
archive module could conceivably have left behind, it is possible
that some modules are doing unexpected things that require
additional cleanup.  Module authors should be sure to do any extra
required cleanup in a PG_CATCH block within the archiving callback.

The archiving callback is now called in a short-lived memory
context that the archiver process resets between invocations.  If a
module requires longer-lived storage, it must maintain its own
memory context.

Thanks to these changes, the basic_archive module can be greatly
simplified.

Suggested-by: Andres Freund
Reviewed-by: Andres Freund, Yong Li
Discussion: https://postgr.es/m/20230217215624.GA3131134%40nathanxps13

Improve eviction algorithm in ReorderBuffer using max-heap for many subtransactions.

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.

We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Peter Smith, Álvaro Herrera,
Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com

Don't adjust ressortgroupref in generate_setop_child_grouplist()

This is already done inside assignSortGroupRef(), therefore is
redundant.

Oversight from 66c0185a3.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/3703023.1711654574@sss.pgh.pa.us

Add functions to binaryheap for efficient key removal and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller had to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian,
Tomas Vondra, Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com

Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian,
Tomas Vondra, Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com

Move WaitLSNShmemInit() to CreateOrAttachShmemStructs()

Thanks to Andres Freund, Thomas Munrom and David Rowley for investigating
this issue.

Discussion: https://postgr.es/m/CAPpHfdvap5mMLikt8CUjA0osAvCJHT0qnYeR3f84EJ_Kvse0mg%40mail.gmail.com

Don't zero tuple_fraction when planning UNIONs with ORDER BYs

Since 66c0185a3, the planner is able to use Merge Append -> Unique to
implement UNION queries and each subquery is prompted to produce Paths
correctly sorted by the UNION's targetlist.

Here we remove some now redundant code which was zeroing the
tuple_fraction at the parent level. This will allow the planner to
consider cheap startup paths when planning the UNION's subqueries.

EXCEPT and INTERSECT set operations still have the tuple_fraction zeroed
in generate_nonunion_paths(). These operations currently always read
all of their subqueries' tuples.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/3703023.1711654574@sss.pgh.pa.us

Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock. This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira

Avoid deadlock during orphan temp table removal.

If temp tables have dependencies (such as sequences) then it's
possible for autovacuum's cleanup of orphan temp tables to deadlock
against an incoming backend that's trying to clean out the temp
namespace for its own use.  That can happen because RemoveTempRelations'
performDeletion call can visit objects within the namespace in
an order different from the order in which a per-table deletion
will visit them.

To fix, observe that performDeletion will begin by taking an exclusive
lock on the temp namespace (even though it won't actually delete it).
So, if we can get a shared lock on the namespace, we can be sure we're
not running concurrently with RemoveTempRelations, while also not
conflicting with ordinary use of the namespace.  This requires
introducing a conditional version of LockDatabaseObject, but that's no
big deal.  (It's surprising we've got along without that this long.)

Report and patch by Mikhail Zhilin.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/c43ce028-2bc2-4865-9b89-3f706246eed5@postgrespro.ru

Avoid function call overhead of pg_popcount() in syslogger.c.

Instead of calling the pg_popcount() function for a single byte, we
can look up the value in the pg_number_of_ones array.

Discussion: https://postgr.es/m/20240401221117.GB2362108%40nathanxps13

Refactor code for setting pg_popcount* function pointers.

Presently, there are three copies of this code, and a proposed
follow-up patch would add more code to each copy. This commit
introduces a new inline function for this code and makes use of it
in the pg_popcount*_choose functions, thereby reducing code
duplication.

Author: Paul Amonson
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com

Unwind #if spaghetti in hmac_openssl.c a bit.

Make this code a little less confusing by defining a separate macro
that controls whether we'll use ResourceOwner facilities to track
the existence of a pg_hmac_ctx context.

The proximate reason to touch this is that since b8bff07da, we got
"unused function" warnings if building with older OpenSSL, because
the #if guards around the ResourceOwner wrapper function definitions
were different from those around the calls of those functions.
Pulling the ResourceOwner machinations outside of the #ifdef HAVE_xxx
guards fixes that and makes the code clearer too.

Discussion: https://postgr.es/m/1394271.1712016101@sss.pgh.pa.us

Allow SIGINT to cancel psql database reconnections.

After installing the SIGINT handler in psql, SIGINT can no longer cancel
database reconnections. For instance, if the user starts a reconnection
and then needs to do some form of interaction (ie psql is polling),
there is no way to cancel the reconnection process currently.

Use PQconnectStartParams() in order to insert a cancel_pressed check
into the polling loop.

Tristan Partin, reviewed by Gurjeet Singh, Heikki Linnakangas, Jelte
Fennema-Nio, and me.

Discussion: http://postgr.es/m/[email protected]

Expose PQsocketPoll via libpq

This is useful when connecting to a database asynchronously via
PQconnectStart(), since it handles deciding between poll() and
select(), and some of the required boilerplate.

Tristan Partin, reviewed by Gurjeet Singh, Heikki Linnakangas, Jelte
Fennema-Nio, and me.

Discussion: http://postgr.es/m/[email protected]

Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use the new streaming
interface. This commit provides a very simple example of such a
transformation.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Provide API for streaming relation data.

Introduce an abstraction allowing relation data to be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number it wants
next, and then consumes individual buffers one at a time from the
stream. This division puts read_stream.c in control of how far ahead it
can see and allows it to read clusters of neighboring blocks with
StartReadBuffers(). It also issues POSIX_FADV_WILLNEED advice ahead of
time when random access is detected.

Other variants of I/O stream will be proposed in future work (for
example to support recovery, whose LsnReadQueue device in
xlogprefetcher.c is a distant cousin of this code and should eventually
be replaced by this), but this basic API is sufficient for many common
executor usage patterns involving predictable access to a single fork of
a single relation.

Several patches using this API are proposed separately.

This stream concept is loosely based on ideas from Andres Freund on how
we should pave the way for later work on asynchronous I/O.

Author: Thomas Munro <[email protected]>
Author: Heikki Linnakangas <[email protected]> (contributions)
Author: Melanie Plageman <[email protected]> (contributions)
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Tested-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps.  StartReadBuffers() and
WaitReadBuffers() give us two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs.  Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once.  The following commit will add a higher level mechanism
for doing that automatically with a practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro <[email protected]>
Author: Andres Freund <[email protected]> (some optimization tweaks)
Reviewed-by: Melanie Plageman <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Dilip Kumar <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Tested-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Don't use the pg_am system catalog in new test

This causes deadlocks because it's a highly trafficked catalog. Use a
regular table created by the same test instead.

Discussion: https://postgr.es/m/f3e61e27-19d0-5e40-3eb2-53282fa0532a@gmail.com

Revert "Custom reloptions for table AM"

This reverts commit c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple
design issues spotted after commit.

Reported-by: Jeff Davis
Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com

Use TidStore for dead tuple TIDs storage during lazy vacuum.

Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
branch prediction behavior.

This commit replaces that array with the TID store from commit
30e144287a.

Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.

Bump catalog version.

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

Fix assert failure when planning setop subqueries with CTEs

66c0185a3 adjusted the UNION planner to request that union child queries
produce Paths correctly ordered to implement the UNION by way of
MergeAppend followed by Unique.  The code there made a bad assumption
that if the root->parent_root->parse had setOperations set that the
query must be the child subquery of a set operation.  That's not true
when it comes to planning a non-inlined CTE which is parented by a set
operation.  This causes issues as the CTE's targetlist has no
requirement to match up to the SetOperationStmt's groupClauses

Fix this by adding a new parameter to both subquery_planner() and
grouping_planner() to explicitly pass the SetOperationStmt only when
planning set operation child subqueries.

Thank you to Tom Lane for helping to rationalize the decision on the
best function signature for subquery_planner().

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/242fc7c6-a8aa-2daf-ac4c-0a231e2619c1@gmail.com

Avoid "unused variable" warning on non-USE_SSL_ENGINE platforms.

If we are building with openssl but USE_SSL_ENGINE didn't get set,
initialize_SSL's variable "pkey" is declared but used nowhere.
Apparently this combination hasn't been exercised in the buildfarm
before now, because I've not seen this warning before, even though
the code has been like this a long time. Move the declaration
to silence the warning (and remove its useless initialization).

Per buildfarm member sawshark. Back-patch to all supported branches.

Introduce 'options' argument to heap_page_prune()

Currently there is only one option, HEAP_PAGE_PRUNE_MARK_UNUSED_NOW
which replaces the old boolean argument, but upcoming patches will
introduce at least one more. Having a lot of boolean arguments makes
it hard to see at the call sites what the arguments mean, so prefer a
bitmask of options with human-readable names.

Author: Melanie Plageman <[email protected]>
Author: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/20240401172219.fngjosaqdgqqvg4e@liskov

Invent --transaction-size option for pg_restore.

This patch allows pg_restore to wrap its commands into transaction
blocks, somewhat like --single-transaction, except that we commit
and start a new block after every N objects.  Using this mode
with a size limit of 1000 or so objects greatly reduces the number
of transactions consumed by the restore, while preventing any
one transaction from taking enough locks to overrun the receiving
server's shared lock table.

(A value of 1000 works well with the default lock table size of
around 6400 locks.  Higher --transaction-size values can be used
if one has increased the receiving server's lock table size.)

Excessive consumption of XIDs has been reported as a problem for
pg_upgrade in particular, but it could be bad for any restore; and the
change also reduces the number of fsyncs and amount of WAL generated,
so it should provide speed benefits too.

This patch does not try to make parallel workers batch the SQL
commands they issue.  The trouble with doing that is that other
workers may need to see the objects a worker creates right away.
Possibly this can be improved later.

In this patch I have hard-wired pg_upgrade to use a transaction size
of 1000 divided by the number of parallel restore jobs allowed
(without that, we'd still be at risk of overrunning the shared lock
table).  Perhaps there would be value in adding another pg_upgrade
option to allow user control of that, but I'm unsure that it's worth
the trouble; I think few users would use it, and any who did would see
not that much benefit compared to the default.

Patch by me, but the original idea to batch SQL commands during
restore is due to Robins Tharakan.

Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com

Rearrange pg_dump's handling of large objects for better efficiency.

Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC
entry for each large object (blob), but it did not touch the ancient
decision to put all the blobs' data into a single "BLOBS" TOC entry.
This is bad for a few reasons: for databases with millions of blobs,
the TOC becomes unreasonably large, causing performance issues;
selective restore of just some blobs is quite impossible; and we
cannot parallelize either dump or restore of the blob data, since our
architecture for that relies on farming out whole TOC entries to
worker processes.

To improve matters, let's group multiple blobs into each blob metadata
TOC entry, and then make corresponding per-group blob data TOC entries.
Selective restore using pg_restore's -l/-L switches is then possible,
though only at the group level.  (Perhaps we should provide a switch
to allow forcing one-blob-per-group for users who need precise
selective restore and don't have huge numbers of blobs.  This patch
doesn't do that, instead just hard-wiring the maximum number of blobs
per entry at 1000.)

The blobs in a group must all have the same owner, since the TOC entry
format only allows one owner to be named.  In this implementation
we also require them to all share the same ACL (grants); the archive
format wouldn't require that, but pg_dump's representation of
DumpableObjects does.  It seems unlikely that either restriction
will be problematic for databases with huge numbers of blobs.

The metadata TOC entries now have a "desc" string of "BLOB METADATA",
and their "defn" string is just a newline-separated list of blob OIDs.
The restore code has to generate creation commands, ALTER OWNER
commands, and drop commands (for --clean mode) from that.  We would
need special-case code for ALTER OWNER and drop in any case, so the
alternative of keeping the "defn" as directly executable SQL code
for creation wouldn't buy much, and it seems like it'd bloat the
archive to little purpose.

Since we require the blobs of a metadata group to share the same ACL,
we can furthermore store only one copy of that ACL, and then make
pg_restore regenerate the appropriate commands for each blob.  This
saves space in the dump file not only by removing duplicative SQL
command strings, but by not needing a separate TOC entry for each
blob's ACL.  In turn, that reduces client-side memory requirements for
handling many blobs.

ACL TOC entries that need this special processing are labeled as
"ACL"/"LARGE OBJECTS nnn..nnn".  If we have a blob with a unique ACL,
continue to label it as "ACL"/"LARGE OBJECT nnn".  We don't actually
have to make such a distinction, but it saves a few cycles during
restore for the easy case, and it seems like a good idea to not change
the TOC contents unnecessarily.

The data TOC entries ("BLOBS") are exactly the same as before,
except that now there can be more than one, so we'd better give them
identifying tag strings.

Also, commit c0d5be5d6 put the new BLOB metadata TOC entries into
SECTION_PRE_DATA, which perhaps is defensible in some ways, but
it's a rather odd choice considering that we go out of our way to
treat blobs as data.  Moreover, because parallel restore handles
the PRE_DATA section serially, this means we'd only get part of the
parallelism speedup we could hope for.  Move these entries into
SECTION_DATA, letting us parallelize the lo_create calls not just the
data loading when there are many blobs.  Add dependencies to ensure
that we won't try to load data for a blob we've not yet created.

As this stands, we still generate a separate TOC entry for any comment
or security label attached to a blob.  I feel comfortable in believing
that comments and security labels on blobs are rare, so this patch
should be enough to get most of the useful TOC compression for blobs.

We have to bump the archive file format version number, since existing
versions of pg_restore wouldn't know they need to do something special
for BLOB METADATA, plus they aren't going to work correctly with
multiple BLOBS entries or multiple-large-object ACL entries.

The directory and tar-file format handlers need some work
for multiple BLOBS entries: they used to hard-wire the file name
as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc".
The 002_pg_dump.pl test script also knows about that and requires
minor updates.  (I had to drop the test for manually-compressed
blobs.toc files with LZ4, because lz4's obtuse command line
design requires explicit specification of the output file name
which seems impractical here.  I don't think we're losing any
useful test coverage thereby; that test stanza seems completely
duplicative with the gzip and zstd cases anyway.)

In passing, centralize management of the lo_buf used to hold data
while restoring blobs.  The code previously had each format handler
create lo_buf, which seems rather pointless given that the format
handlers all make it the same way.  Moreover, the format handlers
never use lo_buf directly, making this setup a failure from a
separation-of-concerns standpoint.  Let's move the responsibility into
pg_backup_archiver.c, which is the only module concerned with lo_buf.
The reason to do this in this patch is that it allows a centralized
fix for the now-false assumption that we never restore blobs in
parallel.  Also, get rid of dead code in DropLOIfExists: it's been a
long time since we had any need to be able to restore to a pre-9.0
server.

Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com

Avoid possible longjmp-induced logic error in PLy_trigger_build_args.

The "pltargs" variable wasn't marked volatile, which makes it unsafe
to change its value within the PG_TRY block.  It looks like the worst
outcome would be to fail to release a refcount on Py_None during an
(improbable) error exit, which would likely go unnoticed in the field.
Still, it's a bug.  A one-liner fix could be to mark pltargs volatile,
but on the whole it seems cleaner to arrange things so that we don't
change its value within PG_TRY.

Per report from Xing Guo.  This has been there for quite awhile,
so back-patch to all supported branches.

Discussion: https://postgr.es/m/CACpMh+DLrk=fDv07MNpBT4J413fDAm+gmMXgi8cjPONE+jvzuw@mail.gmail.com

Fix assorted resource leaks in new pg_createsubscriber code.

Various error paths did not release resources before returning.
While it's likely that the program would just exit shortly later,
none of the functions in question have summary exit(1) calls,
so they should not be assuming that.

Ranier Vilela and Tom Lane, per reports from Coverity

Discussion: https://postgr.es/m/CAEudQAr2_SZFxB4kXJiL4+2UaNZxUk5UBJtj0oXyJYMGZu-03g@mail.gmail.com

Handle non-chain tuples outside of heap_prune_chain()

Handle dead branches of aborted HOT chains outside heap_prune_chain()
as a separate phase. This simplifies the logic in heap_prune_chain(),
as well as allowing us to clean up more RECENTLY_DEAD -> DEAD chains.

To accomplish this efficiently, partition tuples into HOT and non-HOT
while first collecting visibility information for each tuple in
heap_page_prune(). Then call heap_prune_chain() only on potential
chain members. Then mop up the leftover HOT tuples afterwards.

As part of this, keep track of which items on page have already been
processed, in 'processed' array. This replaces the 'marked' array
which was only set for tuples marked for removal or redirection. The
'processed' array is updated also for items that are left unchanged,
when we conclude that an item can be left unchanged. At the end of
pruning, every item on the page should be marked as processed in the
array; an assertion is added for that.

Author: Melanie Plageman <[email protected]>
Author: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov

Refactor heap_prune_chain()

Keep track of the number of deleted tuples in PruneState and record this
information when recording a tuple dead, unused or redirected. This
removes a special case from the traversal and chain processing logic as
well as setting a precedent of recording the impact of prune actions in
the record functions themselves. This paradigm will be used in future
commits which move tracking of additional statistics on pruning actions
from lazy_scan_prune() to heap_prune_chain().

Simplify heap_prune_chain()'s chain traversal logic by handling each
case explicitly. That is, do not attempt to share code when processing
different types of chains. For each category of chain, process it
specifically and procedurally: first handling the root, then any
intervening tuples, and, finally, the end of the chain.

While we are at it, add a few new comments to heap_prune_chain()
clarifying some special cases involving RECENTLY_DEAD tuples.

Author: Melanie Plageman <[email protected]>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov

Minor refactoring in heap_page_prune

Pass 'page', 'blockno' and 'maxoff' to heap_prune_chain() as
arguments, so that it doesn't need to fetch them from the buffer. This
saves a few cycles per chain.

Remove the "if (off_loc != NULL)" checks, and require the caller to
pass a non-NULL 'off_loc'. Pass a pointer to a dummy local variable
when it's not needed. Those checks are cheap, but it's still better to
avoid them in the per-chain loops when we can do so easily.

The CPU time saving from these changes are hardly measurable, but
fewer instructions is good anyway, so why not. I spotted the potential
for these while reviewing Melanie Plageman's patch set to combine
prune and freeze records.

Discussion: https://www.postgresql.org/message-id/CAAKRu_abm2tHhrc0QSQa%3D%3DsHe%3DVA1%3Doz1dJMQYUOKuHmu%2B9Xrg%40mail.gmail.com

Add new COPY option LOG_VERBOSITY.

This commit adds a new COPY option LOG_VERBOSITY, which controls the
amount of messages emitted during processing. Valid values are
'default' and 'verbose'.

This is currently used in COPY FROM when ON_ERROR option is set to
ignore. If 'verbose' is specified, a NOTICE message is emitted for
each discarded row, providing additional information such as line
number, column name, and the malformed value. This helps users to
identify problematic rows that failed to load.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Atsushi Torikoshi, Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/CALj2ACUk700cYhx1ATRQyRw-fBM%2BaRo6auRAitKGff7XNmYfqQ%40mail.gmail.com

Revert "Speed up tail processing when hashing aligned C strings"

This reverts commit 07f0f6abfc7f6c55cede528d9689dedecefc734a.

This has shown failures on both Valgrind and big-endian machines,
per members skink and pike.

Speed up tail processing when hashing aligned C strings

After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated the
terminator's position and re-loaded the remaining bytes from the input
string. We already have all the data we need in a register, so let's
just mask off the bytes we need and hash them immediately. The mask can
be cheaply computed without knowing the terminator's position. We still
need that position for the length calculation, but the CPU can now
do that in parallel with other work, shortening the dependency chain.

Ants Aasma and John Naylor

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com

Let table AM insertion methods control index insertion

Previously, the executor did index insert unconditionally after calling
table AM interface methods tuple_insert() and multi_insert().  This commit
introduces the new parameter insert_indexes for these two methods.  Setting
'*insert_indexes' to true saves the current logic.  Setting it to false
indicates that table AM cares about index inserts itself and doesn't want the
caller to do that.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent, Mark Dilger

Custom reloptions for table AM

Let table AM define custom reloptions for its tables. This allows to
specify AM-specific parameters by WITH clause when creating a table.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent

Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent

Add pg_basetype() function to extract a domain's base type.

This SQL-callable function behaves much like our internal utility
function getBaseType(), except it returns NULL rather than failing for
an invalid type OID. (That behavior is modeled on our experience with
other catalog-inquiry functions such as the ACL checking functions.)
The key advantage over doing a join to pg_type is that it will loop
as needed to find the bottom base type of a nest of domains.

Steve Chavez, reviewed by jian he and others

Discussion: https://postgr.es/m/CAGRrpzZSX8j=MQcbCSEisFA=ic=K3bknVfnFjAv1diVJxFHJvg@mail.gmail.com

Stabilize postgres_fdw test

The test fails when RESET statement_timeout takes longer than 10ms.
Avoid the problem by using SET LOCAL instead.

Overall, this test is not ideal: 10ms could be shorter than the time to
have sent the query to the "remote" server, so it's possible that on
some machines this test doesn't actually witness a remote query being
cancelled. We may want to improve on this someday by using some other
testing technique, but for now it's better than nothing. I verified
manually that one round of remote cancellation occurs when this runs on
my machine.

Discussion: https://postgr.es/m/CAGECzQRsdWnj=YaaPCnA8d7E1AdbxRPBYmyBQRMPUijR2MpM_w@mail.gmail.com

doc: Improve "Partition Maintenance" section

This adds some reference links and clarifies the wording a bit.

Author: Robert Treat <[email protected]>
Reviewed-by: Ashutosh Bapat <[email protected]>
Discussion: https://postgr.es/m/CABV9wwNGn-pweak6_pvL5PJ1mivDNPKfg0Tck_1oTUETv5Y=dg@mail.gmail.com

Add support for MERGE ... WHEN NOT MATCHED BY SOURCE.

This allows MERGE commands to include WHEN NOT MATCHED BY SOURCE
actions, which operate on rows that exist in the target relation, but
not in the data source. These actions can execute UPDATE, DELETE, or
DO NOTHING sub-commands.

This is in contrast to already-supported WHEN NOT MATCHED actions,
which operate on rows that exist in the data source, but not in the
target relation. To make this distinction clearer, such actions may
now be written as WHEN NOT MATCHED BY TARGET.

Writing WHEN NOT MATCHED without specifying BY SOURCE or BY TARGET is
equivalent to writing WHEN NOT MATCHED BY TARGET.

Dean Rasheed, reviewed by Alvaro Herrera, Ted Yu and Vik Fearing.

Discussion: https://postgr.es/m/CAEZATCWqnKGc57Y_JanUBHQXNKcXd7r=0R4NEZUVwP+syRkWbA@mail.gmail.com