fix(pageserver): frozen->L0 flush failure causes data loss #12043

problame · 2025-05-28T15:03:29Z

This patch is a fixup for

pageserver: handle WAL gaps on sharded tenants #6788

Background

That PR 6788 added artificial advancement of disk_consistent_lsn
and remote_consistent_lsn for shards that weren't written to
while other shards were written to.
See the PR description for more context.

At the time of that PR, Pageservers shards were doing WAL filtering.
Nowadays, the WAL filtering happens in Safekeepers.
Shards learn about the WAL gaps via InterpretedWalRecords::next_record_lsn.

The Bug

That artificial advancement code also runs if the flush failed.
So, we advance the disk_consistent_lsn / remote_consistent_lsn,
without having the corresponding L0 to the index_part.json.
The frozen layer remains in the layer map until detach,
so we continue to serve data correctly.
We're not advancing flush loop variable flushed_to_lsn either,
so, subsequent flush requests will retry the flush and repair the
situation if they succeed.
But if there aren't any successful retries, eventually the tenant
will be detached and when it is attached somewhere else, the
index_part.json and therefore layer map...

... does not contain the frozen layer that failed to flush and
... won't re-ingest that WAL either because walreceiver
starts up with the advanced disk_consistent_lsn/remote_consistent_lsn.

The result is that the read path will have a gap in the reconstruct
data for the keys whose modifications were lost, resulting in
a) either walredo failure
b) or an incorrect page@lsn image if walredo doesn't error.

The Fix

The fix is to only do the artificial advancement if result.is_ok().

Misc

As an aside, I took some time to re-review the flush loop and its callers.
I found one more bug related to error handling that I filed here:

pageserver: flush loop leaks disk space if Layer::finish_creating fails #12025

Problem

Summary of changes

This patch is a fixup for - #6788 Background ---------- That PR 6788 added artificial advancement of `disk_consistent_lsn` and `remote_consistent_lsn` for shards that weren't written to while other shards _were_ written to. See the PR description for more context. At the time of that PR, Pageservers shards were doing WAL filtering. Nowadays, the WAL filtering happens in Safekeepers. Shards learn about the WAL gaps via `InterpretedWalRecords::next_record_lsn`. The Bug ------- That artificial advancement code also runs if the flush failed. So, we advance the disk_consistent_lsn / remote_consistent_lsn, without having the corresponding L0 to the `index_part.json`. The frozen layer remains in the layer map until detach, so we continue to serve data correctly. We're not advancing flush loop variable `flushed_to_lsn` either, so, subsequent flush requests will retry the flush and repair the situation if they succeed. But if there aren't any successful retries, eventually the tenant will be detached and when it is attached somewhere else, the `index_part.json` and therefore layer map... 1. ... does not contain the frozen layer that failed to flush and 2. ... won't re-ingest that WAL either because walreceiver starts up with the advanced disk_consistent_lsn/remote_consistent_lsn. The result is that the read path will have a gap in the reconstruct data for the keys whose modifications were lost, resulting in a) either walredo failure b) or an incorrect page@lsn image if walredo doesn't error. The Fix ------- The fix is to only do the artificial advancement if `result.is_ok()`. Misc ---- As an aside, I took some time to re-review the flush loop and its callers. I found one more bug related to error handling that I filed here: - #12025

github-actions · 2025-05-28T16:10:42Z

8525 tests run: 7942 passed, 0 failed, 583 skipped (full report)

Flaky tests (1)

Postgres 14

test_sharding_split_failures[failure9]: release-arm64-with-lfc

Code coverage* (full report)

functions: 32.3% (9065 of 28076 functions)
lines: 48.6% (79951 of 164552 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a3e3ffd at 2025-05-30T11:34:20.680Z :recycle:}

pageserver/src/tenant.rs

skyzh

LGTM but I'd also want to see a test case that guarantees that such flush will eventually succeed and persist the data; the current test case only tests the behavior of flush failure.

problame requested a review from a team as a code owner May 28, 2025 15:03

problame requested review from arpad-m and VladLazar May 28, 2025 15:03

VladLazar approved these changes May 28, 2025

View reviewed changes

pageserver/src/tenant.rs Outdated Show resolved Hide resolved

pageserver/src/tenant.rs Outdated Show resolved Hide resolved

address review comments

a3e3ffd

problame enabled auto-merge May 30, 2025 10:33

skyzh approved these changes May 30, 2025

View reviewed changes

problame added this pull request to the merge queue May 30, 2025

Merged via the queue into main with commit 4a4a457 May 30, 2025
100 checks passed

problame deleted the problame/frozen-l0-flush-failure-data-loss branch May 30, 2025 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pageserver): frozen->L0 flush failure causes data loss #12043

fix(pageserver): frozen->L0 flush failure causes data loss #12043

Uh oh!

problame commented May 28, 2025

Uh oh!

github-actions bot commented May 28, 2025 •

edited

Loading

Postgres 14

Uh oh!

Uh oh!

Uh oh!

skyzh left a comment

Uh oh!

Uh oh!

Uh oh!

fix(pageserver): frozen->L0 flush failure causes data loss #12043

fix(pageserver): frozen->L0 flush failure causes data loss #12043

Uh oh!

Conversation

problame commented May 28, 2025

Background

The Bug

The Fix

Misc

Problem

Summary of changes

Uh oh!

github-actions bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

8525 tests run: 7942 passed, 0 failed, 583 skipped (full report)

Postgres 14

Code coverage* (full report)

Uh oh!

Uh oh!

Uh oh!

skyzh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 28, 2025 •

edited

Loading