fix(pageserver): frozen->L0 flush failure causes data loss #12043
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch is a fixup for
Background
That PR 6788 added artificial advancement of
disk_consistent_lsn
and
remote_consistent_lsn
for shards that weren't written towhile other shards were written to.
See the PR description for more context.
At the time of that PR, Pageservers shards were doing WAL filtering.
Nowadays, the WAL filtering happens in Safekeepers.
Shards learn about the WAL gaps via
InterpretedWalRecords::next_record_lsn
.The Bug
That artificial advancement code also runs if the flush failed.
So, we advance the disk_consistent_lsn / remote_consistent_lsn,
without having the corresponding L0 to the
index_part.json
.The frozen layer remains in the layer map until detach,
so we continue to serve data correctly.
We're not advancing flush loop variable
flushed_to_lsn
either,so, subsequent flush requests will retry the flush and repair the
situation if they succeed.
But if there aren't any successful retries, eventually the tenant
will be detached and when it is attached somewhere else, the
index_part.json
and therefore layer map...starts up with the advanced disk_consistent_lsn/remote_consistent_lsn.
The result is that the read path will have a gap in the reconstruct
data for the keys whose modifications were lost, resulting in
a) either walredo failure
b) or an incorrect page@lsn image if walredo doesn't error.
The Fix
The fix is to only do the artificial advancement if
result.is_ok()
.Misc
As an aside, I took some time to re-review the flush loop and its callers.
I found one more bug related to error handling that I filed here:
Layer::finish_creating
fails #12025Problem
Summary of changes