Fix gang write late_arrival bug #17824
Open
+9
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sponsored by: [Klara, Inc.; Wasabi Technology, Inc.]
Motivation and Context
When a write comes in via dmu_sync_late_arrival, its txg is equal to the open TXG. If that write gangs, and we have not yet activated the new gang header feature, and the gang header we pick can store a larger gang header, we will try to schedule the upgrade for the open TXG + 1. In debug mode, this causes an assertion to trip in
txg_verify
. I can pretty reliably reproduce this on a performance test setup I have.Description
This PR sets the TXG for activating the feature to be at most the current open TXG. Activating the feature a TXG early shouldn't cause any problems, since we don't use the activation txg directly. I believe this method of doing accessing the current open TXG is safe; the current open TXG could increase while the comparison/replacement is happening, but I don't believe a value larger than the open txg can get into
txg_verify
with this code. And we don't use atomics anywhere else to access this field, so it shouldn't be necessary here either.How Has This Been Tested?
Manual testing with the workload that originally triggered the problem.
Unfortunately I haven't been able to find a small reproducer for this. I've tried a few things, but we need the first thing that gangs to be a
dmu_sync_late_arrival
call, which is not trivial to orchestrate. If anyone has ideas for a test that would work, I'm happy to try them out.Types of changes
Checklist:
Signed-off-by
.