-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Do not apply further shard snapshot status updates after shard snapshot is complete #127250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not apply further shard snapshot status updates after shard snapshot is complete #127250
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
Hi @DiannaHohensee, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some name refactoring and commenting. I put comments on any functional changes to make them obvious.
// For example, a delayed/retried PAUSED update should not override a completed shard snapshot. | ||
iterator.remove(); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff.
final ShardSnapshotStatus updatedState; | ||
final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get(); | ||
final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId); | ||
if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should always be non-null - the builder starts with a copy of existingShardSnapshotStatuses
and just overwrites the bits that get updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, that took an unexpected turn. Thanks, updated.
), | ||
ActionTestUtils.assertNoFailureListener(t -> {}) | ||
), | ||
// Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not realistic as a single batch of updates for a single shard - snapshot2
won't start snapshotting this shard until snapshot1
has applied a state update which completes it.
Can we instead randomize the order of the updates? Or randomly add a PAUSED_FOR_NODE_REMOVAL
before and/or after the SUCCESS
one and verify that the outcome is SUCCESS
in every case (and the next snapshot moves from QUEUED
to INIT
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
snapshot1, | ||
shardId, | ||
null, | ||
SnapshotsInProgress.ShardSnapshotStatus.success( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also apply to PAUSED_FOR_NODE_REMOVAL
either side of a FAILED
right? Can we randomly choose between SUCCESS
and FAILED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done 👍
…iceTests.java Co-authored-by: David Turner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, applied the changes 👍
final ShardSnapshotStatus updatedState; | ||
final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get(); | ||
final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId); | ||
if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, that took an unexpected turn. Thanks, updated.
snapshot1, | ||
shardId, | ||
null, | ||
SnapshotsInProgress.ShardSnapshotStatus.success( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done 👍
), | ||
ActionTestUtils.assertNoFailureListener(t -> {}) | ||
), | ||
// Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (one optional request on the tests)
// Randomly add another update that will be ignored because the shard snapshot is complete. | ||
// Note: the originalNodeId is used for this update, so we can verify afterward that the update is not applied. | ||
randomBoolean() ? completedUpdateOnOriginalNode : pausedUpdateOnOriginalNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd like us to only sometimes have this third update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've if-else'd it, for lack of a better notion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry forgot to press the LGTM button
I think this failure matches pre-existing failures. I can't fathom how the other failure is related: |
Relates ES-11375