Skip to content

Do not apply further shard snapshot status updates after shard snapshot is complete #127250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Apr 23, 2025

Relates ES-11375

@DiannaHohensee DiannaHohensee self-assigned this Apr 23, 2025
@DiannaHohensee DiannaHohensee added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >bug labels Apr 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DiannaHohensee, I've created a changelog YAML for you.

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some name refactoring and commenting. I put comments on any functional changes to make them obvious.

// For example, a delayed/retried PAUSED update should not override a completed shard snapshot.
iterator.remove();
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual bug fix

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff.

final ShardSnapshotStatus updatedState;
final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get();
final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId);
if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always be non-null - the builder starts with a copy of existingShardSnapshotStatuses and just overwrites the bits that get updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, that took an unexpected turn. Thanks, updated.

),
ActionTestUtils.assertNoFailureListener(t -> {})
),
// Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not realistic as a single batch of updates for a single shard - snapshot2 won't start snapshotting this shard until snapshot1 has applied a state update which completes it.

Can we instead randomize the order of the updates? Or randomly add a PAUSED_FOR_NODE_REMOVAL before and/or after the SUCCESS one and verify that the outcome is SUCCESS in every case (and the next snapshot moves from QUEUED to INIT)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.

snapshot1,
shardId,
null,
SnapshotsInProgress.ShardSnapshotStatus.success(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also apply to PAUSED_FOR_NODE_REMOVAL either side of a FAILED right? Can we randomly choose between SUCCESS and FAILED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done 👍

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, applied the changes 👍

final ShardSnapshotStatus updatedState;
final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get();
final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId);
if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, that took an unexpected turn. Thanks, updated.

snapshot1,
shardId,
null,
SnapshotsInProgress.ShardSnapshotStatus.success(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done 👍

),
ActionTestUtils.assertNoFailureListener(t -> {})
),
// Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one optional request on the tests)

Comment on lines 604 to 606
// Randomly add another update that will be ignored because the shard snapshot is complete.
// Note: the originalNodeId is used for this update, so we can verify afterward that the update is not applied.
randomBoolean() ? completedUpdateOnOriginalNode : pausedUpdateOnOriginalNode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd like us to only sometimes have this third update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've if-else'd it, for lack of a better notion.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry forgot to press the LGTM button

@DiannaHohensee
Copy link
Contributor Author

DiannaHohensee commented Apr 24, 2025

I think this failure matches pre-existing failures.

I can't fathom how the other failure is related: Accessing unreadable inputs or outputs is not supported.; Failed to create MD5 hash for file

@DiannaHohensee DiannaHohensee merged commit 1dfb70e into elastic:main Apr 24, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants