Do not apply further shard snapshot status updates after shard snapshot is complete #127250

DiannaHohensee · 2025-04-23T14:02:24Z

Relates ES-11375

…ot is complete

elasticsearchmachine · 2025-04-23T14:03:09Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-04-23T14:03:33Z

Hi @DiannaHohensee, I've created a changelog YAML for you.

DiannaHohensee

I did some name refactoring and commenting. I put comments on any functional changes to make them obvious.

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

DiannaHohensee · 2025-04-23T14:05:15Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    // For example, a delayed/retried PAUSED update should not override a completed shard snapshot.
+                    iterator.remove();
+                    return;
+                }


actual bug fix

DaveCTurner

Great stuff.

DaveCTurner · 2025-04-23T14:12:12Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-                final ShardSnapshotStatus updatedState;
+                final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get();
+                final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId);
+                if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) {


This should always be non-null - the builder starts with a copy of existingShardSnapshotStatuses and just overwrites the bits that get updated.

Ohh, that took an unexpected turn. Thanks, updated.

DaveCTurner · 2025-04-23T14:15:45Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                ),
+                ActionTestUtils.assertNoFailureListener(t -> {})
+            ),
+            // Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS.


This is not realistic as a single batch of updates for a single shard - snapshot2 won't start snapshotting this shard until snapshot1 has applied a state update which completes it.

Can we instead randomize the order of the updates? Or randomly add a PAUSED_FOR_NODE_REMOVAL before and/or after the SUCCESS one and verify that the outcome is SUCCESS in every case (and the next snapshot moves from QUEUED to INIT)?

I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

DaveCTurner · 2025-04-23T14:18:18Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                snapshot1,
+                shardId,
+                null,
+                SnapshotsInProgress.ShardSnapshotStatus.success(


This should also apply to PAUSED_FOR_NODE_REMOVAL either side of a FAILED right? Can we randomly choose between SUCCESS and FAILED?

Good idea, done 👍

…iceTests.java Co-authored-by: David Turner <[email protected]>

DiannaHohensee

Thanks for the review, applied the changes 👍

DiannaHohensee · 2025-04-23T14:33:18Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-                final ShardSnapshotStatus updatedState;
+                final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get();
+                final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId);
+                if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) {


Ohh, that took an unexpected turn. Thanks, updated.

DiannaHohensee · 2025-04-23T15:57:52Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                snapshot1,
+                shardId,
+                null,
+                SnapshotsInProgress.ShardSnapshotStatus.success(


Good idea, done 👍

DiannaHohensee · 2025-04-23T16:52:48Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                ),
+                ActionTestUtils.assertNoFailureListener(t -> {})
+            ),
+            // Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS.


I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.

DaveCTurner

LGTM (one optional request on the tests)

DaveCTurner · 2025-04-23T17:07:27Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+            // Randomly add another update that will be ignored because the shard snapshot is complete.
+            // Note: the originalNodeId is used for this update, so we can verify afterward that the update is not applied.
+            randomBoolean() ? completedUpdateOnOriginalNode : pausedUpdateOnOriginalNode


Ideally I'd like us to only sometimes have this third update.

Sure. I've if-else'd it, for lack of a better notion.

DaveCTurner

sorry forgot to press the LGTM button

DiannaHohensee · 2025-04-24T15:10:35Z

I think this failure matches pre-existing failures.

I can't fathom how the other failure is related: Accessing unreadable inputs or outputs is not supported.; Failed to create MD5 hash for file

Do not apply further shard snapshot status updates after shard snapsh…

aba1716

…ot is complete

DiannaHohensee self-assigned this Apr 23, 2025

DiannaHohensee added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >bug labels Apr 23, 2025

elasticsearchmachine added the v9.1.0 label Apr 23, 2025

Update docs/changelog/127250.yaml

d97d047

DiannaHohensee commented Apr 23, 2025

View reviewed changes

DiannaHohensee requested a review from DaveCTurner April 23, 2025 14:09

DaveCTurner reviewed Apr 23, 2025

View reviewed changes

DiannaHohensee and others added 5 commits April 23, 2025 10:33

expect non-null builder entries

114411e

Update server/src/test/java/org/elasticsearch/snapshots/SnapshotsServ…

fe54ad4

…iceTests.java Co-authored-by: David Turner <[email protected]>

randomize testing

f7332bd

verify queued

0c9b6fb

Merge branch 'main' into 2025/04/23/ES-11375-snapshot-update-fix

dcd9807

DiannaHohensee commented Apr 23, 2025

View reviewed changes

DaveCTurner reviewed Apr 23, 2025

View reviewed changes

DaveCTurner approved these changes Apr 23, 2025

View reviewed changes

DiannaHohensee and others added 7 commits April 23, 2025 13:55

randomize last entry

ce684dc

Merge branch 'main' into 2025/04/23/ES-11375-snapshot-update-fix

25b1151

[CI] Auto commit changes from spotless

e28879d

undo version comment out

ba7dded

another typo...

e02802d

Merge branch 'main' into 2025/04/23/ES-11375-snapshot-update-fix

c7f2024

Merge branch 'main' into 2025/04/23/ES-11375-snapshot-update-fix

0679c2d

DiannaHohensee merged commit 1dfb70e into elastic:main Apr 24, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not apply further shard snapshot status updates after shard snapshot is complete #127250

Do not apply further shard snapshot status updates after shard snapshot is complete #127250

DiannaHohensee commented Apr 23, 2025 •

edited

Loading

elasticsearchmachine commented Apr 23, 2025

elasticsearchmachine commented Apr 23, 2025

DiannaHohensee left a comment •

edited

Loading

DiannaHohensee Apr 23, 2025

DaveCTurner left a comment

DaveCTurner Apr 23, 2025

DiannaHohensee Apr 23, 2025

DaveCTurner Apr 23, 2025

DiannaHohensee Apr 23, 2025

DaveCTurner Apr 23, 2025

DiannaHohensee Apr 23, 2025

DiannaHohensee left a comment

DiannaHohensee Apr 23, 2025

DiannaHohensee Apr 23, 2025

DiannaHohensee Apr 23, 2025

DaveCTurner left a comment

DaveCTurner Apr 23, 2025

DiannaHohensee Apr 23, 2025

DaveCTurner left a comment

DiannaHohensee commented Apr 24, 2025 •

edited

Loading

Do not apply further shard snapshot status updates after shard snapshot is complete #127250

Do not apply further shard snapshot status updates after shard snapshot is complete #127250

Conversation

DiannaHohensee commented Apr 23, 2025 • edited Loading

elasticsearchmachine commented Apr 23, 2025

elasticsearchmachine commented Apr 23, 2025

DiannaHohensee left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DiannaHohensee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DiannaHohensee commented Apr 24, 2025 • edited Loading

DiannaHohensee commented Apr 23, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading

DiannaHohensee commented Apr 24, 2025 •

edited

Loading