Skip to content

Fix shard size of initializing restored shard #126783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DaveCTurner
Copy link
Contributor

For shards being restored from a snapshot we use SnapshotShardSizeInfo
to track their sizes while they're unassigned, and then use
ShardRouting#expectedShardSize when they start to recover. However we
were incorrectly ignoring the ShardRouting#expectedShardSize value
when accounting for the movements of shards in the
ClusterInfoSimulator, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) auto-backport Automatically create backport pull requests when merged v8.18.1 v8.19.0 v9.0.1 v9.1.0 labels Apr 14, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Apr 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@@ -230,6 +224,110 @@ public void testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShard
assertBusyWithDiskUsageRefresh(dataNode0Id, indexName, contains(in(shardSizes.getShardIdsWithSizeSmallerOrEqual(usableSpace))));
}

public void testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards was flaky because of this bug, but only failing once every few hundred iterations. This test fails more reliably for the same reason, although still not all that reliably (after around 20-30 iterations on my laptop). I could make it exercise the exact path that hits the bug every time, but it'd be very specific to this one bug and I'd rather have something a little more general to look out for related bugs too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards does a little more disk usage checking as well as the shard assignment. This new test checks that shards are assigned correctly, not so much disk usage.

They feel a little redundant: might we add a couple more touches to the new one and delete the old?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was sort of inclined to keep them both but you're right, we're not really testing anything different in the old test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for updating

@@ -92,7 +92,7 @@ public void simulateShardStarted(ShardRouting shard) {
var project = allocation.metadata().projectFor(shard.index());
var size = getExpectedShardSize(
shard,
UNAVAILABLE_EXPECTED_SHARD_SIZE,
shard.getExpectedShardSize(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-line fix \o/

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

DaveCTurner and others added 2 commits April 14, 2025 22:30
…routing/allocation/decider/DiskThresholdDeciderIT.java

Co-authored-by: Jeremy Dahlgren <[email protected]>
…tRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards
Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look good, I'm good with shipping it as is. Though I do wonder if we could delete the old test in favor of the new -- per comment.

// set up a listener that explicitly forbids more than one shard to be assigned to the tiny node
final String dataNodeId = internalCluster().getInstance(NodeEnvironment.class, dataNodeName).nodeId();
final var allShardsActiveListener = ClusterServiceUtils.addTemporaryStateListener(cs -> {
assertThat(cs.getRoutingNodes().toString(), cs.getRoutingNodes().node(dataNodeId).size(), lessThanOrEqualTo(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooph, RoutingNode#size() is an unhelpful method name 😓

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming things is hard, but yeah this is not good. At least it has Javadocs :)

clusterAdmin().prepareRestoreSnapshot(TEST_REQUEST_TIMEOUT, "repo", "snap")
.setWaitForCompletion(true)
.setRenamePattern(indexName)
.setRenameReplacement(indexName + "-copy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, is this purely a test-only feature? Doesn't look like it's used anyplace else.

(not actionable, I'm just surprised)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the feature in production, see here:

We just don't use the RestoreSnapshotRequestBuilder to build a RestoreSnapshotRequest anywhere, instead building the request directly since it's all mutable anyway. Not a pattern I like, but one that is going to take a long time to completely eliminate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh 🤔 Got it 👍

@@ -230,6 +224,110 @@ public void testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShard
assertBusyWithDiskUsageRefresh(dataNode0Id, indexName, contains(in(shardSizes.getShardIdsWithSizeSmallerOrEqual(usableSpace))));
}

public void testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards does a little more disk usage checking as well as the shard assignment. This new test checks that shards are assigned correctly, not so much disk usage.

They feel a little redundant: might we add a couple more touches to the new one and delete the old?

…tRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards
@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 17, 2025
…tRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards
…tRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards
@elasticsearchmachine elasticsearchmachine merged commit a5f935a into elastic:main Apr 22, 2025
17 checks passed
@DaveCTurner DaveCTurner deleted the 2025/04/11/105331-DiskThresholdDeciderIT-testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards branch April 22, 2025 17:08
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.18
8.x
9.0

elasticsearchmachine pushed a commit that referenced this pull request Apr 23, 2025
* Fix shard size of initializing restored shard (#126783)

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

* Backport utils from 4009599
elasticsearchmachine pushed a commit that referenced this pull request Apr 23, 2025
* Fix shard size of initializing restored shard (#126783)

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

* Missing throws
elasticsearchmachine pushed a commit that referenced this pull request Apr 23, 2025
* Fix shard size of initializing restored shard (#126783)

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

* Backport utils from 4009599

* Missing throws
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v8.18.1 v8.19.0 v9.0.1 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] DiskThresholdDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards failing
4 participants