-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] DiskThresholdDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards failing #105331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
Oh, interesting, according to:
it was stuck waiting at this line: Lines 202 to 204 in a7f2e2d
|
Output also contains:
so the actual failure happened at: Lines 173 to 178 in a7f2e2d
|
According to the recent failure info, the test created the following shards:
During the first allocation, only 5 shard had computed balance and were allocated accordingly:
Notice the smallest one is ignored. Possibly as the size was still not computed from the repository. Later the shard balance was computed as following:
This suggests that the computation failed to take into account another non-empty shard (still?) initializing on the same node. |
I wonder if in this case the shard has started, but corresponding information was still not available in ClusterInfo and the second allocation round was happening as if |
I checked the latest failure logs and they contain:
|
According to the logs from the latest failure:
Allocation of all shards happens in 2 rounds.
Nothing out of ordinary here. Round 2:
The round 2 balance is calculated incorrectly as it is based on incorrect cluster info:
|
I have not found any particular reason why balance was calculated in 2 round here, however I consider this to be a valid scenario. |
Above buildscan failure shows in logs there is
|
My debug logging...
The matcher has what we want, and the shard is on the node that we want.. Maybe this 🤔 |
Okay, so I fixed the Matcher problem. Replace this line with the following
But now I've uncovered something else, either a test bug or code bug. These are my debug logs limited to this line of code where the snapshot restore occurs and allocates shards out to the nodes -- note that rebalancing is otherwise disabled, we won't rebalance after the initial allocation, IIUC. A potentially interesting bit is this, before allocation occurs: somehow allocation thinks the available on the node is zero?
Again later,
100% disk usage until the end (last occurrence of log message)
|
These are the logs written while these lines of code executed, which change the disk size. It looks like
The interesting bit is
We changed the total_bytes, but the used_bytes match. Even though there are no shards allocated. |
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes elastic#105331
For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes #105331
For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes elastic#105331
For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes elastic#105331
For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes elastic#105331
* Fix shard size of initializing restored shard (#126783) For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes #105331 * Backport utils from 4009599
* Fix shard size of initializing restored shard (#126783) For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes #105331 * Missing throws
* Fix shard size of initializing restored shard (#126783) For shards being restored from a snapshot we use `SnapshotShardSizeInfo` to track their sizes while they're unassigned, and then use `ShardRouting#expectedShardSize` when they start to recover. However we were incorrectly ignoring the `ShardRouting#expectedShardSize` value when accounting for the movements of shards in the `ClusterInfoSimulator`, which would sometimes cause us to assign more shards to a node than its disk space should have allowed. Closes #105331 * Backport utils from 4009599 * Missing throws
Build scan:
https://gradle-enterprise.elastic.co/s/kxdqmnytyenuq/tests/:server:internalClusterTest/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT/testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards
Reproduction line:
Applicable branches:
main
Reproduces locally?:
No
Failure history:
Failure dashboard for
org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT#testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards
Failure excerpt:
The text was updated successfully, but these errors were encountered: