Skip to content

[CI] DiskThresholdDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards failing #105331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iverase opened this issue Feb 9, 2024 · 14 comments · Fixed by #126783
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@iverase
Copy link
Contributor

iverase commented Feb 9, 2024

Build scan:
https://gradle-enterprise.elastic.co/s/kxdqmnytyenuq/tests/:server:internalClusterTest/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT/testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards" -Dtests.seed=ED5C019BB52E4B53 -Dtests.locale=is -Dtests.timezone=America/Anchorage -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT#testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([ED5C019BB52E4B53]:0)

@iverase iverase added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 9, 2024
@elasticsearchmachine elasticsearchmachine added blocker Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Feb 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@idegtiarenko
Copy link
Contributor

Oh, interesting, according to:

"TEST-DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards-seed#[ED5C019BB52E4B53]" ID=2402 WAITING on org.elasticsearch.action.support.PlainActionFuture$Sync@dcadc57
	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
	- waiting on org.elasticsearch.action.support.PlainActionFuture$Sync@dcadc57
	at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
	at app//org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:273)
	at app//org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:93)
	at app//org.elasticsearch.client.internal.support.AbstractClient$RefCountedFuture.get(AbstractClient.java:1534)
	at app//org.elasticsearch.client.internal.support.AbstractClient$RefCountedFuture.get(AbstractClient.java:1514)
	at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
	at app//org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:152)
	at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:43)
	at app//org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards(DiskThresholdDeciderIT.java:204)
	at [email protected]/java.lang.invoke.LambdaForm$DMH/0x00007f2a3c300000.invokeVirtual(LambdaForm$DMH)
	at [email protected]/java.lang.invoke.LambdaForm$MH/0x00007f2a3c0a0800.invoke(LambdaForm$MH)

it was stuck waiting at this line:

final RestoreSnapshotResponse restoreSnapshotResponse = clusterAdmin().prepareRestoreSnapshot("repo", "snap")
.setWaitForCompletion(true)
.get();

@idegtiarenko
Copy link
Contributor

Output also contains:

WARNING: Uncaught exception in thread: Thread[#2574,elasticsearch[node_t0][clusterApplierService#updateTask][T#1],5,TGRP-DiskThresholdDeciderIT]
java.lang.AssertionError: 
Expected: <0>
     but: was <1>
	at __randomizedtesting.SeedInfo.seed([ED5C019BB52E4B53]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
	at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2119)
	at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards$2(DiskThresholdDeciderIT.java:176)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

so the actual failure happened at:

internalCluster().getCurrentMasterNodeInstance(ClusterService.class).addListener(event -> {
ClusterInfoServiceUtils.refresh(clusterInfoService);
if (allowRelocations.get() == false) {
assertThat(numberOfShardsWithState(event.state().getRoutingNodes(), ShardRoutingState.RELOCATING), equalTo(0));
}
});

@arteam arteam self-assigned this Feb 9, 2024
DaveCTurner added a commit that referenced this issue Feb 12, 2024
@arteam arteam assigned idegtiarenko and unassigned arteam Feb 13, 2024
@arteam arteam added medium-risk An open issue or test failure that is a medium risk to future releases and removed blocker labels Feb 15, 2024
@idegtiarenko
Copy link
Contributor

idegtiarenko commented Mar 4, 2024

According to the recent failure info, the test created the following shards:

ShardSize[shardId=[btydusdduwrc][3], size=16919]
ShardSize[shardId=[btydusdduwrc][0], size=17886]
ShardSize[shardId=[btydusdduwrc][5], size=17972]
ShardSize[shardId=[btydusdduwrc][1], size=18195]
ShardSize[shardId=[btydusdduwrc][4], size=18198]
ShardSize[shardId=[btydusdduwrc][2], size=18472]

During the first allocation, only 5 shard had computed balance and were allocated accordingly:

[2024-03-03T17:22:30,358][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciling desired balance: DesiredBalance[lastConvergedIndex=13, assignments={
[btydusdduwrc][0]=ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][1]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][4]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][5]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][2]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0],
[btydusdduwrc][3]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1]
}]

Notice the smallest one is ignored. Possibly as the size was still not computed from the repository.

Later the shard balance was computed as following:

[2024-03-03T17:22:30,400][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=14, assignments={[btydusdduwrc][0]=ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][1]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][4]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][5]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][2]=ShardAssignment[nodeIds=[suT6FRFCTBiMDfiBCAU_nQ], total=1, unassigned=0, ignored=0], [btydusdduwrc][3]=ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0]}]. Diff: 
[btydusdduwrc][3]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[CTSLnDyTQj2xDmatjmYTtQ], total=1, unassigned=0, ignored=0]

This suggests that the computation failed to take into account another non-empty shard (still?) initializing on the same node.

@idegtiarenko
Copy link
Contributor

I wonder if in this case the shard has started, but corresponding information was still not available in ClusterInfo and the second allocation round was happening as if CTSLnDyTQj2xDmatjmYTtQ is empty.

@idegtiarenko
Copy link
Contributor

I checked the latest failure logs and they contain:

[2024-03-11T18:25:16,539][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Recomputing desired balance for [15]: DesiredBalance[lastConvergedIndex=14, assignments={[wggatuinx][2]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][3]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][4]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][5]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1], [wggatuinx][0]=ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0], [wggatuinx][1]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1]}], routing_nodes:
-----node_id[CwlFGBJYSXywcA9Qqt25xg][V]
--------[wggatuinx][0], node[CwlFGBJYSXywcA9Qqt25xg], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=lHmaALapQ1qElJRtpQAgug], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11386]
-----node_id[B-MnLxupT2qkXpWstJb6mA][V]
---- unassigned
--------[wggatuinx][1], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][2], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][3], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][4], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[wggatuinx][5], node[null], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
, 


{
  "nodes" : {
    "CwlFGBJYSXywcA9Qqt25xg" : {
      "node_name" : "node_t2",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2",
        "total_bytes" : 21632,
        "used_bytes" : 0,
        "free_bytes" : 21632,
        "free_disk_percent" : 100.0,
        "used_disk_percent" : 0.0
      },
      "most_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2",
        "total_bytes" : 21632,
        "used_bytes" : 0,
        "free_bytes" : 21632,
        "free_disk_percent" : 100.0,
        "used_disk_percent" : 0.0
      }
    },
    "B-MnLxupT2qkXpWstJb6mA" : {
      "node_name" : "node_t1",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-1",
        "total_bytes" : 63302156288,
        "used_bytes" : 9738104832,
        "free_bytes" : 53564051456,
        "free_disk_percent" : 84.6,
        "used_disk_percent" : 15.4
      },
      "most_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-1",
        "total_bytes" : 63302156288,
        "used_bytes" : 9738104832,
        "free_bytes" : 53564051456,
        "free_disk_percent" : 84.6,
        "used_disk_percent" : 15.4
      }
    }
  },
  "shard_sizes" : {
    "[wggatuinx][0][p]_bytes" : 11386
  },
  "shard_data_set_sizes" : {
    "[wggatuinx][0]_bytes" : 11386
  },
  "shard_paths" : {
    "NodeAndShard[nodeId=CwlFGBJYSXywcA9Qqt25xg, shardId=[wggatuinx][0]]" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2"
  },
  "reserved_sizes" : [
    {
      "node_id" : "CwlFGBJYSXywcA9Qqt25xg",
      "path" : "/dev/shm/bk/bk-agent-prod-gcp-1710158680974142795/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_7FA2BD7AB6A11F7E-001/tempDir-003/node-2",
      "total" : 0,
      "shards" : [
        "[wggatuinx][0]"
      ]
    }
  ]
}, 
^^^ cluster info is complete and correct

SnapshotShardSizeInfo{snapshotShardSizes=[[snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][2]]=>11469, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][3]]=>11869, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][4]]=>11392, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][5]]=>12075, [snapshot=repo:snap/gLsztBgmR1G2j8MufCwrSQ, index=[wggatuinx/neUKsrfBR2iP0GAWlUpV2g], shard=[wggatuinx][1]]=>12389]}
^^^ all snapshot sizes are known and available. They are also available for the prior allocation round

[2024-03-11T18:25:16,540][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][0] marked shard as started (routing: [wggatuinx][0], node[CwlFGBJYSXywcA9Qqt25xg], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=lHmaALapQ1qElJRtpQAgug], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11386])
^^^ allocated correctly in the prior round

[2024-03-11T18:25:16,540][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] running delegate allocator
[2024-03-11T18:25:16,542][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][4] marked shard as started (routing: [wggatuinx][4], node[CwlFGBJYSXywcA9Qqt25xg], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=-gFVMSNgRoeo1Fv53EIBhQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11392])
^^^ allocated incorrectly using in the balanced allocator using all available information

[2024-03-11T18:25:16,542][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][1] marked shard as started (routing: [wggatuinx][1], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=Ehs2RXxxTaaYu-puiIjWOg], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12389])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][2] marked shard as started (routing: [wggatuinx][2], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=v3Yh-Ie1SEWjezhmdIiwfg], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11469])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][3] marked shard as started (routing: [wggatuinx][3], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=wz8am3sZSGm7LteXF6GN2Q], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[11869])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] [wggatuinx][5] marked shard as started (routing: [wggatuinx][5], node[B-MnLxupT2qkXpWstJb6mA], [P], recovery_source[snapshot recovery [Us0WnflFQ-eNYAoK_hOaBQ] from repo:snap/gLsztBgmR1G2j8MufCwrSQ], s[INITIALIZING], a[id=BJ7H_lu6QKyV8WvVZsaHRA], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-03-11T12:25:16.245Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12075])
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Desired balance computation for [15] is still not converged after [0s] and [1] iterations
[2024-03-11T18:25:16,543][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] running delegate allocator
[2024-03-11T18:25:16,549][DEBUG][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Desired balance computation for [15] converged after [0s] and [2] iterations
[2024-03-11T18:25:16,550][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=15, assignments={[wggatuinx][2]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0], [wggatuinx][3]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0], [wggatuinx][4]=ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0], [wggatuinx][5]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0], [wggatuinx][0]=ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0], [wggatuinx][1]=ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]}]. Diff: 
[wggatuinx][2]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]
[wggatuinx][3]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]
[wggatuinx][4]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[CwlFGBJYSXywcA9Qqt25xg], total=1, unassigned=0, ignored=0]
[wggatuinx][5]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]
[wggatuinx][1]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[B-MnLxupT2qkXpWstJb6mA], total=1, unassigned=0, ignored=0]

@idegtiarenko
Copy link
Contributor

According to the logs from the latest failure:

Shard sizes
shard=[zoxdkcqpuouh][0]=>12166
shard=[zoxdkcqpuouh][3]=>12451
shard=[zoxdkcqpuouh][2]=>12535
shard=[zoxdkcqpuouh][4]=>12594
shard=[zoxdkcqpuouh][5]=>13006

Nodes
"r4ludfGvTh2wZ7UGP-SjWA" / "node_t1" <-- big node
"zIF6uDFXRYKPVDXFSMTRQQ" / "node_t2" <-- small node

Allocation of all shards happens in 2 rounds.
Round 1:

routing_nodes:
-----node_id[zIF6uDFXRYKPVDXFSMTRQQ][V]
-----node_id[r4ludfGvTh2wZ7UGP-SjWA][V]
---- unassigned
--------[zoxdkcqpuouh][0], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][1], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][2], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][3], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][4], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]
--------[zoxdkcqpuouh][5], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[fetching_shard_data]], failed_attempts[0]

  "shard_sizes" : { },
  "reserved_sizes" : [ ]

SnapshotShardSizeInfo{snapshotShardSizes=[
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][0]]=>12166, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][1]]=>13187, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][2]]=>12535, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][5]]=>13006, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][4]]=>12594, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][3]]=>12451
]}

[2024-04-06T16:17:35,503][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=16, assignments={
[zoxdkcqpuouh][1]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1],
[zoxdkcqpuouh][0]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1],
[zoxdkcqpuouh][5]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0],
[zoxdkcqpuouh][4]=ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1],
[zoxdkcqpuouh][3]=ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0],
[zoxdkcqpuouh][2]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
}]. Diff:
[zoxdkcqpuouh][5]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][3]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][2]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]

Nothing out of ordinary here.

Round 2:

routing_nodes:
-----node_id[zIF6uDFXRYKPVDXFSMTRQQ][V]
--------[zoxdkcqpuouh][3], node[zIF6uDFXRYKPVDXFSMTRQQ], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[INITIALIZING], a[id=092Hbr2cSEakrjwApnkZow], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12451]
-----node_id[r4ludfGvTh2wZ7UGP-SjWA][V]
--------[zoxdkcqpuouh][2], node[r4ludfGvTh2wZ7UGP-SjWA], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[INITIALIZING], a[id=TpE9Bw8VTAy1CJz-pDXquQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[12535]
--------[zoxdkcqpuouh][5], node[r4ludfGvTh2wZ7UGP-SjWA], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[INITIALIZING], a[id=2n_rRR9XQteq_585F1FliQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0], expected_shard_size[13006]
---- unassigned
--------[zoxdkcqpuouh][0], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[zoxdkcqpuouh][1], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]
--------[zoxdkcqpuouh][4], node[null], [P], recovery_source[snapshot recovery [ArTTnq9aTsel6jxW7feZ6Q] from repo:snap/qM6kXprETqqT0R_Qb0Wx-g], s[UNASSIGNED], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2024-04-06T20:17:35.294Z], delayed=false, details[restore_source[repo/snap]], allocation_status[no_attempt]], failed_attempts[0]

  "shard_sizes" : {
    "[zoxdkcqpuouh][2][p]_bytes" : 0, <=== has reserved 12535 bytes
    "[zoxdkcqpuouh][3][p]_bytes" : 12451, <=== has reserved 0 bytes
    "[zoxdkcqpuouh][5][p]_bytes" : 0 <=== has no reserved entry
  },
  "reserved_sizes" : [
    {
      "node_id" : "zIF6uDFXRYKPVDXFSMTRQQ",
      "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-2",
      "total" : 0,
      "shards" : [
        "[zoxdkcqpuouh][3]"
      ]
    },
    {
      "node_id" : "r4ludfGvTh2wZ7UGP-SjWA",
      "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-1",
      "total" : 12535,
      "shards" : [
        "[zoxdkcqpuouh][2]"
      ]
    }
  ]
    "zIF6uDFXRYKPVDXFSMTRQQ" : {
      "node_name" : "node_t2",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-2",
        "total_bytes" : 22691,
        "used_bytes" : 0,
        "free_bytes" : 22691,
        "free_disk_percent" : 100.0,
        "used_disk_percent" : 0.0
      }
    },
    "r4ludfGvTh2wZ7UGP-SjWA" : {
      "node_name" : "node_t1",
      "least_available" : {
        "path" : "/dev/shm/bk/bk-agent-prod-gcp-1712433772034280571/elastic/elasticsearch-periodic/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_E34CC9AAB7634065-001/tempDir-005/node-1",
        "total_bytes" : 63302148096,
        "used_bytes" : 9228197888,
        "free_bytes" : 54073950208,
        "free_disk_percent" : 85.4,
        "used_disk_percent" : 14.6
      },

SnapshotShardSizeInfo{snapshotShardSizes=[
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][1]]=>13187, 
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][4]]=>12594,
[snapshot=repo:snap/qM6kXprETqqT0R_Qb0Wx-g, index=[zoxdkcqpuouh/ZsO0w7BsQRy3hAFwsIpZ2g], shard=[zoxdkcqpuouh][0]]=>12166
]}

Desired balance updated: DesiredBalance[lastConvergedIndex=17, assignments={
[zoxdkcqpuouh][5]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][4]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][3]=ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][2]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][1]=ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0], 
[zoxdkcqpuouh][0]=ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0]
}]. Diff: 
[zoxdkcqpuouh][4]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][1]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[r4ludfGvTh2wZ7UGP-SjWA], total=1, unassigned=0, ignored=0]
[zoxdkcqpuouh][0]: ShardAssignment[nodeIds=[], total=1, unassigned=1, ignored=1] -> ShardAssignment[nodeIds=[zIF6uDFXRYKPVDXFSMTRQQ], total=1, unassigned=0, ignored=0]

The round 2 balance is calculated incorrectly as it is based on incorrect cluster info:

  • [zoxdkcqpuouh][5][p]_bytes does not have size nor reserved entry
  • zIF6uDFXRYKPVDXFSMTRQQ reports "used_bytes" : 0 even though there is one shard on it already (can be related to TestFileStore and how we mock?)

@idegtiarenko
Copy link
Contributor

I have not found any particular reason why balance was calculated in 2 round here, however I consider this to be a valid scenario.
I also do not think we have any guarantees to have a point in time snapshot for ClusterInfo that can explain the test value.

@idegtiarenko idegtiarenko removed their assignment Aug 6, 2024
@DiannaHohensee DiannaHohensee self-assigned this Aug 22, 2024
@DiannaHohensee
Copy link
Contributor

Above buildscan failure shows in logs there is

Maj 17, 2024 5:37:20 PD com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#2582,elasticsearch[node_t0][clusterApplierService#updateTask][T#1],5,TGRP-DiskThresholdDeciderIT]
java.lang.AssertionError: Expects no relocating shards but got: routing_nodes:
-----node_id[2_l39Q7SSiWyhgjDJv2lBg][V]
--------[fvyruaxlr][0], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=SGQo56cyQOuyPEmhwMgeuA], failed_attempts[0]
--------[fvyruaxlr][1], node[2_l39Q7SSiWyhgjDJv2lBg], relocating [s_UbSUETQ-GejfrGhwmCfw], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=5-Mwd2VLSoWPJ5EcXzlrHA, rId=x11KnGalToKb1w59JaOxcQ], failed_attempts[0], expected_shard_size[21221]
--------[fvyruaxlr][3], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=N1UYVQ_aThyoOxBMlQdyGQ], failed_attempts[0]
--------[fvyruaxlr][4], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=09TJ-r24QQK5MTD-GaJMcQ], failed_attempts[0]
--------[fvyruaxlr][5], node[2_l39Q7SSiWyhgjDJv2lBg], [P], s[STARTED], a[id=7sbAsKiUQBWSwlk2fdFDlQ], failed_attempts[0]
-----node_id[s_UbSUETQ-GejfrGhwmCfw][V]
--------[fvyruaxlr][1], node[s_UbSUETQ-GejfrGhwmCfw], relocating [2_l39Q7SSiWyhgjDJv2lBg], [P], s[RELOCATING], a[id=x11KnGalToKb1w59JaOxcQ, rId=5-Mwd2VLSoWPJ5EcXzlrHA], failed_attempts[0], expected_shard_size[21221]
--------[fvyruaxlr][2], node[s_UbSUETQ-GejfrGhwmCfw], [P], s[STARTED], a[id=4Htw-f0yStK2qMjoNC7LNA], failed_attempts[0]
---- unassigned

Expected: <0>
     but: was <1>
	at __randomizedtesting.SeedInfo.seed([360E9317DD5AD10]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2123)
	at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards$2(DiskThresholdDeciderIT.java:186)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented Aug 29, 2024

My debug logging...


REPRODUCE WITH: ./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards {seed=[360E9317DD5AD10:9C6621EC0B9A256A]}" -Dtests.seed=360E9317DD5AD10 -Dtests.filter=@awaitsfix -Dtests.locale=sq -Dtests.timezone=America/Phoenix -Druntime.java=21

DiskThresholdDeciderIT > testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards {seed=[360E9317DD5AD10:9C6621EC0B9A256A]} FAILED
    java.lang.AssertionError: Mismatching shard routings: []
    Expected: iterable containing [one of {<[sqikodzgne][3]>, <[sqikodzgne][2]>}]
         but: no item was one of {<[sqikodzgne][3]>, <[sqikodzgne][2]>}
        at __randomizedtesting.SeedInfo.seed([360E9317DD5AD10:9C6621EC0B9A256A]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2470)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$assertBusyWithDiskUsageRefresh$7(DiskThresholdDeciderIT.java:360)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1408)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.assertBusyWithDiskUsageRefresh(DiskThresholdDeciderIT.java:350)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards(DiskThresholdDeciderIT.java:243)


  1> [2024-08-29T08:44:58,238][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~test done creating data for shards, shardSizes: [
  ShardSize[shardId=[uaxlrfmucfxo][0], size=22976],
  ShardSize[shardId=[uaxlrfmucfxo][2], size=23049],
  ShardSize[shardId=[uaxlrfmucfxo][4], size=23122],
  ShardSize[shardId=[uaxlrfmucfxo][3], size=23271],
  ShardSize[shardId=[uaxlrfmucfxo][1], size=23948],
  ShardSize[shardId=[uaxlrfmucfxo][5], size=24030]]



  1> [2024-08-29T08:44:58,462][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~test decreased disk space and refreshed it, WATERMARK_BYTES: 10240, usableSpace: 23049


  1> [2024-08-29T08:44:58,658][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~test matcher shard set: [[uaxlrfmucfxo][0], [uaxlrfmucfxo][2]]


  1> [2024-08-29T08:44:58,666][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] ~~~assertBusyWithDiskUsageRefresh refreshDiskUsage complete, checking, shardRouting: [[uaxlrfmucfxo][2]]

The matcher has what we want, and the shard is on the node that we want..

Maybe this 🤔

@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented Aug 29, 2024

Okay, so I fixed the Matcher problem. Replace this line with the following

assertBusyWithDiskUsageRefresh(dataNode0Id, indexName, hasItem(in(shardSizes.getShardIdsWithSizeSmallerOrEqual(usableSpace))));

But now I've uncovered something else, either a test bug or code bug. These are my debug logs limited to this line of code where the snapshot restore occurs and allocates shards out to the nodes -- note that rebalancing is otherwise disabled, we won't rebalance after the initial allocation, IIUC.

A potentially interesting bit is this, before allocation occurs: somehow allocation thinks the available on the node is zero?

1> [2024-08-29T12:07:08,531][WARN ][o.e.c.r.a.DiskThresholdMonitor] [node_t0] high disk watermark [10240b] exceeded on [q24VcceoTY-cq6vgafXpOQ][node_t2][/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2] free: 0b[0%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

Again later,

  1> [2024-08-29T12:07:08,538][TRACE][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] getDiskUsage(subtractLeavingShards=false) returning DiskUsageWithRelocations[diskUsage=[q24VcceoTY-cq6vgafXpOQ][node_t2][/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2] free: 0b[0%], relocatingShardSize=0]
  1> [2024-08-29T12:07:08,538][TRACE][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] node [q24VcceoTY-cq6vgafXpOQ] has 100.0% used disk
  1> [2024-08-29T12:07:08,538][DEBUG][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] less than the required 10240 free bytes threshold (0b free) on node q24VcceoTY-cq6vgafXpOQ, preventing allocation

100% disk usage until the end (last occurrence of log message)

  1> [2024-08-29T12:07:08,559][TRACE][o.e.c.r.a.d.DiskThresholdDecider] [node_t0] node [q24VcceoTY-cq6vgafXpOQ] has 100.0% used disk

@DiannaHohensee
Copy link
Contributor

These are the logs written while these lines of code executed, which change the disk size. It looks like

  1> [2024-08-29T12:07:08,514][DEBUG][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Executing allocate for [11]
  1> [2024-08-29T12:07:08,514][INFO ][o.e.c.r.a.d.DiskThresholdDeciderIT] [testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleShards] Refreshed cluster info: {
  1>   "nodes" : {
  1>     "Hh_cFSDBQxSRw0flPCpXYw" : {
  1>       "node_name" : "node_t1",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       }
  1>     },
  1>     "q24VcceoTY-cq6vgafXpOQ" : {
  1>       "node_name" : "node_t2",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       }
  1>     }
  1>   },
  1>   "shard_sizes" : { },
  1>   "shard_data_set_sizes" : { },
  1>   "shard_paths" : { },
  1>   "reserved_sizes" : [ ]
  1> }
  1> [2024-08-29T12:07:08,514][DEBUG][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Starting desired balance computation for [11]
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciling desired balance: DesiredBalance[lastConvergedIndex=10, assignments={}]
  1> [2024-08-29T12:07:08,514][DEBUG][o.e.c.r.a.a.DesiredBalanceReconciler] [node_t0] Reconciling desired balance for [10]
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciled desired balance: DesiredBalance[lastConvergedIndex=10, assignments={}]
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Recomputing desired balance for [11]: DesiredBalance[lastConvergedIndex=10, assignments={}], routing_nodes:
  1> -----node_id[q24VcceoTY-cq6vgafXpOQ][V]
  1> -----node_id[Hh_cFSDBQxSRw0flPCpXYw][V]
  1> ---- unassigned
  1> , {
  1>   "nodes" : {
  1>     "Hh_cFSDBQxSRw0flPCpXYw" : {
  1>       "node_name" : "node_t1",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-1",
  1>         "total_bytes" : 364024733696,
  1>         "used_bytes" : 78748274688,
  1>         "free_bytes" : 285276459008,
  1>         "free_disk_percent" : 78.4,
  1>         "used_disk_percent" : 21.6
  1>       }
  1>     },
  1>     "q24VcceoTY-cq6vgafXpOQ" : {
  1>       "node_name" : "node_t2",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       }
  1>     }
  1>   },
  1>   "shard_sizes" : { },
  1>   "shard_data_set_sizes" : { },
  1>   "shard_paths" : { },
  1>   "reserved_sizes" : [ ]
  1> }, SnapshotShardSizeInfo{snapshotShardSizes=[]}
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] running delegate allocator
  1> [2024-08-29T12:07:08,514][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [node_t0] Start allocating unassigned shards
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [node_t0] Start balancing cluster
  1> [2024-08-29T12:07:08,515][DEBUG][o.e.c.r.a.a.DesiredBalanceComputer] [node_t0] Desired balance computation for [11] converged after [0s] and [1] iterations
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance updated: DesiredBalance[lastConvergedIndex=11, assignments={}]. No changes
  1> [2024-08-29T12:07:08,515][DEBUG][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Desired balance computation for [11] is completed, scheduling reconciliation
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciling desired balance: DesiredBalance[lastConvergedIndex=11, assignments={}]
  1> [2024-08-29T12:07:08,515][DEBUG][o.e.c.r.a.a.DesiredBalanceReconciler] [node_t0] Reconciling desired balance for [11]
  1> [2024-08-29T12:07:08,515][TRACE][o.e.c.r.a.a.DesiredBalanceShardsAllocator] [node_t0] Reconciled desired balance: DesiredBalance[lastConvergedIndex=11, assignments={}]
  1> [2024-08-29T12:07:08,515][WARN ][o.e.c.r.a.DiskThresholdMonitor] [node_t0] high disk watermark [10240b] exceeded on [q24VcceoTY-cq6vgafXpOQ][node_t2][/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2] free: 0b[0%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

The interesting bit is

  1>     "q24VcceoTY-cq6vgafXpOQ" : {
  1>       "node_name" : "node_t2",
  1>       "least_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       },
  1>       "most_available" : {
  1>         "path" : "/home/dianna/elasticsearch/server/build/testrun/internalClusterTest/temp/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT_360E9317DD5AD10-019/tempDir-035/node-2",
  1>         "total_bytes" : 29975,
  1>         "used_bytes" : 29975,
  1>         "free_bytes" : 0,
  1>         "free_disk_percent" : 0.0,
  1>         "used_disk_percent" : 100.0
  1>       }
  1>     }

We changed the total_bytes, but the used_bytes match. Even though there are no shards allocated.

@repantis repantis added Team:Distributed Coordination Meta label for Distributed Coordination team and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 14, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
elasticsearchmachine pushed a commit that referenced this issue Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 22, 2025
For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes elastic#105331
elasticsearchmachine pushed a commit that referenced this issue Apr 23, 2025
* Fix shard size of initializing restored shard (#126783)

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

* Backport utils from 4009599
elasticsearchmachine pushed a commit that referenced this issue Apr 23, 2025
* Fix shard size of initializing restored shard (#126783)

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

* Missing throws
elasticsearchmachine pushed a commit that referenced this issue Apr 23, 2025
* Fix shard size of initializing restored shard (#126783)

For shards being restored from a snapshot we use `SnapshotShardSizeInfo`
to track their sizes while they're unassigned, and then use
`ShardRouting#expectedShardSize` when they start to recover. However we
were incorrectly ignoring the `ShardRouting#expectedShardSize` value
when accounting for the movements of shards in the
`ClusterInfoSimulator`, which would sometimes cause us to assign more
shards to a node than its disk space should have allowed.

Closes #105331

* Backport utils from 4009599

* Missing throws
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants