[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

elasticsearchmachine · 2025-04-08T21:19:00Z

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:downsample:internalClusterTest" --tests "org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart" -Dtests.seed=DADEFAB78F2B7B70 -Dtests.jvm.argline="-Des.entitlements.enabled=true" -Dtests.locale=kl -Dtests.timezone=America/Noronha -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

org.elasticsearch.index.IndexNotFoundException: no such index [downsample-1h-jmuazrphnw]

Issue Reasons:

[main] 2 failures in test testILMDownsampleRollingRestart (0.3% fail rate in 776 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

The text was updated successfully, but these errors were encountered:

…tILMDownsampleRollingRestart #126495

elasticsearchmachine · 2025-04-08T21:19:03Z

This has been muted on branch main

Mute Reasons:

[main] 2 failures in test testILMDownsampleRollingRestart (0.3% fail rate in 776 executions)

Build Scans:

elasticsearchmachine · 2025-04-08T21:19:24Z

Pinging @elastic/es-data-management (Team:Data Management)

parkertimmins · 2025-04-10T18:41:43Z

It looks like this test still has the issue of the @after running too early, as was the case in #114233 .

In the logs there's an instance of cleaning up after test and then were start seeing errors:

 [2025-04-08T19:14:02,442][INFO ][o.e.x.i.IndexLifecycleTransition][node_t0][masterService#updateTask][T#1] moving index [jmuazrphnw] from [{"phase":"warm","action":"downsample","name":"generate-downsampled-index-name"}] to [{"phase":"warm","action":"downsample","name":"rollup"}] in policy [mypolicy]	
  1> [2025-04-08T19:14:02,676][INFO ][o.e.x.d.ILMDownsampleDisruptionIT][testILMDownsampleRollingRestart] [ILMDownsampleDisruptionIT#testILMDownsampleRollingRestart]: cleaning up after test	
  1> [2025-04-08T19:14:03,046][INFO ][o.e.c.r.a.AllocationService][node_t0][masterService#updateTask][T#1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[downsample-1h-jmuazrphnw][0]]])." previous.health="YELLOW" reason="shards started [[downsample-1h-jmuazrphnw][0]]"	
  1> [2025-04-08T19:14:03,152][INFO ][o.e.c.m.MetadataDeleteIndexService][node_t0][masterService#updateTask][T#1] [jmuazrphnw/jiSlSDAeQO6fqoRaCDbFfw] deleting index	
  1> [2025-04-08T19:14:03,152][INFO ][o.e.c.m.MetadataDeleteIndexService][node_t0][masterService#updateTask][T#1] [downsample-1h-jmuazrphnw/IyrR_gMhQYqZyNl-lRZ1Aw] deleting index	
  1> [2025-04-08T19:14:03,279][INFO ][o.e.n.Node               ][testILMDownsampleRollingRestart] stopping ...	
  1> [2025-04-08T19:14:03,280][INFO ][o.e.c.f.AbstractFileWatchingService][[elasticsearch[file-watcher[/dev/shm/bk/bk-agent-prod-gcp-1744146337862119520/elastic/elasticsearch-periodic/x-pack/plugin/downsample/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT_DADEFAB78F2B7B70-001/tempDir-002/config/operator]]]] shutting down watcher thread	
  1> [2025-04-08T19:14:03,281][INFO ][o.e.c.f.AbstractFileWatchingService][testILMDownsampleRollingRestart] watcher service stopped	
  1> [2025-04-08T19:14:03,284][ERROR][o.e.x.d.TransportDownsampleAction][testILMDownsampleRollingRestart] error while waiting for downsampling persistent task	
  1> org.elasticsearch.node.NodeClosedException: node closed {node_t0}{MW5bWJaET1iJiwssiM0xgg}{oeinVuTHQcuwlsnFxLl62Q}{node_t0}{127.0.0.1}{127.0.0.1:18581}{m}{9.1.0}{8000099-9021000}{xpack.installed=true}

nielsbauman · 2025-04-11T14:03:53Z

My guess is that something different went wrong here. The errors after cleaning up after test are expected because downsample is unable to complete on a removed index (which is deleted in the cleanup). The exception that caused the test to fail is

org.elasticsearch.index.IndexNotFoundException: no such index [downsample-1h-etxskwxmta]	domizedtesting.SeedInfo.seed([EF1F23CEBCC22294:5970946792D83C81]:0)	
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.notFoundException(IndexNameExpressionResolver.java:786)	
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.ensureAliasOrIndexExists(IndexNameExpressionResolver.java:1569)	
	...	
	at org.elasticsearch.client.internal.IndicesAdminClient.getSettings(IndicesAdminClient.java:445)	
	at org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.lambda$startDownsampleTaskViaIlm$4(ILMDownsampleDisruptionIT.java:199)	
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1507)	
	at org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.startDownsampleTaskViaIlm(ILMDownsampleDisruptionIT.java:195)	
	at org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart(ILMDownsampleDisruptionIT.java:172)
    ...

Meaning that this settings retrieval API call is failing:

elasticsearch/x-pack/plugin/downsample/src/internalClusterTest/java/org/elasticsearch/xpack/downsample/ILMDownsampleDisruptionIT.java

Lines 197 to 200 in f658af6

    
           var getSettingsResponse = client().admin() 
        
               .indices() 
        
               .getSettings(new GetSettingsRequest(TEST_REQUEST_TIMEOUT).indices(targetIndex)) 
        
               .actionGet();

The reason that this happens - even though we check for index existence on the line above - is because these two API calls can hit different nodes and both of them are running on the local node - since #126051 and #125652. We've seen cases before where one node sees cluster state x, but another still sees version x - 1, resulting in similar errors. I'll put up a fix later.

Wait for the index to exist on the master node to ensure all nodes have the latest cluster state. Fixes elastic#126495

) Wait for the index to exist on the master node to ensure all nodes have the latest cluster state. Fixes #126495

elasticsearchmachine added :Data Management/Indices APIs APIs to create and manage indices and templates >test-failure Triaged test failures from CI labels Apr 8, 2025

elasticsearchmachine added a commit that referenced this issue Apr 8, 2025

Mute org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT tes…

81c544c

…tILMDownsampleRollingRestart #126495

elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 8, 2025

nielsbauman added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 11, 2025

nielsbauman self-assigned this Apr 11, 2025

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 11, 2025

Fix ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart

cbd31ab

Wait for the index to exist on the master node to ensure all nodes have the latest cluster state. Fixes elastic#126495

nielsbauman mentioned this issue Apr 11, 2025

Fix ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart #126692

Merged

nielsbauman closed this as completed in #126692 Apr 11, 2025

nielsbauman added a commit that referenced this issue Apr 11, 2025

Fix ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart (#126692

507f40c

) Wait for the index to exist on the master node to ensure all nodes have the latest cluster state. Fixes #126495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

elasticsearchmachine commented Apr 8, 2025

elasticsearchmachine commented Apr 8, 2025

elasticsearchmachine commented Apr 8, 2025

parkertimmins commented Apr 10, 2025

nielsbauman commented Apr 11, 2025

[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

Comments

elasticsearchmachine commented Apr 8, 2025

elasticsearchmachine commented Apr 8, 2025

elasticsearchmachine commented Apr 8, 2025

parkertimmins commented Apr 10, 2025

nielsbauman commented Apr 11, 2025