Skip to content

[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 8, 2025 · 4 comments · Fixed by #126692
Closed

[CI] ILMDownsampleDisruptionIT testILMDownsampleRollingRestart failing #126495

elasticsearchmachine opened this issue Apr 8, 2025 · 4 comments · Fixed by #126692
Assignees
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:downsample:internalClusterTest" --tests "org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart" -Dtests.seed=DADEFAB78F2B7B70 -Dtests.jvm.argline="-Des.entitlements.enabled=true" -Dtests.locale=kl -Dtests.timezone=America/Noronha -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

org.elasticsearch.index.IndexNotFoundException: no such index [downsample-1h-jmuazrphnw]

Issue Reasons:

  • [main] 2 failures in test testILMDownsampleRollingRestart (0.3% fail rate in 776 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Data Management/Indices APIs APIs to create and manage indices and templates >test-failure Triaged test failures from CI labels Apr 8, 2025
elasticsearchmachine added a commit that referenced this issue Apr 8, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testILMDownsampleRollingRestart (0.3% fail rate in 776 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 8, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@parkertimmins
Copy link
Contributor

It looks like this test still has the issue of the @after running too early, as was the case in #114233 .

In the logs there's an instance of cleaning up after test and then were start seeing errors:

 [2025-04-08T19:14:02,442][INFO ][o.e.x.i.IndexLifecycleTransition][node_t0][masterService#updateTask][T#1] moving index [jmuazrphnw] from [{"phase":"warm","action":"downsample","name":"generate-downsampled-index-name"}] to [{"phase":"warm","action":"downsample","name":"rollup"}] in policy [mypolicy]	
  1> [2025-04-08T19:14:02,676][INFO ][o.e.x.d.ILMDownsampleDisruptionIT][testILMDownsampleRollingRestart] [ILMDownsampleDisruptionIT#testILMDownsampleRollingRestart]: cleaning up after test	
  1> [2025-04-08T19:14:03,046][INFO ][o.e.c.r.a.AllocationService][node_t0][masterService#updateTask][T#1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[downsample-1h-jmuazrphnw][0]]])." previous.health="YELLOW" reason="shards started [[downsample-1h-jmuazrphnw][0]]"	
  1> [2025-04-08T19:14:03,152][INFO ][o.e.c.m.MetadataDeleteIndexService][node_t0][masterService#updateTask][T#1] [jmuazrphnw/jiSlSDAeQO6fqoRaCDbFfw] deleting index	
  1> [2025-04-08T19:14:03,152][INFO ][o.e.c.m.MetadataDeleteIndexService][node_t0][masterService#updateTask][T#1] [downsample-1h-jmuazrphnw/IyrR_gMhQYqZyNl-lRZ1Aw] deleting index	
  1> [2025-04-08T19:14:03,279][INFO ][o.e.n.Node               ][testILMDownsampleRollingRestart] stopping ...	
  1> [2025-04-08T19:14:03,280][INFO ][o.e.c.f.AbstractFileWatchingService][[elasticsearch[file-watcher[/dev/shm/bk/bk-agent-prod-gcp-1744146337862119520/elastic/elasticsearch-periodic/x-pack/plugin/downsample/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT_DADEFAB78F2B7B70-001/tempDir-002/config/operator]]]] shutting down watcher thread	
  1> [2025-04-08T19:14:03,281][INFO ][o.e.c.f.AbstractFileWatchingService][testILMDownsampleRollingRestart] watcher service stopped	
  1> [2025-04-08T19:14:03,284][ERROR][o.e.x.d.TransportDownsampleAction][testILMDownsampleRollingRestart] error while waiting for downsampling persistent task	
  1> org.elasticsearch.node.NodeClosedException: node closed {node_t0}{MW5bWJaET1iJiwssiM0xgg}{oeinVuTHQcuwlsnFxLl62Q}{node_t0}{127.0.0.1}{127.0.0.1:18581}{m}{9.1.0}{8000099-9021000}{xpack.installed=true}

@nielsbauman
Copy link
Contributor

My guess is that something different went wrong here. The errors after cleaning up after test are expected because downsample is unable to complete on a removed index (which is deleted in the cleanup). The exception that caused the test to fail is

org.elasticsearch.index.IndexNotFoundException: no such index [downsample-1h-etxskwxmta]	domizedtesting.SeedInfo.seed([EF1F23CEBCC22294:5970946792D83C81]:0)	
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.notFoundException(IndexNameExpressionResolver.java:786)	
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.ensureAliasOrIndexExists(IndexNameExpressionResolver.java:1569)	
	...	
	at org.elasticsearch.client.internal.IndicesAdminClient.getSettings(IndicesAdminClient.java:445)	
	at org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.lambda$startDownsampleTaskViaIlm$4(ILMDownsampleDisruptionIT.java:199)	
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1507)	
	at org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.startDownsampleTaskViaIlm(ILMDownsampleDisruptionIT.java:195)	
	at org.elasticsearch.xpack.downsample.ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart(ILMDownsampleDisruptionIT.java:172)
    ...

Meaning that this settings retrieval API call is failing:

var getSettingsResponse = client().admin()
.indices()
.getSettings(new GetSettingsRequest(TEST_REQUEST_TIMEOUT).indices(targetIndex))
.actionGet();

The reason that this happens - even though we check for index existence on the line above - is because these two API calls can hit different nodes and both of them are running on the local node - since #126051 and #125652. We've seen cases before where one node sees cluster state x, but another still sees version x - 1, resulting in similar errors. I'll put up a fix later.

@nielsbauman nielsbauman added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 11, 2025
@nielsbauman nielsbauman self-assigned this Apr 11, 2025
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 11, 2025
Wait for the index to exist on the master node to ensure all nodes have
the latest cluster state.

Fixes elastic#126495
nielsbauman added a commit that referenced this issue Apr 11, 2025
)

Wait for the index to exist on the master node to ensure all nodes have
the latest cluster state.

Fixes #126495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants