Skip to content

[CI] FrozenSearchableSnapshotsIntegTests testCreateAndRestorePartialSearchableSnapshot failing #123773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
elasticsearchmachine opened this issue Feb 28, 2025 · 4 comments
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Indexing Meta label for Distributed Indexing team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:searchable-snapshots:internalClusterTest" --tests "org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot" -Dtests.seed=2CDF29528A255FC4 -Dtests.locale=hu-Latn-HU -Dtests.timezone=ROK -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: Searchable snapshot directory does not support the operation [createOutput(corrupted_gTNu35_ERpCZ9TsjuuE44w, IOContext[context=DEFAULT, mergeInfo=null, flushInfo=null, readAdvice=RANDOM])], current directory files: _c.cfe,_c.cfs,_c.si,_c_h.fnm,_c_h_Asserting_0.dvd,_c_h_Asserting_0.dvm,_m.cfe,_m.cfs,_m.si,_m_8.fnm,_m_8_Asserting_0.dvd,_m_8_Asserting_0.dvm,_t.cfe,_t.cfs,_t.si,segments_4

Issue Reasons:

  • [main] 2 failures in test testCreateAndRestorePartialSearchableSnapshot (0.4% fail rate in 464 executions)
  • [main] 2 failures in step part2 (3.8% fail rate in 53 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (3.8% fail rate in 53 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >test-failure Triaged test failures from CI labels Feb 28, 2025
elasticsearchmachine added a commit that referenced this issue Feb 28, 2025
…shotsIntegTests testCreateAndRestorePartialSearchableSnapshot #123773
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testCreateAndRestorePartialSearchableSnapshot (0.4% fail rate in 464 executions)
  • [main] 2 failures in step part2 (3.8% fail rate in 53 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (3.8% fail rate in 53 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Indexing Meta label for Distributed Indexing team labels Feb 28, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@tlrx tlrx added the low-risk An open issue or test failure that is a low risk to future releases label Mar 5, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:risk Requires assignment of a risk label (low, medium, blocker) label Mar 5, 2025
@bcully
Copy link
Contributor

bcully commented Apr 15, 2025

This does reproduce locally, just took 1000 iters.

@bcully
Copy link
Contributor

bcully commented Apr 15, 2025

Looks like another shutdown race most likely.

[2025-03-01T07:51:08,554][INFO ][o.e.n.Node               ] [testCreateAndRestorePartialSearchableSnapshot] closed
...
márc. 01, 2025 7:51:08 DE. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException

The stack trace looks like we're doing a shard recovery (probably automatically) that is prepared to handle an AlreadyClosedException but we're getting a different kind of IOException which is causing us to try to mark the store as corrupted.

We are in the middle of a full cluster restart triggered here. In this run, we're also doing expensive checksum validation at shard startup.

[2025-03-01T07:51:08,468][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] Stopping and resetting node [node_s1] 
...
[2025-03-01T07:51:08,554][INFO ][o.e.n.Node               ] [testCreateAndRestorePartialSearchableSnapshot] closed	
[2025-03-01T07:51:08,554][INFO ][o.e.i.s.IndexShard       ] [node_s1] [tvttlliquc][0] check index [ok]: checksum check passed on [_c_h.fnm]	
[2025-03-01T07:51:08,555][INFO ][o.e.i.s.IndexShard       ] [node_s1] [tvttlliquc][0] check index [ok]: checksum check passed on [_m_8.fnm]	
[2025-03-01T07:51:08,555][INFO ][o.e.i.s.IndexShard       ] [node_s1] [tvttlliquc][0] check index [ok]: checksum check passed on [_m.cfs]	
[2025-03-01T07:51:08,555][INFO ][o.e.i.s.IndexShard       ] [node_s1] [tvttlliquc][0] check index [ok]: checksum check passed on [_m_8_Asserting_0.dvd]	
[2025-03-01T07:51:08,555][INFO ][o.e.i.s.IndexShard       ] [node_s1] [tvttlliquc][0] check index [ok]: checksum check passed on [_t.cfs]	
[2025-03-01T07:51:08,556][WARN ][o.e.i.s.IndexShard       ] [node_s1] [tvttlliquc][0] check index [failure]: checksum failed on [_c.cfs]	
java.io.IOException: failed to read data from cache for [CacheFileReference{cacheKey='CacheKey[snapshotUUID=Ex9qOi2mQkGxeAS37EhaaQ, snapshotIndexName=bpynpmvnoc, shardId=[tvttlliquc][0], fileName=_c.cfs]', fileLength=481290, acquired=false}]

looks like we're doing validation on this shard (during opening?) after the node has been stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Indexing Meta label for Distributed Indexing team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants