Skip to content

[CI] HdfsTests class failing #127290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 23, 2025 · 5 comments · Fixed by #127534
Closed

[CI] HdfsTests class failing #127290

elasticsearchmachine opened this issue Apr 23, 2025 · 5 comments · Fixed by #127534
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 23, 2025

Build Scans:

Reproduction Line:

undefined

Applicable branches:
8.19

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

undefined

Issue Reasons:

  • [8.19] 5 consecutive failures in class org.elasticsearch.repositories.hdfs.HdfsTests
  • [8.19] 2 consecutive failures in step openjdk22_checkpart1_java-matrix
  • [8.19] 17 consecutive failures in step openjdk17_checkpart1_java-matrix
  • [8.19] 18 consecutive failures in step openjdk17_checkpart1_java-fips-matrix
  • [8.19] 19 consecutive failures in step openjdk21_checkpart1_java-matrix
  • [8.19] 18 consecutive failures in step graalvm-ce17_checkpart1_java-matrix
  • [8.19] 87 failures in class org.elasticsearch.repositories.hdfs.HdfsTests (14.5% fail rate in 602 executions)
  • [8.19] 15 failures in step openjdk22_checkpart1_java-matrix (88.2% fail rate in 17 executions)
  • [8.19] 17 failures in step openjdk17_checkpart1_java-matrix (100.0% fail rate in 17 executions)
  • [8.19] 18 failures in step openjdk17_checkpart1_java-fips-matrix (100.0% fail rate in 18 executions)
  • [8.19] 19 failures in step openjdk21_checkpart1_java-matrix (100.0% fail rate in 19 executions)
  • [8.19] 18 failures in step graalvm-ce17_checkpart1_java-matrix (100.0% fail rate in 18 executions)
  • [8.19] 19 failures in pipeline elasticsearch-periodic (100.0% fail rate in 19 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Apr 23, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 23, 2025
@rjernst rjernst added :Data Management/HDFS HDFS repository issues and removed :Core/Infra/Core Core issues without another label labels Apr 23, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed Team:Core/Infra Meta label for core/infra team labels Apr 23, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@nielsbauman
Copy link
Contributor

This test class is part of snapshot/restore; rerouting to Distributed. #127287, #127288, and #127289 have already been correctly assigned to Distributed.

@nielsbauman nielsbauman added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team and removed Team:Data Management Meta label for data/management team :Data Management/HDFS HDFS repository issues labels Apr 24, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@nielsbauman
Copy link
Contributor

Forgot to comment the stack trace for traceability:

Apr 24, 2025 5:29:09 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks	
WARNING: Will linger awaiting termination of 4 leaked thread(s).	
Apr 24, 2025 5:29:14 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks	
SEVERE: 4 threads leaked from SUITE scope at org.elasticsearch.repositories.hdfs.HdfsTests: 	
   1) Thread[id=79, name=ForkJoinPool.commonPool-worker-1, state=WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:371)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1893)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
   2) Thread[id=82, name=ForkJoinPool.commonPool-worker-4, state=WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:371)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1893)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
   3) Thread[id=81, name=ForkJoinPool.commonPool-worker-3, state=TIMED_WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:449)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1891)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
   4) Thread[id=80, name=ForkJoinPool.commonPool-worker-2, state=WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:371)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1893)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
Apr 24, 2025 5:29:14 AM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll	
INFO: Starting to interrupt leaked threads:	
   1) Thread[id=79, name=ForkJoinPool.commonPool-worker-1, state=WAITING, group=TGRP-HdfsTests]	
   2) Thread[id=82, name=ForkJoinPool.commonPool-worker-4, state=WAITING, group=TGRP-HdfsTests]	
   3) Thread[id=81, name=ForkJoinPool.commonPool-worker-3, state=TIMED_WAITING, group=TGRP-HdfsTests]	
   4) Thread[id=80, name=ForkJoinPool.commonPool-worker-2, state=WAITING, group=TGRP-HdfsTests]	
Apr 24, 2025 5:29:17 AM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll	
SEVERE: There are still zombie threads that couldn't be terminated:	
   1) Thread[id=79, name=ForkJoinPool.commonPool-worker-1, state=WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:371)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1893)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
   2) Thread[id=82, name=ForkJoinPool.commonPool-worker-4, state=WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:371)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1893)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
   3) Thread[id=81, name=ForkJoinPool.commonPool-worker-3, state=TIMED_WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:449)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1891)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)	
   4) Thread[id=80, name=ForkJoinPool.commonPool-worker-2, state=WAITING, group=TGRP-HdfsTests]	
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)	
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:371)	
        at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1893)	
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1809)	
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

@JeremyDahlgren JeremyDahlgren self-assigned this Apr 29, 2025
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Apr 29, 2025
Changes "ForkJoinPool-" to "ForkJoinPool." in the
Thread getName().startsWith() checks in
HdfsClientThreadLeakFilter.  This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Apr 29, 2025
Changes "ForkJoinPool-" to "ForkJoinPool." in the
Thread getName().startsWith() checks in
HdfsClientThreadLeakFilter.  This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287
@JeremyDahlgren JeremyDahlgren added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 29, 2025
JeremyDahlgren added a commit that referenced this issue May 2, 2025
Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes #127290
Closes #127289
Closes #127288
Closes #127287
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue May 4, 2025
…#127534)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287

(cherry picked from commit 4408e38)
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue May 4, 2025
…#127534)

Adds the ForkJoinPool.commonPool-worker- prefix to the
Thread getName().startsWith() checks in HdfsClientThreadLeakFilter.
This resolves the
"There are still zombie threads that couldn't be terminated"
errors in the Hdfs IT tests.

Closes elastic#127290
Closes elastic#127289
Closes elastic#127288
Closes elastic#127287

(cherry picked from commit 4408e38)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants