Skip to content

[CI] S3BlobStoreRepositoryTests testMetrics failing #101608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cbuescher opened this issue Oct 31, 2023 · 16 comments
Closed

[CI] S3BlobStoreRepositoryTests testMetrics failing #101608

cbuescher opened this issue Oct 31, 2023 · 16 comments
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases stalled Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/cmzydsjar4s3c/tests/:modules:repository-s3:internalClusterTest/org.elasticsearch.repositories.s3.S3BlobStoreRepositoryTests/testMetrics
Reproduction line:

./gradlew ':modules:repository-s3:internalClusterTest' --tests "org.elasticsearch.repositories.s3.S3BlobStoreRepositoryTests.testMetrics" -Dtests.seed=56B8E75E80922D01 -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dtests.locale=bg-BG -Dtests.timezone=America/Kentucky/Louisville -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.repositories.s3.S3BlobStoreRepositoryTests&tests.test=testMetrics

Failure excerpt:

java.lang.AssertionError: 
Expected: <10L>
     but: was <9L>

  at __randomizedtesting.SeedInfo.seed([56B8E75E80922D01:A8AA35FC9C985AB2]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.repositories.s3.S3BlobStoreRepositoryTests.lambda$testMetrics$3(S3BlobStoreRepositoryTests.java:285)
  at java.util.ArrayList.forEach(ArrayList.java:1596)
  at org.elasticsearch.repositories.s3.S3BlobStoreRepositoryTests.testMetrics(S3BlobStoreRepositoryTests.java:278)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@cbuescher cbuescher added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 31, 2023
@elasticsearchmachine elasticsearchmachine added blocker Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@cbuescher
Copy link
Member Author

And another one from today

@ywangd ywangd self-assigned this Oct 31, 2023
@ywangd ywangd added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Oct 31, 2023
@ywangd
Copy link
Member

ywangd commented Oct 31, 2023

Relabel this to lwo-risk since it is an off-by-one error in metric number comparison which is not a critical path.

@fcofdez
Copy link
Contributor

fcofdez commented Nov 8, 2023

I think this is likely a duplicate of #88841

@ywangd ywangd removed their assignment Nov 20, 2023
@DaveCTurner DaveCTurner self-assigned this Nov 20, 2023
elasticsearchmachine pushed a commit that referenced this issue Nov 20, 2023
In #101608 we saw one of these assertions fail, but it's impossible to
know which one without some more details. This commit adds descriptions
to the assertions in the loop.
@DaveCTurner
Copy link
Contributor

Improved the test output on failure in #102386 and #102387, now waiting on another failure to confirm.

@DaveCTurner DaveCTurner removed their assignment Nov 22, 2023
DaveCTurner added a commit that referenced this issue May 27, 2024
ywangd added a commit to ywangd/elasticsearch that referenced this issue May 27, 2024
With logging restriction (elastic#105020), the networkTrace flag needs to be
set for AWS request debug logging.

Relates: elastic#101608
@ywangd
Copy link
Member

ywangd commented May 27, 2024

Unfortunately the AWS debug logging was disabled due to #105020. I raised #109068 to reenable it. I'll ask core-infra whether it is possible to skip logger checking for tests.

@ywangd ywangd added the stalled label May 27, 2024
elasticsearchmachine pushed a commit that referenced this issue May 28, 2024
With logging restriction (#105020), the networkTrace flag needs to be
set for AWS request debug logging.

Relates: #101608
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 17, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 17, 2024
In elastic#101608 we saw one of these assertions fail, but it's impossible to
know which one without some more details. This commit adds descriptions
to the assertions in the loop.
@ywangd
Copy link
Member

ywangd commented Aug 6, 2024

It still has not failed yet since May 28.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 4, 2024
We're awaiting more information about failures of this test, so we need
to actually run it occasionally...

Relates elastic#101608
@DaveCTurner
Copy link
Contributor

It still has not failed yet since May 28.

Perhaps because it has been muted since then, see 520a159 🤦 I opened #114129 to start running the test again.

elasticsearchmachine pushed a commit that referenced this issue Oct 4, 2024
We're awaiting more information about failures of this test, so we need
to actually run it occasionally...

Relates #101608
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this issue Oct 10, 2024
We're awaiting more information about failures of this test, so we need
to actually run it occasionally...

Relates elastic#101608
@pxsalehi
Copy link
Member

I've been running this test over the past couple of days with stress-ng on and off randomly. over 20k+ runs and no failure. IMO, we can close it since it doesn't reproduce.

@DaveCTurner
Copy link
Contributor

I also couldn't reproduce it on repeated but it was failing very rarely in CI even before we muted it. I still think it's an issue tho.

@repantis repantis added Team:Distributed Coordination Meta label for Distributed Coordination team and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd added a commit to ywangd/elasticsearch that referenced this issue Nov 11, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: elastic#88841
Relates: elastic#101608
elasticsearchmachine pushed a commit that referenced this issue Nov 13, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: #88841 Relates: #101608
ywangd added a commit to ywangd/elasticsearch that referenced this issue Nov 13, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: elastic#88841 Relates: elastic#101608
(cherry picked from commit 5204902)

# Conflicts:
#	modules/repository-s3/src/internalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java
elasticsearchmachine pushed a commit that referenced this issue Nov 13, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: #88841 Relates: #101608
(cherry picked from commit 5204902)

# Conflicts:
#	modules/repository-s3/src/internalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java
jozala pushed a commit that referenced this issue Nov 13, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: #88841 Relates: #101608
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this issue Nov 13, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: elastic#88841 Relates: elastic#101608
afoucret pushed a commit to afoucret/elasticsearch that referenced this issue Nov 14, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: elastic#88841 Relates: elastic#101608
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this issue Nov 28, 2024
If sending request fails locally without reaching the server, the
retryable exception is logged differently. This PR enables the logging
for this scenario.

Relates: elastic#88841 Relates: elastic#101608
@ywangd
Copy link
Member

ywangd commented Jan 13, 2025

Since last update (2 month ago), there has not been any actual CI failure. The only failure is unrelated and due to testing of upload checksum.

@ywangd
Copy link
Member

ywangd commented Feb 17, 2025

No new failure. Keep waiting ...

@nicktindall
Copy link
Contributor

This hasn't failed in the last month, I think that means it's been 4 months since it failed. I'm inclined to close this?

Failures in last month: https://es-delivery-stats.elastic.dev/app/r/s/uOoHS

@DaveCTurner
Copy link
Contributor

With the AWS SDK now upgraded to v2 in #126843 I believe this test failure is now either gone or, at the very least, changed beyond all recognition, so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases stalled Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

8 participants