Skip to content

[CI] CohereServiceUpgradeIT class failing #121537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
elasticsearchmachine opened this issue Feb 3, 2025 · 4 comments
Open

[CI] CohereServiceUpgradeIT class failing #121537

elasticsearchmachine opened this issue Feb 3, 2025 · 4 comments
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Feb 3, 2025

Build Scans:

Reproduction Line:

undefined

Applicable branches:
8.18

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

undefined

Issue Reasons:

  • [8.18] 3 failures in class org.elasticsearch.xpack.application.CohereServiceUpgradeIT (1.0% fail rate in 291 executions)
  • [8.18] 2 failures in step openjdk17_8.17.6_java-fips-matrix-bwc (13.3% fail rate in 15 executions)
  • [8.18] 3 failures in pipeline elasticsearch-periodic (20.0% fail rate in 15 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :ml Machine learning >test-failure Triaged test failures from CI labels Feb 3, 2025
elasticsearchmachine added a commit that referenced this issue Feb 3, 2025
…lasticsearch.xpack.application.CohereServiceUpgradeIT #121537
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 9 failures in class org.elasticsearch.xpack.application.CohereServiceUpgradeIT (5.6% fail rate in 160 executions)
  • [8.x] 2 failures in step 8.16.4_bwc-snapshots (16.7% fail rate in 12 executions)
  • [8.x] 2 failures in step 8.17.2_bwc-snapshots (16.7% fail rate in 12 executions)
  • [8.x] 2 failures in step 8.18.0_bwc-snapshots (16.7% fail rate in 12 executions)
  • [8.x] 2 failures in step 8.19.0_bwc-snapshots (15.4% fail rate in 13 executions)
  • [8.x] 2 failures in pipeline elasticsearch-pull-request (25.0% fail rate in 8 executions)

Build Scans:

@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added Team:ML Meta label for the ML team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 3, 2025
@davidkyle
Copy link
Member

The PR build failures are all from the same PR #121493 and are all NPEs, probably related to the changes in the PR. #121493 is now merged into the 8.x branch

CohereServiceUpgradeIT > testCohereEmbeddings {upgradedNodes=3} FAILED |  
-- | --
  | org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:36237], URI [_inference/upgraded-cluster-embeddings-byte], status line [HTTP/1.1 500 Internal Server Error] |  
  | {"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"Phase failed","phase":"fetch","grouped":true,"failed_shards":[],"caused_by":{"type":"null_pointer_exception","reason":"Cannot invoke \"java.util.List.isEmpty()\" because \"toReduce\" is null"}},"status":500} |  
  | at __randomizedtesting.SeedInfo.seed([12FB1BE175EC8D8:1ADBF121000835B8]:0) |  
  | at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:317) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:292) |  
  | at app//org.elasticsearch.xpack.application.InferenceUpgradeTestCase.delete(InferenceUpgradeTestCase.java:62) |  
  | at app//org.elasticsearch.xpack.application.CohereServiceUpgradeIT.testCohereEmbeddings(CohereServiceUpgradeIT.java:145)

That leaves one unexplained failure elasticsearch-periodic #6172 / openjdk23_8.16.4_java-matrix-bwc

Caused by: |  
  | java.lang.RuntimeException: Timed out after PT3M waiting for ports files for: { cluster: 'test-cluster', node: 'test-cluster-2' } |  
  | at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.waitUntilReady(AbstractLocalClusterFactory.java:311) |  
  | at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.getTransportEndpoint(AbstractLocalClusterFactory.java:216)

There's nothing in the logs to indicate why the node did not start up. Possibly it was just very slow. The build timeline shows very high CPU usage when the test was running
https://gradle-enterprise.elastic.co/s/6hpshewbj34de/timeline

@davidkyle davidkyle added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 4, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 19 failures in class org.elasticsearch.xpack.application.CohereServiceUpgradeIT (2.5% fail rate in 751 executions)
  • [main] 6 failures in step 8.19.0_bwc-snapshots (2.7% fail rate in 224 executions)
  • [main] 12 failures in step 9.0.0_bwc-snapshots (5.4% fail rate in 223 executions)
  • [main] 14 failures in pipeline elasticsearch-pull-request (7.7% fail rate in 181 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Feb 7, 2025
…lasticsearch.xpack.application.CohereServiceUpgradeIT #121537
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

2 participants