[CI] CohereServiceUpgradeIT class failing #121537

elasticsearchmachine · 2025-02-03T14:58:31Z

Build Scans:

Reproduction Line:

undefined

Applicable branches:
8.18

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

undefined

Issue Reasons:

[8.18] 3 failures in class org.elasticsearch.xpack.application.CohereServiceUpgradeIT (1.0% fail rate in 291 executions)
[8.18] 2 failures in step openjdk17_8.17.6_java-fips-matrix-bwc (13.3% fail rate in 15 executions)
[8.18] 3 failures in pipeline elasticsearch-periodic (20.0% fail rate in 15 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

The text was updated successfully, but these errors were encountered:

…lasticsearch.xpack.application.CohereServiceUpgradeIT #121537

elasticsearchmachine · 2025-02-03T14:58:34Z

This has been muted on branch 8.x

Mute Reasons:

[8.x] 9 failures in class org.elasticsearch.xpack.application.CohereServiceUpgradeIT (5.6% fail rate in 160 executions)
[8.x] 2 failures in step 8.16.4_bwc-snapshots (16.7% fail rate in 12 executions)
[8.x] 2 failures in step 8.17.2_bwc-snapshots (16.7% fail rate in 12 executions)
[8.x] 2 failures in step 8.18.0_bwc-snapshots (16.7% fail rate in 12 executions)
[8.x] 2 failures in step 8.19.0_bwc-snapshots (15.4% fail rate in 13 executions)
[8.x] 2 failures in pipeline elasticsearch-pull-request (25.0% fail rate in 8 executions)

Build Scans:

elasticsearchmachine · 2025-02-03T14:58:55Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2025-02-04T13:44:32Z

The PR build failures are all from the same PR #121493 and are all NPEs, probably related to the changes in the PR. #121493 is now merged into the 8.x branch

CohereServiceUpgradeIT > testCohereEmbeddings {upgradedNodes=3} FAILED |  
-- | --
  | org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:36237], URI [_inference/upgraded-cluster-embeddings-byte], status line [HTTP/1.1 500 Internal Server Error] |  
  | {"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"Phase failed","phase":"fetch","grouped":true,"failed_shards":[],"caused_by":{"type":"null_pointer_exception","reason":"Cannot invoke \"java.util.List.isEmpty()\" because \"toReduce\" is null"}},"status":500} |  
  | at __randomizedtesting.SeedInfo.seed([12FB1BE175EC8D8:1ADBF121000835B8]:0) |  
  | at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:317) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:292) |  
  | at app//org.elasticsearch.xpack.application.InferenceUpgradeTestCase.delete(InferenceUpgradeTestCase.java:62) |  
  | at app//org.elasticsearch.xpack.application.CohereServiceUpgradeIT.testCohereEmbeddings(CohereServiceUpgradeIT.java:145)

That leaves one unexplained failure elasticsearch-periodic #6172 / openjdk23_8.16.4_java-matrix-bwc

Caused by: |  
  | java.lang.RuntimeException: Timed out after PT3M waiting for ports files for: { cluster: 'test-cluster', node: 'test-cluster-2' } |  
  | at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.waitUntilReady(AbstractLocalClusterFactory.java:311) |  
  | at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.getTransportEndpoint(AbstractLocalClusterFactory.java:216)

There's nothing in the logs to indicate why the node did not start up. Possibly it was just very slow. The build timeline shows very high CPU usage when the test was running
https://gradle-enterprise.elastic.co/s/6hpshewbj34de/timeline

elasticsearchmachine · 2025-02-07T14:57:08Z

This has been muted on branch main

Mute Reasons:

[main] 19 failures in class org.elasticsearch.xpack.application.CohereServiceUpgradeIT (2.5% fail rate in 751 executions)
[main] 6 failures in step 8.19.0_bwc-snapshots (2.7% fail rate in 224 executions)
[main] 12 failures in step 9.0.0_bwc-snapshots (5.4% fail rate in 223 executions)
[main] 14 failures in pipeline elasticsearch-pull-request (7.7% fail rate in 181 executions)

Build Scans:

…lasticsearch.xpack.application.CohereServiceUpgradeIT #121537

elasticsearchmachine added :ml Machine learning >test-failure Triaged test failures from CI labels Feb 3, 2025

elasticsearchmachine added a commit that referenced this issue Feb 3, 2025

Mute org.elasticsearch.xpack.application.CohereServiceUpgradeIT org.e…

d773184

…lasticsearch.xpack.application.CohereServiceUpgradeIT #121537

elasticsearchmachine added Team:ML Meta label for the ML team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 3, 2025

davidkyle added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Feb 4, 2025

elasticsearchmachine added a commit that referenced this issue Feb 7, 2025

Mute org.elasticsearch.xpack.application.CohereServiceUpgradeIT org.e…

f4ee8c6

…lasticsearch.xpack.application.CohereServiceUpgradeIT #121537

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] CohereServiceUpgradeIT class failing #121537

[CI] CohereServiceUpgradeIT class failing #121537

elasticsearchmachine commented Feb 3, 2025 •

edited

Loading

elasticsearchmachine commented Feb 3, 2025

elasticsearchmachine commented Feb 3, 2025

davidkyle commented Feb 4, 2025

elasticsearchmachine commented Feb 7, 2025

[CI] CohereServiceUpgradeIT class failing #121537

[CI] CohereServiceUpgradeIT class failing #121537

Comments

elasticsearchmachine commented Feb 3, 2025 • edited Loading

elasticsearchmachine commented Feb 3, 2025

elasticsearchmachine commented Feb 3, 2025

davidkyle commented Feb 4, 2025

elasticsearchmachine commented Feb 7, 2025

elasticsearchmachine commented Feb 3, 2025 •

edited

Loading