Skip to content

[CI] JsonLogsFormatAndParseIT testElementsPresentOnAllLinesOfLog failing #111662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Aug 6, 2024 · 7 comments
Assignees
Labels
:Data Management/Health low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Aug 6, 2024

Build Scans:

Reproduction Line:

./gradlew ":qa:unconfigured-node-name:javaRestTest" --tests "org.elasticsearch.unconfigured_node_name.JsonLogsFormatAndParseIT.testElementsPresentOnAllLinesOfLog" -Dtests.seed=C4137AF58BE5F61F -Dtests.locale=vai-Latn -Dtests.timezone=Australia/Currie -Druntime.java=23

Applicable branches:
8.x

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:43369], URI [_cluster/health?wait_for_events=languid], status line [HTTP/1.1 408 Request Timeout]
{"cluster_name":"javaRestTest","status":"green","timed_out":true,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"unassigned_primary_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":61,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":22811,"active_shards_percent_as_number":100.0}

Issue Reasons:

  • [8.x] 2 failures in test testElementsPresentOnAllLinesOfLog (2.2% fail rate in 90 executions)
  • [8.x] 2 failures in pipeline elasticsearch-periodic-platform-support (50.0% fail rate in 4 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI Team:Delivery Meta label for Delivery team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Aug 6, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-delivery (Team:Delivery)

@mark-vieira mark-vieira added :Core/Infra/Logging Log management and logging utilities and removed :Delivery/Build Build or test infrastructure labels Aug 7, 2024
@mark-vieira
Copy link
Contributor

Interestingly this looks to be failing with a similar timeout as #111632 and on the same platform (Amazon Linux 2023). Related?

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed Team:Delivery Meta label for Delivery team labels Aug 7, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ldematte ldematte self-assigned this Aug 9, 2024
@rjernst
Copy link
Member

rjernst commented Aug 9, 2024

The failure here has nothing to do with the test. The log test passed, and this failure happened during cleanup, where the health API timed out. So I am reassigning to data management for investigation.

@rjernst rjernst added :Data Management/Health and removed :Core/Infra/Logging Log management and logging utilities labels Aug 9, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 9, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member

It looks like there is something in a task queue that is just not getting cleared out. One of those reports task_max_waiting_in_queue_millis of 32.8s, and the other 40s. I haven't been able to reproduce this locally (yet), and don't see any logging that might help.

@dakrone dakrone added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Oct 29, 2024
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Feb 12, 2025
In addition to logging the pending cluster tasks after the cluster
health request times out during cluster cleanup in REST tests, we should
log the hot threads to help identify any issues that could cause tasks
to get stuck.

Follow-up of elastic#119186

Relates elastic#111632
Relates elastic#111431
Relates elastic#111662
nielsbauman added a commit that referenced this issue Feb 13, 2025
In addition to logging the pending cluster tasks after the cluster
health request times out during cluster cleanup in REST tests, we should
log the hot threads to help identify any issues that could cause tasks
to get stuck.

Follow-up of #119186

Relates #111632
Relates #111431
Relates #111662
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Feb 13, 2025
In addition to logging the pending cluster tasks after the cluster
health request times out during cluster cleanup in REST tests, we should
log the hot threads to help identify any issues that could cause tasks
to get stuck.

Follow-up of elastic#119186

Relates elastic#111632
Relates elastic#111431
Relates elastic#111662
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Feb 13, 2025
In addition to logging the pending cluster tasks after the cluster
health request times out during cluster cleanup in REST tests, we should
log the hot threads to help identify any issues that could cause tasks
to get stuck.

Follow-up of elastic#119186

Relates elastic#111632
Relates elastic#111431
Relates elastic#111662
elasticsearchmachine pushed a commit that referenced this issue Feb 13, 2025
In addition to logging the pending cluster tasks after the cluster
health request times out during cluster cleanup in REST tests, we should
log the hot threads to help identify any issues that could cause tasks
to get stuck.

Follow-up of #119186

Relates #111632
Relates #111431
Relates #111662
elasticsearchmachine pushed a commit that referenced this issue Feb 13, 2025
In addition to logging the pending cluster tasks after the cluster
health request times out during cluster cleanup in REST tests, we should
log the hot threads to help identify any issues that could cause tasks
to get stuck.

Follow-up of #119186

Relates #111632
Relates #111431
Relates #111662
@nielsbauman nielsbauman self-assigned this Mar 26, 2025
@nielsbauman
Copy link
Contributor

Closing, see #111431 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Health low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

7 participants