Skip to content

Fix timeout for awaiting index existence #126773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 18, 2025

Conversation

nielsbauman
Copy link
Contributor

#126692 allowed consumers to specify a timeout to awaitIndexExists, but that timeout did not get propagated correctly to all the required places.

elastic#126692 allowed consumers to specify a timeout to `awaitIndexExists`,
but that timeout did not get propagated correctly to all the required
places.
@nielsbauman nielsbauman requested a review from DaveCTurner April 14, 2025 11:11
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Apr 14, 2025
// NB private because tests should be designed not to need to wait for longer than SAFE_AWAIT_TIMEOUT.
private static <T> T safeGet(Future<T> future, TimeValue timeout) {
public static <T> T safeGet(Future<T> future, TimeValue timeout) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can simply require tests to not have to wait for more than 10 seconds. In this case, downsampling is not consistently finished within 10 seconds, we really need more time. I'm open to other suggestions, but I think it's acceptable to allow consumers to specify a higher timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we didn't make it so easy to abuse these utilities like this - experience shows that it just leads to lazy testing that waits for far too long and slows everything down. If it's just for a health request, you can pick your own timeout and use ElasticsearchAssertions#assertNoTimeout.

Relatedly, why is downsampling so slow in tests? Are we just sending it unreasonably large amounts of data, or is it slow for some other reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure I feel like not exposing this method will make much of a difference when it comes to encouraging people to write tests that don't need to wait more than 10 seconds - as there are other alternatives.

As to these downsampling tests (but I'm sure there are other places where we need a higher timeout), DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart (which failed a few times #122056 (comment) and #123769 (comment)) and ILMDownsampleDisruptionIT.testILMDownsampleRollingRestart both need longer timeouts because we do a rolling restart of the nodes. We saw both timeouts on the downsampling itself and of the downsampled index becoming green afterwards (i.e. shard relocation took too long).

We could wait for the cluster to complete the rolling restart (although we'd probably need a timeout higher than 10 seconds for that) and only then check for index existence within 10 seconds, but that probably only makes the tests longer as they could also complete downsampling before the rolling restart is finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that, I feel like the difference between TEST_REQUEST_TIMEOUT (30s) and SAFE_AWAIT_TIMEOUT (10s) is also confusing. If we truly believe tests shouldn't have to wait for more than 10 seconds, we should drop the TEST_REQUEST_TIMEOUT too - I don't think we should, but that would be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner any more thoughts on this? Otherwise I'll ask someone from Data Management to review as well to get this through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. I'd still rather we used ElasticsearchAssertions#assertNoTimeout. I disagree that this doesn't make much difference, I think it is important to discourage writing badly-designed tests that have to wait for so long. It's clearly possible to do whatever waiting you want to do in any case, it's just a question of whether that should be easy to do without much thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to assertNoTimeout in c2a30b0.

@nielsbauman nielsbauman added >test Issues or PRs that are addressing/adding tests Team:Distributed Coordination Meta label for Distributed Coordination team :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. and removed needs:triage Requires assignment of a team area label labels Apr 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels LGTM

@nielsbauman nielsbauman enabled auto-merge (squash) April 18, 2025 06:59
@nielsbauman nielsbauman merged commit a81c449 into elastic:main Apr 18, 2025
17 of 18 checks passed
@nielsbauman nielsbauman deleted the fix-timeout branch April 18, 2025 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants