-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES|QL cross-cluster searches honor the skip_unavailable cluster setting #112886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-analytical-engine (Team:Analytics) |
None of this sounds like it's actively against the ESQL philosophy. Its a little weird that you can request an index explicitly (
It's been quite a while since I looked at that code, but I expect we can rig up something. I'm sure testing it's going to be fun though. As with _search we could generate wildly inaccurate results if there's an error while the thing is running, but that's what you ask for when you set |
@quux00, first let me picture how ES|QL behaves in different scenarios of indices missing or not.
Existent behavior for different index patterns:
The way I see this in the context of CCS and skip_unavailable skip_unavailable = true
skip_unavailable = false
Note: for cases where the local cluster is involved and the overall - remote and local - resolution of indices ends up with no indices this should be an error because the user specified a missing indices pattern for the local cluster; even if the user cannot control how the remote clusters behave, the user is in control of the index pattern for the local cluster |
Thanks @astefan for the careful review and analysis! One nit to start, when you say:
Can we change that to say "query time" rather than " Second, I think there are some errors in your table notes.
This is not how ESQL currently behaves. It will ignore the local "inexistent" and search "remote:existent". That's true as long as you only search for one concrete index per cluster and at least one cluster has a matching index - in that case the search succeeds. And
That is how ESQL behaves now (meaning it currently does NOT throw a failure) for the same reason as above, so that is not a change. Where ESQL is different (inconsistent?) is when you search for two concrete indexes on the same cluster and one exists and the other does not. Both of these fail with an index_not_found exception.
In my view we should make ESQL behavior consistent here AND tailor it to be specific to whether the cluster is skip_unavailable=true or false. In other words, a missing concrete index should either:
And that means we need to determine what the skip_unavailable built-in "setting" for the local cluster is. Third, from this example you gave
I believe you are proposing that the local cluster should be treated like skip_unavailable=false. Is that right? That would be consistent with how _search behaves (see below), so I'm not opposed to it, but I did propose the opposite in my write-up because skip_unavailable=true is now the default for remotes (that was changed recently, in 8.15, I think) and it is not a changeable setting by end users/admins. Example showing that in _search, the local cluster is treated like skip_unavailable=false:
For the rest of your write-up example, I tested them against my skip_unavailable branch in progress and they match the behavior I've implemented. So I think we come down to two open questions:
|
UPDATE: Note also that ESQL is not currently consistent around whether to fail queries when two concrete indices are given for the same cluster, but one doesn't exist. This fails (with
But this succeeds:
The reason is the latter is a coordinator only operation and the index never needs to be used again the data node phase of ESQL processing. My vote would be that both should behave the same and whether it fails or not depends on the skip_unavailable setting. In my current PR that's the behavior I've been working towards, so please let me know if anyone doesn't agree. |
You are right, sorry about abusing the
Imo, this is inconsistent and incorrect. If
I think it should (throw an error). This changes if we add the
I think this behavior is consistent: both queries should fail because one concrete index does not exist. It doesn't matter if the index is local or not. Of course, this can change depending on
From what I understand regarding I think, better said, I regard the CCS handling of ES|QL having as a "baseline" comparison the current (non-CCS) behavior which has no knowledge of
Good catch. And this is a bug.
I respectfully disagree based on the logic I described above:
|
Thanks @astefan. Reading through your feedback, I think the behavior you are proposing can be captured in one sentence: If a user requests a concrete index that is not found, the query should be failed with a standard exception and HTTP status code, unless the query was done against a remote cluster with the setting Discussion / further elucidation of this principle:
Open questions:
|
Yes.
Just to make sure I understand what
I don't think this is breaking. I consider the three queries above bugs.
I don't think ES itself makes a difference between an alias and an index name. When writing an ESQL query or an ES
The query could be syntactically correct, (indices exist, field names and types are correct and compatible), or the query is syntactically incorrect (wrong index names), but there are other issues that forbid the remote cluster to fulfill the search. Thus, I consider that this edge case should return no data and if to-be-added-parameter-that-exposes-ccs-metadata is set to What I'm trying to point out is that ES|QL could behave differently for the exact same query depending on several factors (connectivity to remote cluster(s), shards being active or not etc) and that is probably acceptable:
Right now, ES|QL has two flavors of "index not found" errors, as you mentioned:
I think we should be consistent here and, imo, this should be a 400 error code (bad request) - one that indicates that it's an user error where basically the query is syntactically incorrect because it references an inexistent index.
I think @costin referred to my statement in the previous comment: I think, better said, I regard the CCS handling of ES|QL having as a "baseline" comparison the current (non-CCS) behavior which has no knowledge of skip_unvailable. What we consider as new behavior in ES|QL with regards to CCS should be in addition to existent behavior and not as an override of current behavior.
Yes, adding |
Yes. "Fatal" means an exception is thrown and results in a 4xx/5xx HTTP status response. I use this term because in _search some errors/exceptions are not fatal - they are just recorded either in logs or response metadata, but the response still has a 2xx. That will be true here in ESQL CCS as well for Thanks for your detailed answers to the questions. I think we are on the same page now. I will create a new issue with a write-up of the forthcoming PRs for adding |
All the work for skip_unavailable is complete now, the rest is continuing in #122802 |
UPDATE: This issue had a lot of discussion to work through the approach. A new issue has been created that summarizes the proposed approaches to handling
skip_unavailable
in ES|QL.Description
Overview
The
skip_unavailable
remote cluster setting is intended to allow ES admins to specify whether a cross-cluster search should fail or return partial data in the face of a errors on a remote cluster during a cross-cluster search.For
_search
, ifskip_unavailable
is true, a cross-cluster search:ESQL cross-cluster searches should also respect this setting, but we need to define exactly how it should work.
Proposed Implementation in ES|QL. Phase 1: field-caps and enrich policy-resolve APIs
To start, support for
skip_unavailable
should be implemented in both the field-caps and enrich policy-resolve APIs, which occur as part of the "pre-analysis" phase of ES|QL processing.When a remote cluster cannot be connected to during the field-caps or enrich policy-resolve steps:
skip_unavailable=true
(the default setting) for the remote cluster, the cluster will be marked as SKIPPED in the EsqlExcecutionInfo metadata object for that search and reported as skipped in the_clusters/details
metadata section of the ES|QL response and a failure reason will be provided (see examples section below).skip_unavailable=false
, then a 500 HTTP status code is returned along with a single top level error, as_search
does.If the index expression provided does not match any indices on a cluster, how should we handle that? I propose that we follow the pattern in
_search
:skip_unavailable=true
(the default setting) for the remote cluster, the cluster will be marked as SKIPPED along with a "index_not_found" failure messageskip_unavailable=false
for the remote cluster, the cluster will be marked as SKIPPED along with a "index_not_found" failure message, ONLY IF the index expression was specified with a wildcard (lenient handling) - see example belowskip_unavailable=false
for the remote cluster, and a concrete index was specified by the client (no wildcards), the error is fatal and HTTP 500 status will be returned with "index_not_found" failure message (see example below)An additional consideration is how to treat the local cluster. It does not have an explicit skip_unavailable setting. Since it is the coordinating cluster, it will never be unavailable, but we need to decide how to handle the case when the index expression provided matches no indices - is it a fatal error or should we just mark the local cluster as skipped?
I propose that we treat the local cluster like skip_unavailable=true for this case, for three reasons:
Reference
I have documented how _search and ES|QL currently behaves with respect to indices not matching here: https://gist.github.com/quux00/a1256fd43947421e3a6993f982d065e8
Examples of proposed ES|QL response outputs
In these examples:
remote1
hasskip_unavailable=true
remote2
hasskip_unavailable=false
(toggle) Fatal error (404) when index not found and wildcard NOT used in remote2
(toggle) Skipped clusters when index not found and wildcard used in remote2
(toggle) Skipped cluster remote1 when not available
(toggle) Fatal error since remote2 not available
Proposed Implementation in ES|QL. Phase 2: trapping errors during ES|QL operations (after planning)
To be fully compliant with the
skip_unavailable
model, we will also need to add in error handling during ES|QL processing. If shard errors (or other fatal errors) occur during ES|QL processing on a remote cluster and that cluster is marked as skip_unavailable=true, we will need to trap those errors, avoid returning a 4xx/5xx error to the user and instead mark the cluster either asskipped
orpartial
(depending on whether we can use the partial data that came back) in the EsqlExecutionInfo, along with failure info, as we do in _search.Since ES|QL currently treats failures during ES|QL processing as fatal, I do not know how hard adding this feature will be. I would like feedback from the ES|QL team on how feasible this is and how it could be done.
The text was updated successfully, but these errors were encountered: