Skip to content

ESQL: Quotations in index patterns allow for inconsistent pattern composition #122651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbaiera opened this issue Feb 14, 2025 · 18 comments · May be fixed by #127636
Open

ESQL: Quotations in index patterns allow for inconsistent pattern composition #122651

jbaiera opened this issue Feb 14, 2025 · 18 comments · May be fixed by #127636
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@jbaiera
Copy link
Member

jbaiera commented Feb 14, 2025

ES|QL allows for using quotations in index pattern names to escape characters that might otherwise break the syntax. An example of this might be something like

FROM "this=that", """this[that"""

Additionally, we allow users to specify their entire index pattern in a string literal:

FROM "index1,index2,index3"

I've found recently that you can currently mix and match this syntax with cluster id syntax (and eventually index selector logic #120660) to create index patterns that probably should not be allowed.

All of the following patterns are valid in ES|QL, and parse to their unquoted equivalents when executing:

FROM cluster:"index1,index2"
# resolves to [cluster:index1, index2]

FROM cluster1:"index1,cluster2:index2"
# resolves to [cluster1:index1, cluster2:index2]

FROM "clusterA":"index1,index2,index3"
# resolves to [clusterA:index1, index2, index3]

When combining this existing logic with #120660 we end up with valid patterns like the following:

FROM "clusterA":"index1,index2,index3"::"data"
# resolves to [clusterA:index1, index2, index3::data]

FROM "cluster*":"index*,index*,index*::failures,*"::failures
# resolves to [cluster*:index*, index*, index*::failures, *::failures]

FROM "index1,index2,*"::failures
# resolves to [index1, index2, *::failures]

This seems like invalid pattern parsing to me. In my mind cluster:"index1,index2" implies that the comma is escaped, or that both indices should be resolved from the remote cluster provided. Since this incompatibility is really only possible under CCS, I would make a case to disallow commas in the quoted portions of ESQL that include an unquoted cluster id prefix (and in the future, an index selector suffix).

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000
Copy link
Member

nik9000 commented Feb 14, 2025

AKA "jank combinations that make Nik cry"

@alex-spies
Copy link
Contributor

This seems like invalid pattern parsing to me. In my mind cluster:"index1,index2" implies that the comma is escaped

++, esp. since multiple remotes are possible via from "cluster_one":*, "cluster_two":* .

I wonder if there's a good reason to interpret commas inside quotes as separators - but otherwise, I feel like this should be a bug.

@alex-spies
Copy link
Contributor

There is related ongoing work for CCQ index pattern validation here: #122497

@jbaiera
Copy link
Member Author

jbaiera commented Feb 19, 2025

A suggestion for how to handle this going forward: It would be beneficial if we allowed quoted index patterns to be used with commas in two mutually exclusive ways. When searching multiple index patterns:

  1. Each index pattern can be quoted individually to escape characters
    (like FROM "index1", "index2", "index3")
  2. Or a list of indices can be provided in quotes, as long as there are no more expressions after the quoted list
    (like FROM "index1,index2,index3")

As mentioned, these cases should be mutually exclusive to one another.

Unfortunately, I don't think we'll be able to make this approach hold universally in ESQL without getting approval for a breaking change. Today we tolerate index patterns like FROM "index1,index2", index3, "index4". Changing this to throw an exception would break backwards compatibility. This is jank, but it's less janky than when we combine this kind of syntax with cluster names and selectors.

As such, I propose that we continue to support the previous GA usages where they occur, but throw a validation exception when that previous use is combined with cluster names and selectors outside of the quoted pattern:

Usage Today's Result Suggested Result
FROM "index1","index2" [index1, index2] [index1, index2]
FROM "index1,index2" [index1, index2] [index1, index2]
FROM "index1,index2", index3 [index1, index2, index3] [index1, index2, index3] (Breaking change, should maybe disallow)
FROM cluster:"index1" [cluster:index1] [cluster:index1] (should maybe disallow)
FROM "cluster:index1" [cluster:index1] [cluster:index1]
FROM "cluster:index1", index2 [cluster:index1, index2] [cluster:index1, index2]
FROM "cluster:index1", "index2" [cluster:index1, index2] [cluster:index1, index2]
FROM "cluster:index1,index2" [cluster:index1, index2] [cluster:index1, index2]
FROM cluster:"index1,index2" [cluster:index1, index2] ❌ disallow comma usage in index1,index2 with cluster name
FROM cluster:"index1,cluster:index2" [cluster:index1, cluster:index2] ❌ disallow comma usage in index1,cluster:index2 with cluster name

This would allow later operations with index selectors to mimic this behavior:

Usage Without validation change With validation change
FROM "index1", "index2::data" [index1, index2::data] [index1, index2::data]
FROM "index1,index2::data" [index1, index2::data] [index1, index2::data]
FROM "index1,index2::data", index3 [index1, index2::data, index3] [index1, index2::data, index3]
FROM "index1,index2"::data [index1, index2::data] ❌ disallow comma usage in index1,index2 with selector provided
FROM "index1::data,index2"::data [index1::data, index2::data] ❌ disallow comma usage in index1::data,index2 with selector provided

Once CCS supports selectors, the patterns would be unified:

Usage Without validation change With validation change
FROM "cluster:index1", "index2::data" [cluster:index1, index2::data] [cluster:index1, index2::data]
FROM "cluster:index1,index2::data" [cluster:index1, index2::data] [cluster:index1, index2::data]
FROM "cluster:index1,index2::data", index3 [cluster:index1, index2::data, index3] [cluster:index1, index2::data, index3]
FROM cluster:"index1,index2"::data [cluster:index1, index2::data] ❌ disallow comma usage in index1,index2 with cluster name and selector provided
FROM cluster:"index1::data,cluster:index2"::data [cluster:index1::data, cluster:index2::data] ❌ disallow comma usage in index1::data,cluster:index2 with cluster name and selector provided

@smalyshev
Copy link
Contributor

I think we should either support "one quote to rule them all" (option 2) or "quote between commas" (option 1) but not other things, and not commas inside quotes. E.g. things like cluster:"index1" or "index1,index2", index3 shouldn't be allowed. Same for "index1,index2::data", index3, "cluster:index1,index2::data", index3 etc. It may be a bit of BC break, but it's better to take a bit of pain now than having to support increasingly complex code to deal with complications and inconsistencies for years to come, IMHO.

@jbaiera
Copy link
Member Author

jbaiera commented Feb 20, 2025

It was also brought to my attention by @alex-spies that we should probably disallow cluster:"index" as a format since that is also covered under the non-GA nature of CCS and thus would be reasonable to make a case for breaking that tolerance.

As for disallowing commas in quotation marks at all, I'm more on the fence about it. The documentation I can find lists quotation marks as the way to escape characters that are not accepted here by the ESQL parser, but there are plenty of special characters that shouldn't be escaped from index pattern parsing (like < and > for date math, : and :: for the different parts of the pattern, etc). I think it might be valuable to take a stance on what "escaping" means here: Does "index[1,index[2-<@now>" mean "two indices are resolved each named index[1 and index[2-2015-..." or does it mean "one index is trying to be resolved literally named index[1,index[2-<@now>? I don't think it's as simple to say we should escape commas inside of quotes when we don't escape other index syntax characters. Seems much easier to justify disallowing half-in-quotes, half-out-of-quotes index patterns.

@smalyshev
Copy link
Contributor

As for disallowing commas in quotation marks at all, I'm more on the fence about it.

Maybe I went too far there - I don't necessarily mean banning commas in all cases, just not allowing it as semantically valid separator. I.e. "..." should be a single unit that is parsed as single index name, if we go option 1 route. I'm not sure if commas are necessary for datemath (they don't seem to have any special meaning and banned in index name so they can't be part of static string, and even if local date-time format has commas, I don't think you can have an index name that includes them?) but whatever is parsed by datemath should be happening inside quotes and it should be treated as single unit. I.e. index1, "index2, index3" or cluster:"index" shouldn't be a valid thing.

Seems much easier to justify disallowing half-in-quotes, half-out-of-quotes index patterns.

Agreed.

@jbaiera
Copy link
Member Author

jbaiera commented Feb 20, 2025

I updated my suggestion table to include the examples that are unclear if we should allow or not

@alex-spies
Copy link
Contributor

alex-spies commented Apr 30, 2025

@smalyshev , I think we should make a definitive decision on whether we should go forward with @jbaiera 's suggestion (I think we should), and to what extent.

@pawankartik-elastic 's PR #122497 seems to follow a different validation strategy, which treats FROM remote:"index, other_remote:other_index" equivalently to FROM "remote:index", "other_remote:other_index" and FROM remote:"index, other_index" equivalently to FROM "remote:index", "other_index" (if I'm not mistaken).

I think now would be a good time to settle on one strategy, so that #122497 can already follow it and we minimize potentially risky changes to the validation code of FROM.

@smalyshev
Copy link
Contributor

smalyshev commented Apr 30, 2025

I've checked the current version of the table above and my main concern there I still can't figure out the general rule by which it works. The rule for "individual quotes or quote everything" makes sense, but the examples in the table don't seem to follow it. Can we formulate a generic rule that would allow me to know how it's parsed without looking at the table? I think we need to have this rule recorded somewhere before we start coding it, otherwise we get "our code is our docs" which is not a good situation.

The examples individually make sense to me, and I do think FROM remote:"index, other_remote:other_index" should not be allowed and the same with FROM remote:"index, other_index".

@alex-spies
Copy link
Contributor

alex-spies commented Apr 30, 2025

Here's my attempt at formulating a rule that fits the table:

  1. The FROM command accepts a comma-separated list of index patterns.
  2. Each such index pattern can be enclosed in quotes, or it can be unquoted.
  3. Unquoted index patterns can consist only of 1 wildcard pattern or 1 specific index; it can be a remote index/pattern and we can also select the corresponding failure store. (Unquoted index patterns cannot contain commas).
  4. Quoted index patterns can contain 1 or several index patterns as described in 3.; here, commas separate different index patterns.

If I'm correct, the grammar can be simplified and that would automatically take care of most of the validation. We can restrict clusterString and selectorString to UNQUOTED_SOURCE and force the index pattern to be UNQUOTED_SOURCE if the clusterString or selectorString is present.

There's still the issue that https://www.elastic.co/docs/reference/elasticsearch/rest-apis/api-conventions#api-date-math-index-names mentions the comma having special meaning inside date math patterns, but I don't know specific examples that demonstrate this.

@smalyshev
Copy link
Contributor

comma having special meaning inside date math patterns, but I don't know specific examples that demonstrate this.

I'm not sure what is the story there. Comma does not have AFAIK any meaning for datemath syntax itself, and does not have any special meaning in Java date formatting. Some national date formats can include commas, but if you use such a format in datemath expression, you'd get invalid index name since comma is not valid in the index name? So I can't think of any legit way of actually using a comma in this context.

@smalyshev
Copy link
Contributor

it can be a remote index/pattern

I'd say here "remote index expression, e.g. cluster:index", where both cluster and index components can include wildcards. The word "pattern" seems to have too many meanings here so let's use it consistently. Basically, as I understand, unquoted pattern is always one thing - either one concrete index (including datemath expression) or one wildcard expression, but no commas, quotes, etc.

Other than the above, I think it makes sense.

@smalyshev
Copy link
Contributor

I have one more concern here. With all this quoting going on, on which stage the quotes are going to be eliminated? The reason I ask is that the planning stage expects a single IndexPattern (it's a list but it always size 1 now) and IndexPattern is basically a string. So, do we have to preserve all those quotes as they were in the original pattern down to the analysis stage? It looks like the quotes are removed pretty early, but that means quotes probably could never be used for something like quoting commas, right?

@pawankartik-elastic
Copy link
Contributor

With all this quoting going on, on which stage the quotes are going to be eliminated?

Quoting is eliminated within the IdentifierBuilder class. The visit*() methods invoke unquote() if a string is quoted.

@alex-spies
Copy link
Contributor

alex-spies commented May 2, 2025

So, here's my suggestion for going forward:

  1. Let's make a PR that simplifies the grammar according to ESQL: Quotations in index patterns allow for inconsistent pattern composition #122651 (comment). This should exclude a swathe of annoying edge cases because we only have to consider the fully quoted case, as in FROM "remote_expr1:index_expr1, index_expr2::selector2", ... or the fully unquoted case, as in FROM remote_expr1:index_expr1, index_expr2:selector2, ....
  2. Then, let's update Pawan's PR, which will now have to take fewer cases into account. In particular, it's clearer what the cluster string/remote expression is in every case, which currently seems to add complexity to the validation.

What do we think?

@pawankartik-elastic
Copy link
Contributor

@alex-spies Your suggestion looks right to me. If we can agree with the rules, I believe we could go for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants