-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES|QL: Check if cluster aliases and index patterns are valid before executing query #122497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ES|QL: Check if cluster aliases and index patterns are valid before executing query #122497
Conversation
Hi @pawankartik-elastic, I've created a changelog YAML for you. |
Things unclear to me at the moment:
|
I only very briefly looked at the solution, but the approach appears correct to me. Maybe it's not so nice to validate in multiple places for CCQ-purposes, as CCQ needs to take into consideration more than just lexikographic correctness. However, I believe that any validation that is possible based on the parser, only, should happen in the IdentifierBuilder (or otherwise close to the parsing step), like in this PR. This allows early rejections without risks, that e.g. we start big fat field_caps calls to try and validate queries that the parser could rightfully reject. And it means that whenever we have a index pattern inside ESQL, we know that it's at least sane from a lexikographic point of view. The possibility of index patterns containing : slipping through is not ideal at the moment. These are clearly invalid and shouldn't make it into the rest of the system.
Update: Please ignore, there's a much simpler explanation, see below.
Update: It's much simpler than that. For local
Quoting is required due to index patterns that can contain funny characters. Like Allowed quoting:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, thanks @pawankartik-elastic !
- This approach is unfortunately insufficient, but not by much.
This PR improves the situation for the buggy queries that initially triggered the issue. But it doesn't solve the problem that the index pattern may be otherwise invalid, like in FROM valid_remote:"invalid<index_name"
- which still just returns
<no-fields>
---------------
The reason is that in IdentifierBuilder.visitIndexPattern
, we just never call validateIndexPattern
in case that clusterString
is not null. Additionally, inside validateIndexPattern
, we explicitly skip the validation in case that the pattern contains a :
.
I believe the correct solution should try to split the indexPattern
at the first :
in case that there is no clusterString
and this is not a date math index pattern. And then, the remaining, actual index pattern should be sent to validateIndexPattern
, which needs additionally to consider the :
character as invalid rather than just skipping validation.
- I think there is a separate problem where cluster names can be invalid, which we don't reject at parse time, but in this case we just throw
no_such_remote_cluster_exception
.
This may be cheap enough to not be a problem - although we could improve validateIndexPattern
too check for correct remote name syntax, too.
- On a separate note: here https://github.com/elastic/elasticsearch/pull/120355/files#diff-ef7f022d55c8068ca9c7c26cc5bac2c4cc3cf5689a9bfc20e931cdf17541556eL83-L84 Ievgen added randomly generated cluster names to our parser tests.
I think it'd be nice to go and check if they use all possible characters that can be used in remote names? If there are more characters than that, we should go and make the identifier generator use that, so this is included in our tests. For instance, what about unicode? (I don't know :) )
expectError("FROM remote:\"foo:bar\"", "Unexpected index separator in index patter"); | ||
expectError("FROM \"remote:foo:bar:baz\"", "Unexpected index separator in index patter"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move these into their own test method testInvalidRemoteClusterPattern
, the test case testInvalidQuotingAsFromIndexPattern
is about invalid quoting not remotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to add more validation for the remote names, I think we need to add more tests for all kinds of illegal characters that (right now) can go into remote names, using all kinds of possible quoting.
For instance, all of the following currently don't cause a parse time exception, but instead raise no_such_remote_cluster_exception
:
from "cluster|one:foo"
from "cluster<one:foo"
from "cluster🦄one:foo"
var maxSeparators = clusterString == null ? 1 : 0; | ||
if (patternExceedsMaxIndexSeparators(indexPattern, maxSeparators)) { | ||
throw new ParsingException(source(c), "Unexpected index separator in index pattern"); | ||
} | ||
// skip validating index on remote cluster, because the behavior of remote cluster is not consistent with local cluster | ||
// For example, invalid#index is an invalid index name, however FROM *:invalid#index does not return an error | ||
if (clusterString == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be an if-else, the index pattern should always be validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @fang-xing-esql. Maybe you have more context behind this decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we don't validate index names in IdentifierBuilder
for remote clusters, when I implemented the validation of index names, I found the behaviors of invalid index names between remote and local cluster are not quite consistent, so I decide to skip the validation of index names on remote cluster.
For example, these queries below all run to completion, they don't error out although the remote index names are invalid
+ curl -u elastic:password -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "FROM existing_index, *:\"invalid:index\""
}
'
`* | `*.keyword | author | author.keyword | b`a`r | bar | page_count | release_date
------------------+------------------+-----------------+-----------------+---------------+---------------+---------------+------------------------
Snow Crash |Snow Crash |Neal Stephenson |Neal Stephenson |5 |1 |470 |1992-06-01T00:00:00.000Z
Revelation Space |Revelation Space |Alastair Reynolds|Alastair Reynolds|4 |2 |585 |2000-03-15T00:00:00.000Z
1984 |1984 |George Orwell |George Orwell |3 |3 |328 |1985-06-01T00:00:00.000Z
Fahrenheit 451 |Fahrenheit 451 |Ray Bradbury |Ray Bradbury |2 |4 |227 |1953-10-15T00:00:00.000Z
Brave New World |Brave New World |Aldous Huxley |Aldous Huxley |1 |5 |268 |1932-06-01T00:00:00.000Z
The Handmaids Tale|The Handmaids Tale|Margaret Atwood |Margaret Atwood |0 |0 |311 |1985-06-01T00:00:00.000Z
+ curl -u elastic:password -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "FROM existing_index, \"*:invalid:index\""
}
'
`* | `*.keyword | author | author.keyword | b`a`r | bar | page_count | release_date
------------------+------------------+-----------------+-----------------+---------------+---------------+---------------+------------------------
Snow Crash |Snow Crash |Neal Stephenson |Neal Stephenson |5 |1 |470 |1992-06-01T00:00:00.000Z
Revelation Space |Revelation Space |Alastair Reynolds|Alastair Reynolds|4 |2 |585 |2000-03-15T00:00:00.000Z
1984 |1984 |George Orwell |George Orwell |3 |3 |328 |1985-06-01T00:00:00.000Z
Fahrenheit 451 |Fahrenheit 451 |Ray Bradbury |Ray Bradbury |2 |4 |227 |1953-10-15T00:00:00.000Z
Brave New World |Brave New World |Aldous Huxley |Aldous Huxley |1 |5 |268 |1932-06-01T00:00:00.000Z
The Handmaids Tale|The Handmaids Tale|Margaret Atwood |Margaret Atwood |0 |0 |311 |1985-06-01T00:00:00.000Z
+ curl -u elastic:password -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "FROM existing_index, *:\"-invalid:index\""
}
'
`* | `*.keyword | author | author.keyword | b`a`r | bar | page_count | release_date
------------------+------------------+-----------------+-----------------+---------------+---------------+---------------+------------------------
Snow Crash |Snow Crash |Neal Stephenson |Neal Stephenson |5 |1 |470 |1992-06-01T00:00:00.000Z
Revelation Space |Revelation Space |Alastair Reynolds|Alastair Reynolds|4 |2 |585 |2000-03-15T00:00:00.000Z
1984 |1984 |George Orwell |George Orwell |3 |3 |328 |1985-06-01T00:00:00.000Z
Fahrenheit 451 |Fahrenheit 451 |Ray Bradbury |Ray Bradbury |2 |4 |227 |1953-10-15T00:00:00.000Z
Brave New World |Brave New World |Aldous Huxley |Aldous Huxley |1 |5 |268 |1932-06-01T00:00:00.000Z
The Handmaids Tale|The Handmaids Tale|Margaret Atwood |Margaret Atwood |0 |0 |311 |1985-06-01T00:00:00.000Z
+ curl -u elastic:password -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "FROM existing_index, \"*:-invalid:index\""
}
'
`* | `*.keyword | author | author.keyword | b`a`r | bar | page_count | release_date
------------------+------------------+-----------------+-----------------+---------------+---------------+---------------+------------------------
Snow Crash |Snow Crash |Neal Stephenson |Neal Stephenson |5 |1 |470 |1992-06-01T00:00:00.000Z
Revelation Space |Revelation Space |Alastair Reynolds|Alastair Reynolds|4 |2 |585 |2000-03-15T00:00:00.000Z
1984 |1984 |George Orwell |George Orwell |3 |3 |328 |1985-06-01T00:00:00.000Z
Fahrenheit 451 |Fahrenheit 451 |Ray Bradbury |Ray Bradbury |2 |4 |227 |1953-10-15T00:00:00.000Z
Brave New World |Brave New World |Aldous Huxley |Aldous Huxley |1 |5 |268 |1932-06-01T00:00:00.000Z
The Handmaids Tale|The Handmaids Tale|Margaret Atwood |Margaret Atwood |0 |0 |311 |1985-06-01T00:00:00.000Z
If we decide to validate remote index names, it does require more tests so that we are clear about the scope of change. :
is not the only invalid character, and also be careful about *
and -
in the (remote)index patterns IndentifierBuilder.validateIndexPattern
calls some existing APIs to validate index names, and remote cluster index name validation can take advantage of it. It could change the current behaviors related to (remote) index patterns potentially, we need to justify if the behavior changes make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a bunch more tests for invalid index pattern in all the cases where there is a remote prefix some_remote:...
. Ideally, randomly generated invalid index patterns.
Thank you @alex-spies. I agree that this fix doesn't fully address the issue. I was simply trying to:
I'll now work on ensuring that we fully address the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from what @alex-spies awesomely mentioned already, please add many many more tests. For one, there is no test with date math expressions (something you explicitly consider in your code).
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
var maxSeparators = clusterString == null ? 1 : 0; | ||
if (patternExceedsMaxIndexSeparators(indexPattern, maxSeparators)) { | ||
throw new ParsingException(source(c), "Unexpected index separator in index pattern"); | ||
} | ||
// skip validating index on remote cluster, because the behavior of remote cluster is not consistent with local cluster | ||
// For example, invalid#index is an invalid index name, however FROM *:invalid#index does not return an error | ||
if (clusterString == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @fang-xing-esql. Maybe you have more context behind this decision.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
Of course! I'd definitely do that. I did the bare minimums to kick start some discussions. Over the next few commits, I'll address all your concerns. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
@@ -784,6 +781,8 @@ public void testInvalidQuotingAsFromIndexPattern() { | |||
|
|||
expectError("FROM \"\"\"foo\"\"\"bar\"\"\"", ": mismatched input 'bar' expecting {<EOF>, '|', ',', 'metadata'}"); | |||
expectError("FROM \"\"\"foo\"\"\"\"\"\"bar\"\"\"", ": mismatched input '\"bar\"' expecting {<EOF>, '|', ',', 'metadata'}"); | |||
expectError("FROM remote:\"foo:bar\"", "Index pattern [foo:bar] contains a cluster alias despite specifying one [remote]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore these 2 test cases. They're already tested below; will drop them soon.
if (selectorString != null) { | ||
throwOnMixingSelectorWithCluster(reassembleIndexName(clusterString, indexPattern, selectorString), ctx); | ||
} | ||
validateClusterString(clusterString, ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow unicode chars in the cluster name? Are the naming rules documented anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pawankartik-elastic! I added some thoughts.
} | ||
} | ||
|
||
validateClusterAndIndexPatterns(indexPattern, c, clusterString, selectorString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexPattern
, clusterString
and selectorString
can be extracted from an IndexPatternContext
, providing all four to validateClusterAndIndexPatterns
seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validateClusterAndIndexPatterns()
is a static
method currently. If I were to pass the IndexPatternContext
, I'd have to make it non-static
since the visitor methods (such as visitIndexString()
) are non-static
and are overridden. Would that be alright with you?
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
String[] patterns = indexPattern.split(","); | ||
boolean isFirstPattern = true; | ||
|
||
for (String pattern : patterns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this is hard for me to follow. The old logic that we check clusterString
first and then check indexString
and selectorString
, the same sequence as the index patterns are coded in the query, is easier for me to follow. I wonder if we can restore to the previous sequence, does it look messier? It seems to me that the old way is easier to understand, maintain, and less error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that exactly what we're doing in this loop? At this point, we loop over each pattern and then towards the end we validate the clusterString
(if set), then call the validateIndicesForCluster()
to validate the indices specified, and then we finally validate the selectorString
via validateIndexSelectorString()
—all within this loop, at the end of the loop.
I maintained the same order. Had I deviated from this order, wrong error messages would have taken precedence and caused the tests to fail due to incorrect error strings (which indeed happened when I was working on this PR—when I deviated from the previous validation ordering).
I agree that the new logic may be complex. It's due to whole new scenarios that are "unlocked" for validation because it applies to remotes too now.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/IdentifierBuilder.java
Show resolved
Hide resolved
@@ -488,15 +489,12 @@ public void testStringAsIndexPattern() { | |||
clusterAndIndexAsIndexPattern(command, "cluster:index"); | |||
clusterAndIndexAsIndexPattern(command, "cluster:.index"); | |||
clusterAndIndexAsIndexPattern(command, "cluster*:index*"); | |||
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/D}>*");// this is not a valid pattern, * should be inside <> | |||
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/D}*>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this test is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this would not error out even if the logstash part was invalid because it's a remote index and we use to skip validation for remote indices. Now that we validate even remote indices, these cases should error out with a message along the lines of 'D' is an invalid unit. Such an exact test case is already present so these are duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawankartik-elastic , it's hard for me to validate that the case FROM + cluster*:<logstash-{now/D}>*
and cluster*:<logstash-{now/D}*>
is still tested. Could you point me to where exactly this is tested, or just add these back in with an appropriate expectError
just so human readers can easily find this case covered? (Maybe it's covered below in testInvalidPatternsWithIntermittentQuotes
, but that's harder to parse for humans.)
Actually, I just tested these two expressions and except for the fact that it should be a lower case d
, these come through as valid just fine. So I think these cases can be left in place but need to be slightly fixed:
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/d}>*");// this is not a valid pattern, * should be inside <>
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/d}*>");
clustersAndIndices(command, "index*", "-index#pattern"); | ||
clustersAndIndices(command, "*", "-<--logstash-{now/M{yyyy.MM}}>"); | ||
clustersAndIndices(command, "index*", "-<--logstash#-{now/M{yyyy.MM}}>"); | ||
expectInvalidIndexNameErrorWithLineNumber(command, "*, index#pattern", lineNumber, "index#pattern", "must not contain '#'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the current behavior. Today, if a *
exists in index patterns, invalid index names are ignored, they don't error out as far as the index patterns match at least one index(valid and existing). The same applies to the tests a few lines below that have selectors. However, it looks reasonable to me that as soon as there is an index pattern with invalid character found, we error out. In this situation we have a *
, that means there is a high chance that this whole index pattern in the from
command matches an existing index, and it doesn't look completely wrong to tolerate invalid character(could be a typo) either. There could be a reason that why this was allowed originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Should we have a slightly detailed discussion internally with the ES|QL and the CCS team regarding whether we should tolerate such erroneous patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It changes the current behaviors, it worths bringing up to a broader group to discuss further in my opinion, and also add ES|QL-ui
label to inform kibana team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour is about excluding certain patterns as long as there's a preceding pattern(s) with a *
(and does not require any valid and existing indices). Since this functionality existed as a way to exclude arbitrary patterns (but is unfortunately not documented anywhere?), I've reverted some of my changes based on our conversation earlier today to retain this behaviour and added back these test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fang-xing-esql , are you sure this is a behavior change? I just ran this against main and got an invalid index name exception with a 400:
curl -u elastic:password -X POST 'localhost:9200/_query?format=txt' -H 'Content-Type: application/json' '-d 16:36
{
"query": "from *, index#pattern"
}
'
{"error":{"root_cause":[{"type":"invalid_index_name_exception","reason":"Invalid index name [index#pattern], must not contain '#'","index_uuid":"_na_","index":"index#pattern"}],"type":"parsing_exception","reason":"line 1:9: Invalid index name [index#pattern], must not contain '#'","caused_by":{"type":"invalid_index_name_exception","reason":"Invalid index name [index#pattern], must not contain '#'","index_uuid":"_na_","index":"index#pattern"}},"status":400}%
Both added tests here seem to behave the same way as on main in my short manual test.
The code is non-trivial so I may not have read this correctly, but it appears to me that the behavior of the validation for non-remote index patterns remains consistent with this change.
1. Wrapped `splitIndexName()` in try catch and ensured it throws parse exception. 2. Relaxed the validation to allow excluding garbage. 3. Validation no longer breaks patterns on pipe char as it's an invalid char.
Ignore the CI failures. It's mostly BWC failures because branch is out of date. I will update it later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @pawankartik-elastic ! This is looking much better already.
I reviewed the production code, but not yet the new test that was added. The reason is that there have been important discussions on what should and should not be allowed in FROM
commands, especially the comments by @jbaiera here and here, which are not reflected in this PR.
If we go with the suggestions by @jbaiera , and I believe we should, then the present PR could be simplified.
Changing the validation of the FROM
command is high risk, as any potentially introduced new bug that we don't happen to already have tests for can and will break customer queries immediately as this sees the light of day in Serverless.
Therefore, I'd like to double check if we can go with a simpler approach based on @jbaiera 's suggestion. That would IMO likely lead to simpler validation code and thus require less paranoid double checking during review to have good confidence in the change.
(I'm not saying the current code likely has bugs, just that I, as a human, am likely to overlook bugs in code that requires multiple nested loops handling quoted and unquoted index patterns and mutating the current cluster string and so on.)
} | ||
|
||
hasSeenStar.set(hasSeenStar.get() || indexPattern.contains(WILDCARD)); | ||
validateClusterAndIndexPatterns(indexPattern, c, clusterString, selectorString, hasSeenStar.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validateClusterAndIndexPatterns
gets handed indexPattern, c, clusterString, selectorString
. That's a bit redundant, as c
contains the other 3.
I think c
is only really needed for source(ctx)
inside validateClusterAndIndexPatterns
, so I'd change the signature of validateClusterAndIndexPatterns
to only require the Source
, not the whole thing.
* Just because there was no clusterString before this index pattern does not mean that the indices | ||
* are local indices. Patterns can be clubbed with remote names within quotes such as: | ||
* "remote_one:remote_index,local_index". In this case, clusterString will be null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice comment!
* Handle scenarios like remote_one:"index1,remote_two:index2". The clusterString here is | ||
* remote_one and is associated with index1 and not index2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, in a discussion with @jbaiera , we ended up suggesting that remote_one:"index1,remote_two:index2"
should simply be disallowed; more generally, when there is a cluster string prefix, we should interpret the index pattern as being a single actual index pattern and not split by commas - commas should just be invalid.
That's not only simpler, but also less surprising to users. C.f. the table of suggested validations here.
To take this even further, I also brought up simply disallowing half quoted patterns like remote:"index"
altogether; more specifically, quotes and the cluster string would be mutually exclusive. More specifically, this would mean that either
- there is an unquoted cluster string and an unquoted index pattern and the index pattern is interpreted as a singular index pattern (no splitting at commas)
- XOR
- there is a single, quoted string and cluster string is null; then we first have to split at commas to get individual index patterns, resp. cluster string + index pattern pairs.
This more extreme suggestion would get rid of a bunch of edge cases that we need to cover here. E.g.remote":index, other_remote:other_index"
would simply not occur, ever.
Both suggestions would avoid ambiguity from e.g. remote":index,other_index"
as to whether the remote
is supposed to apply to the first or both indices. They should also cut down on the finicky edge case handling that this PR currently has to implement. (Sorry about that, this looks super painful!)
I think we should settle whether we want to go with these suggestions or not before moving forward, as they would also affect the necessary testing and associated headaches. Since ESQL+CCS is not yet GA, we can only put these restrictions in place right now. Lifting them later is possible, but adding them later would be breaking.
index = splitPattern.v1(); | ||
|
||
if (clusterString != null) { | ||
if (selectorString != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, this is a bit out of this PR's scope, but the selectorString is treated as if it applied to every comma-split pattern.
Not a big deal - in #122651 (comment) James suggested that comma-split patterns in quotes just shouldn't mix selectors, like FROM "index1,index2"::data
. This should be easy to guard against in a follow-up at a later time. /cc @jbaiera .
// throws exception if the selector expression is invalid. Selector resolution does not complain about exclusions | ||
throw new ParsingException(e, source(ctx), e.getMessage()); | ||
} | ||
hasSeenStar = hasSeenStar || index.contains(WILDCARD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit and out of scope: This is pre-existing, but I think this may be redundant, as we scan the whole index pattern for wildcards already up here.
clusterAndIndexAsIndexPattern(command, "cluster*:*"); | ||
clusterAndIndexAsIndexPattern(command, "*:index*"); | ||
clusterAndIndexAsIndexPattern(command, "*:*"); | ||
expectError("FROM \"cluster:index|pattern\"", "Invalid index name [index|pattern], must not contain the following characters"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
expectError("FROM \"cluster:index|pattern\"", "Invalid index name [index|pattern], must not contain the following characters"); | |
expectError(command + " \"cluster:index|pattern\"", "Invalid index name [index|pattern], must not contain the following characters"); |
to also include TS
in the tests.
Same below.
assertStringAsIndexPattern("cluster:index|pattern", command + " cluster:\"index|pattern\""); | ||
assertStringAsIndexPattern("*:index|pattern", command + " \"*:index|pattern\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you added these with an expectError
below, I think you changed the quoting. Let's add these test cases back in with the original quoting, in addition to the ones you added.
Additionally, these cases are only invalid because of the |
character in the index pattern. Let's add these test cases back in with valid index patterns, e.g.
assertStringAsIndexPattern("*:index", command + " \"*:index\"");
assertStringAsIndexPattern("*:index*pattern", command + " \"*:index*pattern\"");
assertStringAsIndexPattern("*:index", command + " \"*:index\"");
assertStringAsIndexPattern("*:index*pattern", command + " \"*:index*pattern\"");
@@ -488,15 +489,12 @@ public void testStringAsIndexPattern() { | |||
clusterAndIndexAsIndexPattern(command, "cluster:index"); | |||
clusterAndIndexAsIndexPattern(command, "cluster:.index"); | |||
clusterAndIndexAsIndexPattern(command, "cluster*:index*"); | |||
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/D}>*");// this is not a valid pattern, * should be inside <> | |||
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/D}*>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawankartik-elastic , it's hard for me to validate that the case FROM + cluster*:<logstash-{now/D}>*
and cluster*:<logstash-{now/D}*>
is still tested. Could you point me to where exactly this is tested, or just add these back in with an appropriate expectError
just so human readers can easily find this case covered? (Maybe it's covered below in testInvalidPatternsWithIntermittentQuotes
, but that's harder to parse for humans.)
Actually, I just tested these two expressions and except for the fact that it should be a lower case d
, these come through as valid just fine. So I think these cases can be left in place but need to be slightly fixed:
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/d}>*");// this is not a valid pattern, * should be inside <>
clusterAndIndexAsIndexPattern(command, "cluster*:<logstash-{now/d}*>");
clustersAndIndices(command, "index*", "-index#pattern"); | ||
clustersAndIndices(command, "*", "-<--logstash-{now/M{yyyy.MM}}>"); | ||
clustersAndIndices(command, "index*", "-<--logstash#-{now/M{yyyy.MM}}>"); | ||
expectInvalidIndexNameErrorWithLineNumber(command, "*, index#pattern", lineNumber, "index#pattern", "must not contain '#'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fang-xing-esql , are you sure this is a behavior change? I just ran this against main and got an invalid index name exception with a 400:
curl -u elastic:password -X POST 'localhost:9200/_query?format=txt' -H 'Content-Type: application/json' '-d 16:36
{
"query": "from *, index#pattern"
}
'
{"error":{"root_cause":[{"type":"invalid_index_name_exception","reason":"Invalid index name [index#pattern], must not contain '#'","index_uuid":"_na_","index":"index#pattern"}],"type":"parsing_exception","reason":"line 1:9: Invalid index name [index#pattern], must not contain '#'","caused_by":{"type":"invalid_index_name_exception","reason":"Invalid index name [index#pattern], must not contain '#'","index_uuid":"_na_","index":"index#pattern"}},"status":400}%
Both added tests here seem to behave the same way as on main in my short manual test.
The code is non-trivial so I may not have read this correctly, but it appears to me that the behavior of the validation for non-remote index patterns remains consistent with this change.
clustersAndIndices(command, "*", "-index#pattern::data"); | ||
clustersAndIndices(command, "*", "-index#pattern::data"); | ||
clustersAndIndices(command, "index*", "-index#pattern::data"); | ||
clustersAndIndices(command, "*", "-<--logstash-{now/M{yyyy.MM}}>::data"); | ||
clustersAndIndices(command, "index*", "-<--logstash#-{now/M{yyyy.MM}}>::data"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this being removed? It doesn't fail for me when adding it back in.
We should probably fail this due to the #
, I believe, but I'd rather add a comment like "this should actually fail due to the #
", so that @jbaiera can maybe come back to this as part of the selectors work.
Another good thing to do would be to add a correct positive version of these cases, e.g.
clustersAndIndices(command, "*", "-index::data");
clustersAndIndices(command, "*", "-index*pattern::data");
...
Consider the queries with more than 1 index seperator (not valid as per our current spec):
FROM remote:"foo:bar"
, and,FROM "remote:foo:bar:baz"
As of now, both the queries return empty results when
POST
ing to/_query?format=txt
. This happens because they're syntactically valid (but not semantically) and the subsequent validation makes no attempt to discard them or error out. This PR fixes this behaviour by failing such queries at the parsing stage. There was an initial concern surrounding the licensing and security aspects but since we're failing fast at the parsing stage and aren't allowing the queries "through", we should be fine. Once malformed queries are discarded, subsequent queries should go through the validation steps fine.Addresses #121901.
Considerations:
Should theChanges have been pulled into my branch and accommodated.::
operator be accounted for (because the respective PR is yet to be merged)?This is a bug fix. How far should it be backported (keeping in mind theLooking good to go back as far as 8.19.::
operator)?What's the right team to tag? SF/CCS or Analytics/ES|QL?