Skip to content

ES|QL: pushdown LOOKUP JOIN past SORT #121884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luigidellaquila opened this issue Feb 6, 2025 · 7 comments
Open

ES|QL: pushdown LOOKUP JOIN past SORT #121884

luigidellaquila opened this issue Feb 6, 2025 · 7 comments
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@luigidellaquila
Copy link
Contributor

(and potentially other commands)

Some queries cannot be executed due to unbounded sort.

Eg.

languages mapping:

{
  "mappings": {        
        "properties": {"language_code": {"type": "integer"}, "name": {"type": "keyword"}}}, "settings": {"index": {"mode": "lookup"}}
}
row language_code = 1, name = "foo" 
| sort name 
| lookup join languages on language_code 
| where name == "English"

will result in a verification_exception.

Some cases though, could be fixed by pushing LOOKUP JOIN past SORT.

When there is attribute masking it's particularly challenging, as we'll have to keep a copy of the original column with a different name and then discard it, eg. the above query could be translated to

row language_code = 1, name = "foo" 
| RENAME name AS $$name 
| lookup join languages on language_code 
| sort $$name 
| DROP $$name
| where name == "English"
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@alex-spies
Copy link
Contributor

alex-spies commented Feb 7, 2025

Oh no, I realized there's a problem with this approach.

That is: Currently, after the JOIN/MV_EXPAND, the rows that come from multiple matches/expanding a multivalue are consecutive. Is this part of the contract for JOIN/MV_EXPAND?

If it is, then the approach of pushing down past a sort cannot be correct, as sorting after will break up consecutive rows that belonged to the same match/MV.

Otherwise, this approach is correct but has another problem - in addition to requiring to be well documented: Some users already remarked that with LOOKUP JOIN there's currently no way to get the behavior that ENRICH has, where multiple matches create a multivalue. And for some use cases that could be important. Which could be solved by a MV_COLLAPSE BY field1, field2 command, that would collapse multiple consecutive rows with the same field1, field2 values into a single row.

If we wanted this MV_COLLAPSE, then the proposed optimization would break it:

...
| SORT x
| MV_EXPAND y
| MV_COLLAPSE BY y // together with the MV_EXPAND before, this will be (mostly) a no-op.

not equivalent to

...
| MV_EXPAND y
| SORT x
| MV_COLLAPSE BY y

Similarly:

...
| SORT x
| LOOKUP JOIN idx ON y
| MV_COLLAPSE BY y // together with the LOOKUP JOIN before, this would achieve nearly ENRICH-like behavior

Not equivalent, though:

...
| LOOKUP JOIN idx ON y
| SORT x
| MV_COLLAPSE BY y

@luigidellaquila
Copy link
Contributor Author

Good catch

Is this part of the contract for JOIN/MV_EXPAND?

I don't think we have a documented behavior about this, but yeah, it seems a natural implicit behavior. IMHO we should not break it, unless there is a real reason

@alex-spies
Copy link
Contributor

I think we need to decide on this contract before LOOKUP JOIN goes GA. I'll take note of it.

@luigidellaquila
Copy link
Contributor Author

Thinking more about this, while for MV_EXPAND it seems natural (one record in, N records out, as they flow), for JOIN this constraint could be a big limitation, especially if we implement generalized JOIN between indexes.
Guaranteeing this order in a distributed join is not trivial. And expensive.

@alex-spies
Copy link
Contributor

@nik9000 pointed out that another solution would be if our TopN implementation was stable. Then, I think pushing down LOOKUP JOIN past a SORT would be correct.

@costin
Copy link
Member

costin commented Apr 30, 2025

For the record, for the near future there will be NO ordering guarantees for a various reasons (many already stated in this thread):

a. it’s a two-way door, meaning we can add ordering guarantees later if we want to. the reverse is not possible
b. sorting in a distributed system is expensive for questionable gains (e.g. how to disable it if not needed)
c. multi-values semantics are not guarantees in ES and while in practice there is ordering, enforcing a loose implementation has no immediate upside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

4 participants