Skip to content

ESQL: Compute engine support for tagged queries #128521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 29, 2025
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented May 27, 2025

Begins adding support for running "tagged queries" to the compute engine. Here, it's just the LuceneSourceOperator because that's useful and contained.

Example time! Say you are running:

FROM foo
| STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000)

It's often faster to run this as four queries:

  • The docs that round to 0
  • The docs that round to 100
  • The docs that round to 1000
  • The docs that round to 100000

This creates an ESQL operator that can run these queries, one after the other and attach those tags.

Aggs uses this trick and it's way faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in LuceneSourceOperator is quite similar to the doc loading version in _search.

I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this filter-by-filter and enable it when each group averages to more than 5000 documents and when there isn't an _doc_count field. It's faster in those cases not to push. I expect we'll be pretty similar.

Begins adding support for running "tagged queries" to the compute
engine. Here, it's just the `LuceneSourceOperator` because that's
useful and contained.

Example time! Say you are running:
```
FROM foo
| STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000)
```

It's *often* faster to run this as four queries:
* The docs that round to `0`
* The docs that round to `100`
* The docs that round to `1000`
* The docs that round to `100000`

This creates an ESQL operator that can run these queries, one after the
other and attach those tags.

Aggs uses this trick and it's *way* faster when it can push down count
queries, but it's still faster when it pushes doc loading things. This
implementation in `LuceneSourceOperator` is quite similar to the doc
loading version in _search.

I don't have performance measurements yet because I haven't plugged this
into the language. In _search we call this `filter-by-filter` and enable
it when each group averages to more than 5000 documents and when there
isn't an `_doc_count` field. It's faster in those cases not to push. I
expect we'll be pretty similar.
@nik9000 nik9000 added >non-issue auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v8.19.0 v9.1.0 labels May 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 27, 2025
@nik9000
Copy link
Member Author

nik9000 commented May 27, 2025

This should also work well for things like:

FROM foo
| STATS MAX(v) BY a > 10

With an extension to this PR that enables this behavior for MAX and COUNT and friends we could push really simple queries like the above all the way to lucene. The trick is to figure out exactly what that should look like from an execution standpoint. This PR was "easier" to model.

@nik9000
Copy link
Member Author

nik9000 commented May 27, 2025

The test failure was quite real - it came from attempting to reuse the scorer from one slice with a different query. I'll push a fix.

@dnhatn dnhatn self-requested a review May 27, 2025 22:11
if (currentScorer == null || currentScorer.leafReaderContext() != leaf) {
if (currentScorer == null // First time
|| currentScorer.leafReaderContext() != leaf // Moved to a new leaf
|| currentScorer.weight != currentSlice.weight() // Moved to a new query
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last bit of the if statement took most of a day to figure out I needed. But tests caught it.

/**
* Tags to add to the data returned by this query.
*/
List<Object> tags() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure Object is the right thing. It works, but we might want Supplier<Block> or something more specific. But for now this is good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the object list can provide a better debugging message, but the block supplier might be better; otherwise, we would need to provide the exact boxed type for numeric values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the getting the boxed type perfect could be tricky. Suppliers are quite explicit. Let's keep it as is for now and rework when we find a rough edge.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was big, so I delayed the review until the end of my day, but it's just the first part. Sorry about that. LGTM! Thanks, Nik.

@@ -121,6 +120,9 @@ protected Page getCheckedOutput() throws IOException {
if (scorer == null) {
remainingDocs = 0;
} else {
if (scorer.tags().isEmpty() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leverage this and min/max later too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@nik9000
Copy link
Member Author

nik9000 commented May 29, 2025

I thought it was big, so I delayed the review until the end of my day, but it's just the first part. Sorry about that. LGTM! Thanks, Nik.

Right! I tried to do the next bit and it got big so I put that down.

@nik9000 nik9000 merged commit 1b151ed into elastic:main May 29, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128521

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 29, 2025
Begins adding support for running "tagged queries" to the compute
engine. Here, it's just the `LuceneSourceOperator` because that's
useful and contained.

Example time! Say you are running:
```
FROM foo
| STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000)
```

It's *often* faster to run this as four queries:
* The docs that round to `0`
* The docs that round to `100`
* The docs that round to `1000`
* The docs that round to `100000`

This creates an ESQL operator that can run these queries, one after the
other and attach those tags.

Aggs uses this trick and it's *way* faster when it can push down count
queries, but it's still faster when it pushes doc loading things. This
implementation in `LuceneSourceOperator` is quite similar to the doc
loading version in _search.

I don't have performance measurements yet because I haven't plugged this
into the language. In _search we call this `filter-by-filter` and enable
it when each group averages to more than 5000 documents and when there
isn't an `_doc_count` field. It's faster in those cases not to push. I
expect we'll be pretty similar.
@nik9000
Copy link
Member Author

nik9000 commented May 29, 2025

backport: #128638

@nik9000
Copy link
Member Author

nik9000 commented Jun 3, 2025

Backported with #128638

joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Jun 3, 2025
Begins adding support for running "tagged queries" to the compute
engine. Here, it's just the `LuceneSourceOperator` because that's
useful and contained.

Example time! Say you are running:
```
FROM foo
| STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000)
```

It's *often* faster to run this as four queries:
* The docs that round to `0`
* The docs that round to `100`
* The docs that round to `1000`
* The docs that round to `100000`

This creates an ESQL operator that can run these queries, one after the
other and attach those tags.

Aggs uses this trick and it's *way* faster when it can push down count
queries, but it's still faster when it pushes doc loading things. This
implementation in `LuceneSourceOperator` is quite similar to the doc
loading version in _search.

I don't have performance measurements yet because I haven't plugged this
into the language. In _search we call this `filter-by-filter` and enable
it when each group averages to more than 5000 documents and when there
isn't an `_doc_count` field. It's faster in those cases not to push. I
expect we'll be pretty similar.
Samiul-TheSoccerFan pushed a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 5, 2025
Begins adding support for running "tagged queries" to the compute
engine. Here, it's just the `LuceneSourceOperator` because that's
useful and contained.

Example time! Say you are running:
```
FROM foo
| STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000)
```

It's *often* faster to run this as four queries:
* The docs that round to `0`
* The docs that round to `100`
* The docs that round to `1000`
* The docs that round to `100000`

This creates an ESQL operator that can run these queries, one after the
other and attach those tags.

Aggs uses this trick and it's *way* faster when it can push down count
queries, but it's still faster when it pushes doc loading things. This
implementation in `LuceneSourceOperator` is quite similar to the doc
loading version in _search.

I don't have performance measurements yet because I haven't plugged this
into the language. In _search we call this `filter-by-filter` and enable
it when each group averages to more than 5000 documents and when there
isn't an `_doc_count` field. It's faster in those cases not to push. I
expect we'll be pretty similar.
nik9000 added a commit that referenced this pull request Jun 11, 2025
Begins adding support for running "tagged queries" to the compute
engine. Here, it's just the `LuceneSourceOperator` because that's
useful and contained.

Example time! Say you are running:
```
FROM foo
| STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000)
```

It's *often* faster to run this as four queries:
* The docs that round to `0`
* The docs that round to `100`
* The docs that round to `1000`
* The docs that round to `100000`

This creates an ESQL operator that can run these queries, one after the
other and attach those tags.

Aggs uses this trick and it's *way* faster when it can push down count
queries, but it's still faster when it pushes doc loading things. This
implementation in `LuceneSourceOperator` is quite similar to the doc
loading version in _search.

I don't have performance measurements yet because I haven't plugged this
into the language. In _search we call this `filter-by-filter` and enable
it when each group averages to more than 5000 documents and when there
isn't an `_doc_count` field. It's faster in those cases not to push. I
expect we'll be pretty similar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants