-
Notifications
You must be signed in to change notification settings - Fork 25.3k
ESQL: Compute engine support for tagged queries #128521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Begins adding support for running "tagged queries" to the compute engine. Here, it's just the `LuceneSourceOperator` because that's useful and contained. Example time! Say you are running: ``` FROM foo | STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000) ``` It's *often* faster to run this as four queries: * The docs that round to `0` * The docs that round to `100` * The docs that round to `1000` * The docs that round to `100000` This creates an ESQL operator that can run these queries, one after the other and attach those tags. Aggs uses this trick and it's *way* faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in `LuceneSourceOperator` is quite similar to the doc loading version in _search. I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this `filter-by-filter` and enable it when each group averages to more than 5000 documents and when there isn't an `_doc_count` field. It's faster in those cases not to push. I expect we'll be pretty similar.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
This should also work well for things like:
With an extension to this PR that enables this behavior for |
The test failure was quite real - it came from attempting to reuse the scorer from one slice with a different query. I'll push a fix. |
if (currentScorer == null || currentScorer.leafReaderContext() != leaf) { | ||
if (currentScorer == null // First time | ||
|| currentScorer.leafReaderContext() != leaf // Moved to a new leaf | ||
|| currentScorer.weight != currentSlice.weight() // Moved to a new query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last bit of the if
statement took most of a day to figure out I needed. But tests caught it.
/** | ||
* Tags to add to the data returned by this query. | ||
*/ | ||
List<Object> tags() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure Object
is the right thing. It works, but we might want Supplier<Block>
or something more specific. But for now this is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the object list can provide a better debugging message, but the block supplier might be better; otherwise, we would need to provide the exact boxed type for numeric values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the getting the boxed type perfect could be tricky. Suppliers are quite explicit. Let's keep it as is for now and rework when we find a rough edge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was big, so I delayed the review until the end of my day, but it's just the first part. Sorry about that. LGTM! Thanks, Nik.
@@ -121,6 +120,9 @@ protected Page getCheckedOutput() throws IOException { | |||
if (scorer == null) { | |||
remainingDocs = 0; | |||
} else { | |||
if (scorer.tags().isEmpty() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leverage this and min/max later too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
Right! I tried to do the next bit and it got big so I put that down. |
💔 Backport failed
You can use sqren/backport to manually backport by running |
Begins adding support for running "tagged queries" to the compute engine. Here, it's just the `LuceneSourceOperator` because that's useful and contained. Example time! Say you are running: ``` FROM foo | STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000) ``` It's *often* faster to run this as four queries: * The docs that round to `0` * The docs that round to `100` * The docs that round to `1000` * The docs that round to `100000` This creates an ESQL operator that can run these queries, one after the other and attach those tags. Aggs uses this trick and it's *way* faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in `LuceneSourceOperator` is quite similar to the doc loading version in _search. I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this `filter-by-filter` and enable it when each group averages to more than 5000 documents and when there isn't an `_doc_count` field. It's faster in those cases not to push. I expect we'll be pretty similar.
backport: #128638 |
Backported with #128638 |
Begins adding support for running "tagged queries" to the compute engine. Here, it's just the `LuceneSourceOperator` because that's useful and contained. Example time! Say you are running: ``` FROM foo | STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000) ``` It's *often* faster to run this as four queries: * The docs that round to `0` * The docs that round to `100` * The docs that round to `1000` * The docs that round to `100000` This creates an ESQL operator that can run these queries, one after the other and attach those tags. Aggs uses this trick and it's *way* faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in `LuceneSourceOperator` is quite similar to the doc loading version in _search. I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this `filter-by-filter` and enable it when each group averages to more than 5000 documents and when there isn't an `_doc_count` field. It's faster in those cases not to push. I expect we'll be pretty similar.
Begins adding support for running "tagged queries" to the compute engine. Here, it's just the `LuceneSourceOperator` because that's useful and contained. Example time! Say you are running: ``` FROM foo | STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000) ``` It's *often* faster to run this as four queries: * The docs that round to `0` * The docs that round to `100` * The docs that round to `1000` * The docs that round to `100000` This creates an ESQL operator that can run these queries, one after the other and attach those tags. Aggs uses this trick and it's *way* faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in `LuceneSourceOperator` is quite similar to the doc loading version in _search. I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this `filter-by-filter` and enable it when each group averages to more than 5000 documents and when there isn't an `_doc_count` field. It's faster in those cases not to push. I expect we'll be pretty similar.
Begins adding support for running "tagged queries" to the compute engine. Here, it's just the `LuceneSourceOperator` because that's useful and contained. Example time! Say you are running: ``` FROM foo | STATS MAX(v) BY ROUND_TO(g, 0, 100, 1000, 100000) ``` It's *often* faster to run this as four queries: * The docs that round to `0` * The docs that round to `100` * The docs that round to `1000` * The docs that round to `100000` This creates an ESQL operator that can run these queries, one after the other and attach those tags. Aggs uses this trick and it's *way* faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in `LuceneSourceOperator` is quite similar to the doc loading version in _search. I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this `filter-by-filter` and enable it when each group averages to more than 5000 documents and when there isn't an `_doc_count` field. It's faster in those cases not to push. I expect we'll be pretty similar.
Begins adding support for running "tagged queries" to the compute engine. Here, it's just the
LuceneSourceOperator
because that's useful and contained.Example time! Say you are running:
It's often faster to run this as four queries:
0
100
1000
100000
This creates an ESQL operator that can run these queries, one after the other and attach those tags.
Aggs uses this trick and it's way faster when it can push down count queries, but it's still faster when it pushes doc loading things. This implementation in
LuceneSourceOperator
is quite similar to the doc loading version in _search.I don't have performance measurements yet because I haven't plugged this into the language. In _search we call this
filter-by-filter
and enable it when each group averages to more than 5000 documents and when there isn't an_doc_count
field. It's faster in those cases not to push. I expect we'll be pretty similar.