Skip to content

ESQL: Heuristics to pick efficient partitioning #125739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Apr 11, 2025

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 26, 2025

Adds heuristics to pick an efficient partitioning strategy based on the index and rewritten query. This speeds up some queries by throwing more cores at the problem:

FROM test | STATS SUM(b)

Before: took: 31 CPU: 222.3%
 After: took: 15 CPU: 806.9%

It also lowers the overhead of simpler queries by throwing less cores at the problem when it won't really speed anything up:

FROM test

Before: took: 1 CPU: 48.5%
 After: took: 1 CPU: 70.4%

We have had a pragma to control our data partitioning for a long time, this just looks at the query to pick a partitioning scheme. The partitioning options:

  • shard: use one core per shard
  • segment: use one core per large segment
  • doc: break each shard into as many segments as there are cores

doc is the fastest, but has a lot of overhead, especially for complex Lucene queries. segment is fast, but doesn't make the most out of CPUs when there are few segments. shard has the lowest overhead.

Previously we always used segment partitioning because it doesn't have the terrible overhead but is fast. With this change we use doc when the top level query matches all documents - those have very very low overhead even in the doc partitioning. That's the twice as fast example above.

This also uses the shard partitioning for queries that don't have to do much work like FROM foo or FROM foo | LIMIT 1 or FROM foo | SORT a. That's the lower CPU example above.

This forking choice is taken very late on the data node. So queries like this:

FROM test | WHERE @timestamp > "2025-01-01T00:00:00Z" | STATS SUM(b)

can also use the doc partitioning when all documents are after the timestamp and all documents have b.

Adds heuristics to pick an efficient partitioning strategy based on the
index and rewritten query. This speeds up some queries by throwing more
cores at the problem:
```
FROM test | STATS SUM(b)

Before: took: 31 CPU: 222.3%
 After: took: 15 CPU: 806.9%
```

It also lowers the overhead of simpler queries by throwing less cores at
the problem when it won't really speed anything up:
```
FROM test

Before: took: 1 CPU: 48.5%
 After: took: 1 CPU: 70.4%
```

We have had a `pragma` to control our data partitioning for a long time,
this just looks at the query to pick a partitioning scheme. The
partitioning options:
* `shard`: use one core per shard
* `segment`: use one core per large segment
* `doc`: break each shard into as many segments as there are cores

`doc` is the fastest, but has a lot of overhead, especially for complex
Lucene queries. `segment` is fast, but doesn't make the most out of CPUs
when there are few segments. `shard` has the lowest overhead.

Previously we always used `segment` partitioning because it doesn't have
the terrible overhead but is fast. With this change we use `doc` when
the top level query matches all documents - those have very very low
overhead even in the `doc` partitioning. That's the twice as fast
example above.

This also uses the `shard` partitioning for queries that don't have to
do much work like `FROM foo` or `FROM foo | LIMIT 1` or
`FROM foo | SORT a`. That's the lower CPU example above.

This forking choice is taken very late on the data node. So queries like
this:
```
FROM test | WHERE @timestamp > "2025-01-01T00:00:00Z" | STATS SUM(b)
```
can also use the `doc` partitioning when all documents are after the
timestamp and all documents have `b`.
@nik9000 nik9000 requested a review from dnhatn March 26, 2025 22:58
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@nik9000
Copy link
Member Author

nik9000 commented Mar 26, 2025

I think this needs more testing, but I'm going to open this so folks can have a look at it in the mean time. I think I'd like to see more examples of queries and how they get rewritten. I think real integration tests.

@nik9000
Copy link
Member Author

nik9000 commented Mar 26, 2025

I use a fixed limit of 250,000 for "small" indices and have them stick with shard partitioning. Those small indices used to use segment partitioning but the segments were all stuck into one partition. Honestly either of those two would get the same outcome for small indices - but I wanted to make sure little indices didn't get doc partitioning and fork into a zillion threads.

@nik9000
Copy link
Member Author

nik9000 commented Mar 27, 2025

The way I got the performance numbers I mentioned was on my laptop:

curl -XDELETE -uelastic:password -HContent-Type:application/json localhost:9200/test
rm -rf /tmp/bulk
for a in {1..100}; do
  for b in {1..1000}; do
    echo '{"index": {}}' >> /tmp/bulk
    echo '{"@timestamp": "2025-01-01T00:00:00.'$a'Z", "b": '$b'}' >> /tmp/bulk
  done
done
echo >> /tmp/bulk
for i in {1..100}; do
  echo -n $i:
  curl -s -XPOST -uelastic:password -HContent-Type:application/json localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -XPOST -uelastic:password -HContent-Type:application/json localhost:9200/test/_refresh

while curl -s -XPOST -uelastic:password -HContent-Type:application/json localhost:9200/_query?pretty -d'{
  "query": "FROM test"
}' | grep took; do echo ok; done

Its not the most scientific measure but it takes a few seconds.

You can reproduce the old way with the new code by sending:

while curl -s -XPOST -uelastic:password -HContent-Type:application/json localhost:9200/_query?pretty -d'{
  "query": "FROM test",
  "pragma": { "data_partitioning": "segment" }
}' | grep took; do echo ok; done

If you are in a production release you'd need to send accept_pragma_risks as well.

I'm going to have more testing thoughts today. And more profile thoughts. The change can certainly bring some advantages, but there's risk that it'll make ESQL slower in some cases. And higher overhead in other cases. On a fully loaded cluster throwing more CPUs on at the problem isn't useful. Also - if we don't record these decisions somewhere it'll be hard to reverse engineer them when debugging.

@nik9000
Copy link
Member Author

nik9000 commented Mar 27, 2025

Initial reports from rally on the http_logs track, still just messing around with bash:

FROM logs* | STATS SUM(size)
Without: "took" : 751
With: "took" : 459

Edit: _search is vary fast at this one - clocking it around the same speed as ESQL with this change.

@nik9000
Copy link
Member Author

nik9000 commented Mar 27, 2025

Here's another one:

FROM logs* | STATS SUM(size) BY status
_search:  "took" : 3976
without:  "took" : 2326
with:   "took" : 1517

@nik9000
Copy link
Member Author

nik9000 commented Mar 27, 2025

This is looking pretty good and I think I have the tests I was hoping for now. I'll main merged back into this shop around for reviews. The only person who has loaded into their head the state needed to review this is @dnhatn, but others should look too and have thoughts.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I really like how we handle partitioning per target shard with this PR instead of using the same partitioning for all target shards. Thanks, Nik!


/**
* How we partition the data across {@link Driver}s. Each request forks into
* {@code min(cpus, partition_count)} threads on the data node. More partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we currently have to use min(1.5 * cpus, partition_count) because drivers consist of both I/O and CPU-bound operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment.

* <ul>
* <li>
* If the {@linkplain Query} matches <strong>no</strong> documents then this will
* use the {@link LuceneSliceQueue.PartitioningStrategy#SEGMENT} strategy so we
Copy link
Member

@dnhatn dnhatn Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant PartitioningStrategy#SHARD for MatchNoDocsQuery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return LuceneSliceQueue.PartitioningStrategy.SEGMENT;
}
LuceneSliceQueue.PartitioningStrategy forClause = highSpeedAutoStrategy(c.query());
if (forClause == LuceneSliceQueue.PartitioningStrategy.SHARD) {
Copy link
Member

@dnhatn dnhatn Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe replace these two if statements with:

   if (forClause != LuceneSliceQueue.PartitioningStrategy.DOC) {
     return forClause;
   }

@dnhatn
Copy link
Member

dnhatn commented Mar 28, 2025

@nik9000 I benchmarked this change with the nyc_taxis track. Some tasks are much faster, while some are slower. I haven't looked into the details yet but wanted to share it with you here.

auto_partitioning.txt

@nik9000
Copy link
Member Author

nik9000 commented Mar 28, 2025

Off topic: I'd really like to make a cluster setting to make the default value. Escape hatch.

@nik9000 I benchmarked this change with the nyc_taxis track. Some tasks are much faster, while some are slower. I haven't looked into the details yet but wanted to share it with you here.

auto_partitioning.txt

@dnhatn, thanks so much for running that! I was looking around for stuff to run and kept looking at noaa. I've totally spaced on nyc_taxis.

|                                                                     | Before ms | After ms  |  Diff ms  |   Diff %  |
Yay!
| <this one isn't esql>               avg_passenger_count_aggregation |  643.085  |  212.301  | -430.784  |   -66.99% |
|                       avg_passenger_count_esql_segment_partitioning |  653.07   |  320.15   | -332.92   |   -50.98% |
|                           avg_passenger_count_esql_doc_partitioning |  680.792  |  285.572  | -395.22   |   -58.05% |
|                    count_group_by_keyword_esql_segment_partitioning | 1758.65   | 1168.6    | -590.05   |   -33.55% |
|                                sort_by_ts_esql_segment_partitioning |  142.969  |  110.627  |  -32.342  |   -22.62% |
Oof
|              avg_passenger_count_filtered_esql_segment_partitioning |   39.7524 |   50.6895 |   10.937  |   +27.51% |
|                           avg_tip_percent_esql_segment_partitioning | 2198.35   | 2553.63   |  355.279  |   +16.16% |
| <this one isn't esql>       avg_amount_group_by_integer_aggregation | 2696.23   | 3012.64   |  316.411  |   +11.74% |
| <this one isn't esql>                           multi_terms-keyword |  357.956  |  728.421  |  370.464  |  +103.49% |
|                       multi_terms-keyword_esql_segment_partitioning |  811.621  | 1284.6    |  472.983  |   +58.28% |
| <this one isn't esql>                       composite_terms-keyword |  132.631  |  212.529  |   79.898  |   +60.24% |
|date_histogram_fixed_interval_with_metrics_esql_segment_partitioning | 1749.35   | 1767.31   |   17.961  |    +1.03% |
|                                                 esql_enrich_control |   10.2438 |   12.1602 |    1.9163 |   +18.71% |
|                                           esql_stats_enrich_control |   12.3096 |   14.5346 |    2.225  |   +18.08% |

Everything is the 90th percentile service time.

IMPORTANT: When the name says segment_partitioning that actually means default_partitioning. This PR changes the
default from segment_partitioning to an auto. The name made sense when we wrote the perf thing but a while after
I merge this we should probably rename them. Not right as we rename this, because it'd screw up the nice graph and I
want to see them.

Maybe the most interesting thing is: why the changes in avg_passenger_count_aggregation, multi_terms-keyword
and composite_terms-keyword. Those aren't even ESQL. avg_passenger_count_aggregation got faster. Why?
Spooky JVM action at a distance? Different inline choices? Different data layout? Ghosts?

Why did avg_passenger_count_esql_doc_partitioning change? The path through ESQL shouldn't have changed?

Here's a 100th percentile service time that looks important:

| esql_enrich_payments_fares |   21.9655      |   62.9121      |   40.9466  |     ms |  +186.41% |

What's up with that?

I think I'll need to spin up a cluster with this change, run the benchmarks, and leave the cluster up. Then do
another one without the change. And just poke them until knowledge comes out.

@dnhatn
Copy link
Member

dnhatn commented Mar 28, 2025

Off topic: I'd really like to make a cluster setting to make the default value. Escape hatch.

+1.

@nik9000
Copy link
Member Author

nik9000 commented Mar 28, 2025

I've reproduced this and got fairly different numbers. Let's check one:

| avg_passenger_count_esql_segment_partitioning | 292.798 | 267.412 | -25.3858 | -8.67% |

@dnhatn got a very believable 50% speed boost.

I let the cluster keep running after the tests finished. So I looked up avg_passenger_count_esql_segment_partitioning and ran it in bash:

 while time curl -s -HContent-Type:application/json -u<redacted> -k https://elasticsearch-0:9200/_query?pretty -d'{
    "query": "FROM nyc_taxis | STATS AVG(passenger_count)"
}' | grep took; do echo ok; done

on them.

Before:

  "took" : 557,

real	0m0.576s
user	0m0.012s
sys	0m0.003s

After:

  "took" : 328,

real	0m0.344s
user	0m0.011s
sys	0m0.003s

Now I get the performance boost. More ghosts?

@nik9000
Copy link
Member Author

nik9000 commented Mar 28, 2025

Checking one that got a lot worse for Nhat, multi_terms-keyword_esql_segment_partitioning. Here it is for me:

| 389.288 | 381.213 | -8.07479 | ms | -2.07% |

I think that's just measurement noise from the cloud environments. When I hit new code with a profile I get:

              "partitioning_strategies" : {
                "nyc_taxis:0" : "SEGMENT"
              }

Which is telling me that the heuristics picked the old partitioning scheme.

But! When I run it in bash the new code is faster. What wild magic is this?

@nik9000
Copy link
Member Author

nik9000 commented Mar 28, 2025

The disk layouts are different...
My "after" cluster:

~$ curl -s -HContent-Type:application/json -u<redacted> -k https://elasticsearch-0:9200/_cat/segments
nyc_rate_codes    0 p 10.128.0.106 _0    0        7 0     5kb 0 true true 10.1.0 true
nyc_vendors       0 p 10.128.0.106 _0    0        2 0   4.3kb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _ga 586  1105477 0   178mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _gg 592  1094567 0 176.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _h1 613  1107542 0 177.4mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _h4 616  1974330 0 307.5mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _ht 641   186747 0  28.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _hx 645  2375604 0 377.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _ih 665  1202197 0 188.1mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _iv 679   172274 0  27.7mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _j6 690   711170 0 112.1mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jj 703   297384 0  46.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jk 704   301975 0  47.7mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jn 707   316334 0  49.7mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jo 708   224202 0  35.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jt 713   241810 0  36.1mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _ju 714   210260 0  31.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jv 715   254153 0    38mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jw 716   139831 0  20.7mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jx 717   605755 0  93.3mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _jy 718  3155898 0 496.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.106 _k0 720  9334816 0   1.4gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.106 _k1 721 12459971 0   1.9gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.106 _k2 722 31490469 0     5gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.106 _k3 723 32135920 0   5.1gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.106 _k4 724 32074874 0   5.1gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.106 _k5 725 32173132 0   5.1gb 0 true true 10.1.0 false
nyc_payment_types 0 p 10.128.0.106 _0    0        5 0   4.8kb 0 true true 10.1.0 true
nyc_trip_types    0 p 10.128.0.106 _0    0        2 0   4.4kb 0 true true 10.1.0 true

My "before" cluster:

$ curl -s -HContent-Type:application/json -u<redacted> -k https://elasticsearch-0:9200/_cat/segments
nyc_rate_codes    0 p 10.128.0.108 _0    0        7 0     5kb 0 true true 10.1.0 true
nyc_vendors       0 p 10.128.0.108 _0    0        2 0   4.3kb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _5p 205  3316178 0 535.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _7f 267  4543740 0 731.7mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _8c 300  2755025 0 443.4mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _ay 394  5124240 0 825.3mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _bw 428  5471534 0 878.8mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _c5 437  4369680 0 700.7mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _dn 491  5517900 0 881.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _ff 555  6392014 0     1gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.108 _go 600  8006979 0   1.2gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.108 _ie 662   795955 0 126.9mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _j7 691   912523 0 144.5mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _jo 708  4712197 0 756.8mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _jv 715   992713 0 158.8mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _k1 721   626723 0  98.8mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _k8 728  5763942 0 924.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _km 742   566944 0  90.1mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _kw 752  8017925 0   1.2gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.108 _lh 773   558428 0  85.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _lj 775    24118 0   3.8mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _ll 777    15775 0   2.5mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _lv 787   905816 0 143.3mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _ly 790  1222604 0 189.8mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _mc 804    16899 0   2.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _md 805   958221 0 150.2mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _mf 807    26683 0   4.2mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _mi 810   908162 0 146.2mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _mm 814   205676 0  33.5mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _mn 815  2984847 0 478.6mb 0 true true 10.1.0 true
nyc_taxis         0 p 10.128.0.108 _mq 818 10048537 0   1.5gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.108 _mr 819 15944323 0   2.4gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.108 _ms 820 31785914 0   5.1gb 0 true true 10.1.0 false
nyc_taxis         0 p 10.128.0.108 _mt 821 31854477 0     5gb 0 true true 10.1.0 false
nyc_payment_types 0 p 10.128.0.108 _0    0        5 0   4.8kb 0 true true 10.1.0 true
nyc_trip_types    0 p 10.128.0.108 _0    0        2 0   4.4kb 0 true true 10.1.0 true

@dnhatn
Copy link
Member

dnhatn commented Apr 1, 2025

I'm going to make a control this so folks can disable this. I'd forgotten about this yesterday.

Yes, our lives will be easier with that cluster setting.

@nik9000
Copy link
Member Author

nik9000 commented Apr 1, 2025

I've pushed a change that let's you override the default using a cluster setting.

I also rejuggled the bool handling - there are a few more cases where we can use DOC and SHARD.

@ChrisHegarty
Copy link
Contributor

Overall this looks like a great change. Once merged, we should circle back and add some scenarios with full-text search to ensure that they pick the most optimal partitioning strategy.

@nik9000
Copy link
Member Author

nik9000 commented Apr 8, 2025

Robots, I will make you happy with this change!

@nik9000 nik9000 enabled auto-merge (squash) April 9, 2025 21:26
auto-merge was automatically disabled April 9, 2025 22:23

Pull Request is not mergeable

@nik9000
Copy link
Member Author

nik9000 commented Apr 10, 2025

I had a test bug that seemed to only kick in for serverless and didn't realize it for a few days. In the last build unreleated failures were blocking this. I've merged main again.

@nik9000 nik9000 merged commit 5689dfa into elastic:main Apr 11, 2025
17 checks passed
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 6, 2025
Claims a transport version to backport elastic#125739.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants