ES|QL random sampling #125570

jan-elastic · 2025-03-25T10:02:31Z

No description provided.

cla-checker-service · 2025-03-25T10:02:36Z

💚 CLA has been signed

elasticsearchmachine · 2025-03-25T10:04:21Z

Hi @jan-elastic, I've created a changelog YAML for you.

bpintea

Can we not reuse the SurrogateExpression if'ace?
We could have SampledCount and SampledSum extending the existing corresponding aggs, which already implement that interface. These subclasses would take the sample as an argument and adjust the result of their superclasses in the way the current proposal does.
The rule swapping the Count and/or Sum nodes for their sampled- equivalents would need to be inserted above SubstituteSurrogates -- it'd only execute once; as PropagateSampleFrequencyToAggs does now.

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateSampleFrequencyToAggs.java

bpintea

The new proposal works, but what stands out to me is the fact that the agg functions that need sampling corrections become aware of this, also when that's not needed, i.e. the correcting new agg function's implementation leaks into them.
An alternative would be make the ApplySampleCorrections rule be the "repository of knowledge" about which agg function needs correction and do the substitution there directly, w/o the use of an interface (HasSampleCorrection), in the lines of:

            if (plan instanceof Aggregate && sampleProbability.get() != null) {
                plan = plan.transformExpressionsOnly(e -> switch (e) {
                    case Count count -> new CountSampleCorrection(count.source(), count.field(), count.filter(), sampleProbability.get());
                    case Sum sum -> new SumSampleCorrection(sum.source(), sum.field(), sum.filter(), sampleProbability.get());
                    default -> e;
                });
                sampleProbability.set(null);
            }

...c/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ApplySampleCorrections.java

nik9000 · 2025-04-11T13:37:36Z

the correcting new agg function's implementation leaks into them.
An alternative would be make the ApplySampleCorrections rule be the "repository of knowledge" about which agg function needs correction and do the substitution there directly, w/o the use of an interface (HasSampleCorrection), in the lines of:

Aggs themselves are the best place to put "how to correct this agg for sampling", I think. I'm not picky on how we do it, but that's a per-agg decision if I've ever seen one.

elasticsearchmachine · 2025-04-14T08:14:41Z

Pinging @elastic/ml-core (Team:ML)

bpintea

LGTM so far, maybe safe for the to-be-decided seed handling.

...in/esql/qa/server/src/main/java/org/elasticsearch/xpack/esql/qa/rest/RestSampleTestCase.java

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/SampleOperator.java

...gin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Count.java

...c/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ApplySampleCorrections.java

bpintea · 2025-04-14T18:24:02Z

...main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineSample.java

+            var probability = combinedProbability(context, sample, rsChild);
+            var seed = combinedSeed(context, sample, rsChild);
+            plan = new Sample(sample.source(), probability, seed, rsChild.child());
+        } else if (child instanceof Enrich


These should be replaced with non-SampleBreaking (and instance of UnaryPlan, though this check should maybe throw, even if in the future we might have Nary nodes that could allow Sample to slide by).

fixed the first part; let's fix the second part once we have them

After thinking about it more, we have concluded there are three types of commands:

those that can be swapped (as propagate probability)

those that only propagate probability

those that break sampling

I have updated the code reflecting that, and removed the SampleBreaking interface in doing so. PTAL

It'd be great to add a planning test if possible that iterates over all available commands, dynamically discovered (not sure if we have anything like that already), and checks that unless each command belongs to an allow-filter, it'll be swapped with SAMPLE; as simple as FROM .. | COMMAND .. | SAMPLE ... This would allow us to detect if a new command is added and the implementer forgot to add it to this list.

...main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineSample.java

…erface

jan-elastic marked this pull request as draft March 25, 2025 10:02

elasticsearchmachine added the v9.1.0 label Mar 25, 2025

github-actions bot deployed to docs-preview March 25, 2025 10:03 View deployment

jan-elastic added >feature :ml Machine learning Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team labels Mar 25, 2025

bpintea reviewed Mar 25, 2025

View reviewed changes

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateSampleFrequencyToAggs.java Outdated Show resolved Hide resolved

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateSampleFrequencyToAggs.java Outdated Show resolved Hide resolved

jan-elastic force-pushed the feat/random_sample branch from 42f3b79 to 1a6bf4d Compare March 26, 2025 11:49

bpintea reviewed Mar 26, 2025

View reviewed changes

...c/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ApplySampleCorrections.java Outdated Show resolved Hide resolved

jan-elastic force-pushed the feat/random_sample branch 5 times, most recently from dd90a3b to 4e7959c Compare March 27, 2025 17:02

jan-elastic force-pushed the feat/random_sample branch 4 times, most recently from c42e568 to 53eff0f Compare April 10, 2025 10:30

jan-elastic requested review from bpintea, nik9000 and alex-spies April 10, 2025 10:53

stratoula mentioned this pull request Apr 11, 2025

[ES|QL] Supports SAMPLE command elastic/kibana#217977

Open

5 tasks

jan-elastic marked this pull request as ready for review April 14, 2025 08:14

elasticsearchmachine removed the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 14, 2025

bpintea reviewed Apr 14, 2025

View reviewed changes

jan-elastic and others added 23 commits April 23, 2025 14:37

push down through where and sort

473217d

rename RANDOM_SAMPLE -> SAMPLE

92949fa

[CI] Auto commit changes from spotless

e449ae4

SampleOperaetorTests + fix status

83c0e07

Test accuracy of sampling operator

c98391a

polish code

0e5d497

error on seed in sampling operator

2e32a72

Don't push sample correction through limit

cad45d7

Don't push sample correction through mv_expand

01cc677

CSV tests

011e612

propagate multiple sample probabilities

27f4294

REST test

875c2d5

enable all csv tests

eb1f728

Fix CSV test with sample+limit

ccc7179

add SampleBreaking interface

a5ef3bd

comments

ac41e9c

linkedlist -> arraydeque for efficiencyu

12df3c2

use samplebreaking in pushdown

6051ec0

different operator categories wrt sampling. Remove SampleBreaking int…

c7cbf7e

…erface

sample metrics

0a5085b

[CI] Auto commit changes from spotless

ba917cd

fix esql metrics test

29fbc39

delete unused file

af95b37

jan-elastic force-pushed the feat/random_sample branch from ba58c0c to af95b37 Compare April 23, 2025 12:39

Merge branch 'main' into feat/random_sample

f13a5c9

jan-elastic added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 23, 2025

elasticsearchmachine merged commit bd1a638 into main Apr 23, 2025
18 checks passed

elasticsearchmachine deleted the feat/random_sample branch April 23, 2025 15:48

bpintea mentioned this pull request Apr 23, 2025

ESQL: Add a random sample command #123879

Closed

ChrisHegarty mentioned this pull request Apr 25, 2025

test: check ES|QL SAMPLE capability before running analyzer/parser tests #127382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES|QL random sampling #125570

ES|QL random sampling #125570

jan-elastic commented Mar 25, 2025 •

edited

Loading

cla-checker-service bot commented Mar 25, 2025 •

edited

Loading

elasticsearchmachine commented Mar 25, 2025

bpintea left a comment

bpintea left a comment

nik9000 commented Apr 11, 2025

elasticsearchmachine commented Apr 14, 2025

bpintea left a comment

bpintea Apr 14, 2025

jan-elastic Apr 15, 2025

jan-elastic Apr 15, 2025

bpintea Apr 17, 2025

ES|QL random sampling #125570

ES|QL random sampling #125570

Conversation

jan-elastic commented Mar 25, 2025 • edited Loading

cla-checker-service bot commented Mar 25, 2025 • edited Loading

elasticsearchmachine commented Mar 25, 2025

bpintea left a comment

Choose a reason for hiding this comment

bpintea left a comment

Choose a reason for hiding this comment

nik9000 commented Apr 11, 2025

elasticsearchmachine commented Apr 14, 2025

bpintea left a comment

Choose a reason for hiding this comment

bpintea Apr 14, 2025

Choose a reason for hiding this comment

jan-elastic Apr 15, 2025

Choose a reason for hiding this comment

jan-elastic Apr 15, 2025

Choose a reason for hiding this comment

bpintea Apr 17, 2025

Choose a reason for hiding this comment

jan-elastic commented Mar 25, 2025 •

edited

Loading

cla-checker-service bot commented Mar 25, 2025 •

edited

Loading