-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES|QL random sampling #125570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL random sampling #125570
Conversation
💚 CLA has been signed |
Hi @jan-elastic, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not reuse the SurrogateExpression
if'ace?
We could have SampledCount
and SampledSum
extending the existing corresponding aggs, which already implement that interface. These subclasses would take the sample as an argument and adjust the result of their superclasses in the way the current proposal does.
The rule swapping the Count
and/or Sum
nodes for their sampled- equivalents would need to be inserted above SubstituteSurrogates
-- it'd only execute once; as PropagateSampleFrequencyToAggs
does now.
...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateSampleFrequencyToAggs.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateSampleFrequencyToAggs.java
Outdated
Show resolved
Hide resolved
42f3b79
to
1a6bf4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new proposal works, but what stands out to me is the fact that the agg functions that need sampling corrections become aware of this, also when that's not needed, i.e. the correcting new agg function's implementation leaks into them.
An alternative would be make the ApplySampleCorrections
rule be the "repository of knowledge" about which agg function needs correction and do the substitution there directly, w/o the use of an interface (HasSampleCorrection
), in the lines of:
if (plan instanceof Aggregate && sampleProbability.get() != null) {
plan = plan.transformExpressionsOnly(e -> switch (e) {
case Count count -> new CountSampleCorrection(count.source(), count.field(), count.filter(), sampleProbability.get());
case Sum sum -> new SumSampleCorrection(sum.source(), sum.field(), sum.filter(), sampleProbability.get());
default -> e;
});
sampleProbability.set(null);
}
...c/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ApplySampleCorrections.java
Outdated
Show resolved
Hide resolved
dd90a3b
to
4e7959c
Compare
c42e568
to
53eff0f
Compare
Aggs themselves are the best place to put "how to correct this agg for sampling", I think. I'm not picky on how we do it, but that's a per-agg decision if I've ever seen one. |
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far, maybe safe for the to-be-decided seed handling.
...in/esql/qa/server/src/main/java/org/elasticsearch/xpack/esql/qa/rest/RestSampleTestCase.java
Show resolved
Hide resolved
...in/esql/qa/server/src/main/java/org/elasticsearch/xpack/esql/qa/rest/RestSampleTestCase.java
Show resolved
Hide resolved
...in/esql/qa/server/src/main/java/org/elasticsearch/xpack/esql/qa/rest/RestSampleTestCase.java
Show resolved
Hide resolved
...in/esql/qa/server/src/main/java/org/elasticsearch/xpack/esql/qa/rest/RestSampleTestCase.java
Show resolved
Hide resolved
x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/SampleOperator.java
Show resolved
Hide resolved
...gin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Count.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ApplySampleCorrections.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ApplySampleCorrections.java
Show resolved
Hide resolved
var probability = combinedProbability(context, sample, rsChild); | ||
var seed = combinedSeed(context, sample, rsChild); | ||
plan = new Sample(sample.source(), probability, seed, rsChild.child()); | ||
} else if (child instanceof Enrich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be replaced with non-SampleBreaking
(and instance of UnaryPlan, though this check should maybe throw, even if in the future we might have Nary nodes that could allow Sample to slide by).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the first part; let's fix the second part once we have them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it more, we have concluded there are three types of commands:
- those that can be swapped (as propagate probability)
- those that only propagate probability
- those that break sampling
I have updated the code reflecting that, and removed the SampleBreaking interface in doing so. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to add a planning test if possible that iterates over all available commands, dynamically discovered (not sure if we have anything like that already), and checks that unless each command belongs to an allow-filter, it'll be swapped with SAMPLE
; as simple as FROM .. | COMMAND .. | SAMPLE ..
. This would allow us to detect if a new command is added and the implementer forgot to add it to this list.
...main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineSample.java
Outdated
Show resolved
Hide resolved
ba58c0c
to
af95b37
Compare
No description provided.