Skip to content

ES|QL random sampling #125570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Apr 23, 2025
Merged

ES|QL random sampling #125570

merged 42 commits into from
Apr 23, 2025

Conversation

jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Mar 25, 2025

No description provided.

Copy link

cla-checker-service bot commented Mar 25, 2025

💚 CLA has been signed

@jan-elastic jan-elastic marked this pull request as draft March 25, 2025 10:02
@jan-elastic jan-elastic added >feature :ml Machine learning Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team labels Mar 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not reuse the SurrogateExpression if'ace?
We could have SampledCount and SampledSum extending the existing corresponding aggs, which already implement that interface. These subclasses would take the sample as an argument and adjust the result of their superclasses in the way the current proposal does.
The rule swapping the Count and/or Sum nodes for their sampled- equivalents would need to be inserted above SubstituteSurrogates -- it'd only execute once; as PropagateSampleFrequencyToAggs does now.

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new proposal works, but what stands out to me is the fact that the agg functions that need sampling corrections become aware of this, also when that's not needed, i.e. the correcting new agg function's implementation leaks into them.
An alternative would be make the ApplySampleCorrections rule be the "repository of knowledge" about which agg function needs correction and do the substitution there directly, w/o the use of an interface (HasSampleCorrection), in the lines of:

            if (plan instanceof Aggregate && sampleProbability.get() != null) {
                plan = plan.transformExpressionsOnly(e -> switch (e) {
                    case Count count -> new CountSampleCorrection(count.source(), count.field(), count.filter(), sampleProbability.get());
                    case Sum sum -> new SumSampleCorrection(sum.source(), sum.field(), sum.filter(), sampleProbability.get());
                    default -> e;
                });
                sampleProbability.set(null);
            }

@jan-elastic jan-elastic force-pushed the feat/random_sample branch 5 times, most recently from dd90a3b to 4e7959c Compare March 27, 2025 17:02
@jan-elastic jan-elastic force-pushed the feat/random_sample branch 4 times, most recently from c42e568 to 53eff0f Compare April 10, 2025 10:30
@nik9000
Copy link
Member

nik9000 commented Apr 11, 2025

the correcting new agg function's implementation leaks into them.
An alternative would be make the ApplySampleCorrections rule be the "repository of knowledge" about which agg function needs correction and do the substitution there directly, w/o the use of an interface (HasSampleCorrection), in the lines of:

Aggs themselves are the best place to put "how to correct this agg for sampling", I think. I'm not picky on how we do it, but that's a per-agg decision if I've ever seen one.

@jan-elastic jan-elastic marked this pull request as ready for review April 14, 2025 08:14
@elasticsearchmachine elasticsearchmachine removed the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far, maybe safe for the to-be-decided seed handling.

var probability = combinedProbability(context, sample, rsChild);
var seed = combinedSeed(context, sample, rsChild);
plan = new Sample(sample.source(), probability, seed, rsChild.child());
} else if (child instanceof Enrich
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be replaced with non-SampleBreaking (and instance of UnaryPlan, though this check should maybe throw, even if in the future we might have Nary nodes that could allow Sample to slide by).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the first part; let's fix the second part once we have them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it more, we have concluded there are three types of commands:

  • those that can be swapped (as propagate probability)
  • those that only propagate probability
  • those that break sampling

I have updated the code reflecting that, and removed the SampleBreaking interface in doing so. PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to add a planning test if possible that iterates over all available commands, dynamically discovered (not sure if we have anything like that already), and checks that unless each command belongs to an allow-filter, it'll be swapped with SAMPLE; as simple as FROM .. | COMMAND .. | SAMPLE ... This would allow us to detect if a new command is added and the implementer forgot to add it to this list.

@jan-elastic jan-elastic added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 23, 2025
@elasticsearchmachine elasticsearchmachine merged commit bd1a638 into main Apr 23, 2025
18 checks passed
@elasticsearchmachine elasticsearchmachine deleted the feat/random_sample branch April 23, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >feature :ml Machine learning Team:ML Meta label for the ML team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants