Skip to content

ES|QL SAMPLE aggregation function #127629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jan-elastic
Copy link
Contributor

No description provided.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels May 2, 2025
@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 55fe96f to 62de767 Compare May 2, 2025 13:23
@jan-elastic jan-elastic added >feature :ml Machine learning Team:ML Meta label for the ML team labels May 2, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.


public void add(int groupId, boolean value) {
try (BreakingBytesRefBuilder builder = new BreakingBytesRefBuilder(breaker, "sample")) {
ENCODER.encodeLong(new SplittableRandom().nextLong(), builder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more correct to put the new SplittableRandom into the ctor so we just keep calling nextLong on it?

Same for the BreakigBytesRefBuilder - you could clear it before every call here.

"version" },
description = "Collects sample values for a field.",
type = FunctionType.AGGREGATE,
examples = @Example(file = "stats_sample", tag = "doc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an example of the output in the docs. I'm not entirely sure the right way to hack that one up because it's non-deterministic. Maybe it's hand rolled.

I think we want that example because my first question when reading this is "can I get duplicates or do those count as distinct samples?" Mostly because I'm not good at statistics.

I do think it's interesting that SAMPLE(bool) is strictly more work than VALUES(bool). It feels like sampling shouldn't be, but it makes some sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense that SAMPLE(bool) is more work. VALUES(bool) just keeps track of two boolean values: does true exist and does false exist. SAMPLE(bool) does more.

Copy link
Contributor Author

@jan-elastic jan-elastic May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obv, I prefer some example output too. I didn't know how to achieve that, but I'll think of something. Should've left a TODO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning Team:ML Meta label for the ML team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants