ES|QL SAMPLE aggregation function #127629

jan-elastic · 2025-05-02T10:16:58Z

No description provided.

elasticsearchmachine · 2025-05-02T13:32:37Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2025-05-02T13:32:38Z

Hi @jan-elastic, I've created a changelog YAML for you.

nik9000 · 2025-05-02T15:39:51Z

...te/src/main/generated-src/org/elasticsearch/compute/aggregation/SampleBooleanAggregator.java

+
+        public void add(int groupId, boolean value) {
+            try (BreakingBytesRefBuilder builder = new BreakingBytesRefBuilder(breaker, "sample")) {
+                ENCODER.encodeLong(new SplittableRandom().nextLong(), builder);


Is it more correct to put the new SplittableRandom into the ctor so we just keep calling nextLong on it?

Same for the BreakigBytesRefBuilder - you could clear it before every call here.

nik9000 · 2025-05-02T15:44:11Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Sample.java

+            "version" },
+        description = "Collects sample values for a field.",
+        type = FunctionType.AGGREGATE,
+        examples = @Example(file = "stats_sample", tag = "doc")


I think we should have an example of the output in the docs. I'm not entirely sure the right way to hack that one up because it's non-deterministic. Maybe it's hand rolled.

I think we want that example because my first question when reading this is "can I get duplicates or do those count as distinct samples?" Mostly because I'm not good at statistics.

I do think it's interesting that SAMPLE(bool) is strictly more work than VALUES(bool). It feels like sampling shouldn't be, but it makes some sense.

I think it makes sense that SAMPLE(bool) is more work. VALUES(bool) just keeps track of two boolean values: does true exist and does false exist. SAMPLE(bool) does more.

Obv, I prefer some example output too. I didn't know how to achieve that, but I'll think of something. Should've left a TODO.

ES|QL SAMPLE aggregation function

8ef9b0d

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels May 2, 2025

elasticsearchmachine and others added 2 commits May 2, 2025 10:23

[CI] Auto commit changes from spotless

1d2697e

ThreadLocalRandom -> SplittableRandom

62de767

jan-elastic force-pushed the esql-sample-agg-2 branch from 55fe96f to 62de767 Compare May 2, 2025 13:23

jan-elastic added >feature :ml Machine learning Team:ML Meta label for the ML team labels May 2, 2025

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label May 2, 2025

Update docs/changelog/127629.yaml

629fc2e

github-actions bot deployed to docs-preview May 2, 2025 13:33 View deployment

fix yaml test

549ff83

github-actions bot deployed to docs-preview May 2, 2025 15:35 View deployment

jan-elastic requested a review from alex-spies May 2, 2025 15:36

nik9000 reviewed May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES|QL SAMPLE aggregation function #127629

ES|QL SAMPLE aggregation function #127629

jan-elastic commented May 2, 2025

elasticsearchmachine commented May 2, 2025

elasticsearchmachine commented May 2, 2025

nik9000 May 2, 2025

nik9000 May 2, 2025

jan-elastic May 6, 2025

jan-elastic May 6, 2025 •

edited

Loading

ES|QL SAMPLE aggregation function #127629

Are you sure you want to change the base?

ES|QL SAMPLE aggregation function #127629

Conversation

jan-elastic commented May 2, 2025

elasticsearchmachine commented May 2, 2025

elasticsearchmachine commented May 2, 2025

nik9000 May 2, 2025

Choose a reason for hiding this comment

nik9000 May 2, 2025

Choose a reason for hiding this comment

jan-elastic May 6, 2025

Choose a reason for hiding this comment

jan-elastic May 6, 2025 • edited Loading

Choose a reason for hiding this comment

jan-elastic May 6, 2025 •

edited

Loading