ESQL: Wrong STATS result in case of many distinct groups of 2 fields

`STATS COUNT() BY field1, field2` can produce wrong results if the cardinality of `field1` and `field2` is very large.

Steps to repro:
1.  create files to bulk ingest

```
for i in {1..10}; do
  for j in {1..100000}; do                                                                                                             
    echo "{\"index\": {\"_index\": \"test2\"}}" >> bulk$i.ndjson
    echo "{\"field\": \"$i\", \"otherfield\": \"$j\", \"concat\": \"$i$j\"}" >> bulk$i.ndjson
  done
done
```

Note how the `concat` field is just the concatenation of the `i` and `j` counters.

2. run ES locally via gradle, but increase heap size

```
./gradlew run -Dtests.es.logger.org.elasticsearch.xpack.esql=TRACE -Dtests.jvm.argline="-Xmx16000M"
```

3. bulk ingest

```
for i in {1..10}; do
  curl -u elastic:password -H "Content-Type: application/json" "127.0.0.1:9200/test2/_bulk?refresh" -XPOST --data-binary @bulk$i.ndjson
done
```

4. run queries

```
curl -u elastic:password -X POST 'localhost:9200/_query?format=txt' -H 'Content-Type: application/json' '-d
{                                                                                                                                      
  "query": "from test2 | stats count() by concat.keyword, otherfield.keyword| where concat.keyword == \"11\""}
'

Empty! Although not fully deterministic.

    count()    |concat.keyword |otherfield.keyword
---------------+---------------+------------------

Correct if we filter before!
curl -u elastic:password -X POST 'localhost:9200/_query?format=txt' -H 'Content-Type: application/json' '-d
{
  "query": "from test2 | where concat.keyword == \"11\" | stats count() by concat.keyword, otherfield.keyword"}
'
    count()    |concat.keyword |otherfield.keyword
---------------+---------------+------------------
1              |11             |1 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ESQL: Wrong STATS result in case of many distinct groups of 2 fields #130644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ESQL: Wrong STATS result in case of many distinct groups of 2 fields #130644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions