Open
Description
STATS COUNT() BY field1, field2
can produce wrong results if the cardinality of field1
and field2
is very large.
Steps to repro:
- create files to bulk ingest
for i in {1..10}; do
for j in {1..100000}; do
echo "{\"index\": {\"_index\": \"test2\"}}" >> bulk$i.ndjson
echo "{\"field\": \"$i\", \"otherfield\": \"$j\", \"concat\": \"$i$j\"}" >> bulk$i.ndjson
done
done
Note how the concat
field is just the concatenation of the i
and j
counters.
- run ES locally via gradle, but increase heap size
./gradlew run -Dtests.es.logger.org.elasticsearch.xpack.esql=TRACE -Dtests.jvm.argline="-Xmx16000M"
- bulk ingest
for i in {1..10}; do
curl -u elastic:password -H "Content-Type: application/json" "127.0.0.1:9200/test2/_bulk?refresh" -XPOST --data-binary @bulk$i.ndjson
done
- run queries
curl -u elastic:password -X POST 'localhost:9200/_query?format=txt' -H 'Content-Type: application/json' '-d
{
"query": "from test2 | stats count() by concat.keyword, otherfield.keyword| where concat.keyword == \"11\""}
'
Empty! Although not fully deterministic.
count() |concat.keyword |otherfield.keyword
---------------+---------------+------------------
Correct if we filter before!
curl -u elastic:password -X POST 'localhost:9200/_query?format=txt' -H 'Content-Type: application/json' '-d
{
"query": "from test2 | where concat.keyword == \"11\" | stats count() by concat.keyword, otherfield.keyword"}
'
count() |concat.keyword |otherfield.keyword
---------------+---------------+------------------
1 |11 |1