Skip to content

CAT API documents count incorrect #127354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrklaney opened this issue Apr 24, 2025 · 6 comments
Open

CAT API documents count incorrect #127354

mrklaney opened this issue Apr 24, 2025 · 6 comments
Labels
>bug :SearchOrg/Relevance Label for the Search (solution/org) Relevance team

Comments

@mrklaney
Copy link
Contributor

Elasticsearch Version

8.18.0

Installed Plugins

No response

Java Version

JVM home [.../elasticsearch-8.18.0/jdk.app/Contents/Home], using bundled JDK [true]

OS Version

Darwin Marks-MacBook-Pro.local 24.4.0 Darwin Kernel Version 24.4.0: Fri Apr 11 18:33:39 PDT 2025; root:xnu-11417.101.15~117/RELEASE_ARM64_T6020 arm64

Problem Description

The _cat API "indices" option gives an incorrect number of docs.count for indexes that have a field with data type semantic_text.

The _count and _search APIs are in agreement with what appears to be the correct document count.

Steps to Reproduce

Reindex to a dest index that has a semantic_text field, which creates embeddings.
Tested using both .elser-2-elasticsearch and .multilingual-e5-small models.

Logs (if relevant)

No response

@mrklaney mrklaney added >bug needs:triage Requires assignment of a team area label labels Apr 24, 2025
@benwtrent
Copy link
Member

semantic_text utilizes nested documents internally. I am pretty sure that _cat will count the literal number of Lucene docs, not the individual document abstraction in Elasticsearch.

@benwtrent benwtrent added :SearchOrg/Relevance Label for the Search (solution/org) Relevance team and removed needs:triage Requires assignment of a team area label labels Apr 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@kderusso
Copy link
Member

@benwtrent is correct, this is expected behavior as semantic_text uses nested fields under the hood. I don't think we'd want to hide that because it could lead to issues unexpectedly tripping the max Lucene doc count. Perhaps we should update our docs to make this callout a bit clearer.

@mrklaney
Copy link
Contributor Author

mrklaney commented Apr 30, 2025

GET _cat/count/<index_name> does return a count consistent with the _count and _search endpoints.

@Mikep86
Copy link
Contributor

Mikep86 commented Apr 30, 2025

@mrklaney That's because the _cat/count API uses a search query to get document counts, which does not consider nested documents. In contrast, the _cat API "indices" option counts the number of documents in Lucene, which does include nested documents. These are different operations.

++ for @kderusso's suggestion to make this behavior clearer in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :SearchOrg/Relevance Label for the Search (solution/org) Relevance team
Projects
None yet
Development

No branches or pull requests

5 participants