Skip to content

Track vector disk usage by vectorReader.getOffHeapByteSize #128326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented May 22, 2025

Currently IndexDiskUsageAnalyzer reports disk usage of vectors by

  • Iterating through document values to access vector data
  • Performing sample searches to force loading of the index structures
  • using a sampling approach (only visiting a subset of documents based on log scale)
  • tracking all bytes read during these operations

One problem of this approach is that it is very slow.
Another problem is that modifications to search algorithms and different encodings make it difficult to write definite test and assert expected results, hence a test failure such as #127689.

This modifies IndexDiskUsageAnalyzer for vectors to use a new introduced in Lucene 10.3 method vectorReader.getOffHeapByteSize. As all vector files are offHeap, we can rely on this method to report the precise disk
usage.

Closes #127689

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 22, 2025

@ChrisHegarty This is a draft PR to test the correctness of this statement: "can we can rely on vectorReader.getOffHeapByteSize to report the correct disk usage"?

@ChrisHegarty
Copy link
Contributor

@ChrisHegarty This is a draft PR to test the correctness of this statement: "can we can rely on vectorReader.getOffHeapByteSize to report the correct disk usage"?

Absolutely. The numbers reported by getOffHeapByteSize neglect the file format header/footers, etc, but they are minimal and not currently accounted for in the current implementation anyway. So I think you got it right here @mayya-sharipova. 👍

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mayya-sharipova mayya-sharipova force-pushed the IndexDiskUsageAnalyzer_vectors branch from 0aa43af to 8051f0d Compare May 26, 2025 19:33
@mayya-sharipova mayya-sharipova marked this pull request as ready for review May 26, 2025 19:34
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label May 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine "Check labels"

@mayya-sharipova
Copy link
Contributor Author

@ChrisHegarty I would like your input on this:

Now with DirectIO getOffHeapByteSize method will not report .vec anymore. Should we modify this pr:

  • is a reader is direct IO reader (how easily to detect this?), use old method
  • for all other readers use getOffHeapByteSize

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Jun 5, 2025

@ChrisHegarty I would like your input on this:

Now with DirectIO getOffHeapByteSize method will not report .vec anymore. Should we modify this pr:

  • is a reader is direct IO reader (how easily to detect this?), use old method
  • for all other readers use getOffHeapByteSize

I still need to port #128697 to the main branch. I've not done so yet, since we've had to disable direct IO temporarily.

Once #128697 is ported then we should be good to use getOffHeapByteSize always, even for the direct IO case - since the reader will know whether to report the raw vec usage or not. So no special handling required. Make sense?

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jun 5, 2025

We had a discussion offline, and the approach we've decided to take is:

  • get vectorReader.getOffHeapByteSize output and check it .vec size.
  • if .vec size is 0, do manual calculation of this file based on element type

Currently IndexDiskUsageAnalyzer reports disk usage of vectors by
- Iterating through document values to access vector data
- Performing sample searches to force loading of the index structures
- using a sampling approach (only visiting a subset of documents based on log scale)
- tracking all bytes read during these operations

One problem of this approach is that it is very slow.
Another problem is that modifications to search algorithms and different encodings make it difficult to write definite test and assert expected results, hence a test failure such as elastic#127689.

This modifies IndexDiskUsageAnalyzer for vectors to use a new introduced in Lucene 10.3 method vectorReader.getOffHeapByteSize. As all vector files are offHeap, we can rely on this method to report the precise disk
usage.

Closes elastic#127689
@mayya-sharipova mayya-sharipova force-pushed the IndexDiskUsageAnalyzer_vectors branch from 563b330 to 8bbe9bf Compare June 6, 2025 15:43
@mayya-sharipova
Copy link
Contributor Author

@ChrisHegarty Can you please review the recent changes. Thanks a lot

@mayya-sharipova mayya-sharipova merged commit 0b66ad6 into elastic:lucene_snapshot Jun 10, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>refactoring :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants