-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Track vector disk usage by vectorReader.getOffHeapByteSize #128326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track vector disk usage by vectorReader.getOffHeapByteSize #128326
Conversation
@ChrisHegarty This is a draft PR to test the correctness of this statement: "can we can rely on vectorReader.getOffHeapByteSize to report the correct disk usage"? |
Absolutely. The numbers reported by getOffHeapByteSize neglect the file format header/footers, etc, but they are minimal and not currently accounted for in the current implementation anyway. So I think you got it right here @mayya-sharipova. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
0aa43af
to
8051f0d
Compare
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
@elasticsearchmachine "Check labels" |
@ChrisHegarty I would like your input on this: Now with DirectIO
|
I still need to port #128697 to the main branch. I've not done so yet, since we've had to disable direct IO temporarily.
|
We had a discussion offline, and the approach we've decided to take is:
|
Currently IndexDiskUsageAnalyzer reports disk usage of vectors by - Iterating through document values to access vector data - Performing sample searches to force loading of the index structures - using a sampling approach (only visiting a subset of documents based on log scale) - tracking all bytes read during these operations One problem of this approach is that it is very slow. Another problem is that modifications to search algorithms and different encodings make it difficult to write definite test and assert expected results, hence a test failure such as elastic#127689. This modifies IndexDiskUsageAnalyzer for vectors to use a new introduced in Lucene 10.3 method vectorReader.getOffHeapByteSize. As all vector files are offHeap, we can rely on this method to report the precise disk usage. Closes elastic#127689
563b330
to
8bbe9bf
Compare
@ChrisHegarty Can you please review the recent changes. Thanks a lot |
Currently IndexDiskUsageAnalyzer reports disk usage of vectors by
One problem of this approach is that it is very slow.
Another problem is that modifications to search algorithms and different encodings make it difficult to write definite test and assert expected results, hence a test failure such as #127689.
This modifies IndexDiskUsageAnalyzer for vectors to use a new introduced in Lucene 10.3 method vectorReader.getOffHeapByteSize. As all vector files are offHeap, we can rely on this method to report the precise disk
usage.
Closes #127689