Track vector disk usage by vectorReader.getOffHeapByteSize #128326

mayya-sharipova · 2025-05-22T16:45:57Z

Currently IndexDiskUsageAnalyzer reports disk usage of vectors by

Iterating through document values to access vector data
Performing sample searches to force loading of the index structures
using a sampling approach (only visiting a subset of documents based on log scale)
tracking all bytes read during these operations

One problem of this approach is that it is very slow.
Another problem is that modifications to search algorithms and different encodings make it difficult to write definite test and assert expected results, hence a test failure such as #127689.

This modifies IndexDiskUsageAnalyzer for vectors to use a new introduced in Lucene 10.3 method vectorReader.getOffHeapByteSize. As all vector files are offHeap, we can rely on this method to report the precise disk
usage.

Closes #127689

mayya-sharipova · 2025-05-22T16:48:56Z

@ChrisHegarty This is a draft PR to test the correctness of this statement: "can we can rely on vectorReader.getOffHeapByteSize to report the correct disk usage"?

ChrisHegarty · 2025-05-23T17:00:43Z

@ChrisHegarty This is a draft PR to test the correctness of this statement: "can we can rely on vectorReader.getOffHeapByteSize to report the correct disk usage"?

Absolutely. The numbers reported by getOffHeapByteSize neglect the file format header/footers, etc, but they are minimal and not currently accounted for in the current implementation anyway. So I think you got it right here @mayya-sharipova. 👍

ChrisHegarty

LGTM

elasticsearchmachine · 2025-05-26T19:34:27Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

mayya-sharipova · 2025-05-26T19:40:56Z

@elasticsearchmachine "Check labels"

mayya-sharipova · 2025-06-05T13:25:30Z

@ChrisHegarty I would like your input on this:

Now with DirectIO getOffHeapByteSize method will not report .vec anymore. Should we modify this pr:

is a reader is direct IO reader (how easily to detect this?), use old method
for all other readers use getOffHeapByteSize

ChrisHegarty · 2025-06-05T13:36:00Z

@ChrisHegarty I would like your input on this:

Now with DirectIO getOffHeapByteSize method will not report .vec anymore. Should we modify this pr:

is a reader is direct IO reader (how easily to detect this?), use old method

for all other readers use getOffHeapByteSize

I still need to port #128697 to the main branch. I've not done so yet, since we've had to disable direct IO temporarily.

Once #128697 is ported then we should be good to use getOffHeapByteSize always, even for the direct IO case - since the reader will know whether to report the raw vec usage or not. So no special handling required. Make sense?

mayya-sharipova · 2025-06-05T14:34:24Z

We had a discussion offline, and the approach we've decided to take is:

get vectorReader.getOffHeapByteSize output and check it .vec size.
if .vec size is 0, do manual calculation of this file based on element type

Currently IndexDiskUsageAnalyzer reports disk usage of vectors by - Iterating through document values to access vector data - Performing sample searches to force loading of the index structures - using a sampling approach (only visiting a subset of documents based on log scale) - tracking all bytes read during these operations One problem of this approach is that it is very slow. Another problem is that modifications to search algorithms and different encodings make it difficult to write definite test and assert expected results, hence a test failure such as elastic#127689. This modifies IndexDiskUsageAnalyzer for vectors to use a new introduced in Lucene 10.3 method vectorReader.getOffHeapByteSize. As all vector files are offHeap, we can rely on this method to report the precise disk usage. Closes elastic#127689

mayya-sharipova · 2025-06-06T16:46:53Z

@ChrisHegarty Can you please review the recent changes. Thanks a lot

mayya-sharipova added :Search Relevance/Vectors Vector search >enhancement labels May 22, 2025

mayya-sharipova requested a review from ChrisHegarty May 22, 2025 16:46

mayya-sharipova mentioned this pull request May 22, 2025

[CI] IndexDiskUsageAnalyzerTests testKnnVectors failing #127689

Open

ChrisHegarty approved these changes May 23, 2025

View reviewed changes

mayya-sharipova force-pushed the IndexDiskUsageAnalyzer_vectors branch from 0aa43af to 8051f0d Compare May 26, 2025 19:33

mayya-sharipova marked this pull request as ready for review May 26, 2025 19:34

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label May 26, 2025

mayya-sharipova added >refactoring and removed >enhancement labels May 26, 2025

mayya-sharipova added 3 commits June 6, 2025 07:05

Adjust tests for index usage

214a76a

Iteration

8bbe9bf

mayya-sharipova force-pushed the IndexDiskUsageAnalyzer_vectors branch from 563b330 to 8bbe9bf Compare June 6, 2025 15:43

mayya-sharipova added the v9.1.0 label Jun 6, 2025

ChrisHegarty approved these changes Jun 7, 2025

View reviewed changes

Merge branch 'lucene_snapshot' into IndexDiskUsageAnalyzer_vectors

d22b238

mayya-sharipova merged commit 0b66ad6 into elastic:lucene_snapshot Jun 10, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Track vector disk usage by vectorReader.getOffHeapByteSize #128326

Track vector disk usage by vectorReader.getOffHeapByteSize #128326

Uh oh!

mayya-sharipova commented May 22, 2025 •

edited

Loading

Uh oh!

mayya-sharipova commented May 22, 2025 •

edited

Loading

Uh oh!

ChrisHegarty commented May 23, 2025

Uh oh!

ChrisHegarty left a comment

Uh oh!

elasticsearchmachine commented May 26, 2025

Uh oh!

mayya-sharipova commented May 26, 2025

Uh oh!

mayya-sharipova commented Jun 5, 2025

Uh oh!

ChrisHegarty commented Jun 5, 2025 •

edited

Loading

Uh oh!

mayya-sharipova commented Jun 5, 2025 •

edited

Loading

Uh oh!

mayya-sharipova commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Track vector disk usage by vectorReader.getOffHeapByteSize #128326

Track vector disk usage by vectorReader.getOffHeapByteSize #128326

Uh oh!

Conversation

mayya-sharipova commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisHegarty commented May 23, 2025

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented May 26, 2025

Uh oh!

mayya-sharipova commented May 26, 2025

Uh oh!

mayya-sharipova commented Jun 5, 2025

Uh oh!

ChrisHegarty commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

mayya-sharipova commented May 22, 2025 •

edited

Loading

mayya-sharipova commented May 22, 2025 •

edited

Loading

ChrisHegarty commented Jun 5, 2025 •

edited

Loading

mayya-sharipova commented Jun 5, 2025 •

edited

Loading