Skip to content

[IVF] Improve the format of the tmp file written during merging #129828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 23, 2025

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 23, 2025

During merging, we need to access the vectors in a random access fashion in order to build the clusters. In order to achieve that, we write our vectors and dicIds together on a temporary file. During testing on a memory constraint node, I noticed in the flamegraph that we were taking a lot of time reading docIds:

image

Looking at this process I noticed we can do much better because:

  1. If the segment is dense, e.g all documents have a vector, we don't need to write he docIds as the docId is the ordinal of the vector.
  2. If the segment is not dense, we can write the docIds in a separate file as they are access independent of the vectors.

This commit just adds the logic above which improved the performance on memory constraint nodes.

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jun 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separating out the doc ids vs the vectors is great!

@iverase iverase merged commit 72b488c into elastic:main Jun 23, 2025
27 checks passed
@iverase iverase deleted the ivfwriter_tmpfile branch June 23, 2025 12:44
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants