Skip to content

Add max num_candidates as a dynamic index settings #125065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

weizijun
Copy link
Contributor

Now the num_candidates must be lower then 10000, In some case, use want to get more vectors.
I think the num_candidates can be a parameter like max_window_size, The use can change the default vaule.
I add a new setting named index.max_knn_num_candidates. It can dynamic modify.
The old PR (#125001) set as a index_option parameter in mappings.
The new PR change it into the index setting parameter.

@weizijun weizijun requested a review from a team as a code owner March 18, 2025 07:20
@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Mar 18, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nicer! Could you also add some documentation in docs/reference/elasticsearch/index-settings/index-modules.md ?

@@ -177,7 +178,7 @@ private static Timer maybeStartTimer(DfsProfiler profiler, DfsTimingType dtt) {
return null;
};

private static void executeKnnVectorQuery(SearchContext context) throws IOException {
static void executeKnnVectorQuery(SearchContext context) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eagerly validating like this is OK. However, KnnVectorQueryBuilder#doToQuery should also validate as its possible to provide a knn query that isn't executed through DFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the check in KnnVectorQueryBuilder#doToQuery is better, I will change the check code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@benwtrent benwtrent added the :Search Relevance/Vectors Vector search label Mar 18, 2025
@benwtrent benwtrent self-assigned this Mar 18, 2025
@elasticsearchmachine elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed needs:triage Requires assignment of a team area label labels Mar 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@tteofili
Copy link
Contributor

I wonder about the use case here. it sounds like the ask here is to consider a vast number of candidates, I wonder if this wouldn't be best served by an explicit brute force search?

@weizijun
Copy link
Contributor Author

I wonder about the use case here. it sounds like the ask here is to consider a vast number of candidates, I wonder if this wouldn't be best served by an explicit brute force search?

If the amount of shard data is too large, brute force search will be slow. You can set a larger number of candidates to obtain more vector data.

@weizijun
Copy link
Contributor Author

much nicer! Could you also add some documentation in docs/reference/elasticsearch/index-settings/index-modules.md ?

done

@benwtrent
Copy link
Member

@weizijun is the use-case here aggregating a large number of vector matches? Or an attempt to have very very accurate results from large graphs?

HNSW performance gets worse and worse the more vectors you search and wish to return.

I am just wanting to better understand the reason for this change.

@weizijun
Copy link
Contributor Author

weizijun commented Apr 3, 2025

HNSW performance gets worse and worse the more vectors you search and wish to return.

I am just wanting to better understand the reason for this change.

10000 is a hard limit, when the vector data is large, in some data mining cases, users want to get more docs, 10000 docs is not enough. Self-driving customers want to find more images that is similar to the query image, it will raise the limit of num_candidates.

And I think there is a more solution, the FloatVectorSimilarityQuery in lucene may be a solution, it will collect all docs which similarity is that the min similarity. The min similarity parameter in lucene FloatVectorSimilarityQuery is not the same with which define in elasticsearch knn query.

@benwtrent
Copy link
Member

Self-driving customers want to find more images that is similar to the query image, it will raise the limit of num_candidates.

Increasing num_candidates doesn't fix this. You still won't be able to get more than 10k total nearest neighbors in the final result set due to the limitation of k.

So the actual use case is wanting to search over more than 10k total hits?

This also proves difficult as the maximum docs allowed (e.g. the total allowed to search over with from+size) is 10k.

I don't see how you can actually search over more than 10k nearest neighbors without significant changes to other parts of Elasticsearch.

The min similarity parameter in lucene FloatVectorSimilarityQuery is not the same with which define in elasticsearch knn query.

Correct, this is because the similarity query in Lucene as it is currently designed is syntactic sugar for "Please do a complete brute force query and filter on similarity". Very often it reverts to searching the entire index as HNSW is not designed for this type of query.

@weizijun
Copy link
Contributor Author

weizijun commented Apr 7, 2025

Increasing num_candidates doesn't fix this. You still won't be able to get more than 10k total nearest neighbors in the final result set due to the limitation of k.

So the actual use case is wanting to search over more than 10k total hits?

This also proves difficult as the maximum docs allowed (e.g. the total allowed to search over with from+size) is 10k.

I don't see how you can actually search over more than 10k nearest neighbors without significant changes to other parts of Elasticsearch.

When the num_candidates raise, it will also rise the max_window_size, or use the scroll to get more docs.

@benwtrent
Copy link
Member

When the num_candidates raise, it will also rise the max_window_size, or use the scroll to get more docs.

max_window_size if you are referring to rescore parameters or the retrievers frame work, again, those are all limited by the actual results you can get, which is 10k.

The only way I could see somebody getting MORE than 10k total docs is via scroll.

The only usages I see (without significant work elsewhere) for increasing num_candidates above 10,000 are:

  • Aggregation results
  • Hyper accurate results on very large graphs.

However, neither of these seem to be the reason for this change.

I am possibly confused here :(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants