Add max num_candidates as a dynamic index settings #125065

weizijun · 2025-03-18T07:20:00Z

Now the num_candidates must be lower then 10000, In some case, use want to get more vectors.
I think the num_candidates can be a parameter like max_window_size, The use can change the default vaule.
I add a new setting named index.max_knn_num_candidates. It can dynamic modify.
The old PR (#125001) set as a index_option parameter in mappings.
The new PR change it into the index setting parameter.

benwtrent

much nicer! Could you also add some documentation in docs/reference/elasticsearch/index-settings/index-modules.md ?

server/src/test/java/org/elasticsearch/search/dfs/DfsPhaseTests.java

benwtrent · 2025-03-18T13:07:48Z

server/src/main/java/org/elasticsearch/search/dfs/DfsPhase.java

@@ -177,7 +178,7 @@ private static Timer maybeStartTimer(DfsProfiler profiler, DfsTimingType dtt) {
        return null;
    };

-    private static void executeKnnVectorQuery(SearchContext context) throws IOException {
+    static void executeKnnVectorQuery(SearchContext context) throws IOException {


I think eagerly validating like this is OK. However, KnnVectorQueryBuilder#doToQuery should also validate as its possible to provide a knn query that isn't executed through DFS.

Yeah, I think the check in KnnVectorQueryBuilder#doToQuery is better, I will change the check code.

elasticsearchmachine · 2025-03-18T13:10:04Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

tteofili · 2025-03-18T13:47:31Z

I wonder about the use case here. it sounds like the ask here is to consider a vast number of candidates, I wonder if this wouldn't be best served by an explicit brute force search?

weizijun · 2025-03-18T16:42:59Z

I wonder about the use case here. it sounds like the ask here is to consider a vast number of candidates, I wonder if this wouldn't be best served by an explicit brute force search?

If the amount of shard data is too large, brute force search will be slow. You can set a larger number of candidates to obtain more vector data.

weizijun · 2025-03-19T09:24:17Z

much nicer! Could you also add some documentation in docs/reference/elasticsearch/index-settings/index-modules.md ?

done

benwtrent · 2025-04-03T12:58:21Z

@weizijun is the use-case here aggregating a large number of vector matches? Or an attempt to have very very accurate results from large graphs?

HNSW performance gets worse and worse the more vectors you search and wish to return.

I am just wanting to better understand the reason for this change.

weizijun · 2025-04-03T15:39:11Z

HNSW performance gets worse and worse the more vectors you search and wish to return.

I am just wanting to better understand the reason for this change.

10000 is a hard limit, when the vector data is large, in some data mining cases, users want to get more docs, 10000 docs is not enough. Self-driving customers want to find more images that is similar to the query image, it will raise the limit of num_candidates.

And I think there is a more solution, the FloatVectorSimilarityQuery in lucene may be a solution, it will collect all docs which similarity is that the min similarity. The min similarity parameter in lucene FloatVectorSimilarityQuery is not the same with which define in elasticsearch knn query.

benwtrent · 2025-04-04T11:48:40Z

Self-driving customers want to find more images that is similar to the query image, it will raise the limit of num_candidates.

Increasing num_candidates doesn't fix this. You still won't be able to get more than 10k total nearest neighbors in the final result set due to the limitation of k.

So the actual use case is wanting to search over more than 10k total hits?

This also proves difficult as the maximum docs allowed (e.g. the total allowed to search over with from+size) is 10k.

I don't see how you can actually search over more than 10k nearest neighbors without significant changes to other parts of Elasticsearch.

The min similarity parameter in lucene FloatVectorSimilarityQuery is not the same with which define in elasticsearch knn query.

Correct, this is because the similarity query in Lucene as it is currently designed is syntactic sugar for "Please do a complete brute force query and filter on similarity". Very often it reverts to searching the entire index as HNSW is not designed for this type of query.

weizijun · 2025-04-07T11:24:28Z

Increasing num_candidates doesn't fix this. You still won't be able to get more than 10k total nearest neighbors in the final result set due to the limitation of k.

So the actual use case is wanting to search over more than 10k total hits?

This also proves difficult as the maximum docs allowed (e.g. the total allowed to search over with from+size) is 10k.

I don't see how you can actually search over more than 10k nearest neighbors without significant changes to other parts of Elasticsearch.

When the num_candidates raise, it will also rise the max_window_size, or use the scroll to get more docs.

benwtrent · 2025-04-08T14:01:30Z

When the num_candidates raise, it will also rise the max_window_size, or use the scroll to get more docs.

max_window_size if you are referring to rescore parameters or the retrievers frame work, again, those are all limited by the actual results you can get, which is 10k.

The only way I could see somebody getting MORE than 10k total docs is via scroll.

The only usages I see (without significant work elsewhere) for increasing num_candidates above 10,000 are:

Aggregation results
Hyper accurate results on very large graphs.

However, neither of these seem to be the reason for this change.

I am possibly confused here :(.

benwtrent · 2025-05-15T17:19:33Z

@weizijun any clarification from my previous feedback? #125065 (comment)

HNSW is really bad at gathering very large num_candidates. Size returned is still limited to 10k. This setting doesn't actually do much.

weizijun · 2025-05-16T01:56:07Z

HNSW is really bad at gathering very large num_candidates. Size returned is still limited to 10k. This setting doesn't actually do much.

OK, if we need this feature later, we can discuss it again.

weizijun added 2 commits March 18, 2025 15:10

add index.max_knn_num_candidates settings

2c1f686

improve

1924f9c

weizijun requested a review from a team as a code owner March 18, 2025 07:20

elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Mar 18, 2025

weizijun mentioned this pull request Mar 18, 2025

Set num_candidates as an option parameter #125001

Closed

benwtrent reviewed Mar 18, 2025

View reviewed changes

benwtrent added the :Search Relevance/Vectors Vector search label Mar 18, 2025

benwtrent self-assigned this Mar 18, 2025

benwtrent added the >enhancement label Mar 18, 2025

elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed needs:triage Requires assignment of a team area label labels Mar 18, 2025

fixup

0573a3f

weizijun added 2 commits March 19, 2025 17:03

replace the check from dfs phase to KnnVectorQueryBuilder doToQuery

6f77261

add docs

6a7c414

weizijun closed this May 16, 2025

Add max num_candidates as a dynamic index settings #125065

Add max num_candidates as a dynamic index settings #125065

Uh oh!

Conversation

weizijun commented Mar 18, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benwtrent Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

weizijun Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

weizijun Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Mar 18, 2025

Uh oh!

tteofili commented Mar 18, 2025

Uh oh!

weizijun commented Mar 18, 2025

Uh oh!

weizijun commented Mar 19, 2025

Uh oh!

benwtrent commented Apr 3, 2025

Uh oh!

weizijun commented Apr 3, 2025

Uh oh!

benwtrent commented Apr 4, 2025

Uh oh!

weizijun commented Apr 7, 2025

Uh oh!

benwtrent commented Apr 8, 2025

Uh oh!

benwtrent commented May 15, 2025

Uh oh!

weizijun commented May 16, 2025

Uh oh!

Uh oh!