-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add max num_candidates as a dynamic index settings #125065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nicer! Could you also add some documentation in docs/reference/elasticsearch/index-settings/index-modules.md
?
server/src/test/java/org/elasticsearch/search/dfs/DfsPhaseTests.java
Outdated
Show resolved
Hide resolved
@@ -177,7 +178,7 @@ private static Timer maybeStartTimer(DfsProfiler profiler, DfsTimingType dtt) { | |||
return null; | |||
}; | |||
|
|||
private static void executeKnnVectorQuery(SearchContext context) throws IOException { | |||
static void executeKnnVectorQuery(SearchContext context) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eagerly validating like this is OK. However, KnnVectorQueryBuilder#doToQuery
should also validate as its possible to provide a knn query that isn't executed through DFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the check in KnnVectorQueryBuilder#doToQuery is better, I will change the check code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
I wonder about the use case here. it sounds like the ask here is to consider a vast number of candidates, I wonder if this wouldn't be best served by an explicit brute force search? |
If the amount of shard data is too large, brute force search will be slow. You can set a larger number of candidates to obtain more vector data. |
done |
@weizijun is the use-case here aggregating a large number of vector matches? Or an attempt to have very very accurate results from large graphs? HNSW performance gets worse and worse the more vectors you search and wish to return. I am just wanting to better understand the reason for this change. |
10000 is a hard limit, when the vector data is large, in some data mining cases, users want to get more docs, 10000 docs is not enough. Self-driving customers want to find more images that is similar to the query image, it will raise the limit of num_candidates. And I think there is a more solution, the FloatVectorSimilarityQuery in lucene may be a solution, it will collect all docs which similarity is that the min similarity. The min similarity parameter in lucene FloatVectorSimilarityQuery is not the same with which define in elasticsearch knn query. |
Increasing num_candidates doesn't fix this. You still won't be able to get more than 10k total nearest neighbors in the final result set due to the limitation of So the actual use case is wanting to search over more than 10k total hits? This also proves difficult as the maximum docs allowed (e.g. the total allowed to search over with I don't see how you can actually search over more than 10k nearest neighbors without significant changes to other parts of Elasticsearch.
Correct, this is because the similarity query in Lucene as it is currently designed is syntactic sugar for "Please do a complete brute force query and filter on similarity". Very often it reverts to searching the entire index as HNSW is not designed for this type of query. |
When the num_candidates raise, it will also rise the max_window_size, or use the scroll to get more docs. |
The only way I could see somebody getting MORE than 10k total docs is via The only usages I see (without significant work elsewhere) for increasing
However, neither of these seem to be the reason for this change. I am possibly confused here :(. |
Now the num_candidates must be lower then 10000, In some case, use want to get more vectors.
I think the num_candidates can be a parameter like max_window_size, The use can change the default vaule.
I add a new setting named index.max_knn_num_candidates. It can dynamic modify.
The old PR (#125001) set as a index_option parameter in mappings.
The new PR change it into the index setting parameter.