-
Notifications
You must be signed in to change notification settings - Fork 25.3k
[IVF] Simplify how we build CentroidAssignments #130709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
this.cachedCentroids = cachedCentroids; | ||
this.assignmentsByCluster = assignmentsByCluster; | ||
} | ||
record CentroidAssignments(int numCentroids, float[][] centroids, int[][] assignmentsByCluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I started with originally when I did this hkmeans work but if I remember correctly @benwtrent had concerns about the centroids being in memory on merge and had also originally in the code prior to hkmeans dropped the centroids from in memory after write and then read them in one at a time with what is now the CentroidSupplier. I like the simplification but I lack the intuition or the evaluations to suggest that leaving the centroids in memory on merge is ok, which I think is happening here because we'll hold onto the CentroidAssignments object after doing the calculateAndWriteCentroids call which will keep the centroids loaded. May also be worth considering refactoring the CentroidSupplier logic if we think it's ok to keep the centroids loaded in memory. I may be misunderstanding too so please feel free to correct me here. But i'll defer to Ben here largely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment, we are moving how we use CentroidAssignments during merging so it is only on heap when it is needed.
try { | ||
centroidTemp = mergeState.segmentInfo.dir.createTempOutput(mergeState.segmentInfo.name, "civf_", IOContext.DEFAULT); | ||
centroidTempName = centroidTemp.getName(); | ||
centroidAssignments = calculateAndWriteCentroids( | ||
CentroidAssignments centroidAssignments = calculateAndWriteCentroids( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-wagster Having the CentroidAssignments here will prevent having the centroids in memory during merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok yes sorry I missed the subtle change there. I can get behind this. +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This commit removes the flag for cache centroids when building cetroid assigments as I don't think is needed. It makes CentroidAssignments a record too.