-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Leverage One Layer of Hierarchical KMeans Structure on Read #129950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…veral optimizations / clean up before we can eval
…earch into ivf_hkmeans_struc2
…earch into ivf_hkmeans_struc2
…earch into ivf_hkmeans_struc2
…earch into ivf_hkmeans_struc2
interesting assert tripped in these PR tests: #130472 (comment) (assuming its been fixed) |
I get the feeling I know which assert it was and I do believe it's been fixed. I had no idea draft PR's could mute tests on main though. That's a little surprising to me. I'll have to try to be more careful about what I push up in terms of asserts in the future. |
@john-wagster no, I wouldn't. Your PR is your PR. Add all the assertions you want. I think the automation is completely mistaken in taking your PR into account when it comes to a test failure. |
int centroidOrdinal = centroidQueue.pop(); | ||
scorer.resetPostingsScorer(centroidOrdinal, centroidQueryScorer.centroid(centroidOrdinal)); | ||
if (priorVisitedCentroids.getAndSet(centroidOrdinal)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be wrong, the centroid ordinals are always unique so we might be visiting the same centroid more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The centroids can be duplicated across parent partitions because of the SOAR logic now. So this bitset is meant to deduplicate when a new parent partition is included and it may cause us to re-explore centroids unnecessarily. Are you observing that this logic doesn't work? I can double check it.
@@ -325,9 +342,12 @@ private void mergeOneFieldIVF(FieldInfo fieldInfo, MergeState mergeState) throws | |||
|
|||
CentroidSupplier centroidSupplier = createCentroidSupplier( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand this, I think we should not reuse the written centroids now that we have a complex structure but write them in the temp file in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've been confused about this code too. I'm not sure I fully understood why we have the temp file. I think it's been like this for awhile? It looks like it just gets written and then immediately copied into ivfCentroids. I think that hasn't changed. Do you happen to know if there is some reason we were doing this previously?
And are you suggesting that above we should be passing in both centroidTemp
and ivfCentroids
in calculateAndWriteCentroids
and then writing the in order centroids to centroidTemp. If so I like this idea and am happy to refactor to do so. If I'm misunderstanding please let me know.
Leverages the hkmeans algorithms structure to create an additional layer of high level centroids (parent partitions) that get queries prior to the full list of centroids themselves.