Skip to content

optimize OptimizedScalarQuantizer#scalarQuantize when destination can… #129874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 2, 2025

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 23, 2025

It is possible to optimize this method if the destination array is an integer array. In that case it is easy to panamize the following loop:

float nSteps = ((1 << bits) - 1);
        float step = (upperInterval - lowInterval) / nSteps;
        int sumQuery = 0;
        for (int h = 0; h < vector.length; h++) {
            float xi = Math.min(Math.max(vector[h], lowInterval), upperInterval);
            int assignment = Math.round((xi - lowInterval) / step);
            sumQuery += assignment;
            destination[h] = assignment;
        }
        return sumQuery;

@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Jun 23, 2025
@iverase iverase added >non-issue :Search Relevance/Search Catch all for Search Relevance and removed needs:triage Requires assignment of a team area label labels Jun 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jun 23, 2025
@@ -141,6 +141,36 @@ public QuantizationResult scalarQuantize(float[] vector, byte[] destination, byt
);
}

public QuantizationResult scalarQuantizeToInts(float[] vector, int[] destination, byte bits, float[] centroid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are confident in the speed improvements, I would argue that we shouldn't bother with this new method, and simply adjust the APIs.

The on disk format is unchanged, so any older formats should be able to be adjusted that rely on the scalarQuantize signatures.

@benwtrent
Copy link
Member

I ran on my macbook, there was no significant performance improvement.

You saw an improvement on AVX256?

    @Benchmark
    @Fork(jvmArgsPrepend = { "--add-modules=jdk.incubator.vector" })
    public int[] quantizeIntervalVector() {
        osq.scalarQuantizeToInts(vector, intDestination, bits, centroid);
        return intDestination;
    }

    @Benchmark
    @Fork(jvmArgsPrepend = { "--add-modules=jdk.incubator.vector" })
    public byte[] quantizeIntervalScalar() {
        osq.scalarQuantize(vector, destination, bits, centroid);
        return destination;
    }
./gradlew -p benchmarks run --args 'OptimizedScalarQuantizerBenchmark.quantizeInterval*'
Benchmark                                                 (bits)  (dims)   Mode  Cnt    Score    Error   Units
OptimizedScalarQuantizerBenchmark.quantizeIntervalScalar       1     768  thrpt   15  223.539 ± 21.878  ops/ms
OptimizedScalarQuantizerBenchmark.quantizeIntervalScalar       4     768  thrpt   15  216.348 ± 15.598  ops/ms
OptimizedScalarQuantizerBenchmark.quantizeIntervalScalar       7     768  thrpt   15  249.156 ± 19.012  ops/ms
OptimizedScalarQuantizerBenchmark.quantizeIntervalVector       1     768  thrpt   15  285.095 ± 39.949  ops/ms
OptimizedScalarQuantizerBenchmark.quantizeIntervalVector       4     768  thrpt   15  264.729 ± 56.267  ops/ms
OptimizedScalarQuantizerBenchmark.quantizeIntervalVector       7     768  thrpt   15  223.167 ± 96.211  ops/ms

@iverase
Copy link
Contributor Author

iverase commented Jun 23, 2025

I saw the same running locally. I will see if I can speed it up on mac tomorrow.

@iverase
Copy link
Contributor Author

iverase commented Jun 24, 2025

Running the benchmarks on AVX512 it shows a nice improvemnet:

Benchmark                                      (bits)  (dims)   Mode  Cnt    Score    Error   Units
OptimizedScalarQuantizerBenchmark.vector            1     384  thrpt   15  171.379 ± 13.438  ops/ms
OptimizedScalarQuantizerBenchmark.vector            1     702  thrpt   15   86.122 ± 12.200  ops/ms
OptimizedScalarQuantizerBenchmark.vector            1    1024  thrpt   15   66.933 ±  5.120  ops/ms
OptimizedScalarQuantizerBenchmark.vector            4     384  thrpt   15  164.831 ± 13.994  ops/ms
OptimizedScalarQuantizerBenchmark.vector            4     702  thrpt   15   77.198 ±  3.074  ops/ms
OptimizedScalarQuantizerBenchmark.vector            4    1024  thrpt   15   60.467 ±  2.358  ops/ms
OptimizedScalarQuantizerBenchmark.vector            7     384  thrpt   15  170.618 ± 10.339  ops/ms
OptimizedScalarQuantizerBenchmark.vector            7     702  thrpt   15   88.564 ±  5.674  ops/ms
OptimizedScalarQuantizerBenchmark.vector            7    1024  thrpt   15   65.541 ±  3.965  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       1     384  thrpt   15  375.294 ± 64.289  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       1     702  thrpt   15  172.473 ± 34.064  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       1    1024  thrpt   15  141.787 ± 10.931  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       4     384  thrpt   15  345.374 ± 54.182  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       4     702  thrpt   15  162.475 ± 32.259  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       4    1024  thrpt   15  141.852 ± 21.408  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       7     384  thrpt   15  389.065 ± 38.826  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       7     702  thrpt   15  168.443 ± 13.488  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       7    1024  thrpt   15  153.517 ± 23.609  ops/ms

Running the benchmarks on AVX2 still shows an improvement, a bit less than in AVX512:

Benchmark                                      (bits)  (dims)   Mode  Cnt    Score    Error   Units
OptimizedScalarQuantizerBenchmark.vector            1     384  thrpt   15  321.188 ± 44.707  ops/ms
OptimizedScalarQuantizerBenchmark.vector            1     702  thrpt   15  182.116 ± 25.225  ops/ms
OptimizedScalarQuantizerBenchmark.vector            1    1024  thrpt   15  120.325 ± 10.075  ops/ms
OptimizedScalarQuantizerBenchmark.vector            4     384  thrpt   15  302.188 ± 30.618  ops/ms
OptimizedScalarQuantizerBenchmark.vector            4     702  thrpt   15  163.192 ± 15.973  ops/ms
OptimizedScalarQuantizerBenchmark.vector            4    1024  thrpt   15  112.773 ±  7.515  ops/ms
OptimizedScalarQuantizerBenchmark.vector            7     384  thrpt   15  338.509 ± 36.895  ops/ms
OptimizedScalarQuantizerBenchmark.vector            7     702  thrpt   15  170.923 ±  9.042  ops/ms
OptimizedScalarQuantizerBenchmark.vector            7    1024  thrpt   15  126.368 ± 10.726  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       1     384  thrpt   15  457.541 ± 72.812  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       1     702  thrpt   15  240.875 ± 28.577  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       1    1024  thrpt   15  172.829 ± 20.372  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       4     384  thrpt   15  457.395 ± 65.854  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       4     702  thrpt   15  185.427 ± 55.006  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       4    1024  thrpt   15  155.515 ± 17.608  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       7     384  thrpt   15  455.067 ± 49.820  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       7     702  thrpt   15  280.439 ± 34.467  ops/ms
OptimizedScalarQuantizerBenchmark.vectorToInt       7    1024  thrpt   15  162.441 ±  7.979  ops/ms

I suspect that the expensive part of the algorithm is the #intoArray call. The biggest the bit size, the less calls we need to make to that method so the faster it goes. For Mac with bit size of 128 we just don't see any real improvement.

@benwtrent
Copy link
Member

@iverase You still want to make this change? I can review if you are ready?

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2025

I couldn't make it faster in AVX so we are not getting faster there but it is clearly faster in AVX2 and AVX512 so I would say it is a net net win so I am good to push as it is.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good to me! Being faster on 256/512 is a good win and its no slower on neon-128

iverase added 3 commits July 2, 2025 07:29
# Conflicts:
#	server/src/main/java/org/elasticsearch/index/codec/vectors/DefaultIVFVectorsWriter.java
@iverase iverase merged commit f81d355 into elastic:main Jul 2, 2025
32 checks passed
@iverase iverase deleted the quantizeVectorWithIntervals branch July 2, 2025 13:58
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 3, 2025
optimize OptimizedScalarQuantizer#scalarQuantize when destination can optimize 
OptimizedScalarQuantizer#scalarQuantize when destination can be an integer array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants