Skip to content

Unroll sketch increment #653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 13, 2025
Merged

Unroll sketch increment #653

merged 6 commits into from
Jan 13, 2025

Conversation

bitfaster
Copy link
Owner

@bitfaster bitfaster commented Jan 7, 2025

Port of ben-manes/caffeine@91a36fb.

  • Unrolled outperforms looped on tested platforms. It may be better than using AVX on x64, but Neon instructions are still better on a Mac M2.
  • Increased C# language version to 11 enabling use of unsigned right shift operator.

On a skylake CPU increment unrolled block is faster than AVX on NET8/9:
BitFaster Caching Benchmarks Lfu SketchIncrement-columnchart

Frequency is still faster on the AVX code path:
BitFaster Caching Benchmarks Lfu SketchFrequency-columnchart

@coveralls
Copy link

coveralls commented Jan 7, 2025

Coverage Status

coverage: 99.128% (-0.1%) from 99.228%
when pulling b99f98f on users/alexpeck/unroll
into b34b12c on main.

@bitfaster
Copy link
Owner Author

bitfaster commented Jan 7, 2025

Skylake CPU:

Method Runtime Size Mean Error StdDev Ratio Allocated
IncFlat .NET 6.0 32768 21.68 ns 0.090 ns 0.080 ns 1.00 -
IncFlatAvx .NET 6.0 32768 19.61 ns 0.137 ns 0.114 ns 0.90 -
IncBlock .NET 6.0 32768 22.14 ns 0.070 ns 0.062 ns 1.02 -
IncBlockUnroll .NET 6.0 32768 18.50 ns 0.101 ns 0.090 ns 0.85 -
IncBlockAvxNotPinned .NET 6.0 32768 18.25 ns 0.045 ns 0.038 ns 0.84 -
IncBlockAvxPinned .NET 6.0 32768 17.18 ns 0.053 ns 0.042 ns 0.79 -
IncFlat .NET 8.0 32768 11.85 ns 0.030 ns 0.027 ns 1.00 -
IncFlatAvx .NET 8.0 32768 16.45 ns 0.326 ns 0.570 ns 1.39 -
IncBlock .NET 8.0 32768 15.69 ns 0.129 ns 0.114 ns 1.32 -
IncBlockUnroll .NET 8.0 32768 11.60 ns 0.094 ns 0.088 ns 0.98 -
IncBlockAvxNotPinned .NET 8.0 32768 15.92 ns 0.114 ns 0.107 ns 1.34 -
IncBlockAvxPinned .NET 8.0 32768 14.06 ns 0.277 ns 0.388 ns 1.19 -
IncFlat .NET 9.0 32768 11.50 ns 0.057 ns 0.051 ns 1.00 -
IncFlatAvx .NET 9.0 32768 15.45 ns 0.084 ns 0.075 ns 1.34 -
IncBlock .NET 9.0 32768 16.04 ns 0.162 ns 0.143 ns 1.40 -
IncBlockUnroll .NET 9.0 32768 12.16 ns 0.080 ns 0.071 ns 1.06 -
IncBlockAvxNotPinned .NET 9.0 32768 17.88 ns 0.154 ns 0.136 ns 1.56 -
IncBlockAvxPinned .NET 9.0 32768 13.66 ns 0.071 ns 0.059 ns 1.19 -
IncFlat .NET 6.0 524288 36.46 ns 0.727 ns 0.747 ns 1.00 -
IncFlatAvx .NET 6.0 524288 29.16 ns 0.459 ns 0.383 ns 0.80 -
IncBlock .NET 6.0 524288 32.79 ns 0.464 ns 0.411 ns 0.90 -
IncBlockUnroll .NET 6.0 524288 30.27 ns 0.239 ns 0.212 ns 0.83 -
IncBlockAvxNotPinned .NET 6.0 524288 27.65 ns 0.227 ns 0.190 ns 0.76 -
IncBlockAvxPinned .NET 6.0 524288 29.93 ns 1.333 ns 3.826 ns 0.82 -
IncFlat .NET 8.0 524288 23.89 ns 0.696 ns 1.894 ns 1.01 -
IncFlatAvx .NET 8.0 524288 29.05 ns 0.900 ns 2.509 ns 1.22 -
IncBlock .NET 8.0 524288 31.46 ns 0.783 ns 2.258 ns 1.32 -
IncBlockUnroll .NET 8.0 524288 21.09 ns 0.432 ns 1.197 ns 0.89 -
IncBlockAvxNotPinned .NET 8.0 524288 25.53 ns 0.479 ns 0.400 ns 1.07 -
IncBlockAvxPinned .NET 8.0 524288 23.57 ns 0.289 ns 0.256 ns 0.99 -
IncFlat .NET 9.0 524288 23.18 ns 0.435 ns 1.066 ns 1.00 -
IncFlatAvx .NET 9.0 524288 27.58 ns 0.664 ns 1.894 ns 1.19 -
IncBlock .NET 9.0 524288 32.62 ns 1.126 ns 3.229 ns 1.41 -
IncBlockUnroll .NET 9.0 524288 23.35 ns 0.863 ns 2.476 ns 1.01 -
IncBlockAvxNotPinned .NET 9.0 524288 29.34 ns 0.678 ns 1.933 ns 1.27 -
IncBlockAvxPinned .NET 9.0 524288 29.52 ns 1.075 ns 3.118 ns 1.28 -
IncFlat .NET 6.0 8388608 125.21 ns 2.423 ns 3.235 ns 1.00 -
IncFlatAvx .NET 6.0 8388608 90.96 ns 1.747 ns 1.634 ns 0.73 -
IncBlock .NET 6.0 8388608 109.40 ns 2.180 ns 2.834 ns 0.87 -
IncBlockUnroll .NET 6.0 8388608 73.37 ns 1.221 ns 1.082 ns 0.59 -
IncBlockAvxNotPinned .NET 6.0 8388608 79.76 ns 1.580 ns 4.189 ns 0.64 -
IncBlockAvxPinned .NET 6.0 8388608 71.31 ns 1.421 ns 3.643 ns 0.57 -
IncFlat .NET 8.0 8388608 67.96 ns 1.030 ns 0.913 ns 1.00 -
IncFlatAvx .NET 8.0 8388608 84.05 ns 1.039 ns 0.867 ns 1.24 -
IncBlock .NET 8.0 8388608 71.60 ns 0.617 ns 0.515 ns 1.05 -
IncBlockUnroll .NET 8.0 8388608 55.93 ns 0.570 ns 0.505 ns 0.82 -
IncBlockAvxNotPinned .NET 8.0 8388608 65.83 ns 0.492 ns 0.411 ns 0.97 -
IncBlockAvxPinned .NET 8.0 8388608 64.71 ns 0.889 ns 0.788 ns 0.95 -
IncFlat .NET 9.0 8388608 68.09 ns 1.171 ns 1.438 ns 1.00 -
IncFlatAvx .NET 9.0 8388608 83.79 ns 1.625 ns 1.806 ns 1.23 -
IncBlock .NET 9.0 8388608 71.59 ns 0.914 ns 0.714 ns 1.05 -
IncBlockUnroll .NET 9.0 8388608 55.51 ns 0.714 ns 0.596 ns 0.82 -
IncBlockAvxNotPinned .NET 9.0 8388608 64.74 ns 0.673 ns 0.630 ns 0.95 -
IncBlockAvxPinned .NET 9.0 8388608 63.89 ns 0.551 ns 0.489 ns 0.94 -
IncFlat .NET 6.0 134217728 152.93 ns 2.777 ns 4.071 ns 1.00 -
IncFlatAvx .NET 6.0 134217728 111.15 ns 2.185 ns 2.917 ns 0.73 -
IncBlock .NET 6.0 134217728 161.83 ns 3.547 ns 10.291 ns 1.06 -
IncBlockUnroll .NET 6.0 134217728 90.55 ns 1.689 ns 1.945 ns 0.59 -
IncBlockAvxNotPinned .NET 6.0 134217728 91.74 ns 2.454 ns 7.119 ns 0.60 -
IncBlockAvxPinned .NET 6.0 134217728 84.47 ns 1.688 ns 1.410 ns 0.55 -
IncFlat .NET 8.0 134217728 87.87 ns 1.544 ns 1.444 ns 1.00 -
IncFlatAvx .NET 8.0 134217728 106.61 ns 2.110 ns 3.406 ns 1.21 -
IncBlock .NET 8.0 134217728 86.86 ns 1.681 ns 1.936 ns 0.99 -
IncBlockUnroll .NET 8.0 134217728 79.42 ns 1.899 ns 5.325 ns 0.90 -
IncBlockAvxNotPinned .NET 8.0 134217728 86.72 ns 1.694 ns 1.584 ns 0.99 -
IncBlockAvxPinned .NET 8.0 134217728 89.37 ns 1.782 ns 4.129 ns 1.02 -
IncFlat .NET 9.0 134217728 90.63 ns 1.792 ns 3.496 ns 1.00 -
IncFlatAvx .NET 9.0 134217728 107.70 ns 2.148 ns 3.280 ns 1.19 -
IncBlock .NET 9.0 134217728 94.09 ns 1.869 ns 5.332 ns 1.04 -
IncBlockUnroll .NET 9.0 134217728 75.27 ns 1.898 ns 5.444 ns 0.83 -
IncBlockAvxNotPinned .NET 9.0 134217728 80.50 ns 1.239 ns 1.159 ns 0.89 -
IncBlockAvxPinned .NET 9.0 134217728 84.75 ns 1.680 ns 4.683 ns 0.94 -
Method Runtime Size Mean Error StdDev Ratio Allocated
FrequencyFlat .NET 6.0 32768 33.05 ns 0.301 ns 0.282 ns 1.00 -
FrequencyFlatAvx .NET 6.0 32768 35.95 ns 0.207 ns 0.184 ns 1.09 -
FrequencyBlock .NET 6.0 32768 30.08 ns 0.208 ns 0.195 ns 0.91 -
FrequencyBlockUnroll .NET 6.0 32768 22.32 ns 0.119 ns 0.111 ns 0.68 -
FrequencyBlockAvxNotPinned .NET 6.0 32768 31.92 ns 0.161 ns 0.135 ns 0.97 -
FrequencyBlockAvxPinned .NET 6.0 32768 30.86 ns 0.237 ns 0.221 ns 0.93 -
FrequencyFlat .NET 8.0 32768 22.37 ns 0.123 ns 0.115 ns 1.00 -
FrequencyFlatAvx .NET 8.0 32768 31.36 ns 0.056 ns 0.044 ns 1.40 -
FrequencyBlock .NET 8.0 32768 22.82 ns 0.097 ns 0.091 ns 1.02 -
FrequencyBlockUnroll .NET 8.0 32768 17.92 ns 0.128 ns 0.113 ns 0.80 -
FrequencyBlockAvxNotPinned .NET 8.0 32768 25.98 ns 0.085 ns 0.071 ns 1.16 -
FrequencyBlockAvxPinned .NET 8.0 32768 24.67 ns 0.088 ns 0.078 ns 1.10 -
FrequencyFlat .NET 9.0 32768 17.14 ns 0.045 ns 0.040 ns 1.00 -
FrequencyFlatAvx .NET 9.0 32768 30.24 ns 0.103 ns 0.097 ns 1.76 -
FrequencyBlock .NET 9.0 32768 20.44 ns 0.076 ns 0.067 ns 1.19 -
FrequencyBlockUnroll .NET 9.0 32768 18.18 ns 0.063 ns 0.055 ns 1.06 -
FrequencyBlockAvxNotPinned .NET 9.0 32768 25.75 ns 0.098 ns 0.092 ns 1.50 -
FrequencyBlockAvxPinned .NET 9.0 32768 24.79 ns 0.105 ns 0.093 ns 1.45 -
FrequencyFlat .NET 6.0 524288 59.29 ns 2.726 ns 8.036 ns 1.02 -
FrequencyFlatAvx .NET 6.0 524288 50.30 ns 0.988 ns 0.971 ns 0.86 -
FrequencyBlock .NET 6.0 524288 48.76 ns 3.475 ns 9.915 ns 0.84 -
FrequencyBlockUnroll .NET 6.0 524288 32.61 ns 0.611 ns 0.511 ns 0.56 -
FrequencyBlockAvxNotPinned .NET 6.0 524288 66.50 ns 1.760 ns 5.021 ns 1.14 -
FrequencyBlockAvxPinned .NET 6.0 524288 32.17 ns 0.368 ns 0.326 ns 0.55 -
FrequencyFlat .NET 8.0 524288 32.80 ns 0.445 ns 0.437 ns 1.00 -
FrequencyFlatAvx .NET 8.0 524288 43.35 ns 0.867 ns 0.964 ns 1.32 -
FrequencyBlock .NET 8.0 524288 31.68 ns 0.354 ns 0.296 ns 0.97 -
FrequencyBlockUnroll .NET 8.0 524288 27.10 ns 0.417 ns 0.390 ns 0.83 -
FrequencyBlockAvxNotPinned .NET 8.0 524288 43.28 ns 0.604 ns 0.536 ns 1.32 -
FrequencyBlockAvxPinned .NET 8.0 524288 25.70 ns 0.249 ns 0.194 ns 0.78 -
FrequencyFlat .NET 9.0 524288 31.67 ns 0.332 ns 0.260 ns 1.00 -
FrequencyFlatAvx .NET 9.0 524288 37.77 ns 0.610 ns 0.509 ns 1.19 -
FrequencyBlock .NET 9.0 524288 30.99 ns 0.523 ns 0.464 ns 0.98 -
FrequencyBlockUnroll .NET 9.0 524288 28.14 ns 0.547 ns 0.784 ns 0.89 -
FrequencyBlockAvxNotPinned .NET 9.0 524288 44.71 ns 0.475 ns 0.421 ns 1.41 -
FrequencyBlockAvxPinned .NET 9.0 524288 25.50 ns 0.269 ns 0.251 ns 0.81 -
FrequencyFlat .NET 6.0 8388608 137.52 ns 0.887 ns 0.786 ns 1.00 -
FrequencyFlatAvx .NET 6.0 8388608 144.48 ns 2.076 ns 1.840 ns 1.05 -
FrequencyBlock .NET 6.0 8388608 122.62 ns 0.711 ns 0.631 ns 0.89 -
FrequencyBlockUnroll .NET 6.0 8388608 109.87 ns 0.736 ns 0.689 ns 0.80 -
FrequencyBlockAvxNotPinned .NET 6.0 8388608 122.48 ns 0.950 ns 0.888 ns 0.89 -
FrequencyBlockAvxPinned .NET 6.0 8388608 71.91 ns 0.411 ns 0.384 ns 0.52 -
FrequencyFlat .NET 8.0 8388608 120.26 ns 0.751 ns 0.627 ns 1.00 -
FrequencyFlatAvx .NET 8.0 8388608 136.33 ns 0.555 ns 0.519 ns 1.13 -
FrequencyBlock .NET 8.0 8388608 107.74 ns 0.462 ns 0.386 ns 0.90 -
FrequencyBlockUnroll .NET 8.0 8388608 103.82 ns 0.679 ns 0.602 ns 0.86 -
FrequencyBlockAvxNotPinned .NET 8.0 8388608 88.08 ns 0.631 ns 0.527 ns 0.73 -
FrequencyBlockAvxPinned .NET 8.0 8388608 65.27 ns 0.288 ns 0.255 ns 0.54 -
FrequencyFlat .NET 9.0 8388608 117.87 ns 0.524 ns 0.490 ns 1.00 -
FrequencyFlatAvx .NET 9.0 8388608 133.52 ns 1.696 ns 1.587 ns 1.13 -
FrequencyBlock .NET 9.0 8388608 110.22 ns 1.746 ns 1.547 ns 0.94 -
FrequencyBlockUnroll .NET 9.0 8388608 106.63 ns 0.734 ns 0.651 ns 0.90 -
FrequencyBlockAvxNotPinned .NET 9.0 8388608 126.33 ns 11.730 ns 34.585 ns 1.07 -
FrequencyBlockAvxPinned .NET 9.0 8388608 64.92 ns 0.507 ns 0.449 ns 0.55 -
FrequencyFlat .NET 6.0 134217728 182.69 ns 3.641 ns 5.879 ns 1.00 -
FrequencyFlatAvx .NET 6.0 134217728 181.38 ns 0.948 ns 0.887 ns 0.99 -
FrequencyBlock .NET 6.0 134217728 147.83 ns 1.185 ns 1.050 ns 0.81 -
FrequencyBlockUnroll .NET 6.0 134217728 139.65 ns 2.546 ns 3.220 ns 0.77 -
FrequencyBlockAvxNotPinned .NET 6.0 134217728 161.53 ns 2.966 ns 2.774 ns 0.89 -
FrequencyBlockAvxPinned .NET 6.0 134217728 85.93 ns 1.488 ns 1.882 ns 0.47 -
FrequencyFlat .NET 8.0 134217728 154.33 ns 2.509 ns 2.684 ns 1.00 -
FrequencyFlatAvx .NET 8.0 134217728 173.88 ns 0.834 ns 0.781 ns 1.13 -
FrequencyBlock .NET 8.0 134217728 141.76 ns 3.763 ns 10.676 ns 0.92 -
FrequencyBlockUnroll .NET 8.0 134217728 130.79 ns 1.742 ns 1.544 ns 0.85 -
FrequencyBlockAvxNotPinned .NET 8.0 134217728 116.15 ns 1.484 ns 1.388 ns 0.75 -
FrequencyBlockAvxPinned .NET 8.0 134217728 78.65 ns 1.151 ns 1.077 ns 0.51 -
FrequencyFlat .NET 9.0 134217728 147.43 ns 1.259 ns 1.051 ns 1.00 -
FrequencyFlatAvx .NET 9.0 134217728 168.82 ns 1.739 ns 1.542 ns 1.15 -
FrequencyBlock .NET 9.0 134217728 135.06 ns 2.079 ns 1.843 ns 0.92 -
FrequencyBlockUnroll .NET 9.0 134217728 131.24 ns 1.302 ns 1.154 ns 0.89 -
FrequencyBlockAvxNotPinned .NET 9.0 134217728 113.74 ns 1.176 ns 1.042 ns 0.77 -
FrequencyBlockAvxPinned .NET 9.0 134217728 77.74 ns 0.995 ns 0.930 ns 0.53 -

@ben-manes
Copy link

it’s super finicky where the more complex lookup can be slower than the data access. I think with cpu caches being so large, observing the memory penalty on a benchmark is very difficult without enormous sizes to force the misses. Instead the pipeline bypass, fewer data dependencies, etc appear faster. I thought maybe I’d revisit when Java gets a vector api.

I’d probably say staying on the original sketch would have been the pragmatic right choice for a lower hash collision error rate, but since it was all for our fun then I don’t know if it matters.

@bitfaster
Copy link
Owner Author

Totally - all this is more for my education, and I feel at worst it is noise in a practical setting.

There were significant improvements to the JIT in .NET8/9 - with your unrolled change non-vectorized is faster than my vectorized code in these benchmarks on the newer JITs (I will verify this on more CPUs).

.NET6 is now out of support so likely irrelevant. I expect it is better to use your non-vectorized code on x64 in all cases. Using ARM intrinsics may still be a touch faster on apple M series CPUs - I saw greater gains there even on .NET9.

The AVX code path was my first attempt at using hardware intrinsics. The way the AVX increment method is written, I suspect there is a penalty mixing vector/non vector registers when the result is written back to the array. I want to try switching from the current single GatherVector256 load from the 512-bit cache line into a single 256-bit register followed by non-vectorized element stores to instead do two 256-bit loads and stores, working directly in the 256-bit registers end to end. This requires completely rewriting the code which makes my head hurt, so I didn't try it yet.

@ben-manes
Copy link

Well its really cool stuff. I've not had the fun of trying to write vectorized code, so I can only imagine the headache when I've read this and blogs about it. Its also got to be a lot of fun mental gymnastics, though. The upcoming Java api is going to handle all the size selection itself so I'm hoping to a little less overwhelmed as the code is more generic.

I'm really surprised that neither JITs optimized to loop unroll itself. I even tried the Azul commercial JVM, which swapped in LLVM, but its GC algorithm injects memory barriers on every pointer read that killed throughput so it was a net loss. I thought our code was quite neat and understandable for a compiler to optimize.

The headache I probably had too much fun forcing on myself was to optimize the cache's memory barriers, which is also purely noise performance-wise. I had to refresh myself when a false flag was reported by TSAN. I got a little scared that it was actually making sense to me and I didn't need to review references to reason about it, which given coherency complexity makes me feel more likely going insane than becoming an expert...

@bitfaster
Copy link
Owner Author

Gymnastics for sure - it will be good to see what's possible with the Java API, how your code turns out, and how smart the Java compiler/JIT is. I'm convinced an expert assembly programmer would know all manner of tricks to make it faster. I need the magic combination of focus time + mental energy to try some more experiments.

It's a shame about the memory barriers in Azul - its a smart idea to plug Java into LLVM. I searched and found a similar project for .NET but it's not as mature.

Your memory barrier optimization tricks were all new to me, cool stuff. It took me some time to fully grasp and I made stress tests I ran for weeks on different machines because I am paranoid I made a dumb mistake (and I had).

@bitfaster
Copy link
Owner Author

BenchmarkDotNet v0.14.0, macOS Sonoma 14.5 (23F79) [Darwin 23.5.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 9.0.100
  [Host]   : .NET 6.0.30 (6.0.3024.21525), Arm64 RyuJIT AdvSIMD
  .NET 6.0 : .NET 6.0.30 (6.0.3024.21525), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.0 (8.0.23.53103), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
Method Runtime Size Mean Error StdDev Ratio Allocated
IncFlat .NET 6.0 32768 13.873 ns 0.0031 ns 0.0024 ns 1.00 -
IncBlock .NET 6.0 32768 13.476 ns 0.0097 ns 0.0076 ns 0.97 -
IncBlockUnroll .NET 6.0 32768 11.212 ns 0.0042 ns 0.0035 ns 0.81 -
IncBlockNeonNotPinned .NET 6.0 32768 7.470 ns 0.0598 ns 0.0559 ns 0.54 -
IncBlockNeonPinned .NET 6.0 32768 7.828 ns 0.0876 ns 0.0820 ns 0.56 -
IncFlat .NET 8.0 32768 7.070 ns 0.0826 ns 0.0773 ns 1.00 -
IncBlock .NET 8.0 32768 10.257 ns 0.0288 ns 0.0270 ns 1.45 -
IncBlockUnroll .NET 8.0 32768 7.198 ns 0.0094 ns 0.0074 ns 1.02 -
IncBlockNeonNotPinned .NET 8.0 32768 5.981 ns 0.0188 ns 0.0167 ns 0.85 -
IncBlockNeonPinned .NET 8.0 32768 5.823 ns 0.0150 ns 0.0125 ns 0.82 -
IncFlat .NET 9.0 32768 6.900 ns 0.0572 ns 0.0535 ns 1.00 -
IncBlock .NET 9.0 32768 10.329 ns 0.0088 ns 0.0073 ns 1.50 -
IncBlockUnroll .NET 9.0 32768 7.140 ns 0.0098 ns 0.0076 ns 1.03 -
IncBlockNeonNotPinned .NET 9.0 32768 5.860 ns 0.0460 ns 0.0430 ns 0.85 -
IncBlockNeonPinned .NET 9.0 32768 5.697 ns 0.0575 ns 0.0538 ns 0.83 -
IncFlat .NET 6.0 524288 14.551 ns 0.1516 ns 0.1418 ns 1.00 -
IncBlock .NET 6.0 524288 16.228 ns 0.2122 ns 0.1985 ns 1.12 -
IncBlockUnroll .NET 6.0 524288 14.003 ns 0.1293 ns 0.1210 ns 0.96 -
IncBlockNeonNotPinned .NET 6.0 524288 7.616 ns 0.0222 ns 0.0197 ns 0.52 -
IncBlockNeonPinned .NET 6.0 524288 8.120 ns 0.0243 ns 0.0190 ns 0.56 -
IncFlat .NET 8.0 524288 7.607 ns 0.0515 ns 0.0482 ns 1.00 -
IncBlock .NET 8.0 524288 13.038 ns 0.0523 ns 0.0489 ns 1.71 -
IncBlockUnroll .NET 8.0 524288 9.802 ns 0.0406 ns 0.0379 ns 1.29 -
IncBlockNeonNotPinned .NET 8.0 524288 6.156 ns 0.0224 ns 0.0209 ns 0.81 -
IncBlockNeonPinned .NET 8.0 524288 6.023 ns 0.0308 ns 0.0288 ns 0.79 -
IncFlat .NET 9.0 524288 7.512 ns 0.0718 ns 0.0672 ns 1.00 -
IncBlock .NET 9.0 524288 12.726 ns 0.0767 ns 0.0718 ns 1.69 -
IncBlockUnroll .NET 9.0 524288 9.490 ns 0.0362 ns 0.0321 ns 1.26 -
IncBlockNeonNotPinned .NET 9.0 524288 6.023 ns 0.0452 ns 0.0423 ns 0.80 -
IncBlockNeonPinned .NET 9.0 524288 5.805 ns 0.0765 ns 0.0679 ns 0.77 -
IncFlat .NET 6.0 8388608 60.465 ns 0.0517 ns 0.0432 ns 1.00 -
IncBlock .NET 6.0 8388608 52.754 ns 0.0790 ns 0.0660 ns 0.87 -
IncBlockUnroll .NET 6.0 8388608 39.341 ns 0.0235 ns 0.0196 ns 0.65 -
IncBlockNeonNotPinned .NET 6.0 8388608 30.337 ns 0.0221 ns 0.0184 ns 0.50 -
IncBlockNeonPinned .NET 6.0 8388608 31.744 ns 0.0241 ns 0.0201 ns 0.52 -
IncFlat .NET 8.0 8388608 33.806 ns 0.0958 ns 0.0748 ns 1.00 -
IncBlock .NET 8.0 8388608 38.344 ns 0.0621 ns 0.0581 ns 1.13 -
IncBlockUnroll .NET 8.0 8388608 24.419 ns 0.0174 ns 0.0136 ns 0.72 -
IncBlockNeonNotPinned .NET 8.0 8388608 27.023 ns 0.4288 ns 0.4403 ns 0.80 -
IncBlockNeonPinned .NET 8.0 8388608 25.277 ns 0.0275 ns 0.0215 ns 0.75 -
IncFlat .NET 9.0 8388608 33.798 ns 0.0351 ns 0.0293 ns 1.00 -
IncBlock .NET 9.0 8388608 38.280 ns 0.0535 ns 0.0447 ns 1.13 -
IncBlockUnroll .NET 9.0 8388608 23.897 ns 0.0276 ns 0.0231 ns 0.71 -
IncBlockNeonNotPinned .NET 9.0 8388608 24.447 ns 0.0282 ns 0.0235 ns 0.72 -
IncBlockNeonPinned .NET 9.0 8388608 21.171 ns 0.0407 ns 0.0340 ns 0.63 -
IncFlat .NET 6.0 134217728 68.167 ns 0.1060 ns 0.0885 ns 1.00 -
IncBlock .NET 6.0 134217728 63.206 ns 0.0587 ns 0.0459 ns 0.93 -
IncBlockUnroll .NET 6.0 134217728 45.198 ns 0.0844 ns 0.0705 ns 0.66 -
IncBlockNeonNotPinned .NET 6.0 134217728 34.952 ns 0.0280 ns 0.0233 ns 0.51 -
IncBlockNeonPinned .NET 6.0 134217728 36.763 ns 0.0408 ns 0.0318 ns 0.54 -
IncFlat .NET 8.0 134217728 39.319 ns 0.0602 ns 0.0563 ns 1.00 -
IncBlock .NET 8.0 134217728 44.119 ns 0.0477 ns 0.0399 ns 1.12 -
IncBlockUnroll .NET 8.0 134217728 32.090 ns 0.0168 ns 0.0131 ns 0.82 -
IncBlockNeonNotPinned .NET 8.0 134217728 29.249 ns 0.0541 ns 0.0451 ns 0.74 -
IncBlockNeonPinned .NET 8.0 134217728 29.478 ns 0.0860 ns 0.0762 ns 0.75 -
IncFlat .NET 9.0 134217728 39.088 ns 0.0738 ns 0.0654 ns 1.00 -
IncBlock .NET 9.0 134217728 43.989 ns 0.0324 ns 0.0271 ns 1.13 -
IncBlockUnroll .NET 9.0 134217728 31.795 ns 0.0518 ns 0.0433 ns 0.81 -
IncBlockNeonNotPinned .NET 9.0 134217728 28.688 ns 0.0172 ns 0.0134 ns 0.73 -
IncBlockNeonPinned .NET 9.0 134217728 28.059 ns 0.0202 ns 0.0158 ns 0.72 -
Method Runtime Size Mean Error StdDev Ratio Allocated
FrequencyFlat .NET 6.0 32768 22.101 ns 0.1293 ns 0.1210 ns 1.00 -
FrequencyBlock .NET 6.0 32768 19.216 ns 0.0361 ns 0.0302 ns 0.87 -
FrequencyBlockUnroll .NET 6.0 32768 13.473 ns 0.0155 ns 0.0130 ns 0.61 -
FrequencyBlockNeonNotPinned .NET 6.0 32768 12.304 ns 0.0387 ns 0.0302 ns 0.56 -
FrequencyBlockNeonPinned .NET 6.0 32768 12.669 ns 0.1726 ns 0.1530 ns 0.57 -
FrequencyFlat .NET 8.0 32768 11.343 ns 0.0101 ns 0.0079 ns 1.00 -
FrequencyBlock .NET 8.0 32768 13.619 ns 0.0363 ns 0.0303 ns 1.20 -
FrequencyBlockUnroll .NET 8.0 32768 10.595 ns 0.0100 ns 0.0084 ns 0.93 -
FrequencyBlockNeonNotPinned .NET 8.0 32768 7.864 ns 0.0882 ns 0.0825 ns 0.69 -
FrequencyBlockNeonPinned .NET 8.0 32768 7.422 ns 0.0956 ns 0.0894 ns 0.65 -
FrequencyFlat .NET 9.0 32768 10.713 ns 0.0845 ns 0.0790 ns 1.00 -
FrequencyBlock .NET 9.0 32768 13.948 ns 0.0218 ns 0.0170 ns 1.30 -
FrequencyBlockUnroll .NET 9.0 32768 10.296 ns 0.0106 ns 0.0082 ns 0.96 -
FrequencyBlockNeonNotPinned .NET 9.0 32768 7.576 ns 0.0032 ns 0.0025 ns 0.71 -
FrequencyBlockNeonPinned .NET 9.0 32768 6.596 ns 0.0672 ns 0.0628 ns 0.62 -
FrequencyFlat .NET 6.0 524288 22.136 ns 0.0995 ns 0.0931 ns 1.00 -
FrequencyBlock .NET 6.0 524288 19.456 ns 0.0440 ns 0.0367 ns 0.88 -
FrequencyBlockUnroll .NET 6.0 524288 13.892 ns 0.0214 ns 0.0167 ns 0.63 -
FrequencyBlockNeonNotPinned .NET 6.0 524288 13.086 ns 0.1231 ns 0.1152 ns 0.59 -
FrequencyBlockNeonPinned .NET 6.0 524288 13.014 ns 0.1580 ns 0.1478 ns 0.59 -
FrequencyFlat .NET 8.0 524288 11.799 ns 0.0191 ns 0.0149 ns 1.00 -
FrequencyBlock .NET 8.0 524288 14.074 ns 0.0697 ns 0.0652 ns 1.19 -
FrequencyBlockUnroll .NET 8.0 524288 11.013 ns 0.0230 ns 0.0192 ns 0.93 -
FrequencyBlockNeonNotPinned .NET 8.0 524288 8.747 ns 0.0413 ns 0.0386 ns 0.74 -
FrequencyBlockNeonPinned .NET 8.0 524288 7.712 ns 0.0238 ns 0.0211 ns 0.65 -
FrequencyFlat .NET 9.0 524288 11.053 ns 0.0153 ns 0.0128 ns 1.00 -
FrequencyBlock .NET 9.0 524288 14.404 ns 0.1163 ns 0.1088 ns 1.30 -
FrequencyBlockUnroll .NET 9.0 524288 10.767 ns 0.0935 ns 0.0874 ns 0.97 -
FrequencyBlockNeonNotPinned .NET 9.0 524288 8.299 ns 0.1055 ns 0.0987 ns 0.75 -
FrequencyBlockNeonPinned .NET 9.0 524288 6.848 ns 0.0487 ns 0.0407 ns 0.62 -
FrequencyFlat .NET 6.0 8388608 103.143 ns 0.5006 ns 0.3908 ns 1.00 -
FrequencyBlock .NET 6.0 8388608 62.377 ns 0.3781 ns 0.3537 ns 0.60 -
FrequencyBlockUnroll .NET 6.0 8388608 56.975 ns 0.8338 ns 0.7800 ns 0.55 -
FrequencyBlockNeonNotPinned .NET 6.0 8388608 56.322 ns 0.1819 ns 0.1701 ns 0.55 -
FrequencyBlockNeonPinned .NET 6.0 8388608 55.027 ns 0.2942 ns 0.2752 ns 0.53 -
FrequencyFlat .NET 8.0 8388608 60.363 ns 0.0909 ns 0.0806 ns 1.00 -
FrequencyBlock .NET 8.0 8388608 58.007 ns 0.1571 ns 0.1227 ns 0.96 -
FrequencyBlockUnroll .NET 8.0 8388608 42.998 ns 0.0600 ns 0.0532 ns 0.71 -
FrequencyBlockNeonNotPinned .NET 8.0 8388608 43.520 ns 0.0390 ns 0.0346 ns 0.72 -
FrequencyBlockNeonPinned .NET 8.0 8388608 36.577 ns 0.0906 ns 0.0803 ns 0.61 -
FrequencyFlat .NET 9.0 8388608 52.367 ns 0.1089 ns 0.0909 ns 1.00 -
FrequencyBlock .NET 9.0 8388608 58.023 ns 0.0160 ns 0.0125 ns 1.11 -
FrequencyBlockUnroll .NET 9.0 8388608 40.052 ns 0.0289 ns 0.0226 ns 0.76 -
FrequencyBlockNeonNotPinned .NET 9.0 8388608 41.057 ns 0.0457 ns 0.0382 ns 0.78 -
FrequencyBlockNeonPinned .NET 9.0 8388608 26.974 ns 0.0071 ns 0.0055 ns 0.52 -
FrequencyFlat .NET 6.0 134217728 119.670 ns 0.0768 ns 0.0642 ns 1.00 -
FrequencyBlock .NET 6.0 134217728 72.488 ns 0.0787 ns 0.0698 ns 0.61 -
FrequencyBlockUnroll .NET 6.0 134217728 66.339 ns 0.1132 ns 0.1004 ns 0.55 -
FrequencyBlockNeonNotPinned .NET 6.0 134217728 59.022 ns 0.1104 ns 0.0922 ns 0.49 -
FrequencyBlockNeonPinned .NET 6.0 134217728 61.710 ns 1.2108 ns 1.5312 ns 0.52 -
FrequencyFlat .NET 8.0 134217728 66.547 ns 0.0826 ns 0.0690 ns 1.00 -
FrequencyBlock .NET 8.0 134217728 68.635 ns 0.0777 ns 0.0689 ns 1.03 -
FrequencyBlockUnroll .NET 8.0 134217728 50.909 ns 0.0646 ns 0.0572 ns 0.77 -
FrequencyBlockNeonNotPinned .NET 8.0 134217728 45.308 ns 0.0151 ns 0.0118 ns 0.68 -
FrequencyBlockNeonPinned .NET 8.0 134217728 40.602 ns 0.0282 ns 0.0220 ns 0.61 -
FrequencyFlat .NET 9.0 134217728 60.316 ns 0.0179 ns 0.0140 ns 1.00 -
FrequencyBlock .NET 9.0 134217728 68.726 ns 0.0241 ns 0.0188 ns 1.14 -
FrequencyBlockUnroll .NET 9.0 134217728 45.896 ns 0.0196 ns 0.0163 ns 0.76 -
FrequencyBlockNeonNotPinned .NET 9.0 134217728 41.303 ns 0.0310 ns 0.0275 ns 0.68 -
FrequencyBlockNeonPinned .NET 9.0 134217728 33.761 ns 0.0133 ns 0.0111 ns 0.56 -

@bitfaster bitfaster marked this pull request as ready for review January 13, 2025 01:31
@bitfaster bitfaster merged commit 410dae2 into main Jan 13, 2025
13 checks passed
@bitfaster bitfaster deleted the users/alexpeck/unroll branch January 13, 2025 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants