-
Notifications
You must be signed in to change notification settings - Fork 34
Unroll sketch increment #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skylake CPU:
|
it’s super finicky where the more complex lookup can be slower than the data access. I think with cpu caches being so large, observing the memory penalty on a benchmark is very difficult without enormous sizes to force the misses. Instead the pipeline bypass, fewer data dependencies, etc appear faster. I thought maybe I’d revisit when Java gets a vector api. I’d probably say staying on the original sketch would have been the pragmatic right choice for a lower hash collision error rate, but since it was all for our fun then I don’t know if it matters. |
Totally - all this is more for my education, and I feel at worst it is noise in a practical setting. There were significant improvements to the JIT in .NET8/9 - with your unrolled change non-vectorized is faster than my vectorized code in these benchmarks on the newer JITs (I will verify this on more CPUs). .NET6 is now out of support so likely irrelevant. I expect it is better to use your non-vectorized code on x64 in all cases. Using ARM intrinsics may still be a touch faster on apple M series CPUs - I saw greater gains there even on .NET9. The AVX code path was my first attempt at using hardware intrinsics. The way the AVX increment method is written, I suspect there is a penalty mixing vector/non vector registers when the result is written back to the array. I want to try switching from the current single GatherVector256 load from the 512-bit cache line into a single 256-bit register followed by non-vectorized element stores to instead do two 256-bit loads and stores, working directly in the 256-bit registers end to end. This requires completely rewriting the code which makes my head hurt, so I didn't try it yet. |
Well its really cool stuff. I've not had the fun of trying to write vectorized code, so I can only imagine the headache when I've read this and blogs about it. Its also got to be a lot of fun mental gymnastics, though. The upcoming Java api is going to handle all the size selection itself so I'm hoping to a little less overwhelmed as the code is more generic. I'm really surprised that neither JITs optimized to loop unroll itself. I even tried the Azul commercial JVM, which swapped in LLVM, but its GC algorithm injects memory barriers on every pointer read that killed throughput so it was a net loss. I thought our code was quite neat and understandable for a compiler to optimize. The headache I probably had too much fun forcing on myself was to optimize the cache's memory barriers, which is also purely noise performance-wise. I had to refresh myself when a false flag was reported by TSAN. I got a little scared that it was actually making sense to me and I didn't need to review references to reason about it, which given coherency complexity makes me feel more likely going insane than becoming an expert... |
Gymnastics for sure - it will be good to see what's possible with the Java API, how your code turns out, and how smart the Java compiler/JIT is. I'm convinced an expert assembly programmer would know all manner of tricks to make it faster. I need the magic combination of focus time + mental energy to try some more experiments. It's a shame about the memory barriers in Azul - its a smart idea to plug Java into LLVM. I searched and found a similar project for .NET but it's not as mature. Your memory barrier optimization tricks were all new to me, cool stuff. It took me some time to fully grasp and I made stress tests I ran for weeks on different machines because I am paranoid I made a dumb mistake (and I had). |
|
Port of ben-manes/caffeine@91a36fb.
On a skylake CPU increment unrolled block is faster than AVX on NET8/9:

Frequency is still faster on the AVX code path:
