-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8343689: AArch64: Optimize MulReduction implementation #23181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length: Benchmark (size) Mode Old New Units Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length: Benchmark (size) Mode Old New Units Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms
👋 Welcome back mablakatov! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
@mikabl-arm The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
Please provide info about whuch CPUs are benchmarked. How does this compare with Graviton 4? |
This could also be a relevant Benchmark: |
// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the | ||
// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD | ||
// instructions are used. | ||
void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by question:
This is recursive folding: take halve the vector and add it that way.
What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have strictly-ordered intrinsics for add reduction:
jdk/src/hotspot/cpu/aarch64/aarch64_vector.ad
Line 2903 in 19399d2
instruct reduce_addF_sve(vRegF dst_src1, vReg src2) %{ |
Neither of Arm64 Neon/SVE/SVE2 have a dedicated mul reduction instruction, thus it's implemented recursively whereas strict ordering isn't required (for Vector API). For auto-vectorization we impose _requires_strict_order
on MulReductionVFNode
, MulReductionVDNode
. Although I suspect that we might have missed something as I see a speedup for VectorReduction2.WithSuperword.doubleMulBig
/ floatMulBig
which I didn't expect to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check #23181 (comment) to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is VectorReduction2.WithSuperword.intMulBig
(which is fine since it's an integer benchmark).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these operations are disabled for SLP. But maybe we could add an assertion to check the restrict flag in the match rules.
Benchmarks results: Neoverse-V1 (SVE 256-bit) Benchmark (size) Mode master PR Units ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms Fujitsu A64FX (SVE 512-bit) Benchmark (size) Mode master PR Units ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
Hi @theRealAph , I've updated the description to reflect the former. As for the latter, nothing changes for Graviton 4, as for 128b long vectors we keep using the existing Neon implementation. |
Thank you for pointing this out. I didn't take an effect on auto-vectorization into consideration, though I should. I'll revert on this in a bit. |
%} | ||
ins_pipe(pipe_slow); | ||
%} | ||
|
||
instruct reduce_mulD(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider that reduce_mulF_gt128b
and reduce_mulD_gt128b
might be similar enough that they should be combined in the same way as other patterns in this file.
Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this change, the result of recursive folding is held in vtmp1. To be able to pass this intermediate result to reduce_mul_integral_le128b(), we would have to use another temporary FloatRegister, as vtmp1 would essentially act as vsrc. It's possible to get around this however: reduce_mul_integral_le128b() is modified so it's possible to pass matching vsrc and vtmp2 arguments. By doing this, we save ourselves a temporary register in rules that match to reduce_mul_integral_gt128b().
// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the | ||
// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD | ||
// instructions are used. | ||
void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?
Apologies to all of the reviewers, but it seems that I won't have time to address highlighted issues until May/June. I'm converting the PR to draft for now. |
@mikabl-arm this pull request can not be integrated into git checkout 8343689
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
@mikabl-arm This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
- fixup: disable FP mul reduction auto-vectorization for all targets - fixup: add a tmp vReg to reduce_mul_integral_gt128b and reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified - cleanup: replace a complex lambda in the above methods with a loop - cleanup: rename symbols to follow the existing naming convention - cleanup: add asserts to SVE only instructions - split mul FP reduction instructions into strictly-ordered (default) and explicitly non strictly-ordered - remove redundant conditions in TestVectorFPReduction.java Benchmarks results: Neoverse-V1 (SVE 256-bit) | Benchmark | Before | After | Units | Diff | |---------------------------|----------|----------|--------|-------| | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |
This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks. Neoverse N1 (NEON)Auto-vectorization
VectorAPI
Neoverse V1 (SVE 256-bit)Auto-vectorization
VectorAPI
Neoverse V2 (SVE 128-bit)Auto-vectorization
VectorAPI
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for a review! There are a couple more nits I've missed, I'll submit an update to resolve them shortly.
// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the | ||
// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD | ||
// instructions are used. | ||
void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check #23181 (comment) to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is VectorReduction2.WithSuperword.intMulBig
(which is fine since it's an integer benchmark).
test/hotspot/jtreg/compiler/loopopts/superword/TestVectorFPReduction.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Hao Sun <[email protected]>
Hi. This PR involves the change to {Int Mul Reduction, FP Mul Reduction} X { auto-vectorization, VectorAPI}. After the offiline discussion with @XiaohongGong , we have one question about the impact of this PR on FP Mul Reduction + auto-vectorization. Here lists the change before and after this PR in whether FP Mul Reduction + auto-vectorization is on or off.
case-1 and case-2Background: case-1 was set off after @fg1417 's patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON. But case-2 was not touched. Limitation of SLP as mentioned in @fg1417 's patch
Performance data in this PR on case-2: From your provided test data on
Discussion: should we enable case-1 and case-2?
It would be appreciated if @eme64 or @fg1417 could provide more inputs. case-3Status: this PR adds rules Our suggestion: we're not sure if it's profitable to enable case-3. Could you help do more test on Expected result:
|
@mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply. However, I do plan to remove the auto-vectorization restrictions for simple reductions. You can already now disable the (bad) reduction heuristic, using It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: #25387 If you find any such example, please let me know ;) I don't have access to any SVE machines, so I cannot help you there, unfortunately. Is this helpful to you? |
Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis. BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us. I tested the performance of |
This shifts MulReduction performance on Neoverse V1 a bit. Here Before if before this specific commit (ebad6dd) and After is this commit. | Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) | | ------------------------ | --------------- | -------------- | -------- | | ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% | | DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% | | FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% | | IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% | | LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% | | ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% |
@XiaohongGong , @shqking , @eme64 , Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I’d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I’m citing the key part here for visibility:
This PR does not target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions. @shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, This PR cannot leave
Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in | Benchmark | Before (ns/op) | After (ns/op) | Diff (%) |
|:-----------------------------------------------|-----------------:|----------------:|:-----------|
| VectorReduction.WithSuperword.mulRedD | 401.679 | 401.704 | ~ |
| VectorReduction2.WithSuperword.doubleMulBig | 2365.554 | 7294.706 | +208.37% |
| VectorReduction2.WithSuperword.doubleMulSimple | 2321.154 | 2321.207 | ~ |
| VectorReduction2.WithSuperword.floatMulBig | 2356.006 | 2648.334 | +12.41% |
| VectorReduction2.WithSuperword.floatMulSimple | 2321.018 | 2321.135 | ~ | Given that:
I suggest:
|
@shqking Sorry for joining the discussion a bit late. The patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON was intended to fix a regression on The reason I mentioned this SLP limitation in my previous patch was to clarify why the benchmark cases were written the way they were, and why I chose more complex cases instead of simpler reductions like @XiaohongGong thanks for testing on I'm looking forward to having a cost model in place, so we can safely remove these restrictions and enable SLP to handle these scenarios more flexibly. |
I'm fine with removing the strict-ordered rules and disable these operations for SLP since it does not benefit performance. Thanks for your testing and updating! |
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt); | ||
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt, | ||
/* requires_strict_order */ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you change this? Before it requires strict order, but now it is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it's a correction here.
As noted by this function name move_unordered_reduction_out_of_loop()
and the comment before this function, unordered reduction is expected to be generated. Hence, we should specify /* requires_strict_order */ false
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt); | ||
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt, | ||
/* requires_strict_order */ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it's a correction here.
As noted by this function name move_unordered_reduction_out_of_loop()
and the comment before this function, unordered reduction is expected to be generated. Hence, we should specify /* requires_strict_order */ false
@@ -1992,15 +1992,18 @@ void C2_MacroAssembler::neon_reduce_add_integral(Register dst, BasicType bt, | |||
|
|||
// Vector reduction multiply for integral type with ASIMD instructions. | |||
// Note: temporary registers vtmp1 and vtmp2 are not used in some cases. | |||
// Note: vsrc and vtmp2 may match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment in this "resolved comment thread" several days ago. See https://github.com/openjdk/jdk/pull/23181/files#r2179185158. It might be overlooked since the whole conversation was marked as resolved already.
I personally think we should not allow vsrc
and vtmp2
to match.
@eme64 Thanks for your input. It's very helpful to us.
I agree with your suggestion. |
Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
The benchmarks below are from panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation. To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
Benchmarks results:
Neoverse-V1 (SVE 256-bit)
Fujitsu A64FX (SVE 512-bit):
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
$ git checkout pull/23181
Update a local copy of the PR:
$ git checkout pull/23181
$ git pull https://git.openjdk.org/jdk.git pull/23181/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23181
View PR using the GUI difftool:
$ git pr show -t 23181
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23181.diff
Using Webrev
Link to Webrev Comment