Skip to content

8343689: AArch64: Optimize MulReduction implementation #23181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

mikabl-arm
Copy link
Contributor

@mikabl-arm mikabl-arm commented Jan 17, 2025

Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.

Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.

The benchmarks below are from panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation. To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.

Benchmarks results:

Neoverse-V1 (SVE 256-bit)

  Benchmark                 (size)   Mode   master         PR  Units
  ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
  ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
  IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
  LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
  FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
  DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms

Fujitsu A64FX (SVE 512-bit):

  Benchmark                 (size)   Mode   master         PR  Units
  ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
  ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
  IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
  LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
  FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
  DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8343689: AArch64: Optimize MulReduction implementation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
$ git checkout pull/23181

Update a local copy of the PR:
$ git checkout pull/23181
$ git pull https://git.openjdk.org/jdk.git pull/23181/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23181

View PR using the GUI difftool:
$ git pr show -t 23181

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23181.diff

Using Webrev

Link to Webrev Comment

Add a reduce_mul intrinsic SVE specialization for >= 256-bit long
vectors. It multiplies halves of the source vector using SVE
instructions to get to a 128-bit long vector that fits into a SIMD&FP
register. After that point, existing ASIMD implementation is used.

Benchmarks results for an AArch64 CPU with support for SVE with 256-bit
vector length:

  Benchmark                 (size)   Mode      Old        New  Units
  Byte256Vector.MULLanes      1024  thrpt  502.498  10222.717 ops/ms
  Double256Vector.MULLanes    1024  thrpt  172.116   3130.997 ops/ms
  Float256Vector.MULLanes     1024  thrpt  291.612   4164.138 ops/ms
  Int256Vector.MULLanes       1024  thrpt  362.276   3717.213 ops/ms
  Long256Vector.MULLanes      1024  thrpt  184.826   2054.345 ops/ms
  Short256Vector.MULLanes     1024  thrpt  379.231   5716.223 ops/ms

Benchmarks results for an AArch64 CPU with support for SVE with 512-bit
vector length:

  Benchmark                 (size)   Mode      Old       New   Units
  Byte512Vector.MULLanes      1024  thrpt  160.129  2630.600  ops/ms
  Double512Vector.MULLanes    1024  thrpt   51.229  1033.284  ops/ms
  Float512Vector.MULLanes     1024  thrpt   84.617  1658.400  ops/ms
  Int512Vector.MULLanes       1024  thrpt  109.419  1180.310  ops/ms
  Long512Vector.MULLanes      1024  thrpt   69.036   704.144  ops/ms
  Short512Vector.MULLanes     1024  thrpt  131.029  1629.632  ops/ms
@bridgekeeper
Copy link

bridgekeeper bot commented Jan 17, 2025

👋 Welcome back mablakatov! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 17, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 17, 2025
@openjdk
Copy link

openjdk bot commented Jan 17, 2025

@mikabl-arm The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jan 17, 2025

@theRealAph
Copy link
Contributor

Please provide info about whuch CPUs are benchmarked. How does this compare with Graviton 4?

@eme64
Copy link
Contributor

eme64 commented Feb 4, 2025

This could also be a relevant Benchmark:
./test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java

Comment on lines 2136 to 2139
// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the
// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
// instructions are used.
void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by question:
This is recursive folding: take halve the vector and add it that way.

What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have strictly-ordered intrinsics for add reduction:

instruct reduce_addF_sve(vRegF dst_src1, vReg src2) %{

Neither of Arm64 Neon/SVE/SVE2 have a dedicated mul reduction instruction, thus it's implemented recursively whereas strict ordering isn't required (for Vector API). For auto-vectorization we impose _requires_strict_order on MulReductionVFNode, MulReductionVDNode. Although I suspect that we might have missed something as I see a speedup for VectorReduction2.WithSuperword.doubleMulBig / floatMulBig which I didn't expect to be the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check #23181 (comment) to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is VectorReduction2.WithSuperword.intMulBig (which is fine since it's an integer benchmark).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these operations are disabled for SLP. But maybe we could add an assertion to check the restrict flag in the match rules.

Benchmarks results:

Neoverse-V1 (SVE 256-bit)

  Benchmark                 (size)   Mode   master         PR  Units
  ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
  ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
  IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
  LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
  FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
  DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms

Fujitsu A64FX (SVE 512-bit)

  Benchmark                 (size)   Mode   master         PR  Units
  ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
  ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
  IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
  LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
  FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
  DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms
@mikabl-arm
Copy link
Contributor Author

Please provide info about whuch CPUs are benchmarked. How does this compare with Graviton 4?

Hi @theRealAph , I've updated the description to reflect the former.

As for the latter, nothing changes for Graviton 4, as for 128b long vectors we keep using the existing Neon implementation.

@mikabl-arm
Copy link
Contributor Author

This could also be a relevant Benchmark:
./test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java

Thank you for pointing this out. I didn't take an effect on auto-vectorization into consideration, though I should. I'll revert on this in a bit.

%}
ins_pipe(pipe_slow);
%}

instruct reduce_mulD(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider that reduce_mulF_gt128b and reduce_mulD_gt128b might be similar enough that they should be combined in the same way as other patterns in this file.

mikabl-arm and others added 2 commits February 25, 2025 15:26
Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
change, the result of recursive folding is held in vtmp1. To be able to
pass this intermediate result to reduce_mul_integral_le128b(), we would
have to use another temporary FloatRegister, as vtmp1 would essentially
act as vsrc. It's possible to get around this however:
reduce_mul_integral_le128b() is modified so it's possible to pass
matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
temporary register in rules that match to reduce_mul_integral_gt128b().
Comment on lines 2136 to 2139
// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the
// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
// instructions are used.
void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?

@mikabl-arm
Copy link
Contributor Author

Apologies to all of the reviewers, but it seems that I won't have time to address highlighted issues until May/June. I'm converting the PR to draft for now.

@mikabl-arm mikabl-arm marked this pull request as draft March 21, 2025 13:03
@openjdk
Copy link

openjdk bot commented Mar 21, 2025

@mikabl-arm this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout 8343689
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Mar 21, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented May 16, 2025

@mikabl-arm This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

mikabl-arm and others added 2 commits June 30, 2025 11:39
- fixup: disable FP mul reduction auto-vectorization for all targets
- fixup: add a tmp vReg to reduce_mul_integral_gt128b and
  reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
- cleanup: replace a complex lambda in the above methods with a loop
- cleanup: rename symbols to follow the existing naming convention
- cleanup: add asserts to SVE only instructions
- split mul FP reduction instructions into strictly-ordered (default)
  and explicitly non strictly-ordered
- remove redundant conditions in TestVectorFPReduction.java

Benchmarks results:

Neoverse-V1 (SVE 256-bit)

| Benchmark                 | Before   | After    | Units  | Diff  |
|---------------------------|----------|----------|--------|-------|
| ByteMaxVector.MULLanes    | 619.156  | 9884.578 | ops/ms | 1496% |
| DoubleMaxVector.MULLanes  | 184.693  | 2712.051 | ops/ms | 1368% |
| FloatMaxVector.MULLanes   | 277.818  | 3388.038 | ops/ms | 1119% |
| IntMaxVector.MULLanes     | 371.225  | 4765.434 | ops/ms | 1183% |
| LongMaxVector.MULLanes    | 205.149  | 2672.975 | ops/ms | 1203% |
| ShortMaxVector.MULLanes   | 472.804  | 5122.917 | ops/ms |  984% |
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jun 30, 2025
@mikabl-arm
Copy link
Contributor Author

This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.

Neoverse N1 (NEON)

Auto-vectorization
Benchmark Before After Units Diff
mulRedD 739.699 740.884 ns/op ~
byteAddBig 2670.248 2670.562 ns/op ~
byteAddSimple 1639.796 1639.940 ns/op ~
byteMulBig 2707.900 2708.063 ns/op ~
byteMulSimple 2452.939 2452.906 ns/op ~
charAddBig 2772.363 2772.269 ns/op ~
charAddSimple 1639.867 1639.751 ns/op ~
charMulBig 2796.533 2796.375 ns/op ~
charMulSimple 2453.034 2453.004 ns/op ~
doubleAddBig 2943.613 2936.897 ns/op ~
doubleAddSimple 1635.031 1634.797 ns/op ~
doubleMulBig 3001.937 3003.240 ns/op ~
doubleMulSimple 2448.154 2448.117 ns/op ~
floatAddBig 2963.086 2962.215 ns/op ~
floatAddSimple 1634.987 1634.798 ns/op ~
floatMulBig 3022.442 3021.356 ns/op ~
floatMulSimple 2447.976 2448.091 ns/op ~
intAddBig 832.346 832.382 ns/op ~
intAddSimple 841.276 841.287 ns/op ~
intMulBig 1245.155 1245.095 ns/op ~
intMulSimple 1638.762 1638.826 ns/op ~
longAddBig 4924.541 4924.328 ns/op ~
longAddSimple 841.623 841.625 ns/op ~
longMulBig 9848.954 9848.807 ns/op ~
longMulSimple 3427.169 3427.279 ns/op ~
shortAddBig 2670.027 2670.345 ns/op ~
shortAddSimple 1639.869 1639.876 ns/op ~
shortMulBig 2750.812 2750.562 ns/op ~
shortMulSimple 2453.030 2452.937 ns/op ~
VectorAPI
Benchmark Before After Units Diff
ByteMaxVector.MULLanes 3935.178 3935.776 ops/ms ~
DoubleMaxVector.MULLanes 971.911 973.142 ops/ms ~
FloatMaxVector.MULLanes 1196.405 1196.222 ops/ms ~
IntMaxVector.MULLanes 1218.301 1218.224 ops/ms ~
LongMaxVector.MULLanes 541.793 541.805 ops/ms ~
ShortMaxVector.MULLanes 2332.916 2428.970 ops/ms 4%

Neoverse V1 (SVE 256-bit)

Auto-vectorization
Benchmark Before After Units Diff
mulRedD 401.696 401.699 ns/op ~
byteAddBig 2365.921 2365.726 ns/op ~
byteAddSimple 1569.524 1583.595 ns/op ~
byteMulBig 2368.362 2369.144 ns/op ~
byteMulSimple 2357.183 2356.961 ns/op ~
charAddBig 2262.944 2262.851 ns/op ~
charAddSimple 1569.399 1568.549 ns/op ~
charMulBig 2365.594 2365.540 ns/op ~
charMulSimple 2353.000 2356.285 ns/op ~
doubleAddBig 1640.613 1640.747 ns/op ~
doubleAddSimple 1549.028 1549.056 ns/op ~
doubleMulBig 2352.374 2365.366 ns/op ~
doubleMulSimple 2321.318 2321.273 ns/op ~
floatAddBig 1078.672 1078.641 ns/op ~
floatAddSimple 1549.075 1549.028 ns/op ~
floatMulBig 2351.251 2355.657 ns/op ~
floatMulSimple 2321.234 2321.205 ns/op ~
intAddBig 225.647 225.631 ns/op ~
intAddSimple 789.430 789.409 ns/op ~
intMulBig 785.971 403.520 ns/op -49%
intMulSimple 1569.131 1569.542 ns/op ~
longAddBig 819.702 819.898 ns/op ~
longAddSimple 789.597 789.573 ns/op ~
longMulBig 2460.433 2465.883 ns/op ~
longMulSimple 1560.933 1560.738 ns/op ~
shortAddBig 2268.769 2268.879 ns/op ~
shortAddSimple 1569.829 1577.502 ns/op ~
shortMulBig 2368.849 2369.381 ns/op ~
shortMulSimple 2353.986 2353.620 ns/op ~

VectorAPI

Benchmark Before After Units Diff
ByteMaxVector.MULLanes 619.156 9884.578 ops/ms 1496%
DoubleMaxVector.MULLanes 184.693 2712.051 ops/ms 1368%
FloatMaxVector.MULLanes 277.818 3388.038 ops/ms 1119%
IntMaxVector.MULLanes 371.225 4765.434 ops/ms 1183%
LongMaxVector.MULLanes 205.149 2672.975 ops/ms 1203%
ShortMaxVector.MULLanes 472.804 5122.917 ops/ms 984%

Neoverse V2 (SVE 128-bit)

Auto-vectorization
Benchmark Before After Units Diff
mulRedD 326.590 326.367 ns/op ~
byteAddBig 1889.745 1894.973 ns/op ~
byteAddSimple 1251.112 1255.026 ns/op ~
byteMulBig 1891.615 1896.814 ns/op ~
byteMulSimple 1871.912 1873.334 ns/op ~
charAddBig 1892.921 1894.729 ns/op ~
charAddSimple 1260.088 1260.200 ns/op ~
charMulBig 1895.881 1892.268 ns/op ~
charMulSimple 1871.443 1877.403 ns/op ~
doubleAddBig 1325.652 1323.546 ns/op ~
doubleAddSimple 1229.101 1232.291 ns/op ~
doubleMulBig 1872.655 1873.624 ns/op ~
doubleMulSimple 1843.787 1842.049 ns/op ~
floatAddBig 1093.144 1093.687 ns/op ~
floatAddSimple 1229.396 1229.058 ns/op ~
floatMulBig 1862.449 1873.624 ns/op ~
floatMulSimple 1841.839 1846.539 ns/op ~
intAddBig 316.076 316.111 ns/op ~
intAddSimple 629.235 630.857 ns/op ~
intMulBig 615.185 616.652 ns/op ~
intMulSimple 1258.883 1262.365 ns/op ~
longAddBig 1145.601 1146.965 ns/op ~
longAddSimple 633.978 634.034 ns/op ~
longMulBig 1834.331 1854.264 ns/op ~
longMulSimple 1264.152 1261.659 ns/op ~
shortAddBig 1889.645 1890.173 ns/op ~
shortAddSimple 1251.094 1250.808 ns/op ~
shortMulBig 1893.699 1895.171 ns/op ~
shortMulSimple 1871.791 1876.445 ns/op ~
VectorAPI
Benchmark Before After Units Diff
ByteMaxVector.MULLanes 7210.809 7229.774 ops/ms ~
DoubleMaxVector.MULLanes 1333.230 1330.399 ops/ms ~
FloatMaxVector.MULLanes 1762.932 1767.859 ops/ms ~
IntMaxVector.MULLanes 3690.901 3699.748 ops/ms ~
LongMaxVector.MULLanes 1994.725 1991.539 ops/ms ~
ShortMaxVector.MULLanes 4648.878 4669.074 ops/ms ~

Copy link
Contributor Author

@mikabl-arm mikabl-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a review! There are a couple more nits I've missed, I'll submit an update to resolve them shortly.

Comment on lines 2136 to 2139
// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the
// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
// instructions are used.
void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check #23181 (comment) to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is VectorReduction2.WithSuperword.intMulBig (which is fine since it's an integer benchmark).

@mikabl-arm mikabl-arm marked this pull request as ready for review June 30, 2025 13:21
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 30, 2025
@shqking
Copy link
Contributor

shqking commented Jul 3, 2025

Hi. This PR involves the change to {Int Mul Reduction, FP Mul Reduction} X { auto-vectorization, VectorAPI}. After the offiline discussion with @XiaohongGong , we have one question about the impact of this PR on FP Mul Reduction + auto-vectorization.

Here lists the change before and after this PR in whether FP Mul Reduction + auto-vectorization is on or off.

Check before after
case-1 UseSVE=0 off off
case-2 UseSVE>0 and length_in_bytes=8or16 on off
case-3 UseSVE>0 and length_in_bytes>16 off off

case-1 and case-2

Background: case-1 was set off after @fg1417 's patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON. But case-2 was not touched.
We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on reduction?

Limitation of SLP as mentioned in @fg1417 's patch

Because superword doesn't vectorize reductions unconnected with other vector packs,

Performance data in this PR on case-2: From your provided test data on Neoverse V2 (SVE 128-bit). Auto-vectorization section, there is no obvious performance change on FP Mul Reduction benchmarks (float|double)Mul(Big|Simple).
As we checked the generated code of floatMul(Big|Simple) on Nvidia Grace machine(128b SVE2), we found that before this PR:

  • floatMulBig is vectorized.
  • floatMulSimple is not vectorized because SLP determines that there is no profit.

Discussion: should we enable case-1 and case-2?

  • if the SLP limitation on reductions is fixed?
  • If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.

It would be appreciated if @eme64 or @fg1417 could provide more inputs.

case-3

Status: this PR adds rules reduce_mulF_gt128b and reduce_mulD_gt128b but these two rules are not selected. See the comment from Xiaohong.

Our suggestion: we're not sure if it's profitable to enable case-3. Could you help do more test on Neoverse V1 (SVE 256-bit)? Note that local change should be made to enable case-3, e.g. removing these lines.

Expected result:

  • If there is performance gain, we may consider enabling case-3 for auto-vectorization.
  • If there is no performance gain, we suggest removing these two match rules because they are dead code.

@eme64
Copy link
Contributor

eme64 commented Jul 3, 2025

@mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.

However, I do plan to remove the auto-vectorization restrictions for simple reductions.
https://bugs.openjdk.org/browse/JDK-8307516

You can already now disable the (bad) reduction heuristic, using AutoVectorizationOverrideProfitability.
https://bugs.openjdk.org/browse/JDK-8357530
I published benchmark results there:
#25387
You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.

It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: #25387 If you find any such example, please let me know ;)

I don't have access to any SVE machines, so I cannot help you there, unfortunately.

Is this helpful to you?

@XiaohongGong
Copy link

You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.

It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: #25387 If you find any such example, please let me know ;)

I don't have access to any SVE machines, so I cannot help you there, unfortunately.

Is this helpful to you?

Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis.

BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us.

I tested the performance of VectorReduction2 with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both floatMulSimple and floatMulBig. But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of floatMulBig is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64.

This shifts MulReduction performance on Neoverse V1 a bit. Here Before
if before this specific commit (ebad6dd)
and After is this commit.

| Benchmark                | Before (ops/ms) | After (ops/ms) | Diff (%) |
| ------------------------ | --------------- | -------------- | -------- |
| ByteMaxVector.MULLanes   | 9883.151        | 9093.557       | -7.99%   |
| DoubleMaxVector.MULLanes | 2712.674        | 2607.367       | -3.89%   |
| FloatMaxVector.MULLanes  | 3388.811        | 3291.429       | -2.88%   |
| IntMaxVector.MULLanes    | 4765.554        | 5031.741       | +5.58%   |
| LongMaxVector.MULLanes   | 2685.228        | 2896.445       | +7.88%   |
| ShortMaxVector.MULLanes  | 5128.185        | 5197.656       | +1.35%   |
@mikabl-arm
Copy link
Contributor Author

@XiaohongGong , @shqking , @eme64 ,

Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I’d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I’m citing the key part here for visibility:

To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used.

This PR does not target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions.

@shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, Matcher::match_rule_supported_auto_vectorization() returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the same scalar mul reduction implementation for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I’m mistaken.

This PR cannot leave Matcher::match_rule_supported_auto_vectorization() unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement.

@XiaohongGong ,

But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm !

Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in Matcher::match_rule_supported_auto_vectorization() lifted (After). The linear strictly-ordered SVE implementation matched this way was later removed by 4593a5d.

| Benchmark                                      |   Before (ns/op) |   After (ns/op) | Diff (%)   |
|:-----------------------------------------------|-----------------:|----------------:|:-----------|
| VectorReduction.WithSuperword.mulRedD          |          401.679 |         401.704 | ~          |
| VectorReduction2.WithSuperword.doubleMulBig    |         2365.554 |        7294.706 | +208.37%   |
| VectorReduction2.WithSuperword.doubleMulSimple |         2321.154 |        2321.207 | ~          |
| VectorReduction2.WithSuperword.floatMulBig     |         2356.006 |        2648.334 | +12.41%    |
| VectorReduction2.WithSuperword.floatMulSimple  |         2321.018 |        2321.135 | ~          |

Given that:

  • this PR focuses on VectorAPI and not on auto-vectorization,
  • and it does not introduce regressions in auto-vectorization performance,

I suggest:

  • continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
  • moving forward with resolving the remaining VectorAPI issues and merging this PR.

@fg1417
Copy link

fg1417 commented Jul 10, 2025

Background: case-1 was set off after @fg1417 's patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON. But case-2 was not touched. We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on reduction?

Limitation of SLP as mentioned in @fg1417 's patch

Because superword doesn't vectorize reductions unconnected with other vector packs,

Performance data in this PR on case-2: From your provided test data on Neoverse V2 (SVE 128-bit). Auto-vectorization section, there is no obvious performance change on FP Mul Reduction benchmarks (float|double)Mul(Big|Simple). As we checked the generated code of floatMul(Big|Simple) on Nvidia Grace machine(128b SVE2), we found that before this PR:

  • floatMulBig is vectorized.
  • floatMulSimple is not vectorized because SLP determines that there is no profit.

Discussion: should we enable case-1 and case-2?

  • if the SLP limitation on reductions is fixed?
  • If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.

It would be appreciated if @eme64 or @fg1417 could provide more inputs.

@shqking Sorry for joining the discussion a bit late.

The patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON was intended to fix a regression on NEON machine, while keeping the behaviour unchanged on sve machine — which may be a source of confusion now.

The reason I mentioned this SLP limitation in my previous patch was to clarify why the benchmark cases were written the way they were, and why I chose more complex cases instead of simpler reductions like floatMulSimple.
The rationale was that if a case like floatMulBig doesn’t show any performance gain, then a simpler case like floatMulSimple is even less likely to benefit. In general, more complex reduction cases are more likely to benefit from auto-vectorization.

@XiaohongGong thanks for testing on 128-bit sve machine. Since the performance difference is not significant for both floatMulSimple and floatMulBig with/without auto-vectorization and there is a performance drop with auto-vectorization on 256-bit sve machine reported by @mikabl-arm, it seems reasonable that it should also be disabled on SVE.

I'm looking forward to having a cost model in place, so we can safely remove these restrictions and enable SLP to handle these scenarios more flexibly.

@XiaohongGong
Copy link

@XiaohongGong , @shqking , @eme64 ,

Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I’d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I’m citing the key part here for visibility:

To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used.

This PR does not target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions.

@shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, Matcher::match_rule_supported_auto_vectorization() returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the same scalar mul reduction implementation for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I’m mistaken.

This PR cannot leave Matcher::match_rule_supported_auto_vectorization() unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement.

@XiaohongGong ,

But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm !

Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in Matcher::match_rule_supported_auto_vectorization() lifted (After). The linear strictly-ordered SVE implementation matched this way was later removed by 4593a5d.

| Benchmark                                      |   Before (ns/op) |   After (ns/op) | Diff (%)   |
|:-----------------------------------------------|-----------------:|----------------:|:-----------|
| VectorReduction.WithSuperword.mulRedD          |          401.679 |         401.704 | ~          |
| VectorReduction2.WithSuperword.doubleMulBig    |         2365.554 |        7294.706 | +208.37%   |
| VectorReduction2.WithSuperword.doubleMulSimple |         2321.154 |        2321.207 | ~          |
| VectorReduction2.WithSuperword.floatMulBig     |         2356.006 |        2648.334 | +12.41%    |
| VectorReduction2.WithSuperword.floatMulSimple  |         2321.018 |        2321.135 | ~          |

Given that:

  • this PR focuses on VectorAPI and not on auto-vectorization,
  • and it does not introduce regressions in auto-vectorization performance,

I suggest:

  • continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
  • moving forward with resolving the remaining VectorAPI issues and merging this PR.

I'm fine with removing the strict-ordered rules and disable these operations for SLP since it does not benefit performance. Thanks for your testing and updating!

Comment on lines -4714 to +4715
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt);
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt,
/* requires_strict_order */ false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you change this? Before it requires strict order, but now it is false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it's a correction here.

As noted by this function name move_unordered_reduction_out_of_loop() and the comment before this function, unordered reduction is expected to be generated. Hence, we should specify /* requires_strict_order */ false

Comment on lines -4714 to +4715
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt);
Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt,
/* requires_strict_order */ false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it's a correction here.

As noted by this function name move_unordered_reduction_out_of_loop() and the comment before this function, unordered reduction is expected to be generated. Hence, we should specify /* requires_strict_order */ false

@@ -1992,15 +1992,18 @@ void C2_MacroAssembler::neon_reduce_add_integral(Register dst, BasicType bt,

// Vector reduction multiply for integral type with ASIMD instructions.
// Note: temporary registers vtmp1 and vtmp2 are not used in some cases.
// Note: vsrc and vtmp2 may match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment in this "resolved comment thread" several days ago. See https://github.com/openjdk/jdk/pull/23181/files#r2179185158. It might be overlooked since the whole conversation was marked as resolved already.

I personally think we should not allow vsrc and vtmp2 to match.

@shqking
Copy link
Contributor

shqking commented Jul 11, 2025

@eme64 Thanks for your input. It's very helpful to us.
@fg1417 Thanks for your clarification on case-2 as I mentioned earlier.
@mikabl-arm Thanks for your providing the performance data on Neoverse-V1 machine.

Given that:

  • this PR focuses on VectorAPI and not on auto-vectorization,
  • and it does not introduce regressions in auto-vectorization performance,

I suggest:

  • continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
  • moving forward with resolving the remaining VectorAPI issues and merging this PR.

I agree with your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

6 participants