8343689: AArch64: Optimize MulReduction implementation #23181

mikabl-arm · 2025-01-17T19:35:44Z

Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.

Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.

The benchmarks below are from panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation. To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.

Benchmarks results:

Neoverse-V1 (SVE 256-bit)

  Benchmark                 (size)   Mode   master         PR  Units
  ByteMaxVector.MULLanes      1024  thrpt 5447.643  11455.535 ops/ms
  ShortMaxVector.MULLanes     1024  thrpt 3388.183   7144.301 ops/ms
  IntMaxVector.MULLanes       1024  thrpt 3010.974   4911.485 ops/ms
  LongMaxVector.MULLanes      1024  thrpt 1539.137   2562.835 ops/ms
  FloatMaxVector.MULLanes     1024  thrpt 1355.551   4158.128 ops/ms
  DoubleMaxVector.MULLanes    1024  thrpt 1715.854   3284.189 ops/ms

Fujitsu A64FX (SVE 512-bit):

  Benchmark                 (size)   Mode   master         PR  Units
  ByteMaxVector.MULLanes      1024  thrpt 1091.692   2887.798 ops/ms
  ShortMaxVector.MULLanes     1024  thrpt  597.008   1863.338 ops/ms
  IntMaxVector.MULLanes       1024  thrpt  510.642   1348.651 ops/ms
  LongMaxVector.MULLanes      1024  thrpt  468.878    878.620 ops/ms
  FloatMaxVector.MULLanes     1024  thrpt  376.284   2237.564 ops/ms
  DoubleMaxVector.MULLanes    1024  thrpt  431.343   1646.792 ops/ms

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8343689: AArch64: Optimize MulReduction implementation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
$ git checkout pull/23181

Update a local copy of the PR:
$ git checkout pull/23181
$ git pull https://git.openjdk.org/jdk.git pull/23181/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23181

View PR using the GUI difftool:
$ git pr show -t 23181

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23181.diff

Using Webrev

Link to Webrev Comment

Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length: Benchmark (size) Mode Old New Units Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length: Benchmark (size) Mode Old New Units Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms

bridgekeeper · 2025-01-17T19:36:33Z

👋 Welcome back mablakatov! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-01-17T19:37:24Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-01-17T19:38:04Z

@mikabl-arm The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-01-17T19:41:06Z

Webrevs

theRealAph · 2025-01-18T09:03:19Z

Please provide info about whuch CPUs are benchmarked. How does this compare with Graviton 4?

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

eme64 · 2025-02-04T18:48:38Z

This could also be a relevant Benchmark:
./test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java

eme64 · 2025-02-04T18:52:55Z

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

+// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the
+// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
+// instructions are used.
+void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,


Drive-by question:
This is recursive folding: take halve the vector and add it that way.

What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.

We have strictly-ordered intrinsics for add reduction:

jdk/src/hotspot/cpu/aarch64/aarch64_vector.ad

Line 2903 in 19399d2

instruct reduce_addF_sve(vRegF dst_src1, vReg src2) %{

Neither of Arm64 Neon/SVE/SVE2 have a dedicated mul reduction instruction, thus it's implemented recursively whereas strict ordering isn't required (for Vector API). For auto-vectorization we impose _requires_strict_order on MulReductionVFNode, MulReductionVDNode. Although I suspect that we might have missed something as I see a speedup for VectorReduction2.WithSuperword.doubleMulBig / floatMulBig which I didn't expect to be the case.

I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?

FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check #23181 (comment) to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is VectorReduction2.WithSuperword.intMulBig (which is fine since it's an integer benchmark).

Yes, these operations are disabled for SLP. But maybe we could add an assertion to check the restrict flag in the match rules.

Benchmarks results: Neoverse-V1 (SVE 256-bit) Benchmark (size) Mode master PR Units ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms Fujitsu A64FX (SVE 512-bit) Benchmark (size) Mode master PR Units ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms

mikabl-arm · 2025-02-05T11:27:45Z

Please provide info about whuch CPUs are benchmarked. How does this compare with Graviton 4?

Hi @theRealAph , I've updated the description to reflect the former.

As for the latter, nothing changes for Graviton 4, as for 128b long vectors we keep using the existing Neon implementation.

mikabl-arm · 2025-02-05T12:40:54Z

This could also be a relevant Benchmark:
./test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java

Thank you for pointing this out. I didn't take an effect on auto-vectorization into consideration, though I should. I'll revert on this in a bit.

theRealAph · 2025-02-05T17:57:05Z

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4

  %}
  ins_pipe(pipe_slow);
 %}

-instruct reduce_mulD(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{


Please consider that reduce_mulF_gt128b and reduce_mulD_gt128b might be similar enough that they should be combined in the same way as other patterns in this file.

…ting

Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this change, the result of recursive folding is held in vtmp1. To be able to pass this intermediate result to reduce_mul_integral_le128b(), we would have to use another temporary FloatRegister, as vtmp1 would essentially act as vsrc. It's possible to get around this however: reduce_mul_integral_le128b() is modified so it's possible to pass matching vsrc and vtmp2 arguments. By doing this, we save ourselves a temporary register in rules that match to reduce_mul_integral_gt128b().

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

src/hotspot/cpu/aarch64/aarch64_vector.ad

src/hotspot/cpu/aarch64/assembler_aarch64.hpp

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

shqking · 2025-02-27T03:49:41Z

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

+// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the
+// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
+// instructions are used.
+void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,


I have the same concern about the order issue with @eme64.
Should we only enable this only for VectorAPI case, which doesn't require strict-order?

mikabl-arm · 2025-03-21T13:03:12Z

Apologies to all of the reviewers, but it seems that I won't have time to address highlighted issues until May/June. I'm converting the PR to draft for now.

openjdk · 2025-03-21T13:04:52Z

@mikabl-arm this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout 8343689
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

bridgekeeper · 2025-05-16T14:17:33Z

@mikabl-arm This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

- fixup: disable FP mul reduction auto-vectorization for all targets - fixup: add a tmp vReg to reduce_mul_integral_gt128b and reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified - cleanup: replace a complex lambda in the above methods with a loop - cleanup: rename symbols to follow the existing naming convention - cleanup: add asserts to SVE only instructions - split mul FP reduction instructions into strictly-ordered (default) and explicitly non strictly-ordered - remove redundant conditions in TestVectorFPReduction.java Benchmarks results: Neoverse-V1 (SVE 256-bit) | Benchmark | Before | After | Units | Diff | |---------------------------|----------|----------|--------|-------| | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |

mikabl-arm · 2025-06-30T12:35:47Z

This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.

Neoverse N1 (NEON)

Auto-vectorization

Benchmark	Before	After	Units	Diff
mulRedD	739.699	740.884	ns/op	~
byteAddBig	2670.248	2670.562	ns/op	~
byteAddSimple	1639.796	1639.940	ns/op	~
byteMulBig	2707.900	2708.063	ns/op	~
byteMulSimple	2452.939	2452.906	ns/op	~
charAddBig	2772.363	2772.269	ns/op	~
charAddSimple	1639.867	1639.751	ns/op	~
charMulBig	2796.533	2796.375	ns/op	~
charMulSimple	2453.034	2453.004	ns/op	~
doubleAddBig	2943.613	2936.897	ns/op	~
doubleAddSimple	1635.031	1634.797	ns/op	~
doubleMulBig	3001.937	3003.240	ns/op	~
doubleMulSimple	2448.154	2448.117	ns/op	~
floatAddBig	2963.086	2962.215	ns/op	~
floatAddSimple	1634.987	1634.798	ns/op	~
floatMulBig	3022.442	3021.356	ns/op	~
floatMulSimple	2447.976	2448.091	ns/op	~
intAddBig	832.346	832.382	ns/op	~
intAddSimple	841.276	841.287	ns/op	~
intMulBig	1245.155	1245.095	ns/op	~
intMulSimple	1638.762	1638.826	ns/op	~
longAddBig	4924.541	4924.328	ns/op	~
longAddSimple	841.623	841.625	ns/op	~
longMulBig	9848.954	9848.807	ns/op	~
longMulSimple	3427.169	3427.279	ns/op	~
shortAddBig	2670.027	2670.345	ns/op	~
shortAddSimple	1639.869	1639.876	ns/op	~
shortMulBig	2750.812	2750.562	ns/op	~
shortMulSimple	2453.030	2452.937	ns/op	~

VectorAPI

Benchmark	Before	After	Units	Diff
ByteMaxVector.MULLanes	3935.178	3935.776	ops/ms	~
DoubleMaxVector.MULLanes	971.911	973.142	ops/ms	~
FloatMaxVector.MULLanes	1196.405	1196.222	ops/ms	~
IntMaxVector.MULLanes	1218.301	1218.224	ops/ms	~
LongMaxVector.MULLanes	541.793	541.805	ops/ms	~
ShortMaxVector.MULLanes	2332.916	2428.970	ops/ms	4%

Neoverse V1 (SVE 256-bit)

Auto-vectorization

Benchmark	Before	After	Units	Diff
mulRedD	401.696	401.699	ns/op	~
byteAddBig	2365.921	2365.726	ns/op	~
byteAddSimple	1569.524	1583.595	ns/op	~
byteMulBig	2368.362	2369.144	ns/op	~
byteMulSimple	2357.183	2356.961	ns/op	~
charAddBig	2262.944	2262.851	ns/op	~
charAddSimple	1569.399	1568.549	ns/op	~
charMulBig	2365.594	2365.540	ns/op	~
charMulSimple	2353.000	2356.285	ns/op	~
doubleAddBig	1640.613	1640.747	ns/op	~
doubleAddSimple	1549.028	1549.056	ns/op	~
doubleMulBig	2352.374	2365.366	ns/op	~
doubleMulSimple	2321.318	2321.273	ns/op	~
floatAddBig	1078.672	1078.641	ns/op	~
floatAddSimple	1549.075	1549.028	ns/op	~
floatMulBig	2351.251	2355.657	ns/op	~
floatMulSimple	2321.234	2321.205	ns/op	~
intAddBig	225.647	225.631	ns/op	~
intAddSimple	789.430	789.409	ns/op	~
intMulBig	785.971	403.520	ns/op	-49%
intMulSimple	1569.131	1569.542	ns/op	~
longAddBig	819.702	819.898	ns/op	~
longAddSimple	789.597	789.573	ns/op	~
longMulBig	2460.433	2465.883	ns/op	~
longMulSimple	1560.933	1560.738	ns/op	~
shortAddBig	2268.769	2268.879	ns/op	~
shortAddSimple	1569.829	1577.502	ns/op	~
shortMulBig	2368.849	2369.381	ns/op	~
shortMulSimple	2353.986	2353.620	ns/op	~

VectorAPI

Benchmark	Before	After	Units	Diff
ByteMaxVector.MULLanes	619.156	9884.578	ops/ms	1496%
DoubleMaxVector.MULLanes	184.693	2712.051	ops/ms	1368%
FloatMaxVector.MULLanes	277.818	3388.038	ops/ms	1119%
IntMaxVector.MULLanes	371.225	4765.434	ops/ms	1183%
LongMaxVector.MULLanes	205.149	2672.975	ops/ms	1203%
ShortMaxVector.MULLanes	472.804	5122.917	ops/ms	984%

Neoverse V2 (SVE 128-bit)

Auto-vectorization

Benchmark	Before	After	Units	Diff
mulRedD	326.590	326.367	ns/op	~
byteAddBig	1889.745	1894.973	ns/op	~
byteAddSimple	1251.112	1255.026	ns/op	~
byteMulBig	1891.615	1896.814	ns/op	~
byteMulSimple	1871.912	1873.334	ns/op	~
charAddBig	1892.921	1894.729	ns/op	~
charAddSimple	1260.088	1260.200	ns/op	~
charMulBig	1895.881	1892.268	ns/op	~
charMulSimple	1871.443	1877.403	ns/op	~
doubleAddBig	1325.652	1323.546	ns/op	~
doubleAddSimple	1229.101	1232.291	ns/op	~
doubleMulBig	1872.655	1873.624	ns/op	~
doubleMulSimple	1843.787	1842.049	ns/op	~
floatAddBig	1093.144	1093.687	ns/op	~
floatAddSimple	1229.396	1229.058	ns/op	~
floatMulBig	1862.449	1873.624	ns/op	~
floatMulSimple	1841.839	1846.539	ns/op	~
intAddBig	316.076	316.111	ns/op	~
intAddSimple	629.235	630.857	ns/op	~
intMulBig	615.185	616.652	ns/op	~
intMulSimple	1258.883	1262.365	ns/op	~
longAddBig	1145.601	1146.965	ns/op	~
longAddSimple	633.978	634.034	ns/op	~
longMulBig	1834.331	1854.264	ns/op	~
longMulSimple	1264.152	1261.659	ns/op	~
shortAddBig	1889.645	1890.173	ns/op	~
shortAddSimple	1251.094	1250.808	ns/op	~
shortMulBig	1893.699	1895.171	ns/op	~
shortMulSimple	1871.791	1876.445	ns/op	~

VectorAPI

Benchmark	Before	After	Units	Diff
ByteMaxVector.MULLanes	7210.809	7229.774	ops/ms	~
DoubleMaxVector.MULLanes	1333.230	1330.399	ops/ms	~
FloatMaxVector.MULLanes	1762.932	1767.859	ops/ms	~
IntMaxVector.MULLanes	3690.901	3699.748	ops/ms	~
LongMaxVector.MULLanes	1994.725	1991.539	ops/ms	~
ShortMaxVector.MULLanes	4648.878	4669.074	ops/ms	~

mikabl-arm

Thank you for a review! There are a couple more nits I've missed, I'll submit an update to resolve them shortly.

src/hotspot/cpu/aarch64/aarch64_vector.ad

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

src/hotspot/cpu/aarch64/assembler_aarch64.hpp

mikabl-arm · 2025-06-30T12:20:19Z

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

+// Vector reduction multiply for floating-point type with SVE instructions. Multiplies halves of the
+// source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
+// instructions are used.
+void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,


FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check #23181 (comment) to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is VectorReduction2.WithSuperword.intMulBig (which is fine since it's an integer benchmark).

src/hotspot/cpu/aarch64/assembler_aarch64.hpp

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

src/hotspot/cpu/aarch64/aarch64_vector.ad

test/hotspot/jtreg/compiler/loopopts/superword/TestVectorFPReduction.java

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

Co-authored-by: Hao Sun <[email protected]>

shqking · 2025-07-03T04:44:35Z

Hi. This PR involves the change to {Int Mul Reduction, FP Mul Reduction} X { auto-vectorization, VectorAPI}. After the offiline discussion with @XiaohongGong , we have one question about the impact of this PR on FP Mul Reduction + auto-vectorization.

Here lists the change before and after this PR in whether FP Mul Reduction + auto-vectorization is on or off.

	Check	before	after
case-1	UseSVE=0	off	off
case-2	UseSVE>0 and length_in_bytes=8or16	on	off
case-3	UseSVE>0 and length_in_bytes>16	off	off

case-1 and case-2

Background: case-1 was set off after @fg1417 's patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON. But case-2 was not touched.
We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on reduction?

Limitation of SLP as mentioned in @fg1417 's patch

Because superword doesn't vectorize reductions unconnected with other vector packs,

Performance data in this PR on case-2: From your provided test data on Neoverse V2 (SVE 128-bit). Auto-vectorization section, there is no obvious performance change on FP Mul Reduction benchmarks (float|double)Mul(Big|Simple).
As we checked the generated code of floatMul(Big|Simple) on Nvidia Grace machine(128b SVE2), we found that before this PR:

floatMulBig is vectorized.
floatMulSimple is not vectorized because SLP determines that there is no profit.

Discussion: should we enable case-1 and case-2?

if the SLP limitation on reductions is fixed?
If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.

It would be appreciated if @eme64 or @fg1417 could provide more inputs.

case-3

Status: this PR adds rules reduce_mulF_gt128b and reduce_mulD_gt128b but these two rules are not selected. See the comment from Xiaohong.

Our suggestion: we're not sure if it's profitable to enable case-3. Could you help do more test on Neoverse V1 (SVE 256-bit)? Note that local change should be made to enable case-3, e.g. removing these lines.

Expected result:

If there is performance gain, we may consider enabling case-3 for auto-vectorization.
If there is no performance gain, we suggest removing these two match rules because they are dead code.

eme64 · 2025-07-03T05:23:38Z

@mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply.

However, I do plan to remove the auto-vectorization restrictions for simple reductions.
https://bugs.openjdk.org/browse/JDK-8307516

You can already now disable the (bad) reduction heuristic, using AutoVectorizationOverrideProfitability.
https://bugs.openjdk.org/browse/JDK-8357530
I published benchmark results there:
#25387
You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.

It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: #25387 If you find any such example, please let me know ;)

I don't have access to any SVE machines, so I cannot help you there, unfortunately.

Is this helpful to you?

XiaohongGong · 2025-07-03T05:53:44Z

You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable.

It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: #25387 If you find any such example, please let me know ;)

I don't have access to any SVE machines, so I cannot help you there, unfortunately.

Is this helpful to you?

Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis.

BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us.

I tested the performance of VectorReduction2 with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both floatMulSimple and floatMulBig. But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of floatMulBig is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64.

This shifts MulReduction performance on Neoverse V1 a bit. Here Before if before this specific commit (ebad6dd) and After is this commit. | Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) | | ------------------------ | --------------- | -------------- | -------- | | ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% | | DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% | | FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% | | IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% | | LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% | | ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% |

mikabl-arm · 2025-07-10T14:06:55Z

@XiaohongGong , @shqking , @eme64 ,

Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I’d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I’m citing the key part here for visibility:

To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used.

This PR does not target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions.

@shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, Matcher::match_rule_supported_auto_vectorization() returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the same scalar mul reduction implementation for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I’m mistaken.

This PR cannot leave Matcher::match_rule_supported_auto_vectorization() unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement.

@XiaohongGong ,

But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm !

Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in Matcher::match_rule_supported_auto_vectorization() lifted (After). The linear strictly-ordered SVE implementation matched this way was later removed by 4593a5d.

| Benchmark                                      |   Before (ns/op) |   After (ns/op) | Diff (%)   |
|:-----------------------------------------------|-----------------:|----------------:|:-----------|
| VectorReduction.WithSuperword.mulRedD          |          401.679 |         401.704 | ~          |
| VectorReduction2.WithSuperword.doubleMulBig    |         2365.554 |        7294.706 | +208.37%   |
| VectorReduction2.WithSuperword.doubleMulSimple |         2321.154 |        2321.207 | ~          |
| VectorReduction2.WithSuperword.floatMulBig     |         2356.006 |        2648.334 | +12.41%    |
| VectorReduction2.WithSuperword.floatMulSimple  |         2321.018 |        2321.135 | ~          |

Given that:

this PR focuses on VectorAPI and not on auto-vectorization,
and it does not introduce regressions in auto-vectorization performance,

I suggest:

continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;
moving forward with resolving the remaining VectorAPI issues and merging this PR.

fg1417 · 2025-07-10T15:49:37Z

Background: case-1 was set off after @fg1417 's patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON. But case-2 was not touched. We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on reduction?

Limitation of SLP as mentioned in @fg1417 's patch

Because superword doesn't vectorize reductions unconnected with other vector packs,

Performance data in this PR on case-2: From your provided test data on Neoverse V2 (SVE 128-bit). Auto-vectorization section, there is no obvious performance change on FP Mul Reduction benchmarks (float|double)Mul(Big|Simple). As we checked the generated code of floatMul(Big|Simple) on Nvidia Grace machine(128b SVE2), we found that before this PR:

floatMulBig is vectorized.

floatMulSimple is not vectorized because SLP determines that there is no profit.

Discussion: should we enable case-1 and case-2?

if the SLP limitation on reductions is fixed?

If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop.

It would be appreciated if @eme64 or @fg1417 could provide more inputs.

@shqking Sorry for joining the discussion a bit late.

The patch 8275275: AArch64: Fix performance regression after auto-vectorization on NEON was intended to fix a regression on NEON machine, while keeping the behaviour unchanged on sve machine — which may be a source of confusion now.

The reason I mentioned this SLP limitation in my previous patch was to clarify why the benchmark cases were written the way they were, and why I chose more complex cases instead of simpler reductions like floatMulSimple.
The rationale was that if a case like floatMulBig doesn’t show any performance gain, then a simpler case like floatMulSimple is even less likely to benefit. In general, more complex reduction cases are more likely to benefit from auto-vectorization.

@XiaohongGong thanks for testing on 128-bit sve machine. Since the performance difference is not significant for both floatMulSimple and floatMulBig with/without auto-vectorization and there is a performance drop with auto-vectorization on 256-bit sve machine reported by @mikabl-arm, it seems reasonable that it should also be disabled on SVE.

I'm looking forward to having a cost model in place, so we can safely remove these restrictions and enable SLP to handle these scenarios more flexibly.

XiaohongGong · 2025-07-11T01:26:42Z

@XiaohongGong , @shqking , @eme64 ,

Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I’d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I’m citing the key part here for visibility:

To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used.

This PR does not target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions.

@shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, Matcher::match_rule_supported_auto_vectorization() returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the same scalar mul reduction implementation for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I’m mistaken.

This PR cannot leave Matcher::match_rule_supported_auto_vectorization() unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement.

@XiaohongGong ,

But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm !

Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in Matcher::match_rule_supported_auto_vectorization() lifted (After). The linear strictly-ordered SVE implementation matched this way was later removed by 4593a5d.
| Benchmark                                      |   Before (ns/op) |   After (ns/op) | Diff (%)   |
|:-----------------------------------------------|-----------------:|----------------:|:-----------|
| VectorReduction.WithSuperword.mulRedD          |          401.679 |         401.704 | ~          |
| VectorReduction2.WithSuperword.doubleMulBig    |         2365.554 |        7294.706 | +208.37%   |
| VectorReduction2.WithSuperword.doubleMulSimple |         2321.154 |        2321.207 | ~          |
| VectorReduction2.WithSuperword.floatMulBig     |         2356.006 |        2648.334 | +12.41%    |
| VectorReduction2.WithSuperword.floatMulSimple  |         2321.018 |        2321.135 | ~          |
Given that:

this PR focuses on VectorAPI and not on auto-vectorization,

and it does not introduce regressions in auto-vectorization performance,

I suggest:

continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;

moving forward with resolving the remaining VectorAPI issues and merging this PR.

I'm fine with removing the strict-ordered rules and disable these operations for SLP since it does not benefit performance. Thanks for your testing and updating!

XiaohongGong · 2025-07-11T01:39:11Z

src/hotspot/share/opto/loopopts.cpp

-    Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt);
+    Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt,
+                                                    /* requires_strict_order */ false);


Why do you change this? Before it requires strict order, but now it is false.

IIUC, it's a correction here.

As noted by this function name move_unordered_reduction_out_of_loop() and the comment before this function, unordered reduction is expected to be generated. Hence, we should specify /* requires_strict_order */ false

Precisely that, @shqking , thank you. I found this while evaluating the effect the patch has on auto-vectorization.

I see. Thanks! So what is the type of bt? Is it an integer type of floating-point one? If it's an integer type, I think changing or not does not have difference. But if it is floating-point type, we do not support the non-strict-ordered anyway and they are not enabled in SLP on AArch64. I'm just curious whether this change has any relationship with this PR. If not, I suggest not touching it now. Seems there is the same change in this PR #23181.

@XiaohongGong , JIC, you've referenced the PR you left this comment in. Did you intend to post it somewhere else?

Oh, sorry, my bad. I intended to post this one: https://github.com/openjdk/jdk/pull/21895/files#diff-7b82624b78127158abbce6835eeba196bd062aee59512ec2d4e4c8c7d681573b

shqking · 2025-07-11T01:53:48Z

src/hotspot/share/opto/loopopts.cpp

-    Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt);
+    Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt,
+                                                    /* requires_strict_order */ false);


IIUC, it's a correction here.

As noted by this function name move_unordered_reduction_out_of_loop() and the comment before this function, unordered reduction is expected to be generated. Hence, we should specify /* requires_strict_order */ false

shqking · 2025-07-11T02:01:06Z

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

@@ -1992,15 +1992,18 @@ void C2_MacroAssembler::neon_reduce_add_integral(Register dst, BasicType bt,

 // Vector reduction multiply for integral type with ASIMD instructions.
 // Note: temporary registers vtmp1 and vtmp2 are not used in some cases.
+// Note: vsrc and vtmp2 may match.


I left a comment in this "resolved comment thread" several days ago. See https://github.com/openjdk/jdk/pull/23181/files#r2179185158. It might be overlooked since the whole conversation was marked as resolved already.

I personally think we should not allow vsrc and vtmp2 to match.

Apologies for overlooking the comment. A suggestion that started the thread was marked as a nit so I felt okay about resolving it myself at the time.

If vsrc and vtmp2 match it implies that vsrc is allowed to be modified. This is used so that reduce_mul_integral_le128b may be invoked either independently or to process a tail after reduce_mul_integral_gt128b here: https://github.com/openjdk/jdk/pull/23181/files#diff-75bfb44278df267ce4978393b9b6b6030a7e23065ca15436fb1a5009debc6e81R2106 . In the latter case a temporary register holding intermediate result value is passed to both vsrc and vtmp2 parameters.

I can see it being somewhat confusing though. I could add an explicit boolean flag parameter, e.g. is_tail_processing, and do assertion checks based on its value. And extend the function comment with the described above.

I'm happy to consider other suggestions as well if any :)

I see. Thanks for your explanation.
Current version is okay to me. Perhaps we may want to add more comments here.

Suggested change

// Note: vsrc and vtmp2 may match.

// Note: vsrc and vtmp2 may match when this function is invoked by `reduce_mul_integral_gt128b()`

// as a tail call and vsrc holds the intermediate results.

I see. Thanks for your explanation. Current version is okay to me. Perhaps we may want to add more comments here.

The current code is just the sort of trap for the maintainer that leads to hard-to-find bugs. It'd be much better to remove the need for this comment by forcing everyone to provide two distinct scratch registers.

shqking · 2025-07-11T02:04:54Z

@eme64 Thanks for your input. It's very helpful to us.
@fg1417 Thanks for your clarification on case-2 as I mentioned earlier.
@mikabl-arm Thanks for your providing the performance data on Neoverse-V1 machine.

Given that:

this PR focuses on VectorAPI and not on auto-vectorization,

and it does not introduce regressions in auto-vectorization performance,

I suggest:

continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop;

moving forward with resolving the remaining VectorAPI issues and merging this PR.

I agree with your suggestion.

theRealAph · 2025-07-11T08:11:30Z

src/hotspot/cpu/aarch64/assembler_aarch64.hpp

+    f(0b00000100, 31, 24), f(0b00, 23, 22), f(0b1, 21), f(0b00000, 20, 16);
+    f(0b101111, 15, 10), rf(Zn, 5), rf(Zd, 0);
+  }
+


This pattern should be in a section SVE Integer Reduction, C4.1.37. I'm not sure if any other instructions in that group are defined yet, but if not please start the section.

Sorry, the unpredicated version should be in the SVE Integer Misc - Unpredicated section.

mikabl-arm mentioned this pull request Jan 17, 2025

8343689: AArch64: Optimize MulReduction implementation openjdk/panama-vector#225

Closed

3 tasks

openjdk bot added the rfr Pull request is ready for review label Jan 17, 2025

openjdk bot added the hotspot-compiler [email protected] label Jan 17, 2025

XiaohongGong reviewed Jan 20, 2025

View reviewed changes

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp Outdated Show resolved Hide resolved

eme64 reviewed Feb 4, 2025

View reviewed changes

theRealAph reviewed Feb 5, 2025

View reviewed changes

mikabl-arm and others added 2 commits February 25, 2025 15:26

cleanup: revert an unnecessary change to reduce_mul_fp_le128b() forma…

ff3a7aa

…ting

XiaohongGong reviewed Feb 27, 2025

View reviewed changes

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp Outdated Show resolved Hide resolved

src/hotspot/cpu/aarch64/aarch64_vector.ad Show resolved Hide resolved

shqking reviewed Feb 27, 2025

View reviewed changes

mikabl-arm marked this pull request as draft March 21, 2025 13:03

openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Mar 21, 2025

mikabl-arm and others added 2 commits June 30, 2025 11:39

Merge branch 'master' into 8343689-rebase

8c90035

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jun 30, 2025

cleanup: remove unreferenced definitions

68bbbe2

mikabl-arm commented Jun 30, 2025

View reviewed changes

cleanup: address nits, rename several symbols

025d516

mikabl-arm marked this pull request as ready for review June 30, 2025 13:21

openjdk bot added the rfr Pull request is ready for review label Jun 30, 2025

mikabl-arm requested review from XiaohongGong, shqking and theRealAph June 30, 2025 15:23

shqking reviewed Jul 1, 2025

View reviewed changes

src/hotspot/cpu/aarch64/assembler_aarch64.hpp Outdated Show resolved Hide resolved

XiaohongGong reviewed Jul 1, 2025

View reviewed changes

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp Outdated Show resolved Hide resolved

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp Show resolved Hide resolved

src/hotspot/cpu/aarch64/aarch64_vector.ad Outdated Show resolved Hide resolved

fixup: remove undefined insts from aarch64-asmtest.py

df09ab6

shqking reviewed Jul 2, 2025

View reviewed changes

test/hotspot/jtreg/compiler/loopopts/superword/TestVectorFPReduction.java Outdated Show resolved Hide resolved

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp Show resolved Hide resolved

cleanup: update a copyright notice

ebad6dd

Co-authored-by: Hao Sun <[email protected]>

mikabl-arm added 3 commits July 3, 2025 09:56

Compare VL against MaxVectorSize instead of FloatRegister::sve_vl_max

d35f108

remove the strictly-ordered FP implementation as unused

4593a5d

XiaohongGong reviewed Jul 11, 2025

View reviewed changes

shqking reviewed Jul 11, 2025

View reviewed changes

theRealAph reviewed Jul 11, 2025

View reviewed changes

	// Note: vsrc and vtmp2 may match.
	// Note: vsrc and vtmp2 may match when this function is invoked by `reduce_mul_integral_gt128b()`
	// as a tail call and vsrc holds the intermediate results.

8343689: AArch64: Optimize MulReduction implementation #23181

Are you sure you want to change the base?

8343689: AArch64: Optimize MulReduction implementation #23181

Uh oh!

Conversation

mikabl-arm commented Jan 17, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Jan 17, 2025

Uh oh!

openjdk bot commented Jan 17, 2025

Uh oh!

openjdk bot commented Jan 17, 2025

Uh oh!

mlbridge bot commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

theRealAph commented Jan 18, 2025

Uh oh!

Uh oh!

eme64 commented Feb 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikabl-arm commented Feb 5, 2025

Uh oh!

mikabl-arm commented Feb 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikabl-arm commented Mar 21, 2025

Uh oh!

openjdk bot commented Mar 21, 2025

Uh oh!

bridgekeeper bot commented May 16, 2025

Uh oh!

mikabl-arm commented Jun 30, 2025

Neoverse N1 (NEON)

Neoverse V1 (SVE 256-bit)

VectorAPI

Neoverse V2 (SVE 128-bit)

Uh oh!

mikabl-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikabl-arm commented Jan 17, 2025 •

edited by openjdk bot

Loading

mlbridge bot commented Jan 17, 2025 •

edited

Loading