8275275: AArch64: Fix performance regression after auto-vectorization on NEON #10175

fg1417 · 2022-09-06T03:13:25Z

For some vector opcodes, there are no corresponding AArch64 NEON
instructions but supporting them benefits vector API. Some of
this kind of opcodes are also used by superword for auto-
vectorization and here is the list:

VectorCastD2I, VectorCastL2F
MulVL
AddReductionVI/L/F/D
MulReductionVI/L/F/D
AndReductionV, OrReductionV, XorReductionV

We did some micro-benchmark performance tests on NEON and found
that some of listed opcodes hurt the performance of loops after
auto-vectorization, but others don't.

This patch disables those opcodes for superword, which have
obvious performance regressions after auto-vectorization on
NEON. Besides, one jtreg test case, where IR nodes are checked,
is added in the patch to protect the code against change by
mistake in the future.

Here is the performance data before and after the patch on NEON.

Benchmark length Mode Cnt Before After Units
AddReductionVD 1024 thrpt 15 450.830 548.001 ops/ms
AddReductionVF 1024 thrpt 15 514.468 548.013 ops/ms
MulReductionVD 1024 thrpt 15 405.613 499.531 ops/ms
MulReductionVF 1024 thrpt 15 451.292 495.061 ops/ms

Note:
Because superword doesn't vectorize reductions unconnected with
other vector packs, the benchmark function for Add/Mul
reduction is like:

//  private double[] da, db;
//  private double dresult;
  public void AddReductionVD() {
    double result = 1;
    for (int i = startIndex; i < length; i++) {
      result += (da[i] + db[i]);
    }
    dresult += result;
  }

Specially, vector multiply long has been implemented but disabled
for both vector API and superword. Out of the same reason, the
patch re-enables MulVL on NEON for Vector API but still disables
it for superword. The performance uplift on vector API is ~12.8x
on my local.

Benchmark length Mode Cnt Before After Units
Long128Vector.MUL 1024 thrpt 10 55.015 760.593 ops/ms
MulVL(superword) 1024 thrpt 10 907.788 907.805 ops/ms

Note:
The superword benchmark function is:

//  private long[] in1, in2, res;
  public void MulVL() {
    for (int i = 0; i < length; i++) {
      res[i] = in1[i] * in2[i];
    }
  }

The Vector API benchmark case is from:
https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8275275: AArch64: Fix performance regression after auto-vectorization on NEON

Reviewers

Andrew Haley (@theRealAph - Reviewer) ⚠️ Review applies to d02cd800
Xiaohong Gong (@XiaohongGong - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10175/head:pull/10175
$ git checkout pull/10175

Update a local copy of the PR:
$ git checkout pull/10175
$ git pull https://git.openjdk.org/jdk pull/10175/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10175

View PR using the GUI difftool:
$ git pr show -t 10175

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10175.diff

… on NEON For some vector opcodes, there are no corresponding AArch64 NEON instructions but supporting them benefits vector API. Some of this kind of opcodes are also used by superword for auto- vectorization and here is the list: ``` VectorCastD2I, VectorCastL2F MulVL AddReductionVI/L/F/D MulReductionVI/L/F/D AndReductionV, OrReductionV, XorReductionV ``` We did some micro-benchmark performance tests on NEON and found that some of listed opcodes hurt the performance of loops after auto-vectorization, but others don't. This patch disables those opcodes for superword, which have obvious performance regressions after auto-vectorization on NEON. Besides, one jtreg test case, where IR nodes are checked, is added in the patch to protect the code against change by mistake in the future. Here is the performance data before and after the patch on NEON. Benchmark length Mode Cnt Before After Units AddReductionVD 1024 thrpt 15 450.830 548.001 ops/ms AddReductionVF 1024 thrpt 15 514.468 548.013 ops/ms MulReductionVD 1024 thrpt 15 405.613 499.531 ops/ms MulReductionVF 1024 thrpt 15 451.292 495.061 ops/ms Note: Because superword doesn't vectorize reductions unconnected with other vector packs, the benchmark function for Add/Mul reduction is like: ``` // private double[] da, db; // private double dresult; public void AddReductionVD() { double result = 1; for (int i = startIndex; i < length; i++) { result += (da[i] + db[i]); } dresult += result; } ``` Specially, vector multiply long has been implemented but disabled for both vector API and superword. Out of the same reason, the patch re-enables MulVL on NEON for Vector API but still disables it for superword. The performance uplift on vector API is ~12.8x on my local. Benchmark length Mode Cnt Before After Units Long128Vector.MUL 1024 thrpt 10 55.015 760.593 ops/ms MulVL(superword) 1024 thrpt 10 907.788 907.805 ops/ms Note: The superword benchmark function is: ``` // private long[] in1, in2, res; public void MulVL() { for (int i = 0; i < length; i++) { res[i] = in1[i] * in2[i]; } } The Vector API benchmark case is from: https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190 ``` Change-Id: Ie9133e4010f98b26f97969c02fbf992b11e7edbb

bridgekeeper · 2022-09-06T03:15:10Z

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-09-06T03:17:18Z

@fg1417 The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2022-09-06T03:20:31Z

Webrevs

01: Full - Incremental (fad1cc2f)
00: Full (d02cd800)

theRealAph

That all makes very good sense. Thanks.

openjdk · 2022-09-06T09:41:41Z

@fg1417 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8275275: AArch64: Fix performance regression after auto-vectorization on NEON

Reviewed-by: aph, xgong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 59 new commits pushed to the master branch:

68645eb: 8293566: RISC-V: Clean up push and pop registers
526eb54: 8293669: SA: Remove unnecssary "InstanceStackChunkKlass: InstanceStackChunkKlass" output when scanning heap
41ce658: 8292225: Rename ArchiveBuilder APIs related to source and buffered addresses
155b10a: 8293329: x86: Improve handling of constants in AES/GHASH stubs
d3f7e3b: 8293339: vm/jvmti/StopThread/stop001/stop00103 crashes with SIGSEGV in Continuation::is_continuation_mounted
524af94: 8283627: Outdated comment in MachineDescriptionTwosComplement.isLP64
cea409c: 8292738: JInternalFrame backgroundShadowBorder & foregroundShadowBorder line is longer in Mac Look and Feel
9ef6c09: 8287908: Use non-cloning reflection methods where acceptable
0c61bf1: 8293282: LoadLibraryUnloadTest.java fails with "Too few cleared WeakReferences"
91c9091: 8293343: sun/management/jmxremote/bootstrap/RmiSslNoKeyStoreTest.java failed with "Agent communication error: java.io.EOFException"
... and 49 more: https://git.openjdk.org/jdk/compare/710a14347344f3cc136f3b7f41aad231fbe43625...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @XiaohongGong) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

XiaohongGong · 2022-09-07T02:23:03Z

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4

@@ -143,7 +146,6 @@ source %{
    // Check whether specific Op is supported.
    // Fail fast, otherwise fall through to common vector_size_supported() check.
    switch (opcode) {
-      case Op_MulVL:


Enabling MulVL for vector api is great. Thanks for doing this! However, this might break several match rules like https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2025 and the vmls. The assertion in line-2035 might fail if this rule is matched for a long vector and runs on hardwares that do not support sve. One way to fix is adding the predicate to these rules to skip the long vector type for neon. Thanks!

Thanks for your kind reminder. I'll fix these related rules and add corresponding vector api regression tests in this PR.

Enabling MulVL for vector api is great. Thanks for doing this! However, this might break several match rules like https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2025 and the vmls. The assertion in line-2035 might fail if this rule is matched for a long vector and runs on hardwares that do not support sve. One way to fix is adding the predicate to these rules to skip the long vector type for neon. Thanks!

Done. Thanks!

XiaohongGong

LGTM! Thanks!

fg1417 · 2022-09-09T01:29:26Z

The patch involves aarch64 only, so I suppose the GHA failure is not caused by this PR.

TobiHartmann · 2022-09-09T07:59:02Z

I tested this in our CI. All tests passed.

fg1417 · 2022-09-13T02:10:58Z

I tested this in our CI. All tests passed.

Thanks for your effort @TobiHartmann .

fg1417 · 2022-09-13T02:13:38Z

/integrate

openjdk · 2022-09-13T02:14:47Z

@fg1417
Your change (at version fad1cc2) is now ready to be sponsored by a Committer.

pfustc · 2022-09-13T03:12:40Z

/sponsor

openjdk · 2022-09-13T03:13:54Z

Going to push as commit ec2629c.
Since your change was applied there have been 60 commits pushed to the master branch:

cbee0bc: 8292587: AArch64: Support SVE fabd instruction
68645eb: 8293566: RISC-V: Clean up push and pop registers
526eb54: 8293669: SA: Remove unnecssary "InstanceStackChunkKlass: InstanceStackChunkKlass" output when scanning heap
41ce658: 8292225: Rename ArchiveBuilder APIs related to source and buffered addresses
155b10a: 8293329: x86: Improve handling of constants in AES/GHASH stubs
d3f7e3b: 8293339: vm/jvmti/StopThread/stop001/stop00103 crashes with SIGSEGV in Continuation::is_continuation_mounted
524af94: 8283627: Outdated comment in MachineDescriptionTwosComplement.isLP64
cea409c: 8292738: JInternalFrame backgroundShadowBorder & foregroundShadowBorder line is longer in Mac Look and Feel
9ef6c09: 8287908: Use non-cloning reflection methods where acceptable
0c61bf1: 8293282: LoadLibraryUnloadTest.java fails with "Too few cleared WeakReferences"
... and 50 more: https://git.openjdk.org/jdk/compare/710a14347344f3cc136f3b7f41aad231fbe43625...master

Your commit was automatically rebased without conflicts.

openjdk · 2022-09-13T03:14:11Z

@pfustc @fg1417 Pushed as commit ec2629c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk bot added the rfr Pull request is ready for review label Sep 6, 2022

openjdk bot added the hotspot-compiler [email protected] label Sep 6, 2022

theRealAph approved these changes Sep 6, 2022

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Sep 6, 2022

XiaohongGong reviewed Sep 7, 2022

View reviewed changes

Fei Gao added 2 commits September 8, 2022 06:44

Merge branch 'master' into fg8275275

5b4021c

Fix match rules for mla/mls and add a vector API regression testcase

fad1cc2

XiaohongGong approved these changes Sep 8, 2022

View reviewed changes

openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 13, 2022

openjdk bot added the integrated Pull request has been integrated label Sep 13, 2022

openjdk bot closed this Sep 13, 2022

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Sep 13, 2022

shqking mentioned this pull request Jul 3, 2025

8343689: AArch64: Optimize MulReduction implementation #23181

Open

3 tasks

8275275: AArch64: Fix performance regression after auto-vectorization on NEON #10175

8275275: AArch64: Fix performance regression after auto-vectorization on NEON #10175

Uh oh!

Conversation

fg1417 commented Sep 6, 2022 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Sep 6, 2022

Uh oh!

openjdk bot commented Sep 6, 2022

Uh oh!

mlbridge bot commented Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

theRealAph left a comment

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XiaohongGong Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

fg1417 Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

fg1417 Sep 8, 2022

Choose a reason for hiding this comment

Uh oh!

XiaohongGong left a comment

Choose a reason for hiding this comment

Uh oh!

fg1417 commented Sep 9, 2022

Uh oh!

TobiHartmann commented Sep 9, 2022

Uh oh!

fg1417 commented Sep 13, 2022

Uh oh!

fg1417 commented Sep 13, 2022

Uh oh!

openjdk bot commented Sep 13, 2022

Uh oh!

pfustc commented Sep 13, 2022

Uh oh!

openjdk bot commented Sep 13, 2022

Uh oh!

openjdk bot commented Sep 13, 2022

Uh oh!

Uh oh!

fg1417 commented Sep 6, 2022 •

edited by openjdk bot

Loading

mlbridge bot commented Sep 6, 2022 •

edited

Loading

openjdk bot commented Sep 6, 2022 •

edited

Loading