-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8348561: Add aarch64 intrinsics for ML-DSA #23300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back ferakocz! A progress list of the required criteria for merging this PR into |
@ferakocz This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 49 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@adinn) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Webrevs
|
@ferakocz this pull request can not be integrated into git checkout mldsa-aarch64-intrinsics
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
@ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. |
@adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently. |
@ferakocz Yes, the stub declaration part of it looks to be correct. The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Specifications and Supporting Documentation paper, Shi Bai, Léo Ducas et al, 2021 - are they the best ones to look at? |
The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsics' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing. |
Yes, I located the relevant Java implementations in SHA3.java (keccak) and ML_DSA.java (dilithiumXXX) plus also SHA3Parallel.java (doubleKeccak). The first file does at least mention FIPS-202. The second does not include any reference, in particular does not mention FIPS-204 (correction it mentions it in passing rather than citing it as the basis for the derived code). I still think it would be helpful for reviewers and maintainers if you were to add a comment in front of the generator routines that 1) notes that these routines are based on the relevant Java sources and 2) mentions that the Java code is in turn based on the FIPS-202 and FIPS-204 standards. While I agree that a reviewer or maintainer could simply check the generated code against the Java code I believe access to the underlying theory will be of aid when it comes to understanding what each variant is doing and verifying the equivalence of the two. That's why I'd also prefer to have two reviews to be sure that more than one of us who may be tasked with maintaining this code can be happy that we understand, at least, the equivalence in question. |
Some measurements:
|
@@ -2609,6 +2615,8 @@ template<typename R, typename... Rx> | |||
INSN(minv, 0, 0b011011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S | |||
INSN(smaxp, 0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S | |||
INSN(sminp, 0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S | |||
INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried that, but the python script (actually the as command that it started) threw error messages:
aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760].
prfm PLDL1KEEP, [x15, 43]
^
aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x1, x10, x23, sxth #2
^
aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
add x11, x21, x5, uxtb #3
^
aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
adds x11, x17, x17, uxtw #1
^
aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x11, x0, x15, uxtb #1
^
aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
subs x7, x1, x0, sxth #2
^
This is without any modifications from what is in the master branch currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - python aarch64-asmtest.py | expand > asmtest.out.h
. Please let me know if you still face problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like -
python aarch64-asmtest.py | expand > asmtest.out.h
. Please let me know if you still face problems.
People have been running this script for a decade now.
Let's look at just one of these:
aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x1, x10, x23, sxth #2
From the AArch64 manual:
SUB (extended register)
SUB <Xd|SP>, <Xn|SP>, {, {#}}
It thinks this is a SUB (shifted register), bit it's really a SUB (extended register).
fedora:aarch64 $ cat t.s
sub x1, x10, x23, sxth #2
fedora:aarch64 $ as t.s
fedora:aarch64 $ objdump -D a.out
Disassembly of section .text:
0000000000000000 <.text>:
0: cb37a941 sub x1, x10, w23, sxth #2
So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it with top-of trunk latest binutils:
fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s
fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version
GNU assembler (GNU Binutils) 2.44.50.20250225
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha!
aph@Andrews-MacBook-Pro ~ % as t.s
t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x1, x10, x23, sxth #2
^
aph@Andrews-MacBook-Pro ~ % as --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so GNU as is more forgiving than Apple as...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did my patch to aarch64-asmtest.py solve the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried, I just used GNU as.
@@ -2586,6 +2591,7 @@ template<typename R, typename... Rx> | |||
void NAME(FloatRegister Vd, SIMD_Arrangement T, FloatRegister Vn, FloatRegister Vm) { \ | |||
guarantee(T != T1Q && T != T1D, "incorrect arrangement"); \ | |||
if (!acceptT2D) guarantee(T != T2D, "incorrect arrangement"); \ | |||
if (strcmp(#NAME, "sqdmulh") == 0) guarantee(T != T8B && T != T16B, "incorrect arrangement"); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
I think it might be better to change this test from a strcmp call to (opc2 == 0b101101). The strcmp test is clearer to a reader of the code but the call may not be guaranteed to be compiled out at build time while the latter will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as suggested.
@@ -4063,6 +4063,93 @@ class StubGenerator: public StubCodeGenerator { | |||
return start; | |||
} | |||
|
|||
// Execute on round of keccak of two computations in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
It would be helpful to add comments that relate the register and instruction selection to the original Java source code. e.g. change the header as follows
// Performs 2 keccak round transformations using vector parallelism
//
// Two sets of 25 * 64-bit input states a0[lo:hi]...a24[lo:hi] are passed in
// the lower/upper halves of registers v0...v24 and the transformed states
// are returned in the same registers. Intermediate 64-bit pairs
// c0...c5 and d0...d5 are computed in registers v25...v30. v31 is
// loaded with the required pair of 64 bit rounding constants.
// During computation of the output states some intermediate results are
// shuffled around registers v0...v30. Comments on each line indicate
// how the values in registers correspond to variables ai, ci, di in
// the Java source code, likewise how the generated machine instructions
// correspond to Java source operations (n.b. rol means rotate left).
The annotate the generation steps as follows:
__ eor3(v29, __ T16B, v4, v9, v14); // c4 = a4 ^ a9 ^ a14
__ eor3(v26, __ T16B, v1, v6, v11); // c1 = a1 ^ a16 ^ a11
__ eor3(v28, __ T16B, v3, v8, v13); // c3 = a3 ^ a8 ^a13
__ eor3(v25, __ T16B, v0, v5, v10); // c0 = a0 ^ a5 ^ a10
__ eor3(v27, __ T16B, v2, v7, v12); // c2 = a2 ^ a7 ^ a12
__ eor3(v29, __ T16B, v29, v19, v24); // c4 ^= a19 ^ a24
__ eor3(v26, __ T16B, v26, v16, v21); // c1 ^= a16 ^ a21
__ eor3(v28, __ T16B, v28, v18, v23); // c3 ^= a18 ^ a23
__ eor3(v25, __ T16B, v25, v15, v20); // c0 ^= a15 ^ a20
__ eor3(v27, __ T16B, v27, v17, v22); // c2 ^= a17 ^ a22
__ rax1(v30, __ T2D, v29, v26); // d0 = c4 ^ rol(c1, 1)
__ rax1(v26, __ T2D, v26, v28); // d2 = c1 ^ rol(c3, 1)
__ rax1(v28, __ T2D, v28, v25); // d4 = c3 ^ rol(c0, 1)
__ rax1(v25, __ T2D, v25, v27); // d1 = c0 ^ rol(c2, 1)
__ rax1(v27, __ T2D, v27, v29); // d3 = c2 ^ rol(c4, 1)
__ eor(v0, __ T16B, v0, v30); // a0 = a0 ^ d0
__ xar(v29, __ T2D, v1, v25, (64 - 1)); // a10' = rol((a1^d1), 1)
__ xar(v1, __ T2D, v6, v25, (64 - 44)); // a1 = rol(a6^d1), 44)
__ xar(v6, __ T2D, v9, v28, (64 - 20)); // a6 = rol((a9^d4), 20)
__ xar(v9, __ T2D, v22, v26, (64 - 61)); // a9 = rol((a22^d2), 61)
__ xar(v22, __ T2D, v14, v28, (64 - 39)); // a22 = rol((a14^d4), 39)
__ xar(v14, __ T2D, v20, v30, (64 - 18)); // a14 = rol((a20^d0), 18)
__ xar(v31, __ T2D, v2, v26, (64 - 62)); // a20' = rol((a2^d2), 62)
__ xar(v2, __ T2D, v12, v26, (64 - 43)); // a2 = rol((a12^d2), 43)
__ xar(v12, __ T2D, v13, v27, (64 - 25)); // a12 = rol((a13^d3), 25)
__ xar(v13, __ T2D, v19, v28, (64 - 8)); // a13 = rol((a19^d4), 8)
__ xar(v19, __ T2D, v23, v27, (64 - 56)); // a19 = rol((a23^d3), 56)
__ xar(v23, __ T2D, v15, v30, (64 - 41)); // a23 = rol((a15^d0), 41)
__ xar(v15, __ T2D, v4, v28, (64 - 27)); // a15 = rol((a4^d4), 27)
__ xar(v28, __ T2D, v24, v28, (64 - 14)); // a4' = rol((a24^d4), 14)
__ xar(v24, __ T2D, v21, v25, (64 - 2)); // a24 = rol((a21^d1), 2)
__ xar(v8, __ T2D, v8, v27, (64 - 55)); // a21' = rol((a8^d3), 55)
__ xar(v4, __ T2D, v16, v25, (64 - 45)); // a8' = rol((a16^d1), 45)
__ xar(v16, __ T2D, v5, v30, (64 - 36)); // a16 = rol((a5^d0), 36)
__ xar(v5, __ T2D, v3, v27, (64 - 28)); // a5 = rol((a3^d3), 28)
__ xar(v27, __ T2D, v18, v27, (64 - 21)); // a3' = rol((a18^d3), 21)
__ xar(v3, __ T2D, v17, v26, (64 - 15)); // a18' = rol((a17^d2), 15)
__ xar(v25, __ T2D, v11, v25, (64 - 10)); // a17' = rol((a11^d1), 10)
__ xar(v26, __ T2D, v7, v26, (64 - 6)); // a11' = rol((a7^d2), 6)
__ xar(v30, __ T2D, v10, v30, (64 - 3)); // a7' = rol((a10^d0), 3)
__ bcax(v20, __ T16B, v31, v22, v8); // a20 = a20' ^ (~a21 & a22')
__ bcax(v21, __ T16B, v8, v23, v22); // a21 = a21' ^ (~a22 & a23)
__ bcax(v22, __ T16B, v22, v24, v23); // a22 = a22 ^ (~a23 & a24)
__ bcax(v23, __ T16B, v23, v31, v24); // a23 = a23 ^ (~a24 & a20')
__ bcax(v24, __ T16B, v24, v8, v31); // a24 = a24 ^ (~a20' & a21')
__ ld1r(v31, __ T2D, __ post(rscratch1, 8)); // rc = round_constants[i]
__ bcax(v17, __ T16B, v25, v19, v3); // a17 = a17' ^ (~a18' & a19)
__ bcax(v18, __ T16B, v3, v15, v19); // a18 = a18' ^ (~a19 & a15')
__ bcax(v19, __ T16B, v19, v16, v15); // a19 = a19 ^ (~a15 & a16)
__ bcax(v15, __ T16B, v15, v25, v16); // a15 = a15 ^ (~a16 & a17')
__ bcax(v16, __ T16B, v16, v3, v25); // a16 = a16 ^ (~a17' & a18')
__ bcax(v10, __ T16B, v29, v12, v26); // a10 = a10' ^ (~a11' & a12)
__ bcax(v11, __ T16B, v26, v13, v12); // a11 = a11' ^ (~a12 & a13)
__ bcax(v12, __ T16B, v12, v14, v13); // a12 = a12 ^ (~a13 & a14)
__ bcax(v13, __ T16B, v13, v29, v14); // a13 = a13 ^ (~a14 & a10')
__ bcax(v14, __ T16B, v14, v26, v29); // a14 = a14 ^ (~a10' & a11')
__ bcax(v7, __ T16B, v30, v9, v4); // a7 = a7' ^ (~a8' & a9)
__ bcax(v8, __ T16B, v4, v5, v9); // a8 = a8' ^ (~a9 & a5)
__ bcax(v9, __ T16B, v9, v6, v5); // a9 = a9 ^ (~a5 & a6)
__ bcax(v5, __ T16B, v5, v30, v6); // a5 = a5 ^ (~a6 & a7)
__ bcax(v6, __ T16B, v6, v4, v30); // a6 = a6 ^ (~a7 & a8')
__ bcax(v3, __ T16B, v27, v0, v28); // a3 = a3' ^ (~a4' & a0)
__ bcax(v4, __ T16B, v28, v1, v0); // a4 = a4' ^ (~a0 & a1)
__ bcax(v0, __ T16B, v0, v2, v1); // a0 = a0 ^ (~a1 & a2)
__ bcax(v1, __ T16B, v1, v27, v2); // a1 = a1 ^ (~a2 & a3)
__ bcax(v2, __ T16B, v2, v28, v27); // a2 = a2 ^ (~a3 & a4')
__ eor(v0, __ T16B, v0, v31); // a0 = a0 ^ rc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this piece of code is not new, and I don't really think that this level of commenting is necessary, especially in code that is very unlikely to change, I added the comments.
Hi. Here is the test result of our CI. copyright yearthe following files should update the copyright year to 2025.
cross-build failureCross build for riscv64/s390/ppc64 failed. Here shows the error msg for ppc64 === Output from failing command(s) repeated here ===
* For target support_interim-jmods_support__create_java.base.jmod_exec:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769
# assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
#
# JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V [libjvm.so+0x3b391c] Instruction_aarch64::~Instruction_aarch64()+0xbc
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/
#
# An error report file with more information is saved as:
# /tmp/jdk-src/make/hs_err_pid72752.log
... (rest of output omitted)
* All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
=== End of repeated output === I suppose we should make the similar update at file |
@ferakocz Apologies for the delays in reviewing and the limited feedback up to now. The code clearly does the job well but I think it would be made clearer and easier to maintain by tweaking/extending some of the generator methods and adding more detailed commenting. I am afraid I may take a few days to provide the relevant details because of other commitments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments as indicated to relate generated code to original Java source. Otherwise good to go.
I raised JDK-8350589 to cover investigation of an alternative implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, still good
@ferakocz Apologies for raising yet another resolve conflict. You will need to make a further adjustment to the compiler blob declaration to accommodate a fix I just pushed to resolve a problem with cross-compilation. Your patch should now specify
|
Oops. sorry - cut and paste error -- the new setting should be
|
@adinn, I have done this change, but that erased your approval. Could you reapprove? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still good.
@ferakocz Feel free to integrate and I will sponsor |
/integrate |
I think it would be nice to add a release note for this describing the approximate performance improvement. |
/sponsor |
Going to push as commit 3230894.
Your commit was automatically rebased without conflicts. |
By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300
$ git checkout pull/23300
Update a local copy of the PR:
$ git checkout pull/23300
$ git pull https://git.openjdk.org/jdk.git pull/23300/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23300
View PR using the GUI difftool:
$ git pr show -t 23300
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23300.diff
Using Webrev
Link to Webrev Comment