8322174: RISC-V: C2 VectorizedHashCode RVV Version #17413

ygaevsky · 2024-01-13T09:21:37Z

The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware.

Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8322174: RISC-V: C2 VectorizedHashCode RVV Version (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17413/head:pull/17413
$ git checkout pull/17413

Update a local copy of the PR:
$ git checkout pull/17413
$ git pull https://git.openjdk.org/jdk.git pull/17413/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17413

View PR using the GUI difftool:
$ git pr show -t 17413

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17413.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2024-01-13T09:22:52Z

👋 Welcome back ygaevsky! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

ygaevsky · 2024-01-13T09:23:05Z

NB: I have no access to RVV v1.0.0 hardware so to estimate performance improvements
adopted the patch to RVV v0.7.1 ISA [1] under OpenJDK-21 and run the JMH test
org.openjdk.bench.java.lang.ArraysHashCode on LicheePi-4A TH1520 which does support
RVV v.0.7.1.

[1] https://mail.openjdk.org/pipermail/riscv-port-dev/2024-January/001220.html

The results are below. Hopefully they will be similar on RVV v1.0.0 hardware.

Legend: UseVHI ==> UseVectorizedHashCodeIntrinsic

----------------------------------------------------------------------------------------------------------------------------------------------
                                [-XX:-UseVHI -XX:-UseRVV] [-XX:-UseVHI -XX:+UseRVV] [-XX:+UseVHI -XX:-UseRVV] [-XX:+UseVHi -XX:+UseRVV]
----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark    (size)  Mode  Cnt |       Score      Error  |       Score      Error  |       Score      Error  |       Score      Error  |Units|
----------------------------------------------------------------------------------------------------------------------------------------------
bytes             1  avgt   10 |      20.292 ±    0.524  |      20.693 ±    1.706  |      20.458 ±    0.718  |      20.276 ±    0.525  |ns/op|
bytes            10  avgt   10 |      35.107 ±    0.180  |      35.054 ±    0.029  |      30.898 ±    0.109  |      31.033 ±    0.132  |ns/op|
bytes           100  avgt   10 |     188.190 ±    4.192  |     188.805 ±    4.345  |     152.324 ±    2.205  |      97.673 ±    3.145  |ns/op|
bytes          1000  avgt   10 |    1664.569 ±    1.662  |    1663.711 ±    2.229  |    1184.224 ±    0.731  |     656.340 ±    1.908  |ns/op|
bytes         10000  avgt   10 |   16419.434 ±   68.995  |   16407.357 ±   43.737  |   11599.876 ±   23.574  |    6171.500 ±   16.633  |ns/op|
bytes        100000  avgt   10 |  167738.927 ± 3313.255  |  166577.887 ± 1552.963  |  119475.413 ± 1358.363  |   62061.873 ±  130.268  |ns/op|
chars             1  avgt   10 |      20.420 ±    1.031  |      20.294 ±    0.527  |      20.402 ±    0.992  |      21.267 ±    0.027  |ns/op|
chars            10  avgt   10 |      35.800 ±    0.032  |      35.778 ±    0.049  |      31.170 ±    0.199  |      31.744 ±    0.169  |ns/op|
chars           100  avgt   10 |     185.715 ±    0.674  |     184.531 ±    1.152  |     143.918 ±    1.147  |      90.613 ±    0.092  |ns/op|
chars          1000  avgt   10 |    1683.711 ±   46.493  |    1668.926 ±    6.850  |    1120.730 ±    3.017  |     652.677 ±    2.026  |ns/op|
chars         10000  avgt   10 |   16402.007 ±   16.654  |   16468.497 ±  136.411  |   10939.505 ±   72.647  |    6174.555 ±   28.879  |ns/op|
chars        100000  avgt   10 |  164826.072 ±  381.240  |  165807.663 ± 4328.908  |  114787.826 ± 4217.557  |   61724.436 ±   45.819  |ns/op|
ints              1  avgt   10 |      20.730 ±    2.375  |      20.506 ±    1.458  |      20.277 ±    0.517  |      20.169 ±    0.015  |ns/op|
ints             10  avgt   10 |      36.878 ±    0.059  |      36.162 ±    1.033  |      31.338 ±    0.243  |      32.511 ±    0.165  |ns/op|
ints            100  avgt   10 |     184.288 ±    0.790  |     184.939 ±    0.624  |     143.794 ±    0.708  |      80.406 ±    6.987  |ns/op|
ints           1000  avgt   10 |    1669.219 ±    3.559  |    1670.992 ±   13.830  |    1118.856 ±    1.086  |     486.305 ±    4.471  |ns/op|
ints          10000  avgt   10 |   16432.730 ±   62.326  |   16710.540 ±   68.028  |   11128.766 ±   57.448  |    5232.062 ±  291.835  |ns/op|
ints         100000  avgt   10 |  165387.705 ±  431.814  |  165597.050 ±  278.567  |  115605.648 ± 8245.853  |   45468.032 ± 1793.979  |ns/op|
multibytes        1  avgt   10 |       3.459 ±    0.020  |       3.473 ±    0.055  |       3.477 ±    0.145  |       3.480 ±    0.043  |ns/op|
multibytes       10  avgt   10 |      16.983 ±    0.264  |      17.526 ±    0.375  |      12.325 ±    0.117  |      13.415 ±    0.136  |ns/op|
multibytes      100  avgt   10 |     105.251 ±    0.250  |     105.032 ±    0.180  |      78.795 ±    0.260  |      53.210 ±    1.024  |ns/op|
multibytes     1000  avgt   10 |     948.171 ±    5.950  |     957.757 ±   12.117  |     700.407 ±    1.928  |     440.352 ±    2.248  |ns/op|
multibytes    10000  avgt   10 |    8829.949 ±   64.161  |    9007.879 ±  510.217  |    6406.776 ±   17.982  |    3430.480 ±   35.108  |ns/op|
multibytes   100000  avgt   10 |   89545.793 ± 6151.064  |   88335.319 ±   51.310  |   64236.061 ±   46.572  |   33380.485 ±   56.708  |ns/op|
multichars        1  avgt   10 |       3.475 ±    0.054  |       3.453 ±    0.066  |       3.492 ±    0.122  |       3.495 ±    0.047  |ns/op|
multichars       10  avgt   10 |      17.719 ±    0.645  |      17.201 ±    0.152  |      12.318 ±    0.141  |      13.093 ±    0.147  |ns/op|
multichars      100  avgt   10 |     106.735 ±    0.283  |     106.625 ±    0.177  |      77.695 ±    0.212  |      51.495 ±    0.166  |ns/op|
multichars     1000  avgt   10 |     927.573 ±    6.839  |     932.211 ±    3.445  |     696.374 ±    1.757  |     471.226 ±    1.499  |ns/op|
multichars    10000  avgt   10 |    9846.872 ±   20.840  |    9909.611 ±  188.165  |    6392.901 ±    4.849  |    3978.730 ±  180.130  |ns/op|
multichars   100000  avgt   10 |   88110.303 ±   41.764  |   88892.543 ± 2534.299  |   60615.033 ±   94.002  |   33956.859 ±  199.178  |ns/op|
multiints         1  avgt   10 |       3.450 ±    0.328  |       3.382 ±    0.150  |       3.345 ±    0.024  |       3.380 ±    0.040  |ns/op|
multiints        10  avgt   10 |      18.265 ±    0.424  |      18.644 ±    1.433  |      12.036 ±    0.041  |      13.773 ±    0.114  |ns/op|
multiints       100  avgt   10 |     107.500 ±    0.636  |     107.318 ±    0.466  |      77.971 ±    0.296  |      47.700 ±    0.408  |ns/op|
multiints      1000  avgt   10 |     924.920 ±    9.106  |     937.609 ±   44.303  |     695.427 ±    2.075  |     449.475 ±    2.061  |ns/op|
multiints     10000  avgt   10 |    9322.880 ±   49.589  |    9277.425 ±   91.828  |    7009.704 ±  297.983  |    6196.819 ±  367.531  |ns/op|
multiints    100000  avgt   10 |   88154.281 ±  279.258  |   88272.818 ±  103.608  |   64118.963 ± 6445.702  |   55317.212 ±  916.179  |ns/op|
multishorts       1  avgt   10 |       3.488 ±    0.034  |       3.531 ±    0.227  |       3.521 ±    0.051  |       3.512 ±    0.054  |ns/op|
multishorts      10  avgt   10 |      17.907 ±    0.380  |      17.408 ±    0.659  |      12.252 ±    0.110  |      13.445 ±    0.102  |ns/op|
multishorts     100  avgt   10 |     106.588 ±    0.188  |     107.500 ±    0.531  |      79.630 ±    0.428  |      53.886 ±    3.243  |ns/op|
multishorts    1000  avgt   10 |     931.732 ±    6.891  |     923.814 ±   11.836  |     701.534 ±    1.742  |     470.312 ±    2.117  |ns/op|
multishorts   10000  avgt   10 |    9663.105 ± 1017.387  |    9859.034 ±   66.672  |    6422.864 ±    7.486  |    3785.710 ±   37.656  |ns/op|
multishorts  100000  avgt   10 |   88799.262 ± 2363.672  |   88015.545 ±   52.795  |   60541.966 ±  155.521  |   33888.677 ±  127.071  |ns/op|
shorts            1  avgt   10 |      20.199 ±    0.083  |      20.190 ±    0.027  |      21.389 ±    0.600  |      21.250 ±    0.024  |ns/op|
shorts           10  avgt   10 |      35.842 ±    0.189  |      35.806 ±    0.167  |      30.960 ±    0.186  |      31.451 ±    0.182  |ns/op|
shorts          100  avgt   10 |     184.323 ±    0.488  |     185.318 ±    0.776  |     143.652 ±    1.057  |      90.657 ±    0.052  |ns/op|
shorts         1000  avgt   10 |    1664.583 ±    2.016  |    1666.803 ±    3.100  |    1118.623 ±    0.661  |     652.112 ±    0.346  |ns/op|
shorts        10000  avgt   10 |   16395.042 ±   39.388  |   16426.231 ±   75.461  |   10933.090 ±   16.165  |    6200.135 ±  116.218  |ns/op|
shorts       100000  avgt   10 |  165037.332 ±  226.003  |  167782.156 ± 8844.288  |  114329.012 ± 4326.851  |   61693.056 ±   93.278  |ns/op|
----------------------------------------------------------------------------------------------------------------------------------------------

openjdk · 2024-01-13T09:23:54Z

@ygaevsky The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2024-01-13T09:27:25Z

Webrevs

RealFYang

Some initial comments from a brief look.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp

src/hotspot/cpu/riscv/riscv_v.ad

RealFYang · 2024-01-17T07:56:03Z

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp

+  //   31^^(MaxVectorSize-1)...31^^0 ==> vector registers
+  la(pows31, ExternalAddress(adr_pows31));
+  mv(t1, num_8b_elems_in_vec);
+  vsetvli(t0, t1, Assembler::e32, Assembler::m4);


I wonder if the scalar code for handling WIDE_TAIL could be eliminated with RVV's design for stripmining approach [1]? Looks like the current code doesn't take advantage of this design as new vl returned by vsetvli is not checked and used.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config

One of the common approaches to handling a large number of elements is "stripmining" where each iteration of a loop handles some number of elements, and the iterations continue until all elements have been processed. The RISC-V vector specification provides direct, portable support for this approach. The application specifies the total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype setting. A straightforward loop structure, shown in [Example of stripmining and changes to SEW] (https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew), depicts the ease with which the code keeps track of the remaining number of elements and the amount per iteration handled by hardware.

Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(

I've made additional measurements with more granularity:

[ -XX:-UseRVV ] [-XX:+UseRVV } ArraysHashCode.multiints 10 avgt 30 12.460 ± 0.155 13.836 ± 0.054 ns/op ArraysHashCode.multiints 11 avgt 30 14.541 ± 0.140 14.613 ± 0.084 ns/op ArraysHashCode.multiints 12 avgt 30 15.097 ± 0.052 15.517 ± 0.097 ns/op ArraysHashCode.multiints 13 avgt 30 13.632 ± 0.137 14.486 ± 0.181 ns/op ArraysHashCode.multiints 14 avgt 30 15.771 ± 0.108 16.153 ± 0.092 ns/op ArraysHashCode.multiints 15 avgt 30 14.726 ± 0.088 15.930 ± 0.077 ns/op ArraysHashCode.multiints 16 avgt 30 15.533 ± 0.067 15.496 ± 0.083 ns/op ArraysHashCode.multiints 17 avgt 30 15.875 ± 0.173 16.878 ± 0.172 ns/op ArraysHashCode.multiints 18 avgt 30 15.740 ± 0.114 16.465 ± 0.089 ns/op ArraysHashCode.multiints 19 avgt 30 17.252 ± 0.051 17.628 ± 0.155 ns/op ArraysHashCode.multiints 20 avgt 30 20.193 ± 0.282 19.039 ± 0.441 ns/op ArraysHashCode.multiints 25 avgt 30 20.209 ± 0.070 20.513 ± 0.071 ns/op ArraysHashCode.multiints 30 avgt 30 23.157 ± 0.068 23.290 ± 0.165 ns/op ArraysHashCode.multiints 35 avgt 30 28.671 ± 0.116 26.198 ± 0.127 ns/op <--- ArraysHashCode.multiints 40 avgt 30 30.992 ± 0.068 27.342 ± 0.072 ns/op ArraysHashCode.multiints 45 avgt 30 39.408 ± 1.428 32.170 ± 0.230 ns/op ArraysHashCode.multiints 50 avgt 30 41.976 ± 0.442 33.103 ± 0.090 ns/op ArraysHashCode.multiints 55 avgt 30 45.379 ± 0.236 35.899 ± 0.692 ns/op ArraysHashCode.multiints 60 avgt 30 48.615 ± 0.249 35.709 ± 0.477 ns/op ArraysHashCode.multiints 65 avgt 30 51.455 ± 0.213 38.275 ± 0.266 ns/op ArraysHashCode.multiints 70 avgt 30 54.032 ± 0.324 37.985 ± 0.264 ns/op ArraysHashCode.multiints 75 avgt 30 56.759 ± 0.164 39.446 ± 0.425 ns/op ArraysHashCode.multiints 80 avgt 30 61.334 ± 0.267 41.521 ± 0.310 ns/op ArraysHashCode.multiints 85 avgt 30 66.177 ± 0.299 44.136 ± 0.407 ns/op ArraysHashCode.multiints 90 avgt 30 67.444 ± 0.282 42.909 ± 0.275 ns/op ArraysHashCode.multiints 95 avgt 30 77.312 ± 0.303 49.078 ± 1.166 ns/op ArraysHashCode.multiints 100 avgt 30 78.405 ± 0.220 47.499 ± 0.553 ns/op ArraysHashCode.multiints 105 avgt 30 75.706 ± 0.265 46.029 ± 0.579 ns/op

As you can see the numbers become better with +UseRVV only after length >= 30 and perhaps that can explain why my attempt to improve the tail with RVV instructions was unsuccessful - the cost of setting up Vector Unit for small lengths is to high. :-(

Hi, I don't quite understand why there is a need to change LMUL from m4 to m2 if we are switching to use the stripmining approach. The tail calculation should normally share the code for VEC_LOOP, which also means we need to use some vector mask instructions to filter out the active elements for each loop iteration especially the iteration for handing the tail elements. And the vl returned by vsetvli tells us the number of elements which could be processed in parallel for one certain iteration ([1] is one example). I am not sure if you are trying this way. Do you have more details or code changes to share? Thanks.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew

I used m4->m2 change to process 8 elements in the tail with vector instructions after main vector loop. IIUC, the m4->m2 change in runtime is very costly, so I've created another patch with same goal but without m4->m2 change:

void C2_MacroAssembler::arrays_hashcode_v(Register ary, Register cnt, Register result, Register tmp1, Register tmp2, Register tmp3, Register tmp4, Register tmp5, Register tmp6, BasicType eltype) { ... const int nof_vec_elems = MaxVectorSize; const int hof_vec_elems = nof_vec_elems >> 1; const int elsize_bytes = arrays_hashcode_elsize(eltype); const int elsize_shift = exact_log2(elsize_bytes); const int vec_step_bytes = nof_vec_elems << elsize_shift; const int half_vec_step_bytes = vec_step_bytes >> 1; const address adr_pows31 = StubRoutines::riscv::arrays_hashcode_powers_of_31() + sizeof(jint); ... const Register chunks = tmp1; const Register chunks_end = chunks; const Register pows31 = tmp2; const Register powmax = tmp3; const VectorRegister v_coeffs = v4; const VectorRegister v_src = v8; const VectorRegister v_sum = v12; const VectorRegister v_powmax = v16; const VectorRegister v_result = v20; const VectorRegister v_tmp = v24; const VectorRegister v_zred = v28; Label DONE, TAIL, TAIL_LOOP, PRE_TAIL, SAVE_VRESULT, WIDE_TAIL, VEC_LOOP; // result has a value initially beqz(cnt, DONE); andi(chunks, cnt, ~(hof_vec_elems-1)); beqz(chunks, TAIL); // load pre-calculated powers of 31 la(pows31, ExternalAddress(adr_pows31)); mv(t1, nof_vec_elems); vsetvli(t0, t1, Assembler::e32, Assembler::m4); vle32_v(v_coeffs, pows31); // clear vector registers used in intermediate calculations vmv_v_i(v_sum, 0); vmv_v_i(v_powmax, 0); vmv_v_i(v_result, 0); // set initial values vmv_s_x(v_result, result); vmv_s_x(v_zred, x0); andi(chunks, cnt, ~(nof_vec_elems-1)); beqz(chunks, WIDE_TAIL); subw(cnt, cnt, chunks); slli(chunks_end, chunks, elsize_shift); add(chunks_end, ary, chunks_end); // get value of 31^^nof_vec_elems lw(powmax, Address(pows31, -1 * sizeof(jint))); vmv_s_x(v_powmax, powmax); bind(VEC_LOOP); // result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1) // + ... + v_src[hof_vec_elems-1] * 31^^(0) vmul_vv(v_result, v_result, v_powmax); arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype); vmul_vv(v_src, v_src, v_coeffs); vredsum_vs(v_sum, v_src, v_zred); vadd_vv(v_result, v_result, v_sum); addi(ary, ary, vec_step_bytes); // bump array pointer bne(ary, chunks_end, VEC_LOOP); // reached the end of chunks? beqz(cnt, SAVE_VRESULT); bind(WIDE_TAIL); andi(chunks, cnt, ~(hof_vec_elems-1)); beqz(chunks, PRE_TAIL); mv(t1, hof_vec_elems); subw(cnt, cnt, t1); vslidedown_vx(v_coeffs, v_coeffs, t1); // get value of 31^^hof_vec_elems lw(powmax, Address(pows31, sizeof(jint)*(hof_vec_elems - 1))); vmv_s_x(v_powmax, powmax); vsetvli(t0, t1, Assembler::e32, Assembler::m4); // result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1) // + ... + v_src[hof_vec_elems-1] * 31^^(0) vmul_vv(v_result, v_result, v_powmax); arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype); vmul_vv(v_src, v_src, v_coeffs); vredsum_vs(v_sum, v_src, v_zred); vadd_vv(v_result, v_result, v_sum); beqz(cnt, SAVE_VRESULT); addi(ary, ary, half_vec_step_bytes); // bump array pointer bind(PRE_TAIL); vmv_x_s(result, v_result); bind(TAIL); slli(chunks_end, cnt, elsize_shift); add(chunks_end, ary, chunks_end); bind(TAIL_LOOP); arrays_hashcode_elload(t0, Address(ary), eltype); slli(t1, result, 5); // optimize 31 * result subw(result, t1, result); // with result<<5 - result addw(result, result, t0); addi(ary, ary, elsize_bytes); bne(ary, chunks_end, TAIL_LOOP); j(DONE); bind(SAVE_VRESULT); vmv_x_s(result, v_result); bind(DONE); ... }

and got the following numbers:

[ -XX:+UseVectorizedHashCodeIntrinsic -XX:-UseRVV ] Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.multibytes 8 avgt 10 11.020 ± 0.225 ns/op ArraysHashCode.multibytes 9 avgt 10 12.578 ± 0.117 ns/op ArraysHashCode.multibytes 16 avgt 10 15.505 ± 0.273 ns/op ArraysHashCode.multibytes 17 avgt 10 16.603 ± 0.164 ns/op ArraysHashCode.multibytes 24 avgt 10 21.005 ± 0.271 ns/op ArraysHashCode.multibytes 25 avgt 10 21.428 ± 0.227 ns/op ArraysHashCode.multibytes 32 avgt 10 27.985 ± 0.356 ns/op ArraysHashCode.multibytes 33 avgt 10 29.669 ± 0.145 ns/op ArraysHashCode.multibytes 48 avgt 10 37.575 ± 0.318 ns/op ArraysHashCode.multibytes 49 avgt 10 40.121 ± 0.229 ns/op ArraysHashCode.multibytes 56 avgt 10 48.637 ± 0.274 ns/op ArraysHashCode.multibytes 57 avgt 10 45.931 ± 0.305 ns/op ArraysHashCode.multibytes 64 avgt 10 48.362 ± 0.315 ns/op ArraysHashCode.multibytes 65 avgt 10 52.228 ± 0.320 ns/op ArraysHashCode.multibytes 72 avgt 10 49.523 ± 0.287 ns/op ArraysHashCode.multibytes 73 avgt 10 54.788 ± 0.437 ns/op ArraysHashCode.multibytes 80 avgt 10 62.087 ± 0.289 ns/op ArraysHashCode.multibytes 81 avgt 10 62.570 ± 0.211 ns/op [ -XX:+UseVectorizedHashCodeIntrinsic -XX:+UseRVV ] Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.multibytes 8 avgt 10 15.700 ± 0.181 ns/op ArraysHashCode.multibytes 9 avgt 10 20.743 ± 0.419 ns/op ArraysHashCode.multibytes 16 avgt 10 30.189 ± 0.301 ns/op ArraysHashCode.multibytes 17 avgt 10 32.639 ± 0.601 ns/op ArraysHashCode.multibytes 24 avgt 10 36.358 ± 0.628 ns/op ArraysHashCode.multibytes 25 avgt 10 34.486 ± 0.563 ns/op ArraysHashCode.multibytes 32 avgt 10 42.667 ± 0.473 ns/op ArraysHashCode.multibytes 33 avgt 10 44.858 ± 0.413 ns/op ArraysHashCode.multibytes 48 avgt 10 47.132 ± 0.443 ns/op ArraysHashCode.multibytes 49 avgt 10 51.528 ± 0.519 ns/op ArraysHashCode.multibytes 56 avgt 10 52.133 ± 0.225 ns/op ArraysHashCode.multibytes 57 avgt 10 48.549 ± 0.411 ns/op ArraysHashCode.multibytes 64 avgt 10 57.399 ± 0.546 ns/op ArraysHashCode.multibytes 65 avgt 10 57.680 ± 0.158 ns/op ArraysHashCode.multibytes 72 avgt 10 50.890 ± 0.327 ns/op ArraysHashCode.multibytes 73 avgt 10 54.338 ± 0.378 ns/op ArraysHashCode.multibytes 80 avgt 10 59.218 ± 0.301 ns/op ArraysHashCode.multibytes 81 avgt 10 63.889 ± 0.344 ns/op

As you can see the numbers are worse even in cases when scalar code is not used at all, i.e for lengths 16,24,32,48,56,64 etc. It seems possible to change the code to not contain any scalar code, e.g. use vslidedown instruction to move pre-calculated powers of 31 in v_coeffs according to the count of remaining elements, and perform the calculation:

vmul_vv(v_result, v_result, v_powmax); arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype); vmul_vv(v_src, v_src, v_coeffs); vredsum_vs(v_sum, v_src, v_zred); vadd_vv(v_result, v_result, v_sum);

for them at once. However, as I pointed out above in notes about lengths24/36/..., that unlikely change the performance numbers.

Of course, any ideas for improvements the code are very welcome.

Of course, any ideas for improvements the code are very welcome.

Hi, I am afraid that your local changes posted is still not in a stripmining form. Normally I am expecting a single loop with masked vector instructions to handle all cases including the tail ones. See my previous comment [1]. Note that I am not saying that stripmining is the best one here in performance, but we will need the numbers to evaluate the different approaches.

[1] #17413 (comment)

bridgekeeper · 2024-03-04T10:04:13Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-03-04T16:45:23Z

"Please keep me active" comment.

openjdk · 2024-03-13T20:15:47Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2024-04-10T21:46:29Z

@ygaevsky this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8322174
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

bridgekeeper · 2024-05-08T22:23:13Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-05-10T11:12:09Z

.

bridgekeeper · 2024-06-07T13:04:19Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-06-07T14:33:36Z

.

bridgekeeper · 2024-07-05T17:47:34Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-07-08T07:42:15Z

.

bridgekeeper · 2024-08-05T10:58:06Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-08-05T11:23:00Z

.

bridgekeeper · 2024-09-02T12:01:43Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-09-02T13:17:38Z

.

bridgekeeper · 2024-09-30T16:49:50Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-10-01T08:02:31Z

.

bridgekeeper · 2024-10-29T09:27:20Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-11-26T19:04:13Z

.

bridgekeeper · 2024-12-24T21:34:06Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2024-12-25T08:53:08Z

.

bridgekeeper · 2025-01-22T11:26:17Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2025-01-25T12:54:53Z

.

bridgekeeper · 2025-02-22T17:17:36Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2025-02-25T09:00:35Z

.

bridgekeeper · 2025-03-25T11:15:11Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2025-03-27T14:28:28Z

.

bridgekeeper · 2025-04-25T00:32:00Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2025-04-25T07:39:43Z

.

ygaevsky · 2025-04-25T13:06:46Z

JFTR: the numbers after the above merge on real RVV-1.0 hardware (BPI-F3 16g box) are below:

Legend: UseVHI ==> UseVectorizedHashCodeIntrinsic
------------------------------------------------------------------------------------
                                        (baseline)              (patch)
------------------------------------------------------------------------------------
                               |-XX:-UseVHI -XX:+UseRVV|-XX:+UseVHI -XX:+UseRVV]
------------------------------------------------------------------------------------
Benchmark    (size)  Mode  Cnt |     Score      Error  |    Score      Error |Units|
------------------------------------------------------------------------------------
bytes             1  avgt   10      11.281 ±    0.005      11.279 ±    0.001  ns/op
bytes            10  avgt   10      35.096 ±    0.027      35.730 ±    0.032  ns/op
bytes           100  avgt   10     246.627 ±    0.144     132.879 ±    0.150  ns/op
bytes          1000  avgt   10    2368.472 ±    2.174     914.207 ±    0.931  ns/op
bytes         10000  avgt   10   23548.070 ±    3.285    8707.273 ±    5.666  ns/op
bytes        100000  avgt   10  236725.770 ±  591.357   86590.456 ±  173.544  ns/op
chars             1  avgt   10      11.282 ±    0.006      11.281 ±    0.005  ns/op
chars            10  avgt   10      35.726 ±    0.013      36.978 ±    0.015  ns/op
chars           100  avgt   10     246.913 ±    0.152     134.704 ±    0.036  ns/op
chars          1000  avgt   10    2370.329 ±   10.804     935.244 ±    0.385  ns/op
chars         10000  avgt   10   23596.177 ±   19.305    9495.412 ±    6.368  ns/op
chars        100000  avgt   10  384796.824 ± 3353.051  155220.554 ± 1753.764  ns/op
ints              1  avgt   10      11.280 ±    0.002      11.280 ±    0.002  ns/op
ints             10  avgt   10      35.774 ±    0.220      36.357 ±    0.014  ns/op
ints            100  avgt   10     246.935 ±    0.112     126.494 ±    0.159  ns/op
ints           1000  avgt   10    2372.602 ±    0.753     818.846 ±    0.253  ns/op
ints          10000  avgt   10   25309.538 ±  106.280    8942.238 ±   65.537  ns/op
ints         100000  avgt   10  409074.598 ± 4049.489   87796.390 ±  545.247  ns/op
multibytes        1  avgt   10       5.137 ±    0.006       5.138 ±    0.003  ns/op
multibytes       10  avgt   10      18.361 ±    0.022      19.618 ±    0.006  ns/op
multibytes      100  avgt   10     132.814 ±    0.543      96.236 ±    0.236  ns/op
multibytes     1000  avgt   10    2160.723 ±   22.749     596.236 ±    1.166  ns/op
multibytes    10000  avgt   10   22195.062 ±  300.592    5749.928 ±    5.748  ns/op
multibytes   100000  avgt   10  205825.738 ± 1340.919   47757.644 ±   80.729  ns/op
multichars        1  avgt   10       4.995 ±    0.003       5.003 ±    0.002  ns/op
multichars       10  avgt   10      18.512 ±    0.015      19.430 ±    0.011  ns/op
multichars      100  avgt   10     230.563 ±    1.320     101.515 ±    0.353  ns/op
multichars     1000  avgt   10    1396.042 ±   16.038     634.955 ±    0.824  ns/op
multichars    10000  avgt   10   13445.146 ±    8.403    5838.638 ±    9.851  ns/op
multichars   100000  avgt   10  127475.457 ±   81.919   50308.336 ±   26.640  ns/op
multiints         1  avgt   10       6.980 ±    2.561       5.017 ±    0.007  ns/op
multiints        10  avgt   10      29.743 ±    6.021      19.479 ±    0.008  ns/op
multiints       100  avgt   10     149.720 ±    0.516     110.728 ±    0.280  ns/op
multiints      1000  avgt   10    1442.903 ±   30.199    1023.673 ±   16.614  ns/op
multiints     10000  avgt   10   22702.792 ±  286.336    5941.205 ±   30.079  ns/op
multiints    100000  avgt   10  127134.718 ±  117.502   48745.440 ±   69.036  ns/op
multishorts       1  avgt   10       5.145 ±    0.009       5.140 ±    0.004  ns/op
multishorts      10  avgt   10      18.506 ±    0.006      19.419 ±    0.006  ns/op
multishorts     100  avgt   10     232.937 ±    2.433     100.298 ±    0.318  ns/op
multishorts    1000  avgt   10    1388.111 ±   16.740     657.008 ±    4.362  ns/op
multishorts   10000  avgt   10   13458.090 ±   10.975    5860.546 ±    8.239  ns/op
multishorts  100000  avgt   10  127463.240 ±  102.736   50534.548 ±   34.661  ns/op
shorts            1  avgt   10      11.280 ±    0.007      11.280 ±    0.004  ns/op
shorts           10  avgt   10      35.721 ±    0.011      62.661 ±    0.533  ns/op
shorts          100  avgt   10     246.942 ±    0.165     135.960 ±    0.029  ns/op
shorts         1000  avgt   10    2368.908 ±    0.955     935.607 ±    0.678  ns/op
shorts        10000  avgt   10   23608.074 ±   22.901    8913.395 ±    5.318  ns/op
shorts       100000  avgt   10  237055.625 ±  532.713   94719.177 ±  352.058  ns/op
------------------------------------------------------------------------------------

robehn · 2025-04-30T12:02:25Z

Hey, certainly an improvement.

Can you resolve the conflict by merging with master?

ygaevsky · 2025-05-01T14:28:41Z

Can you resolve the conflict by merging with master?

Fixed.

robehn

I think what @RealFYang is saying:

You don't need to know the vector size, i.e.:

  const int nof_vec_elems = MaxVectorSize;
....
  mv(t1, nof_vec_elems);
  vsetvli(t0, t1, Assembler::e32, Assembler::m4);

You can set vsetvli to to cnt round down to nearest 4 byte.
And let vsetvli process as much as it can per iteration.
It will never process more than vlen, so the last loop it may process only 4 bytes.

Here is example of a memcopy:
https://github.com/riscvarchive/riscv-v-spec/blob/master/example/memcpy.s

This means the main loop is vector register length agonistic.

Now you have 3 or less bytes left to process with normal scalar ops.

ygaevsky · 2025-05-05T18:10:02Z

As you can expect I am trying to implement the following code with RVV:

for (; i + (N-1) < cnt; i += N) {
   h =   31^^N     * h 
       + 31^^(N-1) * val[i + 0] 
       + 31^^(N-2) * val[i + 1] 
	   ...
       + 31^^1 * val[i + (N-2)] 
       + 31^^0 * val[i + (N-1)];
}
for (; i < cnt; i++) {
   h = 31 * h + val[i];
}

where N is a number of processing array elements in "chunk".
IIUC, the main issue with your approach is "reverse" order of array elements versus preloaded 31^^X coeffs WHEN the remaining number of elems is less than N, say M=N-1.

   h =   31^^M     * h 
       + 31^^(M-1) * val[i + 0] 
       + 31^^(M-2) * val[i + 1] 
	   ...
       + 31^^1 * val[i + (M-2)] 
       + 32^^0 * val[i + (M-1)];

or returning to our N for clarity

   h =   31^^(N-1)     * h 
       + 31^^(N-2) * val[i + 0] 
       + 31^^(N-3) * val[i + 1] 
	   ...
       + 31^^1 * val[i + (N-3)] 
       + 31^^0 * val[i + (N-2)];

Now we need to "slide down" preloaded multiplier coeffs in designated vector register by one (as M=N-1) to be in "sync" with val[i + X] (may be move them into temporary VR in the process), and moreover, DO this operation IFF the remaining cnt is less than N (==>an additional check on every iteration). That's probably acceptable only at tail phase as one-time operation but NOT inside of main loop...

robehn · 2025-05-06T17:13:19Z

Hey, I'm sorry for not explaining this proper, maybe this helps:

You have four coefficients - you want to process a batch of four, OR a mutiple of four.
This batch of four - we call this a lane:

            int lane = array[currentIndex] * m_pow_3 + array[currentIndex + 1] * m_pow_2 + array[currentIndex + 2] * m_pow_1 + array[currentIndex + 3] * m_pow_0;
            hashCode = hashCode * m_pow_4 + lane;

You can process mutiple lanes by doing:

            int lane_1 = array[currentIndex  ] * m_pow_3 + array[currentIndex + 1] * m_pow_2 + array[currentIndex + 2] * m_pow_1 + array[currentIndex + 3] * m_pow_0;
            int lane_2 = array[currentIndex+4] * m_pow_3 + array[currentIndex + 5] * m_pow_2 + array[currentIndex + 6] * m_pow_1 + array[currentIndex + 7] * m_pow_0;
            hashCode = hashCode * m_pow_4 + lane1;
            hashCode = hashCode * m_pow_4 + lane2;

So for example you could layout the data like below using vlse32.v, strided load.

v2 = array[currentIndex]   | array[currentIndex+4] | .... | array[currentIndex+n*4]
v4 = array[currentIndex+1] | array[currentIndex+5] | .... | array[currentIndex+1+n*4]
v6 = array[currentIndex+2] | array[currentIndex+6] | .... | array[currentIndex+2+n*4]
v8 = array[currentIndex+3] | array[currentIndex+7] | .... | array[currentIndex+3+n*4]
v10 = sum lane 1           | sum lane 2            | .... | sum lane n

Now you can multiple every element in v2 with m_pow_3 without knowing the length of v2 (i.e. LMUL can be 1 or 8).
Then sum each lane into v10, and finally for each lane mutiple hashcode by m_pow_4 and add that lane sum.

When this is done, you have 0-3 elements left you can process with scalar.

Clarification:

int number_elements_unproccessed_in_4 = count/4;
loop:
vsetvli vl_processing,number_elements_unproccessed_in_4, emul, lmul
...
number_elements_unproccessed_in_4 -= vl_processing;
bnz number_elements_unproccessed_in_4, loop;

vl_processing == number of lanes. There is no need to know the length of the vector registers.

NOTE: I'm not saying this is better or faster than your version - it's hopefully an example of a vector length agnostic approach.

bridgekeeper · 2025-06-03T22:00:33Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

robehn · 2025-06-04T06:04:46Z

@ygaevsky @RealFYang how can we procced ?

ygaevsky · 2025-06-05T07:17:44Z

@ygaevsky @RealFYang how can we procced ?

My apologies, just busy at the moment with other things, going to update the patch soon.

bridgekeeper · 2025-07-03T10:35:07Z

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ygaevsky · 2025-07-07T08:17:17Z

.

8322174: RISC-V: C2 VectorizedHashCode RVV Version

1d6bc62

openjdk bot added the rfr Pull request is ready for review label Jan 13, 2024

openjdk bot added the hotspot-compiler [email protected] label Jan 13, 2024

RealFYang reviewed Jan 17, 2024

View reviewed changes

ygaevsky added 2 commits January 25, 2024 17:25

Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion.

b976ca7

num_8b_elems_in_vec --> nof_vec_elems

7ed3d86

VladimirKempik mentioned this pull request Feb 7, 2024

8321010: RISC-V: C2 RoundVF #17745

Closed

3 tasks

openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Apr 10, 2024

Merge master

301eed9

openjdk bot added rfr Pull request is ready for review and removed merge-conflict Pull request has merge conflict with target branch labels Apr 24, 2025

ygaevsky and others added 3 commits April 30, 2025 23:41

Merge branch 'master' into JDK-8322174

10cf358

Merge branch 'openjdk:master' into JDK-8322174

712cf05

Fixed git rebase artifacts.

9ba2768

ygaevsky added 2 commits May 5, 2025 11:15

reorder instructions to make RVV instructions contiguous

a64dc26

change slli+add sequence to shadd

4e9ad18

robehn reviewed May 5, 2025

View reviewed changes

8322174: RISC-V: C2 VectorizedHashCode RVV Version #17413

Are you sure you want to change the base?

8322174: RISC-V: C2 VectorizedHashCode RVV Version #17413

Uh oh!

Conversation

ygaevsky commented Jan 13, 2024 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Jan 13, 2024

Uh oh!

ygaevsky commented Jan 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jan 13, 2024

Uh oh!

mlbridge bot commented Jan 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

RealFYang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RealFYang Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygaevsky Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

RealFYang Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygaevsky Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygaevsky Jan 30, 2024

Choose a reason for hiding this comment

Uh oh!

RealFYang Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bridgekeeper bot commented Mar 4, 2024

Uh oh!

ygaevsky commented Mar 4, 2024

Uh oh!

openjdk bot commented Mar 13, 2024

Uh oh!

openjdk bot commented Apr 10, 2024

Uh oh!

bridgekeeper bot commented May 8, 2024

Uh oh!

ygaevsky commented May 10, 2024

Uh oh!

bridgekeeper bot commented Jun 7, 2024

Uh oh!

ygaevsky commented Jun 7, 2024

Uh oh!

bridgekeeper bot commented Jul 5, 2024

Uh oh!

ygaevsky commented Jul 8, 2024

Uh oh!

bridgekeeper bot commented Aug 5, 2024

Uh oh!

ygaevsky commented Aug 5, 2024

Uh oh!

bridgekeeper bot commented Sep 2, 2024

Uh oh!

ygaevsky commented Sep 2, 2024

Uh oh!

bridgekeeper bot commented Sep 30, 2024

Uh oh!

ygaevsky commented Oct 1, 2024

ygaevsky commented Jan 13, 2024 •

edited by openjdk bot

Loading

ygaevsky commented Jan 13, 2024 •

edited

Loading

mlbridge bot commented Jan 13, 2024 •

edited

Loading

RealFYang Jan 17, 2024 •

edited

Loading

RealFYang Jan 26, 2024 •

edited

Loading

ygaevsky Jan 30, 2024 •

edited

Loading

RealFYang Feb 5, 2024 •

edited

Loading

ygaevsky commented May 1, 2025 •

edited

Loading

robehn left a comment •

edited

Loading

robehn commented May 6, 2025 •

edited

Loading