reflect change in `rotate!` from LinearAlgebra.jl #603

tam724 · 2025-07-01T13:34:46Z

This reflects a change in LinearAlgebra.jl (JuliaLang/LinearAlgebra.jl#1323). The previous implementation of rotate! did not respect false as a strong zero. For consistency, the implementation in GPUArrays.jl should also be adapted.

Essentially, this covers the edge case:

julia> using GPUArrays, JLArrays

julia> v1 = JLArray([NaN, NaN]);
julia> v2 = JLArray([NaN, NaN]);

julia> GPUArrays.rotate!(v1, v2, false, false) # before PR
([0.0, 0.0], [NaN, NaN])

julia> GPUArrays.rotate!(v1, v2, false, false) # with PR
([0.0, 0.0], [0.0, 0.0])

Should NaN / false behaviour be tested against?
Should this wait until the new LinearAlgebra.jl version is released?

github-actions · 2025-07-01T13:35:20Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/host/linalg.jl b/src/host/linalg.jl
index 4933839..a21d694 100644
--- a/src/host/linalg.jl
+++ b/src/host/linalg.jl
@@ -687,8 +687,8 @@ function LinearAlgebra.rotate!(x::AbstractGPUArray, y::AbstractGPUArray, c::Numb
         i = @index(Global, Linear)
         @inbounds xi = x[i]
         @inbounds yi = y[i]
-        @inbounds x[i] = s*yi +      c *xi
-        @inbounds y[i] = c*yi - conj(s)*xi 
+        @inbounds x[i] = s * yi + c * xi
+        @inbounds y[i] = c * yi - conj(s) * xi
     end
     rotate_kernel!(get_backend(x))(x, y, c, s; ndrange = size(x))
     return x, y
diff --git a/test/testsuite.jl b/test/testsuite.jl
index e7c8d9d..171eeed 100644
--- a/test/testsuite.jl
+++ b/test/testsuite.jl
@@ -63,7 +63,7 @@ function out_has_NaNs(f, AT::Type{<:AbstractGPUArray}, xs...)
     arg_out = f(arg_in...)
     return has_NaNs(arg_out)
 end
-   
+
 # element types that are supported by the array type
 supported_eltypes(AT, test) = supported_eltypes(AT)
 supported_eltypes(AT) = supported_eltypes()

maleadt · 2025-07-02T09:51:34Z

Thanks! Mind adding those examples as a test?

tam724 · 2025-07-03T08:35:12Z

Not sure if this is the right approach. Should we also compare "strong zero"/NaN behaviour against LinearAlgebra.jl or should this be guaranteed only by GPUArrays.jl ? In other words: should the tests use compare or should we test for !any(isnan()) in the output arrays?

For consistency I added similar tests for lmul, rmul, axp{b}y, rotate, reflect and mul.
To have them passing locally I had to exclude rotate (is fixed in LinearAlgebra.jl#master) and mul for Float16 (another inconsistency in LinAlg).

Essentially this now tests whether the multiplication with false stops the propagation of NaNs, if any NaNs would remain in the output array in GPUArrays and LinearAlgebra, the tests would fail, because (NaN ≈ NaN) gives false. If we would like to compare the "NaN treatment", we would have to compare with isequal (see: https://docs.julialang.org/en/v1/base/numbers/#Base.NaN)

maleadt · 2025-07-04T08:31:34Z

Ideally we test compatibility against Array/stdlibs, but if that behavior is clearly buggy it's fine to include VERSION checks to gate tests. So the most complete solution would be to use isnan on older versions, compare on newer. I'd normally suggest to only to the latter, but since the new behaviour isn't part of any Julia version, we should at least have some tests checking isnan.

tam724 · 2025-07-05T14:44:24Z

This should do it. Currently skipped compare tests (because of LinearAlgebra.jl bugs) are skipped and only the implementation for AbstractGPUArrays is tested. After fix/release in LinearAlgebra the compare can be gated with VERSION checks and the !out_has_NaNs can be removed.

maleadt · 2025-07-07T06:56:29Z

Thanks. CUDA.jl and Metal.jl CI failures look related though.

tam724 · 2025-07-07T22:12:29Z

Sure, they are related (also oneAPI.jl). If GPUArrays.jl tests for false\NaN and CUDA.jl uses GPUArrays.jl testsuite, but overwrites the LinearAlgebra.* routines and calls CUBLAS instead (that does not respect false\NaN) the tests fail.

This probably brings me back to JuliaGPU/CUDA.jl#2607 (for CUDA.jl).

I see two solutions (CUDA.jl for now, this is the only library I can test locally):

Guard all the CUBLAS calls with dispatches on Bool and implement the Base-consistent false\NaN treatment.
Delete the LinAlg.* overwrites in CUBLAS and fall back to the generic implementation in GPUArrays.jl
(or don't test this behaviour, LinearAlgebra.jl is also only doing to some extent..)

To me, using GPUArrays.jl implementation sounds like the cleanest option, if performance is comparable. I ran a few benchmarks (not really representative, on a A2000 Laptop GPU).

Benchmark Results

julia> using CUDA, GPUArrays, LinearAlgebra, BenchmarkTools

julia> CUDA.versioninfo()
CUDA runtime 12.9, artifact installation
CUDA driver 12.9
NVIDIA driver 575.64.0

CUDA libraries: 
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 28.0.0)
- NVML: 12.0.0+575.64

Julia packages: 
- CUDA: 5.8.2
- CUDA_Driver_jll: 0.13.1+0
- CUDA_Runtime_jll: 0.17.1+0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A2000 Laptop GPU (sm_86, 3.669 GiB / 4.000 GiB available)

julia> x_10, y_10 = CUDA.rand(10), CUDA.rand(10);
julia> x_1K, y_1K = CUDA.rand(1000), CUDA.rand(1000);
julia> x_100K, y_100K = CUDA.rand(100000), CUDA.rand(100000);
julia> x_1M, y_1M = CUDA.rand(1000000), CUDA.rand(1000000);
julia> x_10M, y_10M = CUDA.rand(10000000), CUDA.rand(10000000);
julia> x_100M, y_100M = CUDA.rand(100000000), CUDA.rand(100000000);
julia> a, b = rand(), rand();


       # rmul!
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_10), $(a));
  9.394 μs (49 allocations: 1.25 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_10), $(a));
  9.844 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_1K), $(a));
  10.384 μs (55 allocations: 1.34 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_1K), $(a));
  9.922 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_100K), $(a));
  15.100 μs (55 allocations: 1.34 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_100K), $(a));
  12.278 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_1M), $(a));
  406.892 μs (56 allocations: 1.36 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_1M), $(a));
  405.526 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_10M), $(a));
  3.910 ms (57 allocations: 1.38 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_10M), $(a));
  3.914 ms (26 allocations: 480 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_100M), $(a));
  39.106 ms (57 allocations: 1.38 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_100M), $(a));
  39.055 ms (26 allocations: 480 bytes)


       # axpby!
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_10), $(b), $(y_10));
  13.757 μs (92 allocations: 2.47 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_10), $(b), $(y_10));
  17.939 μs (55 allocations: 1008 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_1K), $(b), $(y_1K));
  15.993 μs (98 allocations: 2.56 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_1K), $(b), $(y_1K));
  18.613 μs (55 allocations: 1008 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_100K), $(b), $(y_100K));
  24.866 μs (98 allocations: 2.56 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_100K), $(b), $(y_100K));
  21.617 μs (55 allocations: 1008 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_1M), $(b), $(y_1M));
  605.886 μs (99 allocations: 2.58 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_1M), $(b), $(y_1M));
  999.222 μs (56 allocations: 1.00 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_10M), $(b), $(y_10M));
  5.839 ms (100 allocations: 2.59 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_10M), $(b), $(y_10M));
  9.702 ms (56 allocations: 1.00 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_100M), $(b), $(y_100M));
  57.918 ms (100 allocations: 2.59 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_100M), $(b), $(y_100M));
  97.206 ms (56 allocations: 1.00 KiB)


       # reflect!
julia> a, b = rand() .|> (sin, cos);

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  11.870 μs (65 allocations: 1.61 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  22.192 μs (69 allocations: 1.28 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  14.273 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  22.958 μs (69 allocations: 1.28 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  27.170 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  25.417 μs (69 allocations: 1.28 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  7.886 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  11.817 ms (70 allocations: 1.30 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  806.112 μs (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  1.209 ms (70 allocations: 1.30 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  78.336 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  117.388 ms (70 allocations: 1.30 KiB)

       # rotate!
julia> a, b = rand() .|> (sin, cos);

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  11.561 μs (65 allocations: 1.61 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  15.658 μs (45 allocations: 864 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  14.496 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  15.932 μs (45 allocations: 864 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  27.289 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  20.839 μs (45 allocations: 864 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  806.077 μs (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  811.260 μs (46 allocations: 880 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  7.900 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  7.895 ms (46 allocations: 880 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  78.331 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  78.374 ms (46 allocations: 880 bytes)

Main observations:

It feels as if there is no real difference in runtime between similar GPUArrays.jl generated and cuBLAS kernels here, but GPUArrays have a higher CPU allocation size.
The "fused" kernels in GPUArrays (axpby! and reflect!) are faster than the CUDA.jl implementation that uses two cuBLAS kernel calls.

Feels as if we can simply delete the overwrites for rmul!, axpby!, rotate! and reflect! and rely on the generic GPUArrays.jl implementation. What do you think?

tam724 · 2025-07-08T07:27:51Z

Sorry, review request was a misclick.

reflect change in LinearAlgebra.jl

56f2274

tam724 added 2 commits July 3, 2025 09:54

typo

320c560

Test/compare for false = strong zero

60f2d9e

Explicit false/strong zero tests for AbstractGPUArrays

d49cb2e

maleadt approved these changes Jul 7, 2025

View reviewed changes

tam724 requested a review from maleadt July 8, 2025 07:23

tam724 added 2 commits July 8, 2025 11:45

only test matmat/vec for strong zero in β

aaac364

typo

ddd26b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reflect change in `rotate!` from LinearAlgebra.jl #603

reflect change in `rotate!` from LinearAlgebra.jl #603

Uh oh!

tam724 commented Jul 1, 2025

Uh oh!

github-actions bot commented Jul 1, 2025 •

edited

Loading

Uh oh!

maleadt commented Jul 2, 2025

Uh oh!

tam724 commented Jul 3, 2025

Uh oh!

maleadt commented Jul 4, 2025

Uh oh!

tam724 commented Jul 5, 2025

Uh oh!

maleadt commented Jul 7, 2025

Uh oh!

tam724 commented Jul 7, 2025

Uh oh!

tam724 commented Jul 8, 2025

Uh oh!

Uh oh!

reflect change in rotate! from LinearAlgebra.jl #603

Are you sure you want to change the base?

reflect change in rotate! from LinearAlgebra.jl #603

Uh oh!

Conversation

tam724 commented Jul 1, 2025

Uh oh!

github-actions bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jul 2, 2025

Uh oh!

tam724 commented Jul 3, 2025

Uh oh!

maleadt commented Jul 4, 2025

Uh oh!

tam724 commented Jul 5, 2025

Uh oh!

maleadt commented Jul 7, 2025

Uh oh!

tam724 commented Jul 7, 2025

Uh oh!

tam724 commented Jul 8, 2025

Uh oh!

Uh oh!

reflect change in `rotate!` from LinearAlgebra.jl #603

reflect change in `rotate!` from LinearAlgebra.jl #603

github-actions bot commented Jul 1, 2025 •

edited

Loading