Skip to content

reflect change in rotate! from LinearAlgebra.jl #603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tam724
Copy link
Contributor

@tam724 tam724 commented Jul 1, 2025

This reflects a change in LinearAlgebra.jl (JuliaLang/LinearAlgebra.jl#1323). The previous implementation of rotate! did not respect false as a strong zero. For consistency, the implementation in GPUArrays.jl should also be adapted.

Essentially, this covers the edge case:

julia> using GPUArrays, JLArrays

julia> v1 = JLArray([NaN, NaN]);
julia> v2 = JLArray([NaN, NaN]);

julia> GPUArrays.rotate!(v1, v2, false, false) # before PR
([0.0, 0.0], [NaN, NaN])

julia> GPUArrays.rotate!(v1, v2, false, false) # with PR
([0.0, 0.0], [0.0, 0.0])
  • Should NaN / false behaviour be tested against?
  • Should this wait until the new LinearAlgebra.jl version is released?

Copy link
Contributor

github-actions bot commented Jul 1, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/host/linalg.jl b/src/host/linalg.jl
index 4933839..a21d694 100644
--- a/src/host/linalg.jl
+++ b/src/host/linalg.jl
@@ -687,8 +687,8 @@ function LinearAlgebra.rotate!(x::AbstractGPUArray, y::AbstractGPUArray, c::Numb
         i = @index(Global, Linear)
         @inbounds xi = x[i]
         @inbounds yi = y[i]
-        @inbounds x[i] = s*yi +      c *xi
-        @inbounds y[i] = c*yi - conj(s)*xi 
+        @inbounds x[i] = s * yi + c * xi
+        @inbounds y[i] = c * yi - conj(s) * xi
     end
     rotate_kernel!(get_backend(x))(x, y, c, s; ndrange = size(x))
     return x, y
diff --git a/test/testsuite.jl b/test/testsuite.jl
index e7c8d9d..171eeed 100644
--- a/test/testsuite.jl
+++ b/test/testsuite.jl
@@ -63,7 +63,7 @@ function out_has_NaNs(f, AT::Type{<:AbstractGPUArray}, xs...)
     arg_out = f(arg_in...)
     return has_NaNs(arg_out)
 end
-   
+
 # element types that are supported by the array type
 supported_eltypes(AT, test) = supported_eltypes(AT)
 supported_eltypes(AT) = supported_eltypes()

@maleadt
Copy link
Member

maleadt commented Jul 2, 2025

Thanks! Mind adding those examples as a test?

@tam724
Copy link
Contributor Author

tam724 commented Jul 3, 2025

Not sure if this is the right approach. Should we also compare "strong zero"/NaN behaviour against LinearAlgebra.jl or should this be guaranteed only by GPUArrays.jl ? In other words: should the tests use compare or should we test for !any(isnan()) in the output arrays?

For consistency I added similar tests for lmul, rmul, axp{b}y, rotate, reflect and mul.
To have them passing locally I had to exclude rotate (is fixed in LinearAlgebra.jl#master) and mul for Float16 (another inconsistency in LinAlg).

Essentially this now tests whether the multiplication with false stops the propagation of NaNs, if any NaNs would remain in the output array in GPUArrays and LinearAlgebra, the tests would fail, because (NaN ≈ NaN) gives false. If we would like to compare the "NaN treatment", we would have to compare with isequal (see: https://docs.julialang.org/en/v1/base/numbers/#Base.NaN)

@maleadt
Copy link
Member

maleadt commented Jul 4, 2025

Ideally we test compatibility against Array/stdlibs, but if that behavior is clearly buggy it's fine to include VERSION checks to gate tests. So the most complete solution would be to use isnan on older versions, compare on newer. I'd normally suggest to only to the latter, but since the new behaviour isn't part of any Julia version, we should at least have some tests checking isnan.

@tam724
Copy link
Contributor Author

tam724 commented Jul 5, 2025

This should do it. Currently skipped compare tests (because of LinearAlgebra.jl bugs) are skipped and only the implementation for AbstractGPUArrays is tested. After fix/release in LinearAlgebra the compare can be gated with VERSION checks and the !out_has_NaNs can be removed.

@maleadt
Copy link
Member

maleadt commented Jul 7, 2025

Thanks. CUDA.jl and Metal.jl CI failures look related though.

@tam724
Copy link
Contributor Author

tam724 commented Jul 7, 2025

Sure, they are related (also oneAPI.jl). If GPUArrays.jl tests for false\NaN and CUDA.jl uses GPUArrays.jl testsuite, but overwrites the LinearAlgebra.* routines and calls CUBLAS instead (that does not respect false\NaN) the tests fail.

This probably brings me back to JuliaGPU/CUDA.jl#2607 (for CUDA.jl).

I see two solutions (CUDA.jl for now, this is the only library I can test locally):

  • Guard all the CUBLAS calls with dispatches on Bool and implement the Base-consistent false\NaN treatment.
  • Delete the LinAlg.* overwrites in CUBLAS and fall back to the generic implementation in GPUArrays.jl
  • (or don't test this behaviour, LinearAlgebra.jl is also only doing to some extent..)

To me, using GPUArrays.jl implementation sounds like the cleanest option, if performance is comparable. I ran a few benchmarks (not really representative, on a A2000 Laptop GPU).

Benchmark Results
julia> using CUDA, GPUArrays, LinearAlgebra, BenchmarkTools

julia> CUDA.versioninfo()
CUDA runtime 12.9, artifact installation
CUDA driver 12.9
NVIDIA driver 575.64.0

CUDA libraries: 
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 28.0.0)
- NVML: 12.0.0+575.64

Julia packages: 
- CUDA: 5.8.2
- CUDA_Driver_jll: 0.13.1+0
- CUDA_Runtime_jll: 0.17.1+0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A2000 Laptop GPU (sm_86, 3.669 GiB / 4.000 GiB available)

julia> x_10, y_10 = CUDA.rand(10), CUDA.rand(10);
julia> x_1K, y_1K = CUDA.rand(1000), CUDA.rand(1000);
julia> x_100K, y_100K = CUDA.rand(100000), CUDA.rand(100000);
julia> x_1M, y_1M = CUDA.rand(1000000), CUDA.rand(1000000);
julia> x_10M, y_10M = CUDA.rand(10000000), CUDA.rand(10000000);
julia> x_100M, y_100M = CUDA.rand(100000000), CUDA.rand(100000000);
julia> a, b = rand(), rand();


       # rmul!
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_10), $(a));
  9.394 μs (49 allocations: 1.25 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_10), $(a));
  9.844 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_1K), $(a));
  10.384 μs (55 allocations: 1.34 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_1K), $(a));
  9.922 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_100K), $(a));
  15.100 μs (55 allocations: 1.34 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_100K), $(a));
  12.278 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_1M), $(a));
  406.892 μs (56 allocations: 1.36 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_1M), $(a));
  405.526 μs (25 allocations: 464 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_10M), $(a));
  3.910 ms (57 allocations: 1.38 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_10M), $(a));
  3.914 ms (26 allocations: 480 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_100M), $(a));
  39.106 ms (57 allocations: 1.38 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_100M), $(a));
  39.055 ms (26 allocations: 480 bytes)


       # axpby!
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_10), $(b), $(y_10));
  13.757 μs (92 allocations: 2.47 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_10), $(b), $(y_10));
  17.939 μs (55 allocations: 1008 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_1K), $(b), $(y_1K));
  15.993 μs (98 allocations: 2.56 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_1K), $(b), $(y_1K));
  18.613 μs (55 allocations: 1008 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_100K), $(b), $(y_100K));
  24.866 μs (98 allocations: 2.56 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_100K), $(b), $(y_100K));
  21.617 μs (55 allocations: 1008 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_1M), $(b), $(y_1M));
  605.886 μs (99 allocations: 2.58 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_1M), $(b), $(y_1M));
  999.222 μs (56 allocations: 1.00 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_10M), $(b), $(y_10M));
  5.839 ms (100 allocations: 2.59 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_10M), $(b), $(y_10M));
  9.702 ms (56 allocations: 1.00 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_100M), $(b), $(y_100M));
  57.918 ms (100 allocations: 2.59 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_100M), $(b), $(y_100M));
  97.206 ms (56 allocations: 1.00 KiB)


       # reflect!
julia> a, b = rand() .|> (sin, cos);

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  11.870 μs (65 allocations: 1.61 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  22.192 μs (69 allocations: 1.28 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  14.273 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  22.958 μs (69 allocations: 1.28 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  27.170 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  25.417 μs (69 allocations: 1.28 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  7.886 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  11.817 ms (70 allocations: 1.30 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  806.112 μs (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  1.209 ms (70 allocations: 1.30 KiB)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  78.336 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  117.388 ms (70 allocations: 1.30 KiB)

       # rotate!
julia> a, b = rand() .|> (sin, cos);

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  11.561 μs (65 allocations: 1.61 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10), $(y_10), $(a), $(b));
  15.658 μs (45 allocations: 864 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  14.496 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
  15.932 μs (45 allocations: 864 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  27.289 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
  20.839 μs (45 allocations: 864 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  806.077 μs (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
  811.260 μs (46 allocations: 880 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  7.900 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
  7.895 ms (46 allocations: 880 bytes)

julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  78.331 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
  78.374 ms (46 allocations: 880 bytes)

Main observations:

  • It feels as if there is no real difference in runtime between similar GPUArrays.jl generated and cuBLAS kernels here, but GPUArrays have a higher CPU allocation size.
  • The "fused" kernels in GPUArrays (axpby! and reflect!) are faster than the CUDA.jl implementation that uses two cuBLAS kernel calls.

Feels as if we can simply delete the overwrites for rmul!, axpby!, rotate! and reflect! and rely on the generic GPUArrays.jl implementation. What do you think?

@tam724 tam724 requested a review from maleadt July 8, 2025 07:23
@tam724
Copy link
Contributor Author

tam724 commented Jul 8, 2025

Sorry, review request was a misclick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants