-
Notifications
You must be signed in to change notification settings - Fork 87
reflect change in rotate!
from LinearAlgebra.jl
#603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/host/linalg.jl b/src/host/linalg.jl
index 4933839..a21d694 100644
--- a/src/host/linalg.jl
+++ b/src/host/linalg.jl
@@ -687,8 +687,8 @@ function LinearAlgebra.rotate!(x::AbstractGPUArray, y::AbstractGPUArray, c::Numb
i = @index(Global, Linear)
@inbounds xi = x[i]
@inbounds yi = y[i]
- @inbounds x[i] = s*yi + c *xi
- @inbounds y[i] = c*yi - conj(s)*xi
+ @inbounds x[i] = s * yi + c * xi
+ @inbounds y[i] = c * yi - conj(s) * xi
end
rotate_kernel!(get_backend(x))(x, y, c, s; ndrange = size(x))
return x, y
diff --git a/test/testsuite.jl b/test/testsuite.jl
index e7c8d9d..171eeed 100644
--- a/test/testsuite.jl
+++ b/test/testsuite.jl
@@ -63,7 +63,7 @@ function out_has_NaNs(f, AT::Type{<:AbstractGPUArray}, xs...)
arg_out = f(arg_in...)
return has_NaNs(arg_out)
end
-
+
# element types that are supported by the array type
supported_eltypes(AT, test) = supported_eltypes(AT)
supported_eltypes(AT) = supported_eltypes() |
Thanks! Mind adding those examples as a test? |
Not sure if this is the right approach. Should we also compare "strong zero"/NaN behaviour against LinearAlgebra.jl or should this be guaranteed only by GPUArrays.jl ? In other words: should the tests use For consistency I added similar tests for Essentially this now tests whether the multiplication with |
Ideally we test compatibility against Array/stdlibs, but if that behavior is clearly buggy it's fine to include |
This should do it. Currently skipped |
Thanks. CUDA.jl and Metal.jl CI failures look related though. |
Sure, they are related (also oneAPI.jl). If GPUArrays.jl tests for This probably brings me back to JuliaGPU/CUDA.jl#2607 (for CUDA.jl). I see two solutions (CUDA.jl for now, this is the only library I can test locally):
To me, using GPUArrays.jl implementation sounds like the cleanest option, if performance is comparable. I ran a few benchmarks (not really representative, on a A2000 Laptop GPU). Benchmark Resultsjulia> using CUDA, GPUArrays, LinearAlgebra, BenchmarkTools
julia> CUDA.versioninfo()
CUDA runtime 12.9, artifact installation
CUDA driver 12.9
NVIDIA driver 575.64.0
CUDA libraries:
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 28.0.0)
- NVML: 12.0.0+575.64
Julia packages:
- CUDA: 5.8.2
- CUDA_Driver_jll: 0.13.1+0
- CUDA_Runtime_jll: 0.17.1+0
Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6
1 device:
0: NVIDIA RTX A2000 Laptop GPU (sm_86, 3.669 GiB / 4.000 GiB available)
julia> x_10, y_10 = CUDA.rand(10), CUDA.rand(10);
julia> x_1K, y_1K = CUDA.rand(1000), CUDA.rand(1000);
julia> x_100K, y_100K = CUDA.rand(100000), CUDA.rand(100000);
julia> x_1M, y_1M = CUDA.rand(1000000), CUDA.rand(1000000);
julia> x_10M, y_10M = CUDA.rand(10000000), CUDA.rand(10000000);
julia> x_100M, y_100M = CUDA.rand(100000000), CUDA.rand(100000000);
julia> a, b = rand(), rand();
# rmul!
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_10), $(a));
9.394 μs (49 allocations: 1.25 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_10), $(a));
9.844 μs (25 allocations: 464 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_1K), $(a));
10.384 μs (55 allocations: 1.34 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_1K), $(a));
9.922 μs (25 allocations: 464 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_100K), $(a));
15.100 μs (55 allocations: 1.34 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_100K), $(a));
12.278 μs (25 allocations: 464 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_1M), $(a));
406.892 μs (56 allocations: 1.36 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_1M), $(a));
405.526 μs (25 allocations: 464 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_10M), $(a));
3.910 ms (57 allocations: 1.38 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_10M), $(a));
3.914 ms (26 allocations: 480 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{GPUArrays.AbstractGPUArray, Number}), $(x_100M), $(a));
39.106 ms (57 allocations: 1.38 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rmul!), $(Tuple{CUDA.StridedCuArray{Float32}, Number}), $(x_100M), $(a));
39.055 ms (26 allocations: 480 bytes)
# axpby!
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_10), $(b), $(y_10));
13.757 μs (92 allocations: 2.47 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_10), $(b), $(y_10));
17.939 μs (55 allocations: 1008 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_1K), $(b), $(y_1K));
15.993 μs (98 allocations: 2.56 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_1K), $(b), $(y_1K));
18.613 μs (55 allocations: 1008 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_100K), $(b), $(y_100K));
24.866 μs (98 allocations: 2.56 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_100K), $(b), $(y_100K));
21.617 μs (55 allocations: 1008 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_1M), $(b), $(y_1M));
605.886 μs (99 allocations: 2.58 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_1M), $(b), $(y_1M));
999.222 μs (56 allocations: 1.00 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_10M), $(b), $(y_10M));
5.839 ms (100 allocations: 2.59 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_10M), $(b), $(y_10M));
9.702 ms (56 allocations: 1.00 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, GPUArrays.AbstractGPUArray, Number, GPUArrays.AbstractGPUArray}), $(a), $(x_100M), $(b), $(y_100M));
57.918 ms (100 allocations: 2.59 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.axpby!), $(Tuple{Number, CUDA.StridedCuArray{Float32}, Number, CUDA.StridedCuArray{Float32}}), $(a), $(x_100M), $(b), $(y_100M));
97.206 ms (56 allocations: 1.00 KiB)
# reflect!
julia> a, b = rand() .|> (sin, cos);
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10), $(y_10), $(a), $(b));
11.870 μs (65 allocations: 1.61 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10), $(y_10), $(a), $(b));
22.192 μs (69 allocations: 1.28 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
14.273 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
22.958 μs (69 allocations: 1.28 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
27.170 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
25.417 μs (69 allocations: 1.28 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
7.886 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
11.817 ms (70 allocations: 1.30 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
806.112 μs (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
1.209 ms (70 allocations: 1.30 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
78.336 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.reflect!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
117.388 ms (70 allocations: 1.30 KiB)
# rotate!
julia> a, b = rand() .|> (sin, cos);
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10), $(y_10), $(a), $(b));
11.561 μs (65 allocations: 1.61 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10), $(y_10), $(a), $(b));
15.658 μs (45 allocations: 864 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
14.496 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1K), $(y_1K), $(a), $(b));
15.932 μs (45 allocations: 864 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
27.289 μs (71 allocations: 1.70 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100K), $(y_100K), $(a), $(b));
20.839 μs (45 allocations: 864 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
806.077 μs (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_1M), $(y_1M), $(a), $(b));
811.260 μs (46 allocations: 880 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
7.900 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_10M), $(y_10M), $(a), $(b));
7.895 ms (46 allocations: 880 bytes)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{GPUArrays.AbstractGPUArray, GPUArrays.AbstractGPUArray, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
78.331 ms (73 allocations: 1.73 KiB)
julia> @btime CUDA.@sync invoke($(LinearAlgebra.rotate!), $(Tuple{CUDA.StridedCuArray{Float32}, CUDA.StridedCuArray{Float32}, Number, Number}), $(x_100M), $(y_100M), $(a), $(b));
78.374 ms (46 allocations: 880 bytes)
Main observations:
Feels as if we can simply delete the overwrites for |
Sorry, review request was a misclick. |
This reflects a change in LinearAlgebra.jl (JuliaLang/LinearAlgebra.jl#1323). The previous implementation of
rotate!
did not respectfalse
as a strong zero. For consistency, the implementation in GPUArrays.jl should also be adapted.Essentially, this covers the edge case:
NaN
/false
behaviour be tested against?