Skip to content

Commit 0661395

Browse files
authored
Merge pull request #41 from arhik/main
rearrange docs
2 parents 7c86d55 + 6fbeb13 commit 0661395

File tree

4 files changed

+228
-228
lines changed

4 files changed

+228
-228
lines changed

docs/make.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ makedocs(;
1515
assets=String[],
1616
),
1717
pages=[
18-
"Home" => "intro.md",
19-
"API" => "index.md"
18+
"Home" => "index.md",
19+
"API" => "api.md"
2020
],
2121
)
2222

docs/src/api.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
```@meta
2+
CurrentModule = WGPUCompute
3+
```
4+
5+
# WGPUCompute
6+
7+
Documentation for [WGPUCompute](https://github.com/JuliaWGPU/WGPUCompute.jl).
8+
9+
```@index
10+
```
11+
12+
```@autodocs
13+
Modules = [WGPUCompute]
14+
```

docs/src/index.md

Lines changed: 212 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,219 @@
1-
```@meta
2-
CurrentModule = WGPUCompute
1+
# WGPUCompute
2+
3+
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaWGPU.github.io/WGPUCompute.jl/stable/)
4+
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://JuliaWGPU.github.io/WGPUCompute.jl/dev/)
5+
[![Build Status](https://github.com/JuliaWGPU/WGPUCompute.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JuliaWGPU/WGPUCompute.jl/actions/workflows/CI.yml?query=branch%3Amain)
6+
[![Coverage](https://codecov.io/gh/JuliaWGPU/WGPUCompute.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/JuliaWGPU/WGPUCompute.jl)
7+
8+
:warning: This repo is under heavy development.
9+
10+
`WGPUCompute` is a `WGPU` compute shader utility library for julia. Using this library one can define compute shader kernels in regular julia. For example:
11+
12+
```julia
13+
14+
using BenchmarkTools
15+
using WGPUCompute
16+
17+
# Kernel definition
18+
function cast_kernel(x::WgpuArray{T, N}, out::WgpuArray{S, N}) where {T, S, N}
19+
xdim = workgroupDims.x
20+
ydim = workgroupDims.y
21+
gIdx = workgroupId.x*xdim + localId.x
22+
gIdy = workgroupId.y*ydim + localId.y
23+
gId = xDims.x*gIdy + gIdx
24+
out[gId] = S(ceil(x[gId]))
25+
end
26+
27+
# wrapper function
28+
function cast(S::DataType, x::WgpuArray{T, N}) where {T, N}
29+
y = WgpuArray{S}(undef, size(x))
30+
@wgpukernel launch=true workgroupSizes=(4, 4) workgroupCount=(2, 2) shmem=() cast_kernel(x, y)
31+
return y
32+
end
33+
34+
x = WgpuArray{Float32}(rand(Float32, 8, 8) .- 0.5f0)
35+
z = cast(UInt32, x)
36+
337
```
438

5-
# WGPUCompute
39+
In the above example single generalized kernel can be used for casting different datatypes. The type parameters `S`, `T`, & `N` are inferred and replaced with their actual type information internally.
40+
41+
Compute kernels also support defining shared memory and can provide means to implement kernels like matmul. For example
42+
43+
44+
```julia
45+
function tiled_matmul_kernel(x::WgpuArray{T, N}, y::WgpuArray{T, N}, out::WgpuArray{T, N}) where {T, N}
46+
#set out matrix to zero
47+
gId = xDims.x*globalId.y + globalId.x
48+
out[gId] = 0.0
49+
50+
# set local variable = 0.0
51+
sum = 0.0
52+
53+
for tileId in 0:numWorkgroups.y
54+
# copy block from x to shared memory
55+
xId = workgroupId.x*workgroupDims.x + localId.x
56+
yId = tileId*workgroupDims.y + localId.y
57+
sId = localId.y*workgroupDims.x + localId.x
58+
shmem1[sId] = x[yId*xDims.x + xId]
59+
60+
# copy block from y to shared memory
61+
xId = tileId*workgroupDims.x + localId.x
62+
yId = workgroupId.y*workgroupDims.y + localId.y
63+
shmem2[sId] = y[yId*yDims.x + xId]
64+
synchronize()
65+
66+
# block sums for each tid
67+
for i in 0:xDims.y/numWorkgroups.y
68+
sum = sum + shmem1[i*workgroupDims.x + localId.x]*shmem2[localId.y*workgroupDims.x + i]
69+
end
70+
synchronize()
71+
end
72+
73+
out[gId] = sum
74+
end
75+
76+
# For now valid only for square matrices of size powers of 2 and base size 16 to keep it simple.
77+
function tiled_matmul_heuristics(x::WgpuArray{T, N}, y::WgpuArray{T, N}) where {T, N}
78+
aSize = size(x)
79+
bSize = size(y)
80+
@assert last(aSize) == first(bSize)
81+
outSize = (first(aSize), last(bSize))
82+
@assert eltype(x) == eltype(y)
83+
wgSize = (16, 16) # This can be fixed for now
84+
wgCount = div.((outSize[1], outSize[2]), 16, RoundUp)
85+
return (outSize, wgSize, wgCount)
86+
end
87+
88+
function tiled_matmul(x::WgpuArray{T, N}, y::WgpuArray{T, N}) where {T, N}
89+
(outSize, wgSize, wgCount) = tiled_matmul_heuristics(x, y)
90+
out = WgpuArray{eltype(x), ndims(x)}(undef, outSize)
91+
@wgpukernel(
92+
launch=true,
93+
workgroupSizes=wgSize,
94+
workgroupCount=wgCount,
95+
shmem=(:shmem1=>(Float32, wgSize), :shmem2=>(Float32, wgSize)),
96+
tiled_matmul_kernel(x, y, out)
97+
)
98+
return out
99+
end
100+
101+
Base.:*(x::WgpuArray{T, N}, y::WgpuArray{T, N}) where {T, N} = tiled_matmul(x, y)
102+
103+
x = WgpuArray{Float32, 2}(rand(2048, 2048));
104+
y = WgpuArray{Float32, 2}(rand(2048, 2048));
105+
106+
z = x*y
107+
108+
z_cpu = (x |> collect)*(y |> collect)
109+
110+
@test z_cpu (z |> collect)
111+
6112

7-
Documentation for [WGPUCompute](https://github.com/JuliaWGPU/WGPUCompute.jl).
8113

9-
```@index
10114
```
11115

12-
```@autodocs
13-
Modules = [WGPUCompute]
116+
There is limited supported for GPUArrays interface. And is currently under development to make is complete.
117+
118+
```julia
119+
using WGPUCompute
120+
using BenchmarkTools
121+
122+
aArray = WgpuArray{Float32}(undef, (1024, 1024, 100))
123+
bArray = WgpuArray{Float32}(rand(Float32, (1024, 1024, 100)))
124+
125+
@benchmark copyto!(aArray, 1, bArray, 1, prod(size(aArray)))
126+
127+
```
14128
```
129+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
130+
Range (min … max): 62.900 μs … 1.885 ms ┊ GC (min … max): 0.00% … 0.00%
131+
Time (median): 70.100 μs ┊ GC (median): 0.00%
132+
Time (mean ± σ): 95.964 μs ± 80.628 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
133+
134+
▇█▄▃▁▁▃▃▂▂▂▂▂▂▁▂▂▁▁ ▁▂▃▂ ▁▁▂▃▃▂ ▁▂▁▂▁ ▂
135+
█████████████████████████████████████████▇▆▆▅▅▅▇█▇▆▆▇▇▇▆▅▆▆ █
136+
62.9 μs Histogram: log(frequency) by time 208 μs <
137+
138+
Memory estimate: 1.01 KiB, allocs estimate: 37.
139+
```
140+
141+
Basic ML kernels can be defined:
142+
143+
A very simplified kernel example of ML primitive `relu`:
144+
145+
```julia
146+
using WGPUCompute
147+
148+
y = WgpuArray((rand(4, 4) .-0.5) .|> Float32)
149+
150+
function relu_kernel(x::WgpuArray{T, N}, out::WgpuArray{T, N}) where {T, N}
151+
gId = xDims.x*globalId.y + globalId.x
152+
value = x[gId]
153+
out[gId] = max(value, 0.0)
154+
end
155+
156+
function relu(x::WgpuArray{T, N}) where {T, N}
157+
y = similar(x)
158+
@wgpukernel launch=true workgroupSizes=(4,4) workgroupCount=(1,1) shmem=() relu_kernel(x, y)
159+
return y
160+
end
161+
162+
relu(y)
163+
164+
```
165+
166+
The above kernel undergoes two transformations:
167+
1. First the `@wgpukernel` kernel macro takes the kernel function and transforms into an custom AST and intermeditate representation. This transformation is actually carried out the work done in `WGPUTranspiler`. And this AST is again transpiled to the below format. This is very close to `WGSL` but with julia IR semantics. For more detailed explanation please browse to this [link](https://github.com/JuliaWGPU/WGPUTranspier.jl).
168+
```
169+
┌ Info: begin
170+
│ @const workgroupDims = Vec3{UInt32}(0x00000004, 0x00000004, 0x00000001)
171+
│ @const xDims = Vec3{UInt32}(0x00000004, 0x00000004, 0x00000001)
172+
│ @const outDims = Vec3{UInt32}(0x00000004, 0x00000004, 0x00000001)
173+
│ @var StorageReadWrite 0 0 x::Array{Float32, 16}
174+
│ @var StorageReadWrite 0 1 out::Array{Float32, 16}
175+
│ @compute @workgroupSize(4, 4, 1) function relu_kernel(@builtin(global_invocation_id, globalId::Vec3{UInt32}), @builtin(local_invocation_id, localId::Vec3{UInt32}), @builtin(num_workgroups, numWorkgroups::Vec3{UInt32}), @builtin(workgroup_id, workgroupId::Vec3{UInt32}))
176+
│ @let gId = xDims.x * globalId.y + globalId.x
177+
│ @let value = x[gId]
178+
│ out[gId] = max(value, 0.0f0)
179+
│ end
180+
└ end
181+
```
182+
2. Then this representation is again compiled to webgpu/WGPU's representation, `WGSL`. This is carried out an another package called `WGSLTypes`.
183+
184+
```
185+
┌ Info: const workgroupDims = vec3<u32>(4u, 4u, 1u);
186+
│ const xDims = vec3<u32>(4u, 4u, 1u);
187+
│ const outDims = vec3<u32>(4u, 4u, 1u);
188+
│ @group(0) @binding(0) var<storage, read_write> x:array<f32, 16> ;
189+
│ @group(0) @binding(1) var<storage, read_write> out:array<f32, 16> ;
190+
│ @compute @workgroup_size(4, 4, 1)
191+
│ fn relu_kernel(@builtin(global_invocation_id) globalId:vec3<u32>, @builtin(local_invocation_id) localId:vec3<u32>, @builtin(num_workgroups) numWorkgroups:vec3<u32>, @builtin(workgroup_id) workgroupId:vec3<u32>) {
192+
│ let gId = xDims.x * globalId.y + globalId.x;
193+
│ let value = x[gId];
194+
│ out[gId] = max(value, 0.0);
195+
│ }
196+
197+
```
198+
199+
This final shader code is compiled using `naga`, `WGPU-native`'s compiler.
200+
201+
## Conventions
202+
203+
1. Input arguments are converted into `storage` variables and placed at the top of the shader code.
204+
2. Size of input arguments are converted into `const` variables and placed at the top of the shader code. Users can use these arguments to probe for input arrays's size. The corresponding name of variable declaring size of array will be a concatenation of variable name followed by "Dims". For example: if variable is `x`, `xDims` holds the size information.
205+
3. Kernel arguments like `workgroupDims` etc are also placed at the top of the shader code and can be used as an variables inside kernel code. This will eventually be probed using julia's `size` function. Until then we can use this convention.
206+
4. Shared memory can be declared in the `@wgpukernel` macro using `shmem` kwarg. `shmem` expects a tuple of pairs with each pair representing name and (type, size) of shared memory. Example: `shmem = ("xShared"=>(Float32, 16))`
207+
208+
209+
210+
## Known issues
211+
212+
- jupyter notebooks are not tested yet and might need some work to have compatibility with pluto as well.
213+
214+
## TODO
215+
216+
- [ ] atomics support is under development.
217+
- [ ] possibility of JSServe the generated wgsl code in web app.
218+
- [ ] Complete SPIRV version
219+
- [ ] Explore and adhere to Binary generation eventually.

0 commit comments

Comments
 (0)