You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+36-3Lines changed: 36 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -153,10 +153,20 @@ Each register is specified by 4 bits, meaning that there are 16 total registers.
153
153
154
154

155
155
156
+
Each thread within each core follows the above execution path to perform computations on the data in it's dedicated register file.
157
+
158
+
This resembles a standard CPU diagram, and is quite similar in functionality as well. The main difference is that the `%blockIdx`, `%blockDim`, and `%threadIdx` values lie in the read-only registers for each thread, enabling SIMD functionality.
159
+
156
160
# Kernels
157
161
162
+
I wrote a matrix addition and matrix multiplication kernel using my ISA as a proof of concept to demonstrate SIMD programming and execution with my GPU. The test files in this repository are capable of fully simulating the execution of these kernels on the GPU, producing data memory states and a complete execution trace.
163
+
158
164
### Matrix Addition
159
165
166
+
This matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads.
167
+
168
+
This demonstration makes use of the `%blockIdx`, `%blockDim`, and `%threadIdx` registers to show SIMD programming on this GPU. It also uses the `LDR` and `STR` instructions which require async memory management.
169
+
160
170
`matadd.asm`
161
171
162
172
```asm
@@ -187,6 +197,8 @@ RET ; end of kernel
187
197
188
198
### Matrix Multiplication
189
199
200
+
The matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the `CMP` and `BRnzp` instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation).
201
+
190
202
`matmul.asm`
191
203
192
204
```asm
@@ -237,13 +249,34 @@ RET ; end of kernel
237
249
238
250
# Simulation
239
251
240
-
tiny-gpu is setup to simulate the execution of both of the above kernels using `iverilog` and `cocotb`.
252
+
tiny-gpu is setup to simulate the execution of both of the above kernels. Before simulating, you'll need to install [iverilog](https://steveicarus.github.io/iverilog/usage/installation.html) and [cocotb](https://docs.cocotb.org/en/stable/install.html).
253
+
254
+
Once you've installed the pre-requisites, you can run the kernel simulations with `make test_matadd` and `make test_matmul`.
241
255
242
-
Running `make test_matadd` or `make test_matmul`will run the specified kernel and output a log file with the complete execution trace of the kernel from start to finish, as well as the intial and final states of data memory.
256
+
Executing the simulations will output a log file in `test/logs`with the initial data memory state, complete execution trace of the kernel, and final data memory state.
243
257
244
258
The `matadd` kernel adds 2 1x8 matrices across 8 threads running on 2 cores, and the `matmul` kernel multiplies 2 2x2 matrices across 4 threads.
245
259
246
-
# Notes
260
+
If you look at the initial for each, you should see the two start matrices for the calculation, and in final data memory you should also see the resultant matrix.
261
+
262
+
Below is a sample of the execution traces, showing on each cycle the execution of every thread within every core, including the current instruction, PC, register values, states, etc.
263
+
264
+

265
+
266
+
## Notes
267
+
268
+
Notes on design decisions made for simplicity that could be optimized away;
247
269
248
270
- Many things that could be wires are registers to make things explicitly synchronous and for code simplicity and clarity.
249
271
- State management does some things in many cycles that could be done in 1 to make control flow explicit.
272
+
273
+
## Next Steps
274
+
275
+
Updates I want to make in the future to improve the design, anyone else is welcome to contribute as well:
276
+
277
+
[] Build an adapter to use GPU with Tiny Tapeout 7
278
+
[] Add support for branch divergence
279
+
[] Optimize control flow and use of registers to improve cycle time
280
+
[] Add basic memory coalescing
281
+
[] Add basic pipelining
282
+
[] Write a basic graphics kernel or add simple graphics hardware to demonstrate graphics functionality
0 commit comments