You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+53-18Lines changed: 53 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -48,47 +48,74 @@ This project is primarily focused on exploring:
48
48
49
49
## GPU
50
50
51
-
tiny-gpu is built to execute a single kernel at a time. In order to launch a kernel, we need to load external program memory with our kernel code, load data memory with the necessary data, specify the number of threads to launch in the device control register, and then launch the kernel by setting the start signal to high.
51
+
tiny-gpu is built to execute a single kernel at a time.
52
52
53
-
The GPU itself consists of 2 control units - the device control register & the dispatcher, as well as multiple compute cores, and memory controllers with cache.
53
+
In order to launch a kernel, we need to do the following:
54
54
55
-
### Device Control Register
55
+
1. Load global program memory with the kernel code
56
+
2. Load data memory with the necessary data
57
+
3. Specify the number of threads to launch in the device control register
58
+
4. Launch the kernel by setting the start signal to high.
59
+
60
+
The GPU itself consists of the following units:
61
+
62
+
1. Device control register
63
+
2. Dispatcher
64
+
3. Variable number of compute cores
65
+
4. Memory controllers for data memory & program memory
66
+
5. Cache
67
+
68
+
**Device Control Register:**
56
69
57
70
The device control register usually stores metadata specifying how kernels should be executed on the GPU.
58
71
59
72
In this case, the device control register just stores the `thread_count` - the total number of threads to launch for the active kernel.
60
73
61
-
### Dispatcher
74
+
**Dispatcher:**
62
75
63
76
Once a kernel is launched, the dispatcher is the unit that actually manages the distribution of threads to different compute cores.
64
77
65
78
The dispatcher organizes threads into groups that can be executed in parallel on a single core called **blocks** and sends these blocks off to be processed by available cores.
66
79
67
80
Once all blocks have been processed, the dispatcher reports back that the kernel execution is done.
68
81
69
-
### Memory Controllers
82
+
## Memory
83
+
84
+
The GPU is built to interface with an external global memory. Here, data memory and program memory are separated out for simplicity.
85
+
86
+
**Global Memory:**
87
+
88
+
tiny-gpu data memory has the following specifications:
89
+
90
+
- 8 bit addressability (256 total rows of data memory)
91
+
- 8 bit data (stores values of <256 for each row)
92
+
93
+
tiny-gpu program memory has the following specifications:
94
+
95
+
- 8 bit addressability (256 rows of program memory)
96
+
- 16 bit data (each instruction is 16 bits as specified by the ISA)
97
+
98
+
**Memory Controllers:**
70
99
71
100
Global memory has fixed read/write bandwidth, but there may be far more incoming requests across all cores to access data from memory than the external memory is actually able to handle.
72
101
73
102
The memory controllers keep track of all the outgoing requests to memory from the compute cores, throttle requests based on actual external memory bandwidth, and relay responses from external memory back to the proper resources.
74
103
75
-
### Cache
104
+
Each memory controller has a fixed number of channels based on the bandwidth of global memory.
105
+
106
+
**Cache:**
76
107
77
108
The same data is often requested from global memory by multiple cores. Constantly access global memory repeatedly is expensive, and since the data has already been fetched once, it would be more efficient to store it on device in SRAM to be retrieved much quicker on later requests.
78
109
79
110
This is exactly what the cache is used for. Data retrieved from external memory is stored in cache and can be retrieved from there on later requests, freeing up memory bandwidth to be used for new data.
80
111
81
-
## Global Memory
82
-
83
-
The GPU is built to interface with an external global memory. Here, data memory and program memory are separated out for simplicity.
84
-
85
112
## Core
86
113
87
114
Each core has a number of compute resources, often built around a certain number of threads it can support. In order to maximize parallelization, these resources need to be managed optimally to maximize resource utilization.
88
115
89
116
In this simplified GPU, each core processed one **block** at a time, and for each thread in a block, the core has a dedicated ALU, LSU, PC, and register file. Managing the execution of thread instructions on these resources is one of the most challening problems in GPUs.
90
117
91
-
### Scheduler
118
+
**Scheduler:**
92
119
93
120
Each core has a single scheduler that manages the execution of threads.
94
121
@@ -98,32 +125,40 @@ In more advanced schedulers, techniques like **pipelining** are used to stream t
98
125
99
126
The main constraint the scheduler has to work around is the latency associated with loading & storing data from global memory. While most instructions can be executed synchronously, these load-store operations are asynchronous, meaning the rest of the instruction execution has to be built around these long wait times.
100
127
101
-
### Fetcher
128
+
**Fetcher:**
102
129
103
130
Asynchronously fetches the instruction at the current program counter from program memory (most should actually be fetching from cache after a single block is executed).
104
131
105
-
### Decoder
132
+
**Decoder:**
106
133
107
134
Decodes the fetched instruction into control signals for thread execution.
108
135
109
-
### Register Files
136
+
**Register Files:**
110
137
111
138
Each thread has it's own dedicated set of register files. The register files hold the data that each thread is performing computations on, which enables the same-instruction multiple-data (SIMD) pattern.
112
139
113
140
Importantly, each register file contains a few read-only registers holding data about the current block & thread being executed locally, enabling kernels to be executed with different data based on the local thread id.
114
141
115
-
### ALUs
142
+
**ALUs:**
116
143
117
-
Dedicated arithmetic-logic unit for each thread to perform computations.
144
+
Dedicated arithmetic-logic unit for each thread to perform computations. Handles the `ADD`, `SUB`, `MUL`, `DIV` arithmetic instructions.
118
145
119
-
### LSUs
146
+
Also handles the `CMP` comparison instruction which actually outputs whether the result of the difference between two registers is negative, zero or positive - and stores the result in the `NZP` register in the PC unit.
147
+
148
+
**LSUs:**
120
149
121
150
Dedicated load-store unit for each thread to access global data memory.
122
151
123
-
### PCs
152
+
Handles the `LDR` & `STR` instructions - and handles async wait times for memory requests to be processed and relayed by the memory controller.
153
+
154
+
**PCs:**
124
155
125
156
Dedicated program-counter for each unit to determine the next instructions to execute on each thread.
126
157
158
+
By default, the PC increments by 1 after every instruction.
159
+
160
+
With the `BRnzp` instruction, the NZP register checks to see if the NZP register (set by a previous `CMP` instruction) matches some case - and if it does, it will branch to a specific line of program memory. _This is how loops and conditionals are implemented._
161
+
127
162
Since threads are processed in parallel, tiny-gpu assumes that all threads "converge" to the same program counter after each instruction - which is a naive assumption for the sake of simplicity.
128
163
129
164
In real GPUs, individual threads can branch to different PCs, causing **branch divergence** where a group of threads threads initially being processed together has to split out into separate execution.
0 commit comments