GPU_Programming_slides_2
GPU_Programming_slides_2
Welcomes
Control A A A A A A
L L L L L L …
ALU ALU Cache U U U U U U
Control A A A A A A
Control
L L L L L L …
ALU ALU Cache U U U U U U
Control A A A A A A
L L L L L L …
Cache U U U U U U
Cache
Control A A A A A A
L L L L L L …
Cache U U U U U U
DRAM DRAM
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SMs …
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
Global Memory
CUDA Capable GPU Architecture
Host
Input Assembler
Host:
In CUDA terminology, the host refers to the central processing unit
(CPU) and its associated memory.
The host is responsible for managing the overall execution of a
CUDA program, including:
a) Memory Management: Allocating and transferring data between
the host (CPU) memory and the device (GPU) memory.
b) Kernel Launching: Initiating functions (kernels) that execute on
the GPU.
c) Synchronization: Coordinating the execution flow between the
CPU and GPU to ensure correct program behavior.
The host prepares data, transfers it to the device, launches kernels
on the GPU, and retrieves the results after computation.
CUDA Capable GPU Architecture
Host
Input Assembler
Input Assembler:
The input assembler is a component of the GPU's
graphics pipeline (see Reference), primarily involved in
the initial stages of rendering graphics.
Its main functions include:
Vertex Fetching: Retrieving vertex data from memory.
Primitive Assembly: Organizing vertices into geometric
primitives such as points, lines, and triangles.
In the context of general-purpose computing with CUDA,
the input assembler is not directly utilized, as CUDA
focuses on computation rather than graphics rendering.
CUDA Capable GPU Architecture
Host
Input Assembler
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SMs
…
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
Global Memory
Data Parallelism
Task Parallelism vs Data
Parallelism
*Vector processing units and GPUs take
care of data parallelism, where the same
operation or set of operations need to
be performed over a large set of data.
*Most parallel programs utilize data
parallelism to achieve performance
improvements and scalability.
Task Parallelism vs Data
Parallelism
+ + + + +
The CUDA compiler (nvcc, nvcxx) is essentially a wrapper around another C/C++
compiler (gcc, g++, llvm). The CUDA compiler compiles the parts for the GPU
and the regular compiler compiles for the CPU:
NVCC Compiler
Host Code Device Code
Host C preprocessor,
Device just-in-time Compiler
compiler, linker
The program will block waiting for the CUDA kernel to complete
executing all its parallel threads.
int main() {
float* h_A = new float[100];
float* h_B = new float[100];
float* h_C = new float[100];
…
//assign elements into h_A, h_b
…
vecAdd(h_A, h_B, h_C, 100);
}
int main() {
float* h_A = new float[100];
float* h_B = new float[100];
float* h_C = new float[100];
…
//assign elements into h_A, h_b
Syntax of cudaMemcpy:
cudaError_t cudaMemcpy(void* dst, const void* src, size_t count,
cudaMemcpyKind kind);
if (err != cudaSuccess) {
printf(“%s in file %s at line %d\n”,
cudaGetErrorString(err),
__FILE__, __LINE__);
exit(EXIT_FAILURE);
}
Skeleton Vector Add in
int main() { CUDA
int size = 100;
float* h_A = new float[size];
float* h_B = new float[size];
float* h_C = new float[size];
//assign elements into h_A, h_b
…
//allocate enough memory for h_A, h_B, h_C as d_A, d_B, d_C
//on the GPU
float *d_A, *d_B, *d_C;
cudaMalloc((void**) &d_A, size);
cudaMalloc((void**) &d_B, size);
cudaMalloc((void**) &d_C, size);
//copy the memory of h_A and h_B onto the GPU
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//copy the memory of d_C on the GPU back into h_C on the CPU
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
For our GPU vector add, this will work something like this:
int main() {
…
//this will create ceil(size/256.0) blocks, each with 256 threads
//that will each run the vecAddKernel.
vecAddKernel<<<ceil(size/256.0), 256>>>(d_A, d_B, d_C, size);
…
}
Vector Add Kernel
BlockIdx: this variable is a constant for all blocks and stores the number of
threads along each dimension of the block. For one-dimentional block, we
refer only to blockDim.x.
1) Write a CUDA program that takes an array of integer and negates each
element.
2) Write a CUDA program to initialize an array with a constant value. Pass the
value from the CPU code.
Watch List:
https://www.youtube.com/watch?v=usY0643pYs8
https://www.youtube.com/watch?v=cRY5utouJzQ&t=60s
https://www.youtube.com/watch?v=OJuA3DZNfz8
https://www.youtube.com/watch?v=uTyYNPU4mGQ