0% found this document useful (0 votes)

5 views

GPU_Programming_slides_2

The document outlines a GPU Programming course (CSGG3018) led by instructor Amit Gurung, covering CUDA architecture, data parallelism, and program structure. It explains the differences between CPUs and GPUs, details CUDA memory operations, and provides examples of vector addition in both C and CUDA. The document also discusses kernel functions, threading, and the use of CUDA keywords for efficient GPU programming.

Uploaded by

pillai.siddhart

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

GPU_Programming_slides_2

Uploaded by

pillai.siddhart

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

GPU Programming

Course Code: CSGG3018

Instructor: AMIT GURUNG
Email: [email protected]

Welcomes

12 programmes accredited Ranked 501-600 in Ranked 801-850 in Ranked 46th in

Rankings 2025 World Rankings 2025 Rankings 2024*
A Grade
* University Category

Jan – June, 2025

1
CUDA
Overview
1.GPU Architecture
2.Data Parallelism
3.CUDA Program Structure
4.Vector Addition
5.Device Global Memory and Data Transfer
6.Kernel Functions and Threading
GPU Architecture
GPU Architecture

Some general terminology:

A device refers to a GPU in the computer.
A host refers to a CPU in the computer.
GPUs vs CPUs .
.
.

Control A A A A A A
L L L L L L …
ALU ALU Cache U U U U U U

Control A A A A A A
Control
L L L L L L …
ALU ALU Cache U U U U U U

Control A A A A A A
L L L L L L …
Cache U U U U U U
Cache
Control A A A A A A
L L L L L L …
Cache U U U U U U

DRAM DRAM

CPUs are designed to minimize the execution time of single

threads.
GPUs are designed to maximize the throughput of many
identical threads working on different memory.
CUDA Capable GPU Architecture
Host In a modern CUDA capable GPU, multiple
streaming units (SMs) contain multiple
Input Assembler streaming processors (SPs) that share the
same control logic and instruction cache.
Thread Execution Manager

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SMs …
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Cache Cache Cache Cache Cache Cache Cache Cache

Texture Texture Texture Texture Texture Texture Texture Texture

Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store

Global Memory
CUDA Capable GPU Architecture
Host

Input Assembler

Thread Execution Manager

Host:

In CUDA terminology, the host refers to the central processing unit
(CPU) and its associated memory.

The host is responsible for managing the overall execution of a
CUDA program, including:
a) Memory Management: Allocating and transferring data between
the host (CPU) memory and the device (GPU) memory.
b) Kernel Launching: Initiating functions (kernels) that execute on
the GPU.
c) Synchronization: Coordinating the execution flow between the
CPU and GPU to ensure correct program behavior.

The host prepares data, transfers it to the device, launches kernels
on the GPU, and retrieves the results after computation.
CUDA Capable GPU Architecture
Host

Input Assembler

Thread Execution Manager

Input Assembler:

The input assembler is a component of the GPU's
graphics pipeline (see Reference), primarily involved in
the initial stages of rendering graphics.

Its main functions include:

Vertex Fetching: Retrieving vertex data from memory.

Primitive Assembly: Organizing vertices into geometric
primitives such as points, lines, and triangles.

In the context of general-purpose computing with CUDA,
the input assembler is not directly utilized, as CUDA
focuses on computation rather than graphics rendering.
CUDA Capable GPU Architecture
Host

Input Assembler

Thread Execution Manager

Thread Execution Manager:


The thread execution manager is a crucial component of
the GPU architecture that oversees the scheduling and
execution of threads on the GPU.

Its responsibilities include:

Thread Scheduling: Managing the concurrent execution of
thousands of threads across multiple streaming multiprocessors
(SMs).

Resource Allocation: Distributing computational resources such
as registers and shared memory among active threads.

Latency Hiding: Switching between threads to mask memory
access latencies, ensuring efficient utilization of the GPU's
computational units.
CUDA Capable GPU Architecture
CUDA programs define kernels which are
Host loaded into the instruction caches on each SM.
The kernel allows for hundreds or thousands
of threads to perform the same instructions
Input Assembler
over different data.

Thread Execution Manager

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SMs
…
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Cache Cache Cache Cache Cache Cache Cache Cache

Texture Texture Texture Texture Texture Texture Texture Texture

Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store

Global Memory
Data Parallelism
Task Parallelism vs Data
Parallelism
*Vector processing units and GPUs take
care of data parallelism, where the same
operation or set of operations need to
be performed over a large set of data.
*Most parallel programs utilize data
parallelism to achieve performance
improvements and scalability.
Task Parallelism vs Data
Parallelism

Task parallelism, however, is when a large

set of independent (or mostly independent)
tasks need to be performed.
The tasks are generally much more complex
than the operations performed by data
parallel programs.
Task Parallelism vs Data
Parallelism

It is worth mentioning that both can be combined

— the tasks can contain data parallel elements.

Further, in many ways systems are getting closer to

do both. GPU hardware and programming
languages like CUDA are allowing more
operations to be performed in parallel for the
data parallel tasks.
Data Processing Example

For example, vector addition is a data parallel:

Vector A A[0] A[1] A[2] A[3] … A[N]

Vector B B[0] B[1] B[2] B[3]

… B[N]

+ + + + +

Vector C C[0] C[1] C[2] C[3]

… C[N]
CUDA Program
Structure
CUDA Program Structure
CUDA programs are C/C++ programs containing code that uses CUDA
extensions (.cu).

The CUDA compiler (nvcc, nvcxx) is essentially a wrapper around another C/C++
compiler (gcc, g++, llvm). The CUDA compiler compiles the parts for the GPU
and the regular compiler compiles for the CPU:

Integrated C programs with CUDA extensions

NVCC Compiler
Host Code Device Code
Host C preprocessor,
Device just-in-time Compiler
compiler, linker

Heterogeneous Computing Platform With CPUs and GPUs

CUDA Program Structure
When CUDA programs run, they alternate between CPU serial code
and GPU parallel kernels (which execute many threads in parallel).

The program will block waiting for the CUDA kernel to complete
executing all its parallel threads.

CPU serial code

GPU parallel kernel Block of Block of Block of Block of Block of Block of

KernelA<<<nBlocks, nThreads>>>(args); Threads Threads Threads Threads Threads Threads

CPU serial code

GPU parallel kernel Block of Block of Block of Block of Block of Block of

KernelA<<<nBlocks, nThreads>>>(args); Threads Threads Threads Threads Threads Threads
Vector Addition
Vector Addition in C
//Compute vector sum h_C = h_A + h_B
void vecAdd(float* h_A, float* h_B, float* h_C, int n) {
for (int i = 0; i < n; i++) {
h_C[i] = h_A[i] + h_B[i];
}
}

int main() {
float* h_A = new float[100];
float* h_B = new float[100];
float* h_C = new float[100];
…
//assign elements into h_A, h_b
…
vecAdd(h_A, h_B, h_C, 100);
}

The above is a simple C program which performs vector

addition. The arrays h_A and h_b are added with the
results being stored into h_C.
Skeleton Vector Add in
CUDA
It’s important to note that GPUs and CPUs do not
share the same memory, so in a CUDA program
you have to move the memory back and forth.

Host (CPU) Device (GPU)

Host Memory (RAM) Global Memory
Skeleton Vector Add in
CUDA
//We want to parallelize this!
//Compute vector sum h_C = h_A + h_B
void vecAdd(float* h_A, float* h_B, float* h_C, int n) {
for (int i = 0; i < n; i++) {
h_C[i] = h_A[i] + h_B[i];
}
}

int main() {
float* h_A = new float[100];
float* h_B = new float[100];
float* h_C = new float[100];
…
//assign elements into h_A, h_b

//allocate enough memory for h_A, h_B, h_C on the GPU

//copy the memory of h_A and h_B onto the GPU
vecAddCUDA(h_A, h_B, h_C, 100); //perform this on the GPU!
//copy the memory of h_C on the GPU back into h_C on the CPU
}
Device Global
Memory and
Data Transfer
Memory Operations in
CUDA
CUDA provides three methods for allocating on and moving
memory to GPUs.

cudaMalloc(); //allocates memory on the GPU (like malloc)

cudaFree(); //frees memory on the GPU (like free)

cudaMemcpy(); //copies memory from the host to the

//device (like memcpy)
Memory Operations in
CUDA
cudaMalloc(); //allocates memory on the GPU (like malloc)
cudaFree(); //frees memory on the GPU (like free)

cudaMemcpy(); //copies memory from the host to the

//device (like memcpy)

All three methods can return a cudaError_t type, which

can be used to test for error conditions, eg:
cudaError_t err = cudaMalloc((void**) &d_A, size);

(void**)&d_A is the address of the pointer that will point to

the allocated device memory.

size is the total size in bytes to allocate.

Memory Operations in
CUDA

Syntax of cudaMemcpy:
cudaError_t cudaMemcpy(void* dst, const void* src, size_t count,
cudaMemcpyKind kind);

dst: Destination pointer (d_A in device memory).

src: Source pointer (h_A in host memory).

count: Number of bytes to copy (size).

kind: Direction of copy (Eg., cudaMemcpyHostToDevice indicates

host to device transfer).
Memory Operations in
CUDA

All three methods can return a cudaError_t type, which

can be used to test for error conditions, eg:
cudaError_t err = cudaMalloc((void**) &d_A, size);

if (err != cudaSuccess) {
printf(“%s in file %s at line %d\n”,
cudaGetErrorString(err),
__FILE__, __LINE__);
exit(EXIT_FAILURE);
}
Skeleton Vector Add in
int main() { CUDA
int size = 100;
float* h_A = new float[size];
float* h_B = new float[size];
float* h_C = new float[size];
//assign elements into h_A, h_b
…

//allocate enough memory for h_A, h_B, h_C as d_A, d_B, d_C
//on the GPU
float *d_A, *d_B, *d_C;
cudaMalloc((void**) &d_A, size);
cudaMalloc((void**) &d_B, size);
cudaMalloc((void**) &d_C, size);
//copy the memory of h_A and h_B onto the GPU
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

vecAddCUDA(d_A, d_B, d_C); //perform this on the GPU!

//copy the memory of d_C on the GPU back into h_C on the CPU
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

//free the memory on the device

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
Kernel Functions
and Threading
CUDA Grids
CUDA assigns groups of threads to blocks, and groups of blocks
to grids. For now, we’ll just worry about blocks of threads.

Using the blockIdx, blockDim, and threadIdx keywords, you

can determine which thread is being run in the kernel, from
which block.

For our GPU vector add, this will work something like this:

Block 0 Block 1 Block N-1

Thread 0 1 2 … 256 Thread 0 1 2 …
256 Thread 0 1 2 …
256

i = (blockIdx.x * i = (blockIdx.x * blockDim.x) … i = (blockIdx.x *

blockDim.x) + threadIdx.x; + threadIdx.x; blockDim.x) + threadIdx.x;

d_C = d_A[i] + d_B[i]; d_C = d_A[i] + d_B[i]; d_C = d_A[i] + d_B[i];

… … …
Vector Add Kernel

//Computes the vector sum on the GPU.

__global__ void vecAddKernel(float* A, float* B, float* C, int n) {
int i = threadIdx.x + (blockDim.x * blockIdx.x);
if (i < n) C[i] = A[i] + B[i];
}

int main() {
…
//this will create ceil(size/256.0) blocks, each with 256 threads
//that will each run the vecAddKernel.
vecAddKernel<<<ceil(size/256.0), 256>>>(d_A, d_B, d_C, size);
…
}
Vector Add Kernel

BlockIdx: this variable is a constant for all blocks and stores the number of
threads along each dimension of the block. For one-dimentional block, we
refer only to blockDim.x.

BlockDim: The dimension (total number of threads) of the thread block is

accessible within the kernel through this variable

ThreadIdx: The index of the current thread within its block.

CUDA keywords
Function Qualifiers Executed Only callable
on the: from the:
__device__ float deviceFunc() device device

global void KernelFunc() device host

host float HostFunc() host host

By default, all functions are host functions, so you

typically do not see this frequently in CUDA code.

device functions are for use within kernel functions, they

can only be called and used on the GPU.

global functions are accessible from the CPU, however they

are executed on the GPU. This function can not return any value
and so must be void.
Code for Vector Addition in GPU
#include <iostream>
#include <cuda_runtime.h>
// CUDA kernel for vector addition
__global__ void vecAddKernel(float* d_A, float* d_B, float* d_C, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
// Define block and grid sizes
d_C[i] = d_A[i] + d_B[i];
int blockSize = 256;
}
} int gridSize = (n + blockSize - 1) / blockSize;

int main() { // Launch the vector addition kernel

int n = 4; vecAddKernel<<<gridSize, blockSize>>>(d_A, d_B, d_C, n);
size_t size = n * sizeof(float);
// Allocate host memory // Copy result from device to host
float* h_A = new float[n]; cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
float* h_B = new float[n];
float* h_C = new float[n]; // Display the result
for (int i = 0; i < n; ++i) {
// Initialize host arrays
std::cout << "h_C[" << i << "] = " << h_C[i] << std::endl;
for (int i = 0; i < n; ++i) {
}
h_A[i] = 5.0f;
h_B[i] = 5.0f;
// Free device memory
}
cudaFree(d_A);
// Allocate device memory cudaFree(d_B);
float *d_A, *d_B, *d_C; cudaFree(d_C);
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size); // Free host memory
cudaMalloc((void**)&d_C, size); delete[] h_A;
// Copy data from host to device delete[] h_B;
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); delete[] h_C;
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
return 0;
A few simple exercises

1) Write a CUDA program that takes an array of integer and negates each
element.

2) Write a CUDA program to initialize an array with a constant value. Pass the
value from the CPU code.

3) Write a CUDA program that compute element-wise squaring of a given

array.
References

1 . GPU Programming, IBM Publication.

2 .https://learn.microsoft.com/en-us/windows/win32/direct3d11/
overviews-direct3d-11-graphics-pipeline.

Watch List:
https://www.youtube.com/watch?v=usY0643pYs8

https://www.youtube.com/watch?v=cRY5utouJzQ&t=60s

https://www.youtube.com/watch?v=OJuA3DZNfz8

https://www.youtube.com/watch?v=uTyYNPU4mGQ

Math Makes Sense Practice and Homework Book Grade 5 Answer Key
100% (1)
Math Makes Sense Practice and Homework Book Grade 5 Answer Key
7 pages
Sabah Quarry 2023
No ratings yet
Sabah Quarry 2023
18 pages
Case B Notes
No ratings yet
Case B Notes
4 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
CatalogoPGT25 PDF
100% (5)
CatalogoPGT25 PDF
4 pages
Post Office Challan - Railway Recruitment Board
No ratings yet
Post Office Challan - Railway Recruitment Board
1 page
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA
No ratings yet
CUDA
33 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA_part-1-LMS
No ratings yet
CUDA_part-1-LMS
51 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda C
No ratings yet
Cuda C
70 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Lec 1
No ratings yet
Lec 1
27 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
chapter-8
No ratings yet
chapter-8
58 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
PSP Architecture: Architecture of Consoles: A Practical Analysis, #18
From Everand
PSP Architecture: Architecture of Consoles: A Practical Analysis, #18
Rodrigo Copetti
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
My Course Reg
No ratings yet
My Course Reg
1 page
The Origins of CPM
No ratings yet
The Origins of CPM
34 pages
Factors
No ratings yet
Factors
3 pages
Error Warning 20225 or No PDF Printer Appears
No ratings yet
Error Warning 20225 or No PDF Printer Appears
2 pages
Assignment 1 Research Esm641
100% (1)
Assignment 1 Research Esm641
11 pages
VOA02-677ME-1 KCF Tech Ford Stamping Template V1 113021ll
No ratings yet
VOA02-677ME-1 KCF Tech Ford Stamping Template V1 113021ll
57 pages
Lecture 2 - Analysis of Cables
No ratings yet
Lecture 2 - Analysis of Cables
25 pages
Adamson vs. Court of Appeals, 232 SCRA 602 (1994)
0% (1)
Adamson vs. Court of Appeals, 232 SCRA 602 (1994)
5 pages
Ontario Provincial Standards Adult Residental Addiction Services 2017
No ratings yet
Ontario Provincial Standards Adult Residental Addiction Services 2017
98 pages
A Study On Pushover Analysis Using Eurocode Based Capacity Spectrum Method PDF
No ratings yet
A Study On Pushover Analysis Using Eurocode Based Capacity Spectrum Method PDF
6 pages
스탯 겨울특강 PDF
No ratings yet
스탯 겨울특강 PDF
276 pages
UFS UD - GES4 2 - 7011 7427460 en
No ratings yet
UFS UD - GES4 2 - 7011 7427460 en
1 page
Residents Turn Out For Fall Festival: Inside This Issue
No ratings yet
Residents Turn Out For Fall Festival: Inside This Issue
20 pages
ID FD PA Fans Structures
No ratings yet
ID FD PA Fans Structures
17 pages
Evaluating Spotify's Product Extension Strategies and Its Role in Providing A Competitive Advantage in The Music Streaming Industry PDF
No ratings yet
Evaluating Spotify's Product Extension Strategies and Its Role in Providing A Competitive Advantage in The Music Streaming Industry PDF
33 pages
KKN Without Lib For Classification (Sweet and Sour) : #Data Set
No ratings yet
KKN Without Lib For Classification (Sweet and Sour) : #Data Set
7 pages
The Top 10 High-Demand Jobs With Attractive Salaries
No ratings yet
The Top 10 High-Demand Jobs With Attractive Salaries
54 pages
Bodie Essentials 2024 Release Chapter 14 PPT
No ratings yet
Bodie Essentials 2024 Release Chapter 14 PPT
32 pages
Update Final PNL Jim Brickman Balai Sarbini
No ratings yet
Update Final PNL Jim Brickman Balai Sarbini
6 pages
Pengaruh Teknik Budidaya Terhadap Produksi Kopi (Coffea Spp. L.) MASYARAKAT KARO
No ratings yet
Pengaruh Teknik Budidaya Terhadap Produksi Kopi (Coffea Spp. L.) MASYARAKAT KARO
16 pages
constitutional-foundation-of-criminal-justice-system-in-india-approaches-and-reformation
No ratings yet
constitutional-foundation-of-criminal-justice-system-in-india-approaches-and-reformation
6 pages
History: of Computer
No ratings yet
History: of Computer
3 pages
Presentation On Service Industry - Meru
No ratings yet
Presentation On Service Industry - Meru
27 pages
Robert L Carringer Citizen Kane
No ratings yet
Robert L Carringer Citizen Kane
18 pages
Regulation of Mouse Sterol Regulatory Element-Binding Protein-1c Gene (SREBP-1c) by Oxysterol Receptors, LXR
No ratings yet
Regulation of Mouse Sterol Regulatory Element-Binding Protein-1c Gene (SREBP-1c) by Oxysterol Receptors, LXR
12 pages

GPU_Programming_slides_2

Uploaded by

GPU_Programming_slides_2

Uploaded by

GPU Programming

Course Code: CSGG3018

12 programmes accredited Ranked 501-600 in Ranked 801-850 in Ranked 46th in

Jan – June, 2025

Some general terminology:

CPUs are designed to minimize the execution time of single

Cache Cache Cache Cache Cache Cache Cache Cache

Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store

Thread Execution Manager

Thread Execution Manager

Thread Execution Manager

Thread Execution Manager:

Thread Execution Manager

Cache Cache Cache Cache Cache Cache Cache Cache

Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store

Task parallelism, however, is when a large

It is worth mentioning that both can be combined

Further, in many ways systems are getting closer to

For example, vector addition is a data parallel:

Vector A A[0] A[1] A[2] A[3] … A[N]

Vector B B[0] B[1] B[2] B[3]

Vector C C[0] C[1] C[2] C[3]

Integrated C programs with CUDA extensions

Heterogeneous Computing Platform With CPUs and GPUs

CPU serial code

GPU parallel kernel Block of Block of Block of Block of Block of Block of

CPU serial code

GPU parallel kernel Block of Block of Block of Block of Block of Block of

The above is a simple C program which performs vector

Host (CPU) Device (GPU)

//allocate enough memory for h_A, h_B, h_C on the GPU

cudaMalloc(); //allocates memory on the GPU (like malloc)

cudaFree(); //frees memory on the GPU (like free)

cudaMemcpy(); //copies memory from the host to the

cudaMemcpy(); //copies memory from the host to the

All three methods can return a cudaError_t type, which

(void**)&d_A is the address of the pointer that will point to

size is the total size in bytes to allocate.

dst: Destination pointer (d_A in device memory).

src: Source pointer (h_A in host memory).

count: Number of bytes to copy (size).

kind: Direction of copy (Eg., cudaMemcpyHostToDevice indicates

All three methods can return a cudaError_t type, which

vecAddCUDA(d_A, d_B, d_C); //perform this on the GPU!

//free the memory on the device

Using the blockIdx, blockDim, and threadIdx keywords, you

Block 0 Block 1 Block N-1

i = (blockIdx.x * i = (blockIdx.x * blockDim.x) … i = (blockIdx.x *

d_C = d_A[i] + d_B[i]; d_C = d_A[i] + d_B[i]; d_C = d_A[i] + d_B[i];

//Computes the vector sum on the GPU.

BlockDim: The dimension (total number of threads) of the thread block is

ThreadIdx: The index of the current thread within its block.

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

By default, all functions are __host__ functions, so you

__device__ functions are for use within kernel functions, they

__global__ functions are accessible from the CPU, however they

int main() { // Launch the vector addition kernel

3) Write a CUDA program that compute element-wise squaring of a given

1 . GPU Programming, IBM Publication.

You might also like

global void KernelFunc() device host

host float HostFunc() host host

By default, all functions are host functions, so you

device functions are for use within kernel functions, they

global functions are accessible from the CPU, however they