lecture25

This lecture discusses the transition from CPU to GPU computing in machine learning, highlighting the architectural advantages of GPUs for large-scale numerical computations. It covers the evolution of GPUs from fixed-function graphics processors to programmable units with CUDA, enabling general-purpose computing. The lecture emphasizes the efficiency of GPUs in performing data-parallel tasks, particularly matrix multiplication, which significantly accelerates deep learning training times.

Uploaded by

黄家毅

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views2 pages

lecture25

Uploaded by

黄家毅

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Lecture 25: GPU Computing and Matrix Multiply

CS4787/5777 — Principles of Large-Scale Machine Learning Systems

Over the past three lectures, we’ve been talking about the architecture of the CPU and how it affects per-
formance of machine learning models. However, the CPU is not the only type of hardware that machine
learning models are trained or run on. In fact, most modern DNN training happens not on CPUs but on
GPUs. In this lecture, we’ll look at why GPUs became dominant for machine learning training, and we’ll
explore what makes their architecture uniquely well-suited to large-scale numerical computation.

A brief history of GPU computing. GPUs were originally designed to support the 3D graphics
pipeline, much of which was driven for demand for videogames with ever-increasing graphical fidelity. Im-
portant properties of 3d graphics rendering:

• Lots of opportunities for parallelism—rendering different pixels/objects in the scene can be done si-
multaneously.

• Lots of numerical computations—3d rendering is based on geometry.

The first era of GPUs ran a fixed-function graphics pipeline. They weren’t programmed, but instead just
configured to use a set of fixed functions designed for specific graphics tasks: mostly drawing and shading
polygons in 3d space. In the early 2000s, there was a shift towards programmable GPUs. These pro-
grammable GPUs allowed for people to customize certain stages in the graphics pipeline by writing small
programs called shaders which let developers process the vertices and pixels of the polygons they wanted to
render in custom ways. These shaders were capable of very high-throughput parallel processing, since they
would need to process and render very large numbers of polygons in a single frame of a 3d animation.

The GPU supported parallel programs that were more parallel than those of the CPU. But unlike multi-
threaded CPUs, which supported computing different functions at the same time, the GPU focused on com-
puting the same function simultaneously on multiple elements of data. That is, the GPU could run the same
function on a bunch of triangles in parallel, but couldn’t easily compute a different function for each triangle.
This illustrates a distinction between two types of parallelism:

• Data parallelism involves the same operations being computed in parallel on many different data
elements.

• Task parallelism involves different operations being computed in parallel (on either the same data or
different data).

How are the different types of parallelism we’ve discussed categorized according to this distinction?

• SIMD/Vector parallelism?

• Multi-core/multi-thread parallelism?

• Distributed computing?

1
The general-purpose GPU. Eventually, people started to use GPUs for tasks other than graphics rendering.
However, working within the structure of the graphics pipeline of the GPU placed limits on this. To better
support general-purpose GPU programming, in 2007 NVIDIA released CUDA, a parallel programming lan-
guage/computing platform for general-purpose computation on the GPU. (Other companies such as Intel
and AMD have competing products as well.) Now, programmers no longer needed to use the 3d graphics
API to access the parallel computing capabilities of the GPU. This led to a revolution in GPU computing, with
several major applications, including:

• Deep neural networks (particularly training)

• Cryptocurrencies

A function executed on the GPU in a CUDA program is called a kernel. An illustration of this from the CUDA
C programming guide:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}

This syntax launches N threads, each of which performs a single addition. Importantly, spinning up many
threads like this could be reasonably fast on a GPU, while on a CPU this would be way too slow due to the
overhead of creating new threads. Additionally, GPUs support many many more parallel threads running
than a CPU.

• Typical thread count for a GPU: tens of thousands.

• Typical thread count for a CPU: at most a few dozen.

Important downside of GPU threads: they’re data-parallel only. You can’t direct each of the individual
threads to “do its own thing” or run its own independent computation (although GPUs do support multi-task
parallelism to a limited extent).

GPUs in machine learning. Because of their large amount of data parallelism, GPUs provide an ideal
substrate for large-scale numerical computation. In particular, GPUs can perform matrix multiplies very fast.
Just like BLAS on the CPU, there’s an optimized library from NVIDIA “cuBLAS” that does matrix multiples
efficiently on their GPUs. There’s even a specialized library of primitives designed for deep learning: cuDNN.
Machine learning frameworks, such as TensorFlow, are designed to support computation on GPUs. And
training a deep net on a GPU can decrease training time by an order of magnitude. How do machine
learning frameworks support GPU computing?

Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Introduction CUDA
No ratings yet
Introduction CUDA
46 pages
Gpu Cuda Part2
No ratings yet
Gpu Cuda Part2
15 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
A Complete Gpu Guide - Cherry Servers
No ratings yet
A Complete Gpu Guide - Cherry Servers
29 pages
chapter-8
No ratings yet
chapter-8
58 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Part1 22
No ratings yet
Part1 22
77 pages
1 Cuda
100% (1)
1 Cuda
173 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Cuda
No ratings yet
Cuda
69 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
w13s1_MultiprocessingGPU
No ratings yet
w13s1_MultiprocessingGPU
21 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
p10-cuda
No ratings yet
p10-cuda
28 pages
GPU_Architecture_and_Programming_Lecture
No ratings yet
GPU_Architecture_and_Programming_Lecture
9 pages
gpu (1)
No ratings yet
gpu (1)
11 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Khan Muhammad Nafee Mostafa: Presented by
No ratings yet
Khan Muhammad Nafee Mostafa: Presented by
20 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
GPU_Programming_slides_2
No ratings yet
GPU_Programming_slides_2
37 pages
What is a GPU
No ratings yet
What is a GPU
3 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
note2_4
No ratings yet
note2_4
11 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPU Computing For Data Science - John Joo
No ratings yet
GPU Computing For Data Science - John Joo
34 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
No ratings yet
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
71 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CCS347 GD - UNit 1 Upddated 18-02-2025
No ratings yet
CCS347 GD - UNit 1 Upddated 18-02-2025
16 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA
No ratings yet
CUDA
46 pages
CUDA
No ratings yet
CUDA
33 pages
Tips and Tricks With DirectX 9
No ratings yet
Tips and Tricks With DirectX 9
729 pages
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
No ratings yet
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
1 page
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Real-Time Rendering 4th-ToC Preface Intro Bib Index
No ratings yet
Real-Time Rendering 4th-ToC Preface Intro Bib Index
158 pages
log - Copia (3)
No ratings yet
log - Copia (3)
41 pages
C PDF
No ratings yet
C PDF
151 pages
isaacv1.7.9b.J839-20241028-192039-6488-3804
No ratings yet
isaacv1.7.9b.J839-20241028-192039-6488-3804
14 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
The Ves Handbook Of Visual Effects Industry Standard Vfx Practices And Procedures Second Edition Jeffrey A Okun Susan Zwerman download
100% (1)
The Ves Handbook Of Visual Effects Industry Standard Vfx Practices And Procedures Second Edition Jeffrey A Okun Susan Zwerman download
85 pages
ESSENCE TheFace Modeling and Texturing
0% (1)
ESSENCE TheFace Modeling and Texturing
41 pages
Sin Título
No ratings yet
Sin Título
117 pages
LightWave 11 Addendum 120412 Small
No ratings yet
LightWave 11 Addendum 120412 Small
166 pages
Mantle Programming Guide and API Reference
No ratings yet
Mantle Programming Guide and API Reference
435 pages
Manual Artlantis 5
No ratings yet
Manual Artlantis 5
358 pages
Essential Mathematics for Games and Interactive Applications 3rd Edition James M. Van Verth - Get instant access to the full ebook content
No ratings yet
Essential Mathematics for Games and Interactive Applications 3rd Edition James M. Van Verth - Get instant access to the full ebook content
40 pages
Log
No ratings yet
Log
3 pages
Open B3 D
No ratings yet
Open B3 D
57 pages
2.4 AS IndustryTerms
No ratings yet
2.4 AS IndustryTerms
9 pages
ATI Radeon HD 2400 PRO Specs
No ratings yet
ATI Radeon HD 2400 PRO Specs
2 pages
Tutorial Libgdx
No ratings yet
Tutorial Libgdx
47 pages
GPU Graftals - Stylized Rendering of Fur and Grass
No ratings yet
GPU Graftals - Stylized Rendering of Fur and Grass
15 pages
Computer Graphics - Shading
No ratings yet
Computer Graphics - Shading
41 pages
Pixel Bender Language 10
No ratings yet
Pixel Bender Language 10
35 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Diff Shader
No ratings yet
Diff Shader
12 pages
Dewan V S Institute of Engineering & Technology Department of Cse/It BTECH: CSE/IT, SEM:VIII, SUB: Game Programming, SUB CODE: RCS-081
No ratings yet
Dewan V S Institute of Engineering & Technology Department of Cse/It BTECH: CSE/IT, SEM:VIII, SUB: Game Programming, SUB CODE: RCS-081
27 pages
Visvesvaraya Technological University: "Copter"
No ratings yet
Visvesvaraya Technological University: "Copter"
34 pages
OpenGL Quick Reference
No ratings yet
OpenGL Quick Reference
12 pages
System Compatibility Report
No ratings yet
System Compatibility Report
4 pages
Efficient GPU Path Rendering Using Scanline Rasterization
No ratings yet
Efficient GPU Path Rendering Using Scanline Rasterization
12 pages
Crash 2019 09 08 - 22.40.35 Client
No ratings yet
Crash 2019 09 08 - 22.40.35 Client
4 pages
Visvesvaraya Technological University: CG Laboratory With Mini Project (17CSL68)
No ratings yet
Visvesvaraya Technological University: CG Laboratory With Mini Project (17CSL68)
34 pages
SAG - 3D Animation NC III
No ratings yet
SAG - 3D Animation NC III
8 pages
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet

lecture25

Uploaded by

lecture25

Uploaded by

Lecture 25: GPU Computing and Matrix Multiply

CS4787/5777 — Principles of Large-Scale Machine Learning Systems

• Lots of numerical computations—3d rendering is based on geometry.

• Deep neural networks (particularly training)

• Typical thread count for a GPU: tens of thousands.

• Typical thread count for a CPU: at most a few dozen.

You might also like