cuuda nvidai guide_Part1
cuuda nvidai guide_Part1
Introduction
1. Introduction
1.1. The Benefits of Using GPUs
The Graphics Processing Unit (GPU)1 provides much higher instruction throughput
and memory bandwidth than the CPU within a similar price and power envelope. Many
applications leverage these higher capabilities to run faster on the GPU than on the
CPU (see GPU Applications ). Other computing devices, like FPGAs, are also very
energy efficient, but offer much less programming flexibility than GPUs.
This difference in capabilities between the GPU and the CPU exists because they are
designed with different goals in mind. While the CPU is designed to excel at executing
a sequence of operations, called a thread, as fast as possible and can execute a few
tens of these threads in parallel, the GPU is designed to excel at executing thousands
of them in parallel (amortizing the slower single-thread performance to achieve
greater throughput).
The GPU is specialized for highly parallel computations and therefore designed such
that more transistors are devoted to data processing rather than data caching and
flow control. The schematic Figure 1 shows an example distribution of chip resources
for a CPU versus a GPU.
Figure 1: The GPU Devotes More Transistors to Data Processing
In general, an application has a mix of parallel parts and sequential parts, so systems
are designed with a mix of GPUs and CPUs in order to maximize overall performance.
Applications with a high degree of parallelism can exploit this massively parallel nature
of the GPU to achieve higher performance than on the CPU.
CUDA comes with a software environment that allows developers to use C++ as a
high-level programming language. As illustrated by Figure 2, other languages,
application programming interfaces, or directives-based approaches are supported,
such as FORTRAN, DirectCompute, OpenACC.
Figure 2: GPU Computing Applications. CUDA is designed to support various languages
and application programming interfaces.
The CUDA parallel programming model is designed to overcome this challenge while
maintaining a low learning curve for programmers familiar with standard programming
languages such as C.
At its core are three key abstractions — a hierarchy of thread groups, shared
memories, and barrier synchronization — that are simply exposed to the programmer
as a minimal set of language extensions.
This scalable programming model allows the GPU architecture to span a wide market
range by simply scaling the number of multiprocessors and memory partitions: from
the high-performance enthusiast GeForce GPUs and professional Quadro and Tesla
computing products to a variety of inexpensive, mainstream GeForce GPUs (see
CUDA-Enabled GPUs for a list of all CUDA-enabled GPUs).
Note
[1] : The graphics qualifier comes from the fact that when the GPU was originally created,
two decades ago, it was designed as a specialized processor to accelerate graphics
rendering. Driven by the insatiable market demand for real-time, high-definition, 3D
graphics, it has evolved into a general processor used for many more workloads than
just graphics rendering.
2. Programming Model
This chapter introduces the main concepts behind the CUDA programming model by
outlining how they are exposed in C++.
Full code for the vector addition example used in this chapter and the next can be
found in the vectorAdd CUDA sample .
2.1. Kernels
CUDA C++ extends C++ by allowing the programmer to define C++ functions, called
kernels, that, when called, are executed N times in parallel by N different CUDA threads,
as opposed to only once like regular C++ functions.
A kernel is defined using the __global__ declaration specifier and the number of
CUDA threads that execute that kernel for a given kernel call is specified using a new
<<<...>>> execution configuration syntax (see Execution Configuration ). Each thread
that executes the kernel is given a unique thread ID that is accessible within the kernel
through built-in variables.
As an illustration, the following sample code, using the built-in variable threadIdx ,
adds two vectors A and B of size N and stores the result into vector C.
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
Here, each of the N threads that execute VecAdd() performs one pair-wise addition.
The index of a thread and its thread ID relate to each other in a straightforward way:
For a one-dimensional block, they are the same; for a two-dimensional block of size
(Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional
block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).
As an example, the following code adds two matrices A and B of size NxN and stores
the result into matrix C.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
There is a limit to the number of threads per block, since all threads of a block are
expected to reside on the same streaming multiprocessor core and must share the
limited memory resources of that core. On current GPUs, a thread block may contain
up to 1024 threads.
The number of threads per block and the number of blocks per grid specified in the
<<<...>>> syntax can be of type int or dim3 . Two-dimensional blocks or grids can be
specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional, or
three-dimensional unique index accessible within the kernel through the built-in
blockIdx variable. The dimension of the thread block is accessible within the kernel
through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code
becomes as follows.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a
common choice. The grid is created with enough blocks to have one thread per matrix
element as before. For simplicity, this example assumes that the number of threads
per grid in each dimension is evenly divisible by the number of threads per block in
that dimension, although that need not be the case.
Threads within a block can cooperate by sharing data through some shared memory
and by synchronizing their execution to coordinate memory accesses. More precisely,
one can specify synchronization points in the kernel by calling the __syncthreads()
intrinsic function; __syncthreads() acts as a barrier at which all threads in the block
must wait before any is allowed to proceed. Shared Memory gives an example of using
shared memory. In addition to __syncthreads() , the Cooperative Groups API provides a
rich set of thread-synchronization primitives.
For efficient cooperation, shared memory is expected to be a low-latency memory
near each processor core (much like an L1 cache) and __syncthreads() is expected to
be lightweight.
Similar to thread blocks, clusters are also organized into a one-dimension, two-
dimension, or three-dimension grid of thread block clusters as illustrated by Figure
5 . The number of thread blocks in a cluster can be user-defined, and a maximum of
8 thread blocks in a cluster is supported as a portable cluster size in CUDA. Note that
on GPU hardware or MIG configurations which are too small to support 8
multiprocessors the maximum cluster size will be reduced accordingly. Identification
of these smaller configurations, as well as of larger configurations supporting a thread
block cluster size beyond 8, is architecture-specific and can be queried using the
cudaOccupancyMaxPotentialClusterSize API.
Note
In a kernel launched using cluster support, the gridDim variable still denotes the
size in terms of number of thread blocks, for compatibility purposes. The rank of a
block in a cluster can be found using the Cluster Group API.
A thread block cluster can be enabled in a kernel either using a compile-time kernel
attribute using __cluster_dims__(X,Y,Z) or using the CUDA kernel launch API
cudaLaunchKernelEx . The example below shows how to launch a cluster using a
compile-time kernel attribute. The cluster size using kernel attribute is fixed at
compile time and then the kernel can be launched using the classical <<< , >>> . If a
kernel uses compile-time cluster size, the cluster size cannot be modified when
launching the kernel.
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__(2, 1, 1) cluster_kernel(float *input, float* output)
{
int main()
{
float *input, *output;
// Kernel invocation with compile time cluster size
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
// The grid dimension is not affected by cluster launch, and is still enumerated
// using number of blocks.
// The grid dimension must be a multiple of cluster size.
cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
}
A thread block cluster size can also be set at runtime and the kernel can be launched
using the CUDA kernel launch API cudaLaunchKernelEx . The code example below shows
how to launch a cluster kernel using the extensible API.
// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel(float *input, float* output)
{
int main()
{
float *input, *output;
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
attribute[0].val.clusterDim.y = 1;
attribute[0].val.clusterDim.z = 1;
config.attrs = attribute;
config.numAttrs = 1;
In GPUs with compute capability 9.0, all the thread blocks in the cluster are
guaranteed to be co-scheduled on a single GPU Processing Cluster (GPC) and allow
thread blocks in the cluster to perform hardware-supported synchronization using the
Cluster Group API cluster.sync() . Cluster group also provides member functions to
query cluster group size in terms of number of threads or number of blocks using
num_threads() and num_blocks() API respectively. The rank of a thread or block in the
cluster group can be queried using dim_threads() and dim_blocks() API respectively.
Thread blocks that belong to a cluster have access to the Distributed Shared Memory.
Thread blocks in a cluster have the ability to read, write, and perform atomics to any
address in the distributed shared memory. Distributed Shared Memory gives an
example of performing histograms in distributed shared memory.
There are also two additional read-only memory spaces accessible by all threads: the
constant and texture memory spaces. The global, constant, and texture memory
spaces are optimized for different memory usages (see Device Memory Accesses).
Texture memory also offers different addressing modes, as well as data filtering, for
some specific data formats (see Texture and Surface Memory).
The global, constant, and texture memory spaces are persistent across kernel
launches by the same application.
The CUDA programming model also assumes that both the host and the device
maintain their own separate memory spaces in DRAM, referred to as host memory and
device memory, respectively. Therefore, a program manages the global, constant, and
texture memory spaces visible to kernels through calls to the CUDA runtime
(described in Programming Interface). This includes device memory allocation and
deallocation as well as data transfer between host and device memory.
Unified Memory provides managed memory to bridge the host and device memory
spaces. Managed memory is accessible from all CPUs and GPUs in the system as a
single, coherent memory image with a common address space. This capability enables
oversubscription of device memory and can greatly simplify the task of porting
applications by eliminating the need to explicitly mirror data on host and device. See
Unified Memory Programming for an introduction to Unified Memory.
Note
Serial code executes on the host while parallel code executes on the device.
2.5. Asynchronous SIMT Programming Model
In the CUDA programming model a thread is the lowest level of abstraction for doing a
computation or a memory operation. Starting with devices based on the NVIDIA
Ampere GPU architecture, the CUDA programming model provides acceleration to
memory operations via the asynchronous programming model. The asynchronous
programming model defines the behavior of asynchronous operations with respect to
CUDA threads.
Such an asynchronous thread (an as-if thread) is always associated with the CUDA
thread that initiated the asynchronous operation. An asynchronous operation uses a
synchronization object to synchronize the completion of the operation. Such a
synchronization object can be explicitly managed by a user (e.g., cuda::memcpy_async ) or
implicitly managed within a library (e.g., cooperative_groups::memcpy_async ).
These thread scopes are implemented as extensions to standard C++ in the CUDA
Standard C++ library.
The compute capability comprises a major revision number X and a minor revision
number Y and is denoted by X.Y.
Devices with the same major revision number are of the same core architecture. The
major revision number is 9 for devices based on the NVIDIA Hopper GPU architecture, 8
for devices based on the NVIDIA Ampere GPU architecture, 7 for devices based on the
Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based
on the Maxwell architecture, and 3 for devices based on the Kepler architecture.
Turing is the architecture for devices of compute capability 7.5, and is an incremental
update based on the Volta architecture.