0% found this document useful (0 votes)
16 views

cuuda nvidai guide_Part1

The CUDA C++ Programming Guide provides an overview of the CUDA model and its interface, highlighting the advantages of using GPUs for parallel computing over CPUs. It introduces CUDA as a general-purpose parallel computing platform and programming model, detailing its scalable programming model and thread hierarchy. The document is structured to cover various aspects of CUDA, including programming interfaces, performance guidelines, and hardware implementation.

Uploaded by

faraziid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

cuuda nvidai guide_Part1

The CUDA C++ Programming Guide provides an overview of the CUDA model and its interface, highlighting the advantages of using GPUs for parallel computing over CPUs. It introduces CUDA as a general-purpose parallel computing platform and programming model, detailing its scalable programming model and thread hierarchy. The document is structured to cover various aspects of CUDA, including programming interfaces, performance guidelines, and hardware implementation.

Uploaded by

faraziid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

 » 1.

Introduction

CUDA C++ Programming Guide


The programming guide to the CUDA model and interface.

Changes in Version 12.8


Added section TMA Swizzle

1. Introduction
1.1. The Benefits of Using GPUs
The Graphics Processing Unit (GPU)1 provides much higher instruction throughput
and memory bandwidth than the CPU within a similar price and power envelope. Many
applications leverage these higher capabilities to run faster on the GPU than on the
CPU (see GPU Applications ). Other computing devices, like FPGAs, are also very
energy efficient, but offer much less programming flexibility than GPUs.

This difference in capabilities between the GPU and the CPU exists because they are
designed with different goals in mind. While the CPU is designed to excel at executing
a sequence of operations, called a thread, as fast as possible and can execute a few
tens of these threads in parallel, the GPU is designed to excel at executing thousands
of them in parallel (amortizing the slower single-thread performance to achieve
greater throughput).

The GPU is specialized for highly parallel computations and therefore designed such
that more transistors are devoted to data processing rather than data caching and
flow control. The schematic Figure 1 shows an example distribution of chip resources
for a CPU versus a GPU.
Figure 1: The GPU Devotes More Transistors to Data Processing

Devoting more transistors to data processing, for example, floating-point


computations, is beneficial for highly parallel computations; the GPU can hide memory
access latencies with computation, instead of relying on large data caches and
complex flow control to avoid long memory access latencies, both of which are
expensive in terms of transistors.

In general, an application has a mix of parallel parts and sequential parts, so systems
are designed with a mix of GPUs and CPUs in order to maximize overall performance.
Applications with a high degree of parallelism can exploit this massively parallel nature
of the GPU to achieve higher performance than on the CPU.

1.2. CUDA®: A General-Purpose Parallel Computing


Platform and Programming Model
In November 2006, NVIDIA® introduced CUDA®, a general purpose parallel computing
platform and programming model that leverages the parallel compute engine in
NVIDIA GPUs to solve many complex computational problems in a more efficient way
than on a CPU.

CUDA comes with a software environment that allows developers to use C++ as a
high-level programming language. As illustrated by Figure 2, other languages,
application programming interfaces, or directives-based approaches are supported,
such as FORTRAN, DirectCompute, OpenACC.
Figure 2: GPU Computing Applications. CUDA is designed to support various languages
and application programming interfaces.

1.3. A Scalable Programming Model


The advent of multicore CPUs and manycore GPUs means that mainstream processor
chips are now parallel systems. The challenge is to develop application software that
transparently scales its parallelism to leverage the increasing number of processor
cores, much as 3D graphics applications transparently scale their parallelism to
manycore GPUs with widely varying numbers of cores.

The CUDA parallel programming model is designed to overcome this challenge while
maintaining a low learning curve for programmers familiar with standard programming
languages such as C.

At its core are three key abstractions — a hierarchy of thread groups, shared
memories, and barrier synchronization — that are simply exposed to the programmer
as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread parallelism,


nested within coarse-grained data parallelism and task parallelism. They guide the
programmer to partition the problem into coarse sub-problems that can be solved
independently in parallel by blocks of threads, and each sub-problem into finer pieces
that can be solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads to cooperate


when solving each sub-problem, and at the same time enables automatic scalability.
Indeed, each block of threads can be scheduled on any of the available
multiprocessors within a GPU, in any order, concurrently or sequentially, so that a
compiled CUDA program can execute on any number of multiprocessors as illustrated
by Figure 3, and only the runtime system needs to know the physical multiprocessor
count.

This scalable programming model allows the GPU architecture to span a wide market
range by simply scaling the number of multiprocessors and memory partitions: from
the high-performance enthusiast GeForce GPUs and professional Quadro and Tesla
computing products to a variety of inexpensive, mainstream GeForce GPUs (see
CUDA-Enabled GPUs for a list of all CUDA-enabled GPUs).

Figure 3: Automatic Scalability

 Note

A GPU is built around an array of Streaming Multiprocessors (SMs) (see Hardware


Implementation for more details). A multithreaded program is partitioned into
blocks of threads that execute independently from each other, so that a GPU with
more multiprocessors will automatically execute the program in less time than a
GPU with fewer multiprocessors.

1.4. Document Structure


This document is organized into the following sections:

 Introduction is a general introduction to CUDA.


 Programming Model outlines the CUDA programming model.
 Programming Interface describes the programming interface.
 Hardware Implementation describes the hardware implementation.
 Performance Guidelines gives some guidance on how to achieve maximum
performance.
 CUDA-Enabled GPUs lists all CUDA-enabled devices.
 C++ Language Extensions is a detailed description of all extensions to the C++
language.
 Cooperative Groups describes synchronization primitives for various groups of
CUDA threads.
 CUDA Dynamic Parallelism describes how to launch and synchronize one kernel
from another.
 Virtual Memory Management describes how to manage the unified virtual address
space.
 Stream Ordered Memory Allocator describes how applications can order memory
allocation and deallocation.
 Graph Memory Nodes describes how graphs can create and own memory
allocations.
 Mathematical Functions lists the mathematical functions supported in CUDA.
 C++ Language Support lists the C++ features supported in device code.
 Texture Fetching gives more details on texture fetching.
 Compute Capabilities gives the technical specifications of various devices, as well
as more architectural details.
 Driver API introduces the low-level driver API.
 CUDA Environment Variables lists all the CUDA environment variables.
 Unified Memory Programming introduces the Unified Memory programming model.

[1] : The graphics qualifier comes from the fact that when the GPU was originally created,
two decades ago, it was designed as a specialized processor to accelerate graphics
rendering. Driven by the insatiable market demand for real-time, high-definition, 3D
graphics, it has evolved into a general processor used for many more workloads than
just graphics rendering.

2. Programming Model
This chapter introduces the main concepts behind the CUDA programming model by
outlining how they are exposed in C++.

An extensive description of CUDA C++ is given in Programming Interface.

Full code for the vector addition example used in this chapter and the next can be
found in the vectorAdd CUDA sample .
2.1. Kernels
CUDA C++ extends C++ by allowing the programmer to define C++ functions, called
kernels, that, when called, are executed N times in parallel by N different CUDA threads,
as opposed to only once like regular C++ functions.

A kernel is defined using the __global__ declaration specifier and the number of
CUDA threads that execute that kernel for a given kernel call is specified using a new
<<<...>>> execution configuration syntax (see Execution Configuration ). Each thread
that executes the kernel is given a unique thread ID that is accessible within the kernel
through built-in variables.

As an illustration, the following sample code, using the built-in variable threadIdx ,
adds two vectors A and B of size N and stores the result into vector C.

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}

Here, each of the N threads that execute VecAdd() performs one pair-wise addition.

2.2. Thread Hierarchy


For convenience, threadIdx is a 3-component vector, so that threads can be identified
using a one-dimensional, two-dimensional, or three-dimensional thread index, forming
a one-dimensional, two-dimensional, or three-dimensional block of threads, called a
thread block. This provides a natural way to invoke computation across the elements in
a domain such as a vector, matrix, or volume.

The index of a thread and its thread ID relate to each other in a straightforward way:
For a one-dimensional block, they are the same; for a two-dimensional block of size
(Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional
block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).

As an example, the following code adds two matrices A and B of size NxN and stores
the result into matrix C.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}

int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

There is a limit to the number of threads per block, since all threads of a block are
expected to reside on the same streaming multiprocessor core and must share the
limited memory resources of that core. On current GPUs, a thread block may contain
up to 1024 threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that


the total number of threads is equal to the number of threads per block times the
number of blocks.

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional


grid of thread blocks as illustrated by Figure 4. The number of thread blocks in a grid is
usually dictated by the size of the data being processed, which typically exceeds the
number of processors in the system.

Figure 4: Grid of Thread Blocks

The number of threads per block and the number of blocks per grid specified in the
<<<...>>> syntax can be of type int or dim3 . Two-dimensional blocks or grids can be
specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional, or
three-dimensional unique index accessible within the kernel through the built-in
blockIdx variable. The dimension of the thread block is accessible within the kernel
through the built-in blockDim variable.

Extending the previous MatAdd() example to handle multiple blocks, the code
becomes as follows.

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}

int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

A thread block size of 16x16 (256 threads), although arbitrary in this case, is a
common choice. The grid is created with enough blocks to have one thread per matrix
element as before. For simplicity, this example assumes that the number of threads
per grid in each dimension is evenly divisible by the number of threads per block in
that dimension, although that need not be the case.

Thread blocks are required to execute independently. It must be possible to execute


blocks in any order, in parallel or in series. This independence requirement allows
thread blocks to be scheduled in any order and across any number of cores as
illustrated by Figure 3 , enabling programmers to write code that scales with the
number of cores.

Threads within a block can cooperate by sharing data through some shared memory
and by synchronizing their execution to coordinate memory accesses. More precisely,
one can specify synchronization points in the kernel by calling the __syncthreads()
intrinsic function; __syncthreads() acts as a barrier at which all threads in the block
must wait before any is allowed to proceed. Shared Memory gives an example of using
shared memory. In addition to __syncthreads() , the Cooperative Groups API provides a
rich set of thread-synchronization primitives.
For efficient cooperation, shared memory is expected to be a low-latency memory
near each processor core (much like an L1 cache) and __syncthreads() is expected to
be lightweight.

2.2.1. Thread Block Clusters


With the introduction of NVIDIA Compute Capability 9.0, the CUDA programming
model introduces an optional level of hierarchy called Thread Block Clusters that are
made up of thread blocks. Similar to how threads in a thread block are guaranteed to
be co-scheduled on a streaming multiprocessor, thread blocks in a cluster are also
guaranteed to be co-scheduled on a GPU Processing Cluster (GPC) in the GPU.

Similar to thread blocks, clusters are also organized into a one-dimension, two-
dimension, or three-dimension grid of thread block clusters as illustrated by Figure
5 . The number of thread blocks in a cluster can be user-defined, and a maximum of
8 thread blocks in a cluster is supported as a portable cluster size in CUDA. Note that
on GPU hardware or MIG configurations which are too small to support 8
multiprocessors the maximum cluster size will be reduced accordingly. Identification
of these smaller configurations, as well as of larger configurations supporting a thread
block cluster size beyond 8, is architecture-specific and can be queried using the
cudaOccupancyMaxPotentialClusterSize API.

Figure 5: Grid of Thread Block Clusters

 Note

In a kernel launched using cluster support, the gridDim variable still denotes the
size in terms of number of thread blocks, for compatibility purposes. The rank of a
block in a cluster can be found using the Cluster Group API.

A thread block cluster can be enabled in a kernel either using a compile-time kernel
attribute using __cluster_dims__(X,Y,Z) or using the CUDA kernel launch API
cudaLaunchKernelEx . The example below shows how to launch a cluster using a
compile-time kernel attribute. The cluster size using kernel attribute is fixed at
compile time and then the kernel can be launched using the classical <<< , >>> . If a
kernel uses compile-time cluster size, the cluster size cannot be modified when
launching the kernel.

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__(2, 1, 1) cluster_kernel(float *input, float* output)
{

int main()
{
float *input, *output;
// Kernel invocation with compile time cluster size
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

// The grid dimension is not affected by cluster launch, and is still enumerated
// using number of blocks.
// The grid dimension must be a multiple of cluster size.
cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
}

A thread block cluster size can also be set at runtime and the kernel can be launched
using the CUDA kernel launch API cudaLaunchKernelEx . The code example below shows
how to launch a cluster kernel using the extensible API.
// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel(float *input, float* output)
{

int main()
{
float *input, *output;
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

// Kernel invocation with runtime cluster size


{
cudaLaunchConfig_t config = {0};
// The grid dimension is not affected by cluster launch, and is still enumerated
// using number of blocks.
// The grid dimension should be a multiple of cluster size.
config.gridDim = numBlocks;
config.blockDim = threadsPerBlock;

cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
attribute[0].val.clusterDim.y = 1;
attribute[0].val.clusterDim.z = 1;
config.attrs = attribute;
config.numAttrs = 1;

cudaLaunchKernelEx(&config, cluster_kernel, input, output);


}
}

In GPUs with compute capability 9.0, all the thread blocks in the cluster are
guaranteed to be co-scheduled on a single GPU Processing Cluster (GPC) and allow
thread blocks in the cluster to perform hardware-supported synchronization using the
Cluster Group API cluster.sync() . Cluster group also provides member functions to
query cluster group size in terms of number of threads or number of blocks using
num_threads() and num_blocks() API respectively. The rank of a thread or block in the
cluster group can be queried using dim_threads() and dim_blocks() API respectively.

Thread blocks that belong to a cluster have access to the Distributed Shared Memory.
Thread blocks in a cluster have the ability to read, write, and perform atomics to any
address in the distributed shared memory. Distributed Shared Memory gives an
example of performing histograms in distributed shared memory.

2.3. Memory Hierarchy


CUDA threads may access data from multiple memory spaces during their execution
as illustrated by Figure 6. Each thread has private local memory. Each thread block has
shared memory visible to all threads of the block and with the same lifetime as the
block. Thread blocks in a thread block cluster can perform read, write, and atomics
operations on each other’s shared memory. All threads have access to the same global
memory.

There are also two additional read-only memory spaces accessible by all threads: the
constant and texture memory spaces. The global, constant, and texture memory
spaces are optimized for different memory usages (see Device Memory Accesses).
Texture memory also offers different addressing modes, as well as data filtering, for
some specific data formats (see Texture and Surface Memory).

The global, constant, and texture memory spaces are persistent across kernel
launches by the same application.

Figure 6: Memory Hierarchy

2.4. Heterogeneous Programming


As illustrated by Figure 7, the CUDA programming model assumes that the CUDA
threads execute on a physically separate device that operates as a coprocessor to the
host running the C++ program. This is the case, for example, when the kernels execute
on a GPU and the rest of the C++ program executes on a CPU.

The CUDA programming model also assumes that both the host and the device
maintain their own separate memory spaces in DRAM, referred to as host memory and
device memory, respectively. Therefore, a program manages the global, constant, and
texture memory spaces visible to kernels through calls to the CUDA runtime
(described in Programming Interface). This includes device memory allocation and
deallocation as well as data transfer between host and device memory.

Unified Memory provides managed memory to bridge the host and device memory
spaces. Managed memory is accessible from all CPUs and GPUs in the system as a
single, coherent memory image with a common address space. This capability enables
oversubscription of device memory and can greatly simplify the task of porting
applications by eliminating the need to explicitly mirror data on host and device. See
Unified Memory Programming for an introduction to Unified Memory.

Figure 7: Heterogeneous Programming

 Note

Serial code executes on the host while parallel code executes on the device.
2.5. Asynchronous SIMT Programming Model
In the CUDA programming model a thread is the lowest level of abstraction for doing a
computation or a memory operation. Starting with devices based on the NVIDIA
Ampere GPU architecture, the CUDA programming model provides acceleration to
memory operations via the asynchronous programming model. The asynchronous
programming model defines the behavior of asynchronous operations with respect to
CUDA threads.

The asynchronous programming model defines the behavior of Asynchronous Barrier


for synchronization between CUDA threads. The model also explains and defines how
cuda::memcpy_async can be used to move data asynchronously from global memory
while computing in the GPU.

2.5.1. Asynchronous Operations


An asynchronous operation is defined as an operation that is initiated by a CUDA
thread and is executed asynchronously as-if by another thread. In a well formed
program one or more CUDA threads synchronize with the asynchronous operation.
The CUDA thread that initiated the asynchronous operation is not required to be
among the synchronizing threads.

Such an asynchronous thread (an as-if thread) is always associated with the CUDA
thread that initiated the asynchronous operation. An asynchronous operation uses a
synchronization object to synchronize the completion of the operation. Such a
synchronization object can be explicitly managed by a user (e.g., cuda::memcpy_async ) or
implicitly managed within a library (e.g., cooperative_groups::memcpy_async ).

A synchronization object could be a cuda::barrier or a cuda::pipeline . These objects


are explained in detail in Asynchronous Barrier and Asynchronous Data Copies using
cuda::pipeline. These synchronization objects can be used at different thread scopes. A
scope defines the set of threads that may use the synchronization object to
synchronize with the asynchronous operation. The following table defines the thread
scopes available in CUDA C++ and the threads that can be synchronized with each.

Thread Scope Description

cuda::thread_scope::thread_scope_thread Only the CUDA


thread which
initiated
asynchronous
operations
synchronizes.
Thread Scope Description

cuda::thread_scope::thread_scope_block All or any CUDA


threads within the
same thread block
as the initiating
thread synchronizes.

cuda::thread_scope::thread_scope_device All or any CUDA


threads in the same
GPU device as the
initiating thread
synchronizes.

cuda::thread_scope::thread_scope_system All or any CUDA or


CPU threads in the
same system as the
initiating thread
synchronizes.

These thread scopes are implemented as extensions to standard C++ in the CUDA
Standard C++  library.

2.6. Compute Capability


The compute capability of a device is represented by a version number, also sometimes
called its “SM version”. This version number identifies the features supported by the
GPU hardware and is used by applications at runtime to determine which hardware
features and/or instructions are available on the present GPU.

The compute capability comprises a major revision number X and a minor revision
number Y and is denoted by X.Y.

Devices with the same major revision number are of the same core architecture. The
major revision number is 9 for devices based on the NVIDIA Hopper GPU architecture, 8
for devices based on the NVIDIA Ampere GPU architecture, 7 for devices based on the
Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based
on the Maxwell architecture, and 3 for devices based on the Kepler architecture.

The minor revision number corresponds to an incremental improvement to the core


architecture, possibly including new features.

Turing is the architecture for devices of compute capability 7.5, and is an incremental
update based on the Volta architecture.

You might also like