0% found this document useful (0 votes)

222 views24 pages

A Whirlwind Tour of Python

This document discusses a compiler and runtime called cuda-on-cl that allows running NVIDIA CUDA C++ applications on OpenCL devices. It works by compiling CUDA code to OpenCL bytecode at compile-time and implementing part of the CUDA API in OpenCL. This provides portability across OpenCL devices while retaining most of the CUDA programming model and optimizations.

Uploaded by

Pallab Sarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

222 views24 pages

A Whirlwind Tour of Python

Uploaded by

Pallab Sarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

cuda-on-cl

A compiler and runtime for running

NVIDIA® CUDA™ C++11 applications on
OpenCL™ 1.2 devices

Hugh Perkins (ASAPP)

Demo: CUDA™ on Intel HD5500
__global__ void setValue(float *data, int idx, float value) {
if(threadIdx.x == 0) {
data[idx] = value;
}
}

int main(int argc, char *argv[]) {

...
cudaMalloc((void**)(&gpuFloats), N * sizeof(float));
setValue<<<dim3(32, 1, 1), dim3(32, 1, 1)>>>(gpuFloats, 2, 123.0f);
cudaMemcpy(hostFloats, gpuFloats, 4 * sizeof(float), cudaMemcpyDeviceToHost);
cout << "hostFloats[2] " << hostFloats[2] << endl;
...
}
Background: why?

- NVIDIA® CUDA™ is the language of choice for machine learning libraries:

- Tensorflow
- Caffe
- Torch
- Theano
- …

- Ports to OpenCL by hand include

- Caffe (Tschopp; Gu et al; Engel)
- Torch (Perkins) <= me :-)

- Dedicated OpenCL libraries are few:

- DeepCL (Perkins)
Why not port by hand?

- Maintenance nightmare
- Need to fork the code
- The Caffe forks are separate from core CUDA codebase
- The Torch fork is a separate repo from core CUDA Torch
codebase
- Feature incomplete
- Frozen at February 2016
Concept: leave the code in NVIDIA® CUDA™

- Leave the code in NVIDIA® CUDA™

- Compile into OpenCL
NVIDIA® CUDA™ compiler ecosystem

What Who Input Portable? Compile/run

NVIDIA® CUDA™?

HIP AMD HIP No (AMD only) Almost (rename API

calls)

ComputeCpp Codeplay SYCL Yes (SPIR) No (different api)

triSYCL Keryell SYCL Yes (SPIR) No (different api)

NVIDIA CUDA NVIDIA NVIDIA® CUDA™ No (NVIDIA only) Yes

NVIDIA® CUDA™ compiler ecosystem

What Who Input Portable? Compile/run

NVIDIA® CUDA™?

HIP AMD HIP No (AMD only) Almost (rename API

calls)

ComputeCpp Codeplay SYCL Yes (SPIR) No (different api)

triSYCL Keryell SYCL Yes (SPIR) No (different api)

NVIDIA CUDA NVIDIA NVIDIA® CUDA™ No (NVIDIA only) Yes

cuda-on-cl Perkins NVIDIA® CUDA™ Yes (OpenCL 1.2) Yes

Portability vs speed

Speed vs portability: pick one

Portability Speed

CUDA-on-CL triSYCL HIP NVIDIA CUDA

One single hardware type, One single hardware type,
OpenCL 1.2 is widely ComputeCpp so fast, but not portable And its NVIDIA :-)
supported
SPIR-1.2 not widely
supported
So how fast is cuda-on-cl?

NVIDIA K80 GPU

Batch size: ~500MB
So how fast is cuda-on-cl?
Execution times comparable for:
- unary ops
- binary ops
- single-axis reduction

But:
- full reduction slow
- large batchsizes

NVIDIA K80 GPU

Batch size: ~500MB
Effect of batchsize on execution time, unary ops

- similar for batchsize >= 1MB

- constant per-batch overhead higher
Effect of batchsize on execution time, full reduction

- full reduction 14 times slower

- open opportunity to analyze why
Key design decisions

- We want to compile C++11 kernels. How?

Use CLANG C++11 parser

- How to feed bytecode to the GPU driver?

Convert to OpenCL 1.2

- How to handle NVIDIA® CUDA™ API calls?

Implement NVIDIA® CUDA™ API, in OpenCL

Kernel compilation

OpenCL
CLANG Generator
CUDA Kernel Bytecode OpenCL 1.2

Compile-time Run-time
Compile-time

Host bytecode Executable

Device
CUDA bytecode
CLANG
sourcecode
CUDA-on-CL
runtime
Device
bytecode
Run-time
NVIDIA® CUDA™ API partial implementation

Kernel launch

User CUDA API

Executable Virtual memory GPU

Context
Edge/not-so-edge cases

OpenCL 1.2:

- does not allow hostside GPU buffer offsets <= we will look at this
- requires address-spaces to be statically declared (global/local/private…)
- … including function parameters
- … which might be called with diverse address-space combinations
- forbids by-value structs as kernel parameters
- forbids pointers in kernel parameter structs
- lacks many hardware operations, eg __shfl__

CUDA-on-CL handles all of the above

Case-study: hostside GPU buffer offsets

CUDA lets you do things like:

float buf = (float )cudaMalloc(1024);

someKernel<<<... >>>(buf + 128);
Case-study: hostside GPU buffer offsets

OpenCL 1.2 doesn’t allow this:

cl_mem buf = clCreateBuffer(..., 1024, ...);

clEnqueueNDRangeKernel(... , buf + 128, ...);

Not allowed
Case-study: hostside GPU buffer offsets

CUDA-on-CL solution:

1. Implement virtual memory, and

2. Rewrite the kernel
Part 1: virtual memory

Executable Virtual memory Kernel Launch

1. Request buffer

2. Allocate cl_mem

3. Return virtual address

4. Launch kernel

5. Look up virtual address, get:

- cl_mem
- offset
Part 2: rewrite kernel
Before:

kernel void someKernel(global float *buf, …) {

...
}

After:

kernel void someKernel(global float *buf_data, int buf_offset, …) {

global float *buf = buf_data + buf_offset;
...
}

Transparent: no changes required to user source-code

Open issues

- Execution speed:
- NVIDIA compiler optimizations really good
- OpenCL 1.2 compatibility boilerplate increases launch overhead
- OpenCL intrinsic kernel launch time high
- Missing hardware implementations (shfl etc)

- Portability:
- Each vendor driver has different quirks
- Need to test case-by-case
- CUDA-on-CL stresses the drivers in unusual ways
Overall

- CUDA-on-CL actually works, on some fairly complex kernels

- Runs on multiple vendors’ GPUs
- Execution speed can be at parity with native NVIDIA® CUDA™
- Much more general solution than porting by hand

Opensource, Apache 2.0 License:

https://github.com/hughperkins/cuda-on-cl

Thank you to Andy Maheshwari (ASAPP) for awesome help reviewing the presentation.

FG100 Tech Manual v2
80% (10)
FG100 Tech Manual v2
94 pages
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
No ratings yet
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
18 pages
Problems and Solutions - C4
83% (6)
Problems and Solutions - C4
25 pages
Chapter 4: Forecasting: Problem 1: Auto Sales at Carmen's Chevrolet Are Shown Below. Find A Naive Forecast
No ratings yet
Chapter 4: Forecasting: Problem 1: Auto Sales at Carmen's Chevrolet Are Shown Below. Find A Naive Forecast
11 pages
Text Linguistics and Classical Studies - Facebook Com LinguaLIB
100% (1)
Text Linguistics and Classical Studies - Facebook Com LinguaLIB
129 pages
5B07 PHY Electrostatics and Magnetostatics
No ratings yet
5B07 PHY Electrostatics and Magnetostatics
3 pages
Chemistry Basics for Students
No ratings yet
Chemistry Basics for Students
16 pages
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
No ratings yet
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
39 pages
Sample Test Hkimo Grade 3 (Vòng Sơ Lo I) : Part I: Logical Thinking
100% (1)
Sample Test Hkimo Grade 3 (Vòng Sơ Lo I) : Part I: Logical Thinking
7 pages
Hungarian Mathematical Olympiad 1998/99: Final Round
No ratings yet
Hungarian Mathematical Olympiad 1998/99: Final Round
1 page
1 Cuda
100% (1)
1 Cuda
173 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
Midpalatal Miniscrew Insertion The Accuracy of Di
No ratings yet
Midpalatal Miniscrew Insertion The Accuracy of Di
7 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Cuda On CL Iwocl2017
No ratings yet
Cuda On CL Iwocl2017
4 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
OpenCL for Programmers
No ratings yet
OpenCL for Programmers
13 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
OpenCL to CUDA Translator Guide
No ratings yet
OpenCL to CUDA Translator Guide
4 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Syllabus Musi1311
No ratings yet
Syllabus Musi1311
4 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Car Audio Systems for Toyota, Honda, Kia
No ratings yet
Car Audio Systems for Toyota, Honda, Kia
68 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
Lec 1
No ratings yet
Lec 1
27 pages
Glass Ceramics PDF
No ratings yet
Glass Ceramics PDF
80 pages
Semen Analysis
No ratings yet
Semen Analysis
42 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Lecture 1 1
No ratings yet
Lecture 1 1
26 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
The Geek Way Andrew Mcafee Reid Hoffman Download
100% (1)
The Geek Way Andrew Mcafee Reid Hoffman Download
40 pages
The Handbook of Computational Linguistics and Natural Language Processing 1st Edition Alexander Clark PDF Download
100% (1)
The Handbook of Computational Linguistics and Natural Language Processing 1st Edition Alexander Clark PDF Download
50 pages
Unit 4th DAQ and Amplifier
No ratings yet
Unit 4th DAQ and Amplifier
88 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Introduction To OpenCL With Examples
No ratings yet
Introduction To OpenCL With Examples
128 pages
Microsoft Excel MCQs
No ratings yet
Microsoft Excel MCQs
15 pages
Dynamic Load Balancing On Single-And Multi-GPU Systems
No ratings yet
Dynamic Load Balancing On Single-And Multi-GPU Systems
12 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
Electrical Wind Turbine Systems
No ratings yet
Electrical Wind Turbine Systems
101 pages
STA02A2 - Chapter 1
No ratings yet
STA02A2 - Chapter 1
25 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Hack RQD
No ratings yet
Hack RQD
82 pages
2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
OpenCL Device Specs: NVIDIA & Intel
No ratings yet
OpenCL Device Specs: NVIDIA & Intel
6 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
OpenCL Programming Guide
No ratings yet
OpenCL Programming Guide
61 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Hpca2021 Gpu 2
No ratings yet
Hpca2021 Gpu 2
41 pages
Advanced CUDA for Developers
No ratings yet
Advanced CUDA for Developers
41 pages
Upcrc Opencl Lec1
No ratings yet
Upcrc Opencl Lec1
38 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
No ratings yet
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
9 pages
Coccinia Grandis
No ratings yet
Coccinia Grandis
9 pages
Qalambartar (QB) For Windows and Mac: 10, 2 M Flower @
No ratings yet
Qalambartar (QB) For Windows and Mac: 10, 2 M Flower @
3 pages
Control Structures in PLSQL
No ratings yet
Control Structures in PLSQL
8 pages
Environment Control System: Types of Absorbents
No ratings yet
Environment Control System: Types of Absorbents
4 pages
Network Optimization Checklist
No ratings yet
Network Optimization Checklist
6 pages
B10 AutoCAD 201222
No ratings yet
B10 AutoCAD 201222
2 pages
Statement Project
No ratings yet
Statement Project
1 page

A Whirlwind Tour of Python

Uploaded by

A Whirlwind Tour of Python

Uploaded by

cuda-on-cl

A compiler and runtime for running

Hugh Perkins (ASAPP)

int main(int argc, char *argv[]) {

- NVIDIA® CUDA™ is the language of choice for machine learning libraries:

- Ports to OpenCL by hand include

- Dedicated OpenCL libraries are few:

- Leave the code in NVIDIA® CUDA™

What Who Input Portable? Compile/run

HIP AMD HIP No (AMD only) Almost (rename API

ComputeCpp Codeplay SYCL Yes (SPIR) No (different api)

triSYCL Keryell SYCL Yes (SPIR) No (different api)

NVIDIA CUDA NVIDIA NVIDIA® CUDA™ No (NVIDIA only) Yes

What Who Input Portable? Compile/run

HIP AMD HIP No (AMD only) Almost (rename API

ComputeCpp Codeplay SYCL Yes (SPIR) No (different api)

triSYCL Keryell SYCL Yes (SPIR) No (different api)

NVIDIA CUDA NVIDIA NVIDIA® CUDA™ No (NVIDIA only) Yes

cuda-on-cl Perkins NVIDIA® CUDA™ Yes (OpenCL 1.2) Yes

Speed vs portability: pick one

CUDA-on-CL triSYCL HIP NVIDIA CUDA

NVIDIA K80 GPU

NVIDIA K80 GPU

- similar for batchsize >= 1MB

- full reduction 14 times slower

- We want to compile C++11 kernels. How?

Use CLANG C++11 parser

- How to feed bytecode to the GPU driver?

Convert to OpenCL 1.2

- How to handle NVIDIA® CUDA™ API calls?

Implement NVIDIA® CUDA™ API, in OpenCL

Host bytecode Executable

User CUDA API

CUDA-on-CL handles all of the above

CUDA lets you do things like:

float *buf = (float *)cudaMalloc(1024);

OpenCL 1.2 doesn’t allow this:

cl_mem buf = clCreateBuffer(..., 1024, ...);

1. Implement virtual memory, and

Executable Virtual memory Kernel Launch

3. Return virtual address

5. Look up virtual address, get:

kernel void someKernel(global float *buf, …) {

kernel void someKernel(global float *buf_data, int buf_offset, …) {

Transparent: no changes required to user source-code

- CUDA-on-CL actually works, on some fairly complex kernels

Opensource, Apache 2.0 License:

You might also like

float buf = (float )cudaMalloc(1024);