0% found this document useful (0 votes)
222 views24 pages

A Whirlwind Tour of Python

This document discusses a compiler and runtime called cuda-on-cl that allows running NVIDIA CUDA C++ applications on OpenCL devices. It works by compiling CUDA code to OpenCL bytecode at compile-time and implementing part of the CUDA API in OpenCL. This provides portability across OpenCL devices while retaining most of the CUDA programming model and optimizations.

Uploaded by

Pallab Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views24 pages

A Whirlwind Tour of Python

This document discusses a compiler and runtime called cuda-on-cl that allows running NVIDIA CUDA C++ applications on OpenCL devices. It works by compiling CUDA code to OpenCL bytecode at compile-time and implementing part of the CUDA API in OpenCL. This provides portability across OpenCL devices while retaining most of the CUDA programming model and optimizations.

Uploaded by

Pallab Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

cuda-on-cl

A compiler and runtime for running


NVIDIA® CUDA™ C++11 applications on
OpenCL™ 1.2 devices

Hugh Perkins (ASAPP)


Demo: CUDA™ on Intel HD5500
__global__ void setValue(float *data, int idx, float value) {
if(threadIdx.x == 0) {
data[idx] = value;
}
}

int main(int argc, char *argv[]) {


...
cudaMalloc((void**)(&gpuFloats), N * sizeof(float));
setValue<<<dim3(32, 1, 1), dim3(32, 1, 1)>>>(gpuFloats, 2, 123.0f);
cudaMemcpy(hostFloats, gpuFloats, 4 * sizeof(float), cudaMemcpyDeviceToHost);
cout << "hostFloats[2] " << hostFloats[2] << endl;
...
}
Background: why?

- NVIDIA® CUDA™ is the language of choice for machine learning libraries:


- Tensorflow
- Caffe
- Torch
- Theano
- …

- Ports to OpenCL by hand include


- Caffe (Tschopp; Gu et al; Engel)
- Torch (Perkins) <= me :-)

- Dedicated OpenCL libraries are few:


- DeepCL (Perkins)
Why not port by hand?

- Maintenance nightmare
- Need to fork the code
- The Caffe forks are separate from core CUDA codebase
- The Torch fork is a separate repo from core CUDA Torch
codebase
- Feature incomplete
- Frozen at February 2016
Concept: leave the code in NVIDIA® CUDA™

- Leave the code in NVIDIA® CUDA™


- Compile into OpenCL
NVIDIA® CUDA™ compiler ecosystem

What Who Input Portable? Compile/run


NVIDIA® CUDA™?

HIP AMD HIP No (AMD only) Almost (rename API


calls)

ComputeCpp Codeplay SYCL Yes (SPIR) No (different api)

triSYCL Keryell SYCL Yes (SPIR) No (different api)

NVIDIA CUDA NVIDIA NVIDIA® CUDA™ No (NVIDIA only) Yes


NVIDIA® CUDA™ compiler ecosystem

What Who Input Portable? Compile/run


NVIDIA® CUDA™?

HIP AMD HIP No (AMD only) Almost (rename API


calls)

ComputeCpp Codeplay SYCL Yes (SPIR) No (different api)

triSYCL Keryell SYCL Yes (SPIR) No (different api)

NVIDIA CUDA NVIDIA NVIDIA® CUDA™ No (NVIDIA only) Yes

cuda-on-cl Perkins NVIDIA® CUDA™ Yes (OpenCL 1.2) Yes


Portability vs speed

Speed vs portability: pick one

Portability Speed

CUDA-on-CL triSYCL HIP NVIDIA CUDA


One single hardware type, One single hardware type,
OpenCL 1.2 is widely ComputeCpp so fast, but not portable And its NVIDIA :-)
supported
SPIR-1.2 not widely
supported
So how fast is cuda-on-cl?

NVIDIA K80 GPU


Batch size: ~500MB
So how fast is cuda-on-cl?
Execution times comparable for:
- unary ops
- binary ops
- single-axis reduction

But:
- full reduction slow
- large batchsizes

NVIDIA K80 GPU


Batch size: ~500MB
Effect of batchsize on execution time, unary ops

- similar for batchsize >= 1MB


- constant per-batch overhead higher
Effect of batchsize on execution time, full reduction

- full reduction 14 times slower


- open opportunity to analyze why
Key design decisions

- We want to compile C++11 kernels. How?

Use CLANG C++11 parser

- How to feed bytecode to the GPU driver?

Convert to OpenCL 1.2

- How to handle NVIDIA® CUDA™ API calls?

Implement NVIDIA® CUDA™ API, in OpenCL


Kernel compilation

OpenCL
CLANG Generator
CUDA Kernel Bytecode OpenCL 1.2

Compile-time Run-time
Compile-time

Host bytecode Executable

Device
CUDA bytecode
CLANG
sourcecode
CUDA-on-CL
runtime
Device
bytecode
Run-time
NVIDIA® CUDA™ API partial implementation

Kernel launch

User CUDA API


Executable Virtual memory GPU

Context
Edge/not-so-edge cases

OpenCL 1.2:

- does not allow hostside GPU buffer offsets <= we will look at this
- requires address-spaces to be statically declared (global/local/private…)
- … including function parameters
- … which might be called with diverse address-space combinations
- forbids by-value structs as kernel parameters
- forbids pointers in kernel parameter structs
- lacks many hardware operations, eg __shfl__

CUDA-on-CL handles all of the above


Case-study: hostside GPU buffer offsets

CUDA lets you do things like:

float *buf = (float *)cudaMalloc(1024);


someKernel<<<... >>>(buf + 128);
Case-study: hostside GPU buffer offsets

OpenCL 1.2 doesn’t allow this:

cl_mem buf = clCreateBuffer(..., 1024, ...);


clEnqueueNDRangeKernel(... , buf + 128, ...);

Not allowed
Case-study: hostside GPU buffer offsets

CUDA-on-CL solution:

1. Implement virtual memory, and


2. Rewrite the kernel
Part 1: virtual memory

Executable Virtual memory Kernel Launch

1. Request buffer

2. Allocate cl_mem

3. Return virtual address

4. Launch kernel

5. Look up virtual address, get:


- cl_mem
- offset
Part 2: rewrite kernel
Before:

kernel void someKernel(global float *buf, …) {


...
}

After:

kernel void someKernel(global float *buf_data, int buf_offset, …) {


global float *buf = buf_data + buf_offset;
...
}

Transparent: no changes required to user source-code


Open issues

- Execution speed:
- NVIDIA compiler optimizations really good
- OpenCL 1.2 compatibility boilerplate increases launch overhead
- OpenCL intrinsic kernel launch time high
- Missing hardware implementations (shfl etc)

- Portability:
- Each vendor driver has different quirks
- Need to test case-by-case
- CUDA-on-CL stresses the drivers in unusual ways
Overall

- CUDA-on-CL actually works, on some fairly complex kernels


- Runs on multiple vendors’ GPUs
- Execution speed can be at parity with native NVIDIA® CUDA™
- Much more general solution than porting by hand

Opensource, Apache 2.0 License:

https://github.com/hughperkins/cuda-on-cl

Thank you to Andy Maheshwari (ASAPP) for awesome help reviewing the presentation.

You might also like