You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Maybe you hope to take advantage of multiple GPU to make inference even faster. Here are few tips to help you deal with it! Take **YOLO V4** as an example.
4
+
5
+
## 1. Make custom plugin (i.e. YOLO layer and Mish layer for YOLO V4) running asynchronically.
6
+
7
+
To do this, we need to use CudaStream parameter in the kernels of all custom layers and use asynchronous functions.
8
+
For example, in function ` forwardGpu()` of **yololayer.cu**, you need to do the following changes to make sure that the engine will be running on a specific CudaStream.
## 2. Create an engine for each device you want to use.
14
+
15
+
Maybe it is a good idea to create a struct to store the engine, context and buffer for each device individually. For example,
16
+
```
17
+
struct Plan{
18
+
IRuntime* runtime;
19
+
ICudaEngine* engine;
20
+
IExecutionContext* context;
21
+
void buffers[2];
22
+
cudaStream_t stream;
23
+
};
24
+
```
25
+
And then use `cudaSetDevice()` to make each engine you create running on specific device. Moreover, to maximize performance, make sure that the engine file you are using to deserialize is the one tensor RT optimized for this device.
26
+
27
+
## 3. Use function wisely
28
+
Here are some knowledge I learned when trying to parallelize the inference.
29
+
1) Do not use synchronized function , like `cudaFree()`, during inference.
30
+
2) Using `cudaMallocHost()` instead of `malloc()` when allocating memory on the host side.
0 commit comments