|
| 1 | +# `ISO3DFD OpenMP Offload` Sample |
| 2 | + |
| 3 | +The ISO3DFD sample refers to Three-Dimensional Finite-Difference Wave Propagation in Isotropic Media. It is a three-dimensional stencil to simulate a wave propagating in a 3D isotropic medium and shows some of the more common challenges and techniques when targeting OMP Offload devices (GPU) in more complex applications to achieve good performance. |
| 4 | + |
| 5 | +| Optimized for | Description |
| 6 | +|:--- |:--- |
| 7 | +| OS | Linux* Ubuntu* 18.04 |
| 8 | +| Hardware | Skylake with GEN9 or newer |
| 9 | +| Software | Intel® oneAPI DPC++/C++ Compiler; |
| 10 | +| What you will learn | How to offload the computation to GPU using Intel® oneAPI DPC++/C++ Compiler |
| 11 | +| Time to complete | 15 minutes |
| 12 | + |
| 13 | +Performance number tabulation |
| 14 | + |
| 15 | +| iso3dfd_omp_offload sample | Performance data |
| 16 | +|:--- |:--- |
| 17 | +| Default Baseline version | 1.0 |
| 18 | +| Optimized version 1 | 1.11x |
| 19 | +| Optimized version 2 | 1.48x |
| 20 | +| Optimized version 3 | 1.60x |
| 21 | + |
| 22 | + |
| 23 | +## Purpose |
| 24 | + |
| 25 | +ISO3DFD is a finite difference stencil kernel for solving the 3D acoustic isotropic wave equation which can be used as a proxy for propogating a seismic wave. Kernels in this sample are implemented as 16th order in space, with symmetric coefficients, and 2nd order in time scheme without boundary conditions.. Using OpenMP Offload, the sample can explicitly run on the GPU to propagate a seismic wave which is a compute intensive task. |
| 26 | + |
| 27 | +The code will attempt to find an available GPU or OpenMP Offload capable device and exit if a compatible device is not detected. By default, the output will print the device name where the OpenMP Offload code ran along with the grid computation metrics - flops and effective throughput. For validating results, a OpenMP/CPU-only version of the application will be run on host/CPU and results will be compared to the OpenMP Offload version. |
| 28 | + |
| 29 | +The code also demonstrates some of the common optimization techniques which can be used to improve performance of 3D-stencil code running on a GPU device. |
| 30 | + |
| 31 | +## Key Implementation Details |
| 32 | + |
| 33 | +The basic OpenMP Offload implementation explained in the code includes the use of the following : |
| 34 | +* OpenMP offload target data map construct |
| 35 | +* Default Baseline version demonstrates use of OpenMP offload target parallel for construct with collapse |
| 36 | +* Optimized version 1 demonstrates use of OpenMP offload teams distribute construct and use of num_teams and thread_limit clause |
| 37 | +* Incremental Optimized version 2 demonstrates use of OpenMP offload teams distribute construct with improved data-access pattern |
| 38 | +* Incremental Optimized version 3 demonstrates use of OpenMP CPU threads along with OpenMP offload target construct |
| 39 | + |
| 40 | + |
| 41 | +## License |
| 42 | + |
| 43 | +This code sample is licensed under MIT license. |
| 44 | + |
| 45 | + |
| 46 | +## Building the `ISO3DFD` Program for GPU |
| 47 | + |
| 48 | +### Running Samples In DevCloud |
| 49 | +If running a sample in the Intel DevCloud, remember that you must specify the compute node (CPU, GPU) as well whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide (https://devcloud.intel.com/oneapi/get-started/base-toolkit/) and Intel® oneAPI HPC Toolkit Get Started Guide (https://devcloud.intel.com/oneapi/get-started/hpc-toolkit/) |
| 50 | + |
| 51 | +### On a Linux* System |
| 52 | +Perform the following steps: |
| 53 | +1. Build the program using the following `cmake` commands. |
| 54 | +``` |
| 55 | +$ mkdir build |
| 56 | +$ cd build |
| 57 | +$ cmake .. |
| 58 | +$ make -j |
| 59 | +``` |
| 60 | + |
| 61 | +> Note: by default, executable is build with default baseline version. You can build the kernel with optimized versions with the following: |
| 62 | +``` |
| 63 | +cmake -DUSE_OPT1=1 .. |
| 64 | +make -j |
| 65 | +``` |
| 66 | +``` |
| 67 | +cmake -DUSE_OPT2=1 .. |
| 68 | +make -j |
| 69 | +``` |
| 70 | +``` |
| 71 | +cmake -DUSE_OPT3=1 .. |
| 72 | +make -j |
| 73 | +``` |
| 74 | + |
| 75 | +2. Run the program : |
| 76 | + ``` |
| 77 | + make run |
| 78 | + ``` |
| 79 | +
|
| 80 | +3. Clean the program using: |
| 81 | + ``` |
| 82 | + make clean |
| 83 | + ``` |
| 84 | +
|
| 85 | +## Running the Sample |
| 86 | +``` |
| 87 | +make run |
| 88 | +``` |
| 89 | +
|
| 90 | +### Application Parameters |
| 91 | +You can modify the ISO3DFD parameters from the command line. |
| 92 | + * Configurable Application Parameters |
| 93 | + |
| 94 | + Usage: src/iso3dfd n1 n2 n3 n1_block n2_block n3_block Iterations |
| 95 | +
|
| 96 | + n1 n2 n3 : Grid sizes for the stencil |
| 97 | + n1_block n2_block n3_block : cache block sizes for CPU |
| 98 | + : OR TILE sizes for OMP Offload |
| 99 | + Iterations : No. of timesteps. |
| 100 | +
|
| 101 | +### Example of Output with default baseline version |
| 102 | +``` |
| 103 | +Grid Sizes: 256 256 256 |
| 104 | +Tile sizes ignored for OMP Offload |
| 105 | +--Using Baseline version with omp target with collapse |
| 106 | +Memory Usage (MBytes): 230 |
| 107 | +-------------------------------------- |
| 108 | +time : 4.827 secs |
| 109 | +throughput : 347.57 Mpts/s |
| 110 | +flops : 21.2018 GFlops |
| 111 | +bytes : 4.17084 GBytes/s |
| 112 | + |
| 113 | +-------------------------------------- |
| 114 | + |
| 115 | +-------------------------------------- |
| 116 | +Checking Results ... |
| 117 | +Final wavefields from OMP Offload device and CPU are equivalent: Success |
| 118 | +-------------------------------------- |
| 119 | +``` |
| 120 | +
|
| 121 | +### Example of Output with Optimized version 3 |
| 122 | +``` |
| 123 | +Grid Sizes: 256 256 256 |
| 124 | +Tile sizes: 16 8 64 |
| 125 | +Using Optimized target code - version 3: |
| 126 | +--OMP Threads + OMP_Offload with Tiling and Z Window |
| 127 | +Memory Usage (MBytes): 230 |
| 128 | +-------------------------------------- |
| 129 | +time : 3.014 secs |
| 130 | +throughput : 556.643 Mpts/s |
| 131 | +flops : 33.9552 GFlops |
| 132 | +bytes : 6.67971 GBytes/s |
| 133 | + |
| 134 | +-------------------------------------- |
| 135 | + |
| 136 | +-------------------------------------- |
| 137 | +Checking Results ... |
| 138 | +Final wavefields from OMP Offload device and CPU are equivalent: Success |
| 139 | + |
| 140 | +``` |
0 commit comments