You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DirectProgramming/DPC++FPGA/ReferenceDesigns/qrd/README.md
+44-31Lines changed: 44 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
# QR Decomposition of Matrices
2
-
This DPC++ reference design demonstrates high performance QR decomposition of complex matrices on FPGA.
2
+
This DPC++ reference design demonstrates high performance QR decomposition of complex/real matrices on FPGA.
3
3
4
4
***Documentation***: The [DPC++ FPGA Code Samples Guide](https://software.intel.com/content/www/us/en/develop/articles/explore-dpcpp-through-intel-fpga-code-samples.html) helps you to navigate the samples and build your knowledge of DPC++ for FPGA. <br>
5
5
The [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) is the reference manual for targeting FPGAs through DPC++. <br>
@@ -21,13 +21,13 @@ Please refer to the performance disclaimer at the end of this README.
21
21
22
22
| Device | Throughput
23
23
|:--- |:---
24
-
| Intel® PAC with Intel Arria® 10 GX FPGA | 24.5k matrices/s for matrices of size 128 * 128
25
-
| Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) | 7k matrices/s for matrices of size 256 * 256
24
+
| Intel® PAC with Intel Arria® 10 GX FPGA | 24k matrices/s for complex matrices of size 128 * 128
This FPGA reference design demonstrates QR decomposition of matrices of complex numbers, a common operation employed in linear algebra. Matrix _A_ (input) is decomposed into a product of an orthogonal matrix _Q_ and an upper triangular matrix _R_.
30
+
This FPGA reference design demonstrates QR decomposition of matrices of complex/real numbers, a common operation employed in linear algebra. Matrix _A_ (input) is decomposed into a product of an orthogonal matrix _Q_ and an upper triangular matrix _R_.
31
31
32
32
The algorithms employed by the reference design are the Gram-Schmidt QR decomposition algorithm and the thin QR factorization method. Background information on these algorithms can be found in Wikipedia's [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) article. The original algorithm has been modified and optimized for performance on FPGAs in this implementation.
33
33
@@ -36,14 +36,13 @@ QR decomposition is used extensively in signal processing applications such as b
36
36
37
37
### Matrix dimensions and FPGA resources
38
38
39
-
The QR decomposition algorithm factors a complex _m_×_n_ matrix, where _m_ ≥ _n_. The algorithm computes the vector dot product of two columns of the matrix. In our FPGA implementation, the dot product is computed in a loop over the column's _m_ elements. The loop is fully unrolled to maximize throughput. As a result, *m* complex multiplication operations are performed in parallel on the FPGA, followed by sequential additions to compute the dot product result.
39
+
The QR decomposition algorithm factors a complex _m_ × _n_ matrix, where _m_ ≥ _n_. The algorithm computes the vector dot product of two columns of the matrix. In our FPGA implementation, the dot product is computed in a loop over the column's _m_ elements. The loop is fully unrolled to maximize throughput. As a result, *m* complex multiplication operations are performed in parallel on the FPGA, followed by sequential additions to compute the dot product result.
40
40
41
41
We use the compiler flag `-fp-relaxed`, which permits the compiler to reorder floating point additions (i.e. to assume that floating point addition is commutative). The compiler uses this freedom to reorder the additions so that the dot product arithmetic can be optimally implemented using the FPGA's specialized floating point DSP (Digital Signal Processing) hardware.
42
42
43
-
With this optimization, our FPGA implementation requires 4*m* DSPs to compute the complex floating point dot product. Thus, the matrix size is constrained by the total FPGA DSP resources available. Note that this upper bound is a consequence of this particular implementation.
44
-
45
-
By default, the design is parameterized to process 128 × 128 matrices when compiled targeting Intel® PAC with Intel Arria® 10 GX FPGA. It is parameterized to process 256 × 256 matrices when compiled targeting Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX), a larger device.
43
+
With this optimization, our FPGA implementation requires 4*_m_ DSPs to compute the complex floating point dot product or 2*m* DSPs for the real case. Thus, the matrix size is constrained by the total FPGA DSP resources available.
46
44
45
+
By default, the design is parameterized to process 128 × 128 matrices when compiled targeting Intel® PAC with Intel Arria® 10 GX FPGA. It is parameterized to process 256 × 256 matrices when compiled targeting Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX), a larger device. However, the design can process matrices from 4 x 4 to 512 x 512.
47
46
48
47
## Key Implementation Details
49
48
| Kernel | Description
@@ -57,7 +56,7 @@ To optimize the performance-critical loop in its algorithm, the design leverages
57
56
***Unrolling Loops** (loop_unroll)
58
57
59
58
The key optimization techniques used are as follows:
60
-
1. Refactoring the algorithm to merge two dot products into one, reducing the total number of dot products needed to three from two. This helps us reduce the DSPs required for the implementation.
59
+
1. Refactoring the original Gram-Schmidt algorithm to merge two dot products into one, reducing the total number of dot products needed to three from two. This helps us reduce the DSPs required for the implementation.
61
60
2. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This allows us to generate a design that is very well pipelined.
62
61
3. Fully vectorizing the dot products using loop unrolling.
63
62
4. Using the compiler flag -Xsfp-relaxed to re-order floating point operations and allowing the inference of a specialised dot-product DSP. This further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
@@ -166,15 +165,17 @@ After learning how to use the extensions for Intel oneAPI Toolkits, return to th
166
165
* An FPGA hardware target is not provided on Windows*.
167
166
168
167
*Note:* The Intel® PAC with Intel Arria® 10 GX FPGA and Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) do not yet support Windows*. Compiling to FPGA hardware on Windows* requires a third-party or custom Board Support Package (BSP) with Windows* support.<br>
169
-
*Note:* If you encounter any issues with long paths when compiling under Windows*, you may have to create your ‘build’ directory in a shorter path, for example c:\samples\build. You can then run cmake from that directory, and provide cmake with the full path to your sample directory.
168
+
169
+
*Note:* If you encounter any issues with long paths when compiling under Windows*, you may have to create your ‘build’ directory in a shorter path, for example c:\samples\build. You can then run cmake from that directory, and provide cmake with the full path to your sample directory.
170
+
170
171
171
172
### In Third-Party Integrated Development Environments (IDEs)
172
173
173
174
You can compile and run this Reference Design in the Eclipse* IDE (in Linux*) and the Visual Studio* IDE (in Windows*). For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide)
174
175
175
176
## Running the Reference Design
176
177
You can apply QR decomposition to a number of matrices, as shown below. This step performs the following:
177
-
* Generates the number of random matrices specified as the command line argument (defaults to 1).
178
+
* Generates the number of random matrices specified as the command line argument (defaults to 128).
178
179
* Computes QR decomposition on all matrices.
179
180
* Evaluates performance.
180
181
NOTE: The design is optimized to perform best when run on a large number of matrices, where the total number of matrices is a power of 2.
@@ -190,39 +191,55 @@ NOTE: The design is optimized to perform best when run on a large number of matr
190
191
qrd.fpga_emu.exe (Windows)
191
192
```
192
193
193
-
2. Run the sample on the FPGA device. It is recommended to pass in an optional argument (as shown) when invoking the sample on hardware. Otherwise, the performance will not be representative.
194
+
2. Run the sample on the FPGA device. It is recommended to pass in an optional argument (as shown) when invoking the sample on hardware. Otherwise, the performance will not be representative of the design's throughput. Indeed, the throughput is measured as the total kernel execution time divided by the number of matrices decomposed. However, the transfer of the matrices from the host/device to the device/host also takes some time. This memory transfer is performed by chunks of matrices in parallel to the compute kernel. The first/last chunk of matrices transferred will therefore occur with the computation kernel doing nothing. Thus, the higher the number of matrices to be decomposed, the more accurate the throughput result will be.
194
195
```
195
-
./qrd.fpga 40960 (Linux)
196
+
./qrd.fpga (Linux)
196
197
```
197
198
### Application Parameters
198
199
199
200
| Argument | Description
200
201
--- |---
201
-
| `<num>` | Optional argument that specifies the number of matrices to decompose. Its default value is `1`.
202
+
| `<num>` | Optional argument that specifies the number of times to repeat the decomposition of 8 matrices. Its default value is `16` for the emulation flow and '819200' for the FPGA flow.
202
203
203
204
### Example of Output
204
205
205
-
Example output when running on Intel® PAC with Intel Arria® 10 GX FPGA for 32768 matrices (each consisting of 128*128 complex numbers):
206
+
Example output when running on Intel® PAC with Intel Arria® 10 GX FPGA for 8 matrices 819200 times (each matrix consisting of 128*128 complex numbers):
Running QR decomposition of 32768 matrices repeatedly
211
-
Total duration: 42.3408 s
212
-
Throughput: 24.7652k matrices/s
213
-
Verifying results on matrix 0 16384 32767
210
+
Generating 8 random complex matrices of size 128x128
211
+
Running QR decomposition of 8 matrices 819200 times
212
+
Total duration: 268.733 s
213
+
Throughput: 24.387k matrices/s
214
+
Verifying results on matrix 0
215
+
1
216
+
2
217
+
3
218
+
4
219
+
5
220
+
6
221
+
7
222
+
214
223
PASSED
215
224
```
216
225
217
-
Example output when running on Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) for 40960 matrices (each consisting of 256*256 complex numbers):
226
+
Example output when running on Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) for the decomposition of 8 matrices 409600 times (each matrix consisting of 256*256 complex numbers):
Running QR decomposition of 4096 matrices repeatedly
223
-
Total duration: 17.3197 s
224
-
Throughput: 7.36231k matrices/s
225
-
Verifying results on matrix 0 20480 40959
230
+
Generating 8 random complex matrices of size 256x256
231
+
Running QR decomposition of 8 matrices 819200 times
232
+
Total duration: 888.077 s
233
+
Throughput: 7.37954k matrices/s
234
+
Verifying results on matrix 0
235
+
1
236
+
2
237
+
3
238
+
4
239
+
5
240
+
6
241
+
7
242
+
226
243
PASSED
227
244
```
228
245
@@ -240,12 +257,10 @@ PASSED
240
257
`-DROWS_COMPONENT` | Specifies the number of rows of the matrix
241
258
`-DCOLS_COMPONENT` | Specifies the number of columns of the matrix
242
259
`-DFIXED_ITERATIONS` | Used to set the ivdep safelen attribute for the performance critical triangular loop
260
+
`-DCOMPLEX` | Used to select between the complex and real QR decomposition (complex is the default)
243
261
244
262
NOTE: The values for `seed`, `FIXED_ITERATIONS`, `ROWS_COMPONENT`, `COLS_COMPONENT` are set according to the board being targeted.
245
263
246
-
### Host Limitations
247
-
The QRD demo host is not optimized for a very large number of matrices. Running the QRD executable with number of matrices that occupy more memory than what is physically available on the host machine will result in system performance degradation due to virtual memory thrashing by the operating system.
248
-
249
264
### Performance disclaimers
250
265
251
266
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit [www.intel.com/benchmarks](www.intel.com/benchmarks).
@@ -259,5 +274,3 @@ The performance was measured by Intel on July 29, 2020.
259
274
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
0 commit comments