Skip to content

Commit 5e5e162

Browse files
authored
Fixing functional issue in DB query 11 (oneapi-src#841)
* Fixing bug in DB query 11 * Updated cache depth * Fixed typo :/ * Changes to MapJoin to try and fix issue :( * Reset old loop structure for compute loop in Q11 * Printing kernel time as well as total time (including memory transfer) * Printing more performance results * Changes to Q11 to improve performance std::endl for some messages to ensure buffer is flushed Updated README with new output example * Creating new, more useful, CachedMemory More explicit messages between kernels to try and address hang * Removed debug prints * Removed ArrayMap from MapJoin in favour of separate 'data' and 'valid' map arrays passed in by caller. This allows for the compiler to create a more optimal memory system. Updated q9 and q11 for to use this new MapJoin * Removing ivdep from map join loops in query9 and query11 since they are not needed anymore with change to loop induction variable indexing. * Samples JSON, adding SF_SMALL=1 to Windows CMake generation
1 parent 9c7d508 commit 5e5e162

File tree

11 files changed

+210
-216
lines changed

11 files changed

+210
-216
lines changed

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/README.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,7 @@ The [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programmi
1616
_Notice: This example design is only officially supported for the Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX)_
1717

1818
**Performance**
19-
In this design, we accelerate four database queries as *offload accelerators*. In an offload accelerator scheme, the queries are performed by transferring the relevant data from the CPU host to the FPGA, starting the query kernel on the FPGA, and copying the results back. This means that the relevant performance number is the latency (i.e., the wall clock time) from when the query is requested to the time the output data is accessible by the host. This includes the time to transfer data between the CPU and FPGA over PCIe (with an approximate read and write bandwidth of 6877 and 6582 MB/s, respectively). As shown in the table below, most of the total query time is spent transferring the data between the CPU and FPGA, and the query kernels themselves are a small portion of the total latency.
20-
21-
The performance data below was gathered using the Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) with a database scale factor (SF) of 1. Please see the [Database files](#database-files) section for more information on generating data for a scale factor of 1.
22-
23-
| Query | Approximate Data Transfer Time (ms) | Measured Total Query Processing Time (ms)
24-
|:--- |:--- |:---
25-
| 1 | 35 | 39
26-
| 9 | 37 | 43
27-
| 11 | 5 | 11
28-
| 12 | 16 | 26
19+
In this design, we accelerate four database queries as *offload accelerators*. In an offload accelerator scheme, the queries are performed by transferring the relevant data from the CPU host to the FPGA, starting the query kernel on the FPGA, and copying the results back. This means that the relevant performance number is the processing time (i.e., the wall clock time) from when the query is requested to the time the output data is accessible by the host. This includes the time to transfer data between the CPU and FPGA over PCIe (with an approximate read and write bandwidth of 6877 and 6582 MB/s, respectively). As shown in the table below, most of the total query time is spent transferring the data between the CPU and FPGA, and the query kernels themselves are a small portion of the total latency.
2920

3021
## Purpose
3122
The database in this tutorial has 8-tables and a set of 21 business-oriented queries with broad industry-wide relevance. This reference design shows how four queries can be accelerated using the Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) and oneAPI. To do so, we create a set of common database operators (found in the `src/db_utils/` directory) that are are combined in different ways to build the four queries.
@@ -232,7 +223,9 @@ You should see the following output in the console:
232223
Validating query 1 test results
233224
Running Q1 within 90 days of 1998-12-1
234225
Validating query 1 test results
235-
Processing time: 40.2986 ms
226+
Total processing time: 34.389 ms
227+
Kernel processing time: 3.16621 ms
228+
Throughput: 315.835 queries/s
236229
PASSED
237230
```
238231
NOTE: the scale factor 1 (SF=1) database files (`../data/sf1`) are **not** shipped with this reference design. Please refer to the [Database files](#database-files) section for information on how to generate these files yourself.

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/sample.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@
145145
"cd ../..",
146146
"mkdir build-q1",
147147
"cd build-q1",
148-
"cmake -G \"NMake Makefiles\" ../ReferenceDesigns/db -DQUERY=1",
148+
"cmake -G \"NMake Makefiles\" ../ReferenceDesigns/db -DQUERY=1 -DSF_SMALL=1",
149149
"nmake report"
150150
]
151151
},
@@ -156,7 +156,7 @@
156156
"cd ../..",
157157
"mkdir build-q11",
158158
"cd build-q11",
159-
"cmake -G \"NMake Makefiles\" ../ReferenceDesigns/db -DQUERY=11",
159+
"cmake -G \"NMake Makefiles\" ../ReferenceDesigns/db -DQUERY=11 -DSF_SMALL=1",
160160
"nmake report"
161161
]
162162
},
@@ -167,7 +167,7 @@
167167
"cd ../..",
168168
"mkdir build-q12",
169169
"cd build-q12",
170-
"cmake -G \"NMake Makefiles\" ../ReferenceDesigns/db -DQUERY=12",
170+
"cmake -G \"NMake Makefiles\" ../ReferenceDesigns/db -DQUERY=12 -DSF_SMALL=1",
171171
"nmake report"
172172
]
173173
}

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/src/db.cpp

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -268,8 +268,15 @@ int main(int argc, char* argv[]) {
268268
std::accumulate(total_latency.begin() + 1, total_latency.end(), 0.0) /
269269
(double)(runs - 1);
270270

271+
double kernel_latency_avg =
272+
std::accumulate(kernel_latency.begin() + 1, kernel_latency.end(), 0.0) /
273+
(double)(runs - 1);
274+
271275
// print the performance results
272276
std::cout << "Processing time: " << total_latency_avg << " ms\n";
277+
std::cout << "Kernel time: " << kernel_latency_avg << " ms\n";
278+
std::cout << "Throughput: " << ((1 / kernel_latency_avg) * 1e3)
279+
<< " queries/s\n";
273280
#endif
274281

275282
std::cout << "PASSED\n";
@@ -325,7 +332,7 @@ bool DoQuery1(queue& q, Database& dbinfo, std::string& db_root_dir,
325332
unsigned int low_date_compact = low_date.ToCompact();
326333

327334
std::cout << "Running Q1 within " << DELTA << " days of " << date.year << "-"
328-
<< date.month << "-" << date.day << "\n";
335+
<< date.month << "-" << date.day << std::endl;
329336

330337
// the query output data
331338
std::array<DBDecimal, kQuery1OutSize> sum_qty = {0}, sum_base_price = {0},
@@ -378,7 +385,7 @@ bool DoQuery9(queue& q, Database& dbinfo, std::string& db_root_dir,
378385
// convert the colour regex to uppercase characters (convention)
379386
transform(colour.begin(), colour.end(), colour.begin(), ::toupper);
380387

381-
std::cout << "Running Q9 with colour regex: " << colour << "\n";
388+
std::cout << "Running Q9 with colour regex: " << colour << std::endl;
382389

383390
// the output of the query
384391
std::array<DBDecimal, 25 * 2020> sum_profit;
@@ -424,7 +431,8 @@ bool DoQuery11(queue& q, Database& dbinfo, std::string& db_root_dir,
424431
transform(nation.begin(), nation.end(), nation.begin(), ::toupper);
425432

426433
std::cout << "Running Q11 for nation " << nation.c_str()
427-
<< " (key=" << (int)(dbinfo.n.name_key_map[nation]) << ")\n";
434+
<< " (key=" << (int)(dbinfo.n.name_key_map[nation]) << ")"
435+
<< std::endl;
428436

429437
// the query output
430438
std::vector<DBIdentifier> partkeys(kPartTableSize);
@@ -492,7 +500,7 @@ bool DoQuery12(queue& q, Database& dbinfo, std::string& db_root_dir,
492500

493501
std::cout << "Running Q12 between years " << low_date.year << " and "
494502
<< high_date.year << " for SHIPMODES " << shipmode1 << " and "
495-
<< shipmode2 << "\n";
503+
<< shipmode2 << std::endl;;
496504

497505
// the output of the query
498506
std::array<DBDecimal, 2> high_line_count, low_line_count;

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/src/db_utils/Accumulator.hpp

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,13 @@ class BRAMAccumulator {
8989
// initialize the memory entries
9090
void Init() {
9191
// initialize the memory entries
92-
for (IndexType i = 0; i < size; i++) {
92+
for (int i = 0; i < size; i++) {
9393
mem[i] = 0;
9494
}
9595

9696
// initialize the cache
9797
#pragma unroll
98-
for (IndexType i = 0; i < cache_size + 1; i++) {
98+
for (int i = 0; i < cache_size + 1; i++) {
9999
cache_value[i] = 0;
100100
cache_tag[i] = 0;
101101
}
@@ -104,34 +104,33 @@ class BRAMAccumulator {
104104
// accumulate 'value' into register 'index' (i.e. registers[index] += value)
105105
void Accumulate(IndexType index, StorageType value) {
106106
// get value from memory
107-
StorageType currVal = mem[index];
107+
StorageType curr_val = mem[index];
108108

109109
// check if value is in cache
110110
#pragma unroll
111-
for (IndexType i = 0; i < cache_size + 1; i++) {
111+
for (int i = 0; i < cache_size + 1; i++) {
112112
if (cache_tag[i] == index) {
113-
currVal = cache_value[i];
113+
curr_val = cache_value[i];
114114
}
115115
}
116116

117117
// write the new value to both the shift register cache and the local mem
118-
const StorageType newVal = currVal + value;
119-
mem[index] = cache_value[cache_size] = newVal;
118+
StorageType new_val = curr_val + value;
119+
mem[index] = new_val;
120+
cache_value[cache_size] = new_val;
120121
cache_tag[cache_size] = index;
121122

122123
// Cache is just a shift register, so shift it
123124
// pushing into back of the shift register done above
124125
#pragma unroll
125-
for (IndexType i = 0; i < cache_size; i++) {
126+
for (int i = 0; i < cache_size; i++) {
126127
cache_value[i] = cache_value[i + 1];
127128
cache_tag[i] = cache_tag[i + 1];
128129
}
129130
}
130131

131132
// get the value of memory at 'index'
132-
StorageType Get(IndexType index) {
133-
return mem[index];
134-
}
133+
StorageType Get(IndexType index) { return mem[index]; }
135134

136135
// internal storage
137136
StorageType mem[size];
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#ifndef __CACHED_MEMORY_HPP__
2+
#define __CACHED_MEMORY_HPP__
3+
4+
template <typename StorageType, int n, int cache_n,
5+
typename IndexType = int>
6+
class CachedMemory {
7+
// static asserts
8+
static_assert(n > 0);
9+
static_assert(cache_n >= 0);
10+
static_assert(std::is_arithmetic<StorageType>::value,
11+
"StorageType must be arithmetic to support accumulation");
12+
static_assert(std::is_integral<IndexType>::value,
13+
"IndexType must be an integral type");
14+
static_assert(std::numeric_limits<IndexType>::max() >= (n - 1),
15+
"IndexType must be large enough to index the entire array");
16+
17+
public:
18+
CachedMemory() {}
19+
20+
void Init(StorageType init_val = 0) {
21+
for (int i = 0; i < n; i++) {
22+
mem[i] = init_val;
23+
}
24+
#pragma unroll
25+
for (int i = 0; i < cache_n + 1; i++) {
26+
cache_value[i] = init_val;
27+
cache_tag[i] = 0;
28+
}
29+
}
30+
31+
auto Get(IndexType idx) {
32+
// grab the value from memory
33+
StorageType ret = mem[idx];
34+
35+
// check for this value in the cache as well
36+
#pragma unroll
37+
for (int i = 0; i < cache_n + 1; i++) {
38+
if (cache_tag[i] == idx) {
39+
ret = cache_value[i];
40+
}
41+
}
42+
43+
return ret;
44+
}
45+
46+
void Set(IndexType idx, StorageType val) {
47+
// store the new value in the actual memory, and the start of the shift
48+
// register cache
49+
mem[idx] = val;
50+
cache_value[cache_n] = val;
51+
cache_tag[cache_n] = idx;
52+
53+
// shift the shift register cache
54+
#pragma unroll
55+
for (int i = 0; i < cache_n; i++) {
56+
cache_value[i] = cache_value[i + 1];
57+
cache_tag[i] = cache_tag[i + 1];
58+
}
59+
}
60+
61+
private:
62+
// internal storage
63+
StorageType mem[n];
64+
65+
// internal cache for hiding write latency
66+
[[intel::fpga_register]]
67+
StorageType cache_value[cache_n + 1];
68+
69+
[[intel::fpga_register]]
70+
int cache_tag[cache_n + 1];
71+
};
72+
73+
#endif /* __CACHED_MEMORY_HPP__ */

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/src/db_utils/MapJoin.hpp

Lines changed: 3 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -11,45 +11,13 @@
1111
#include "Unroller.hpp"
1212
#include "Tuple.hpp"
1313
#include "StreamingData.hpp"
14-
#include "ShannonIterator.hpp"
15-
16-
//
17-
// ArrayMap class
18-
//
19-
template <typename Type, int size>
20-
class ArrayMap {
21-
// static asserts
22-
static_assert(size > 0,
23-
"size must be positive and non-zero");
24-
static_assert(std::is_same<bool, decltype(Type().valid)>::value,
25-
"Type must have a 'valid' boolean member");
26-
27-
public:
28-
void Init() {
29-
for (unsigned int i = 0; i < size; i++) {
30-
valid[i] = false;
31-
}
32-
}
33-
34-
std::pair<bool, Type> Get(unsigned int key) {
35-
return {valid[key], map[key]};
36-
}
37-
38-
void Set(unsigned int key, Type data) {
39-
map[key] = data;
40-
valid[key] = true;
41-
}
42-
43-
Type map[size];
44-
bool valid[size];
45-
};
4614

4715
//
4816
// MapJoin implementation
4917
//
5018
template<typename MapType, typename T2Pipe, typename T2Data, int t2_win_size,
5119
typename JoinPipe, typename JoinType>
52-
void MapJoin(MapType& map) {
20+
void MapJoin(MapType map_data[], bool map_valid[]) {
5321
//////////////////////////////////////////////////////////////////////////////
5422
// static asserts
5523
static_assert(t2_win_size > 0,
@@ -89,12 +57,10 @@ void MapJoin(MapType& map) {
8957
const unsigned int t2_key =
9058
in_data.data.template get<j>().PrimaryKey();
9159

92-
auto [data_valid, map_data] = map.Get(t2_key);
93-
94-
if (t2_win_valid && data_valid) {
60+
if (t2_win_valid && map_valid[t2_key]) {
9561
// NOTE: order below important if Join() overrides valid
9662
join_data.data.template get<j>().valid = true;
97-
join_data.data.template get<j>().Join(map_data,
63+
join_data.data.template get<j>().Join(map_data[t2_key],
9864
in_data.data.template get<j>());
9965
}
10066
});

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/src/dbdata.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -507,7 +507,7 @@ bool Database::ValidateQ1(std::string db_root_dir,
507507
std::array<DBDecimal, 3 * 2>& avg_price,
508508
std::array<DBDecimal, 3 * 2>& avg_discount,
509509
std::array<DBDecimal, 3 * 2>& count) {
510-
std::cout << "Validating query 1 test results\n";
510+
std::cout << "Validating query 1 test results" << std::endl;
511511

512512
// populate date row by row (as presented in the file)
513513
std::string path(db_root_dir + kSeparator + "answers" + kSeparator + "q1.out");
@@ -630,7 +630,7 @@ bool Database::ValidateQ1(std::string db_root_dir,
630630
//
631631
bool Database::ValidateQ9(std::string db_root_dir,
632632
std::array<DBDecimal, 25 * 2020>& sum_profit) {
633-
std::cout << "Validating query 9 test results\n";
633+
std::cout << "Validating query 9 test results" << std::endl;
634634

635635
// populate date row by row (as presented in the file)
636636
std::string path(db_root_dir + kSeparator + "answers" + kSeparator + "q9.out");
@@ -684,7 +684,7 @@ bool Database::ValidateQ9(std::string db_root_dir,
684684
bool Database::ValidateQ11(std::string db_root_dir,
685685
std::vector<DBIdentifier>& partkeys,
686686
std::vector<DBDecimal>& partkey_values) {
687-
std::cout << "Validating query 11 test results\n";
687+
std::cout << "Validating query 11 test results" << std::endl;
688688

689689
// populate date row by row (as presented in the file)
690690
std::string path(db_root_dir + kSeparator + "answers" + kSeparator + "q11.out");

DirectProgramming/DPC++FPGA/ReferenceDesigns/db/src/query11/pipe_types.hpp

Lines changed: 2 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -10,24 +10,6 @@
1010

1111
using namespace sycl;
1212

13-
//
14-
// A single row of the SUPPLIER table
15-
// with a subset of the columns (needed for this query)
16-
//
17-
class SupplierRow {
18-
public:
19-
// SupplierRow() : valid(false), suppkey(0), nationkey(0) {}
20-
SupplierRow() {}
21-
SupplierRow(bool v_valid, DBIdentifier v_suppkey, unsigned char v_nationkey)
22-
: valid(v_valid), suppkey(v_suppkey), nationkey(v_nationkey) {}
23-
24-
DBIdentifier PrimaryKey() const { return suppkey; }
25-
26-
bool valid;
27-
DBIdentifier suppkey;
28-
unsigned char nationkey;
29-
};
30-
3113
//
3214
// A single row of the PARTSUPPLIER table
3315
// with a subset of the columns (needed for this query)
@@ -71,11 +53,11 @@ class SupplierPartSupplierJoined {
7153

7254
DBIdentifier PrimaryKey() const { return partkey; }
7355

74-
void Join(const SupplierRow& s_row, const PartSupplierRow& ps_row) {
56+
void Join(const unsigned char nation_key, const PartSupplierRow& ps_row) {
7557
partkey = ps_row.partkey;
7658
availqty = ps_row.availqty;
7759
supplycost = ps_row.supplycost;
78-
nationkey = s_row.nationkey;
60+
nationkey = nation_key;
7961
}
8062

8163
bool valid;

0 commit comments

Comments
 (0)