Skip to content

Commit d241a8b

Browse files
terdnerJoeOster
andauthored
Updated sample to increase memory size on CPU runs (oneapi-src#136)
* initial commit of openMP example. Signed-off-by: todd.erdner <[email protected]> * Initial commit of the dpc_reduce Signed-off-by: todd.erdner <[email protected]> * added guid to sample.json Signed-off-by: todd.erdner <[email protected]> * fixed sample.json files. Signed-off-by: todd.erdner <[email protected]> * fixed the include files. Somehow I copied a slightly old repo and it still had <chrono> and the omp_common.hpp file. They have been removed. Signed-off-by: todd.erdner <[email protected]> * added license.txt file ran through formating tool one more time removed all calls to "std::endl" and replaced with " \n" Signed-off-by: todd.erdner <[email protected]> * renamed license.txt to License.txt Signed-off-by: todd.erdner <[email protected]> * added "ciTests" to the sample.json file. It passed the check. Signed-off-by: todd.erdner <[email protected]> * fixed make error Signed-off-by: todd.erdner <[email protected]> * fixed sample.json Signed-off-by: todd.erdner <[email protected]> * removed "2020" from the License.txt file due to update guidelines. Signed-off-by: todd.erdner <[email protected]> * added comment regarding where you can find dpc_common in both files per Paul's comments. Signed-off-by: todd.erdner <[email protected]> * Modified names of the functions to represent what they do (ie. calc_pi_*) per suggestion from Paul. Signed-off-by: todd.erdner <[email protected]> * initial check-in to the C++ repo Signed-off-by: todd.erdner <[email protected]> * put correct comment on dpc_common.hpp Signed-off-by: todd.erdner <[email protected]> * added commenting indicating where they can find corresponding include files. Signed-off-by: todd.erdner <[email protected]> * added comment line Signed-off-by: todd.erdner <[email protected]> * removed openMP repo from DPC++ as it will be moved to C++ directory * Update README.md * Update README.md * Update README.md * Update README.md * fixed category line in sample.json to match exact text expected. * removing openMP from the DPC directory. It has been moved to C++ directory. * fixed tf_init call Signed-off-by: todd.erdner <[email protected]> * removed all calls into PSTL internal logic. This is what was causing fails between beta08 and beta09. Signed-off-by: todd.erdner <[email protected]> * fixed env variable to run on CPU Signed-off-by: todd.erdner <[email protected]> * update Readme file to include information about setting env variable to allocate more memory for any runs on the cpu Signed-off-by: todd.erdner <[email protected]> Co-authored-by: JoeOster <[email protected]>
1 parent 966dd85 commit d241a8b

File tree

4 files changed

+117
-94
lines changed

4 files changed

+117
-94
lines changed

DirectProgramming/DPC++/ParallelPatterns/dpc_reduce/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,6 @@ set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -ltbb -lsycl")
1919
add_executable (dpc_reduce src/main.cpp)
2020

2121
add_custom_target (run
22-
COMMAND dpc_reduce
22+
COMMAND CL_CONFIG_CPU_FORCE_PRIVATE_MEM_SIZE=16MB dpc_reduce
2323
WORKING_DIRECTORY ${CMAKE_PROJECT_DIR}
2424
)

DirectProgramming/DPC++/ParallelPatterns/dpc_reduce/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,10 @@ Using Data Parallel C++, the code sample runs multiple MPI ranks to distribute t
3636
calculation of the number Pi. Each rank offloads the computation to an accelerator
3737
(GPU/CPU) using Intel DPC++ compiler to compute a partial compution of the number Pi.
3838

39+
If you run the sample on a CPU as your default device, you may need to increase
40+
the memory allocation for openCL. You can do this by setting an environment variable,
41+
"CL_CONFIG_CPU_FORCE_PRIVATE_MEM_SIZE=16MB
42+
3943

4044
## Key Implementation Details
4145
The basic DPC++ implementation explained in the code includes accessor,

DirectProgramming/DPC++/ParallelPatterns/dpc_reduce/sample.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
"cd build",
2121
"cmake ..",
2222
"make",
23-
"./dpc_reduce"
23+
"CL_CONFIG_CPU_FORCE_PRIVATE_MEM_SIZE=16MB ./src/dpc_reduce"
2424
]
2525
}
2626
]

DirectProgramming/DPC++/ParallelPatterns/dpc_reduce/src/main.cpp

Lines changed: 111 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -167,108 +167,126 @@ struct slice_area {
167167
};
168168
};
169169

170-
// a way to get value_type from both accessors and USM that is needed for
171-
// transform_init
170+
171+
// a way to get value_type from both accessors and USM that is needed for transform_init
172172
template <typename Unknown>
173-
struct accessor_traits {};
174-
175-
template <typename T, int Dim, sycl::access::mode AccMode,
176-
sycl::access::target AccTarget, sycl::access::placeholder Placeholder>
177-
struct accessor_traits<
178-
sycl::accessor<T, Dim, AccMode, AccTarget, Placeholder>> {
179-
using value_type = typename sycl::accessor<T, Dim, AccMode, AccTarget,
180-
Placeholder>::value_type;
173+
struct accessor_traits
174+
{
175+
};
176+
177+
template <typename T, int Dim, sycl::access::mode AccMode, sycl::access::target AccTarget,
178+
sycl::access::placeholder Placeholder>
179+
struct accessor_traits<sycl::accessor<T, Dim, AccMode, AccTarget, Placeholder>>
180+
{
181+
using value_type = typename sycl::accessor<T, Dim, AccMode, AccTarget, Placeholder>::value_type;
181182
};
182183

183184
template <typename RawArrayValueType>
184-
struct accessor_traits<RawArrayValueType*> {
185-
using value_type = RawArrayValueType;
185+
struct accessor_traits<RawArrayValueType*>
186+
{
187+
using value_type = RawArrayValueType;
186188
};
187189

188190
// calculate shift where we should start processing on current item
189-
template <typename NDItemId, typename GlobalIdx, typename SizeNIter,
190-
typename SizeN>
191-
SizeN calc_shift(const NDItemId item_id, const GlobalIdx global_idx,
192-
SizeNIter& n_iter, const SizeN n) {
193-
auto global_range_size = item_id.get_global_range().size();
194-
195-
auto start = n_iter * global_idx;
196-
auto global_shift = global_idx + n_iter * global_range_size;
197-
if (n_iter > 0 && global_shift > n) {
198-
start += n % global_range_size - global_idx;
199-
} else if (global_shift < n) {
200-
n_iter++;
201-
}
202-
return start;
203-
}
204-
205-
template <typename ExecutionPolicy, typename Operation1, typename Operation2>
206-
struct transform_init {
207-
Operation1 binary_op;
208-
Operation2 unary_op;
209-
210-
template <typename NDItemId, typename GlobalIdx, typename Size,
211-
typename AccLocal, typename... Acc>
212-
void operator()(const NDItemId item_id, const GlobalIdx global_idx, Size n,
213-
AccLocal& local_mem, const Acc&... acc) {
214-
auto local_idx = item_id.get_local_id(0);
191+
template <typename NDItemId, typename GlobalIdx, typename SizeNIter, typename SizeN>
192+
SizeN
193+
calc_shift(const NDItemId item_id, const GlobalIdx global_idx, SizeNIter& n_iter, const SizeN n)
194+
{
215195
auto global_range_size = item_id.get_global_range().size();
216-
auto n_iter = n / global_range_size;
217-
auto start = calc_shift(item_id, global_idx, n_iter, n);
218-
auto shifted_global_idx = global_idx + start;
219196

220-
typename accessor_traits<AccLocal>::value_type res;
221-
if (global_idx < n) {
222-
res = unary_op(shifted_global_idx, acc...);
197+
auto start = n_iter * global_idx;
198+
auto global_shift = global_idx + n_iter * global_range_size;
199+
if (n_iter > 0 && global_shift > n)
200+
{
201+
start += n % global_range_size - global_idx;
223202
}
224-
// Add neighbour to the current local_mem
225-
for (decltype(n_iter) i = 1; i < n_iter; ++i) {
226-
res = binary_op(res, unary_op(shifted_global_idx + i, acc...));
203+
else if (global_shift < n)
204+
{
205+
n_iter++;
227206
}
228-
if (global_idx < n) {
229-
local_mem[local_idx] = res;
207+
return start;
208+
}
209+
210+
211+
template <typename ExecutionPolicy, typename Operation1, typename Operation2>
212+
struct transform_init
213+
{
214+
Operation1 binary_op;
215+
Operation2 unary_op;
216+
217+
template <typename NDItemId, typename GlobalIdx, typename Size, typename AccLocal, typename... Acc>
218+
void
219+
operator()(const NDItemId item_id, const GlobalIdx global_idx, Size n, AccLocal& local_mem,
220+
const Acc&... acc)
221+
{
222+
auto local_idx = item_id.get_local_id(0);
223+
auto global_range_size = item_id.get_global_range().size();
224+
auto n_iter = n / global_range_size;
225+
auto start = calc_shift(item_id, global_idx, n_iter, n);
226+
auto shifted_global_idx = global_idx + start;
227+
228+
typename accessor_traits<AccLocal>::value_type res;
229+
if (global_idx < n)
230+
{
231+
res = unary_op(shifted_global_idx, acc...);
232+
}
233+
// Add neighbour to the current local_mem
234+
for (decltype(n_iter) i = 1; i < n_iter; ++i)
235+
{
236+
res = binary_op(res, unary_op(shifted_global_idx + i, acc...));
237+
}
238+
if (global_idx < n)
239+
{
240+
local_mem[local_idx] = res;
241+
}
230242
}
231-
}
232243
};
233244

245+
234246
// Reduce on local memory
235247
template <typename ExecutionPolicy, typename BinaryOperation1, typename Tp>
236-
struct reduce {
237-
BinaryOperation1 bin_op1;
238-
239-
template <typename NDItemId, typename GlobalIdx, typename Size,
240-
typename AccLocal>
241-
Tp operator()(const NDItemId item_id, const GlobalIdx global_idx,
242-
const Size n, AccLocal& local_mem) {
243-
auto local_idx = item_id.get_local_id(0);
244-
auto group_size = item_id.get_local_range().size();
245-
246-
auto k = 1;
247-
do {
248-
item_id.barrier(sycl::access::fence_space::local_space);
249-
if (local_idx % (2 * k) == 0 && local_idx + k < group_size &&
250-
global_idx < n && global_idx + k < n) {
251-
local_mem[local_idx] =
252-
bin_op1(local_mem[local_idx], local_mem[local_idx + k]);
253-
}
254-
k *= 2;
255-
} while (k < group_size);
256-
return local_mem[local_idx];
257-
}
248+
struct reduce
249+
{
250+
BinaryOperation1 bin_op1;
251+
252+
template <typename NDItemId, typename GlobalIdx, typename Size, typename AccLocal>
253+
Tp
254+
operator()(const NDItemId item_id, const GlobalIdx global_idx, const Size n, AccLocal& local_mem)
255+
{
256+
auto local_idx = item_id.get_local_id(0);
257+
auto group_size = item_id.get_local_range().size();
258+
259+
auto k = 1;
260+
do
261+
{
262+
item_id.barrier(sycl::access::fence_space::local_space);
263+
if (local_idx % (2 * k) == 0 && local_idx + k < group_size && global_idx < n &&
264+
global_idx + k < n)
265+
{
266+
local_mem[local_idx] = bin_op1(local_mem[local_idx], local_mem[local_idx + k]);
267+
}
268+
k *= 2;
269+
} while (k < group_size);
270+
return local_mem[local_idx];
271+
}
258272
};
259273

274+
260275
// walk through the data
261276
template <typename ExecutionPolicy, typename F>
262-
struct walk_n {
263-
F f;
264-
265-
template <typename ItemId, typename... Ranges>
266-
auto operator()(const ItemId idx, Ranges&&... rngs)
267-
-> decltype(f(rngs[idx]...)) {
268-
return f(rngs[idx]...);
269-
}
277+
struct walk_n
278+
{
279+
F f;
280+
281+
template <typename ItemId, typename... Ranges>
282+
auto
283+
operator()(const ItemId idx, Ranges&&... rngs) -> decltype(f(rngs[idx]...))
284+
{
285+
return f(rngs[idx]...);
286+
}
270287
};
271288

289+
272290
// This option uses a parallel for to fill the buffer and then
273291
// uses a tranform_init with plus/no_op and then
274292
// a local reduction then global reduction.
@@ -301,12 +319,12 @@ float calc_pi_dpstd_native3(size_t num_steps, int groups, Policy&& policy) {
301319
// In this example we have done the calculation and filled the buffer above
302320
// The way transform_init works is that you need to have the value already
303321
// populated in the buffer.
304-
auto tf_init = transform_init<Policy, std::plus<float>, Functor>{
305-
std::plus<float>(), Functor{my_no_op()}};
322+
auto tf_init = transform_init<Policy, std::plus<float>,
323+
Functor>{std::plus<float>(), Functor{my_no_op()}};
306324

307325
auto combine = std::plus<float>();
308-
auto brick_reduce =
309-
reduce<Policy, std::plus<float>, float>{std::plus<float>()};
326+
auto brick_reduce = reduce<Policy, std::plus<float>, float>{
327+
std::plus<float>()};
310328
auto workgroup_size =
311329
policy.queue()
312330
.get_device()
@@ -336,8 +354,8 @@ float calc_pi_dpstd_native3(size_t num_steps, int groups, Policy&& policy) {
336354
[=](nd_item<1> item_id) mutable {
337355
auto global_idx = item_id.get_global_id(0);
338356
// 1. Initialization (transform part).
339-
tf_init(item_id, global_idx, num_steps, temp_buf_local,
340-
access_buf);
357+
tf_init(item_id, global_idx, num_steps,
358+
temp_buf_local, access_buf);
341359
// 2. Reduce within work group
342360
float local_result = brick_reduce(
343361
item_id, global_idx, num_steps, temp_buf_local);
@@ -402,12 +420,13 @@ float calc_pi_dpstd_native4(size_t num_steps, int groups, Policy&& policy) {
402420
// The buffer has 1...num it at and now we will use that as an input
403421
// to the slice structue which will calculate the area of each
404422
// rectangle.
405-
auto tf_init = transform_init<Policy, std::plus<float>, Functor2>{
406-
std::plus<float>(), Functor2{slice_area(num_steps)}};
423+
auto tf_init = transform_init<Policy, std::plus<float>,
424+
Functor2>{
425+
std::plus<float>(), Functor2{slice_area(num_steps)}};
407426

408427
auto combine = std::plus<float>();
409-
auto brick_reduce =
410-
reduce<Policy, std::plus<float>, float>{std::plus<float>()};
428+
auto brick_reduce = reduce<Policy, std::plus<float>, float>{
429+
std::plus<float>()};
411430

412431
// get workgroup_size from the device
413432
auto workgroup_size =
@@ -446,8 +465,8 @@ float calc_pi_dpstd_native4(size_t num_steps, int groups, Policy&& policy) {
446465
auto global_idx = item_id.get_global_id(0);
447466
// 1. Initialization (transform part). Fill local
448467
// memory
449-
tf_init(item_id, global_idx, num_steps, temp_buf_local,
450-
access_buf);
468+
tf_init(item_id, global_idx, num_steps,
469+
temp_buf_local, access_buf);
451470
// 2. Reduce within work group
452471
float local_result = brick_reduce(
453472
item_id, global_idx, num_steps, temp_buf_local);

0 commit comments

Comments
 (0)