Skip to content

Commit f3c2428

Browse files
authored
Add Intrinsics Code Sample (oneapi-src#17)
* added source files for intrinsics sample Signed-off-by: Ethan Hirsch <[email protected]> * added sample.json Signed-off-by: Ethan Hirsch <[email protected]> * updated sample.json to comply with guidelines Signed-off-by: Ethan Hirsch <[email protected]> * added implementation deets to readme Signed-off-by: Ethan Hirsch <[email protected]> * added sample output to readme Signed-off-by: Ethan Hirsch <[email protected]> * removed old readme Signed-off-by: Ethan Hirsch <[email protected]> * renamed .c files to .cpp includes renaming all file references Signed-off-by: Ethan Hirsch <[email protected]> * added debug ci config and documentation Signed-off-by: Ethan Hirsch <[email protected]> * styled src files with clang-format Signed-off-by: Ethan Hirsch <[email protected]> * update license.txt Signed-off-by: Ethan Hirsch <[email protected]>
1 parent d2c48ea commit f3c2428

File tree

7 files changed

+710
-0
lines changed

7 files changed

+710
-0
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
#==============================================================
2+
#
3+
# SAMPLE SOURCE CODE - SUBJECT TO THE TERMS OF SAMPLE CODE LICENSE AGREEMENT,
4+
# http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/
5+
#
6+
# Copyright Intel Corporation
7+
#
8+
# THIS FILE IS PROVIDED "AS IS" WITH NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT
9+
# NOT LIMITED TO ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
10+
# PURPOSE, NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS.
11+
#
12+
# =============================================================
13+
CC = icc
14+
EXECS=intrin_dot_sample.exe intrin_double_sample.exe intrin_ftz_sample.exe
15+
DBG_EXECS=intrin_dot_sample_dbg.exe intrin_double_sample_dbg.exe intrin_ftz_sample_dbg.exe
16+
17+
release: $(EXECS)
18+
19+
debug: $(DBG_EXECS)
20+
21+
run: release
22+
@for i in $(EXECS); do ./$$i; done
23+
24+
debug_run: debug
25+
@for i in $(DBG_EXECS); do ./$$i; done
26+
27+
intrin_dot_sample.exe: intrin_dot_sample.o
28+
$(CC) -O2 $^ -o $@
29+
30+
intrin_double_sample.exe: intrin_double_sample.o
31+
$(CC) -O2 $^ -o $@
32+
33+
intrin_ftz_sample.exe: intrin_ftz_sample.o
34+
$(CC) -O2 $^ -o $@
35+
36+
intrin_dot_sample_dbg.exe: intrin_dot_sample_dbg.o
37+
$(CC) -O0 -g $^ -o $@
38+
39+
intrin_double_sample_dbg.exe: intrin_double_sample_dbg.o
40+
$(CC) -O0 -g $^ -o $@
41+
42+
intrin_ftz_sample_dbg.exe: intrin_ftz_sample_dbg.o
43+
$(CC) -O0 -g $^ -o $@
44+
45+
%.o: src/%.cpp
46+
$(CC) -O2 -c -o $@ $<
47+
48+
%_dbg.o: src/%.cpp
49+
$(CC) -O0 -g -c -o $@ $<
50+
51+
clean:
52+
/bin/rm -f core.* *.o *.exe
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# `Intrinsics` Sample
2+
3+
The intrinsic samples are designed to show how to utilize the intrinsics supported by the Intel&reg; C++ compiler in a variety of applications. The src folder contains three .cpp source files each demonstrating different functionality of the intrinsics, including vector operations, complex numbers computations, and FTZ/DAZ flags.
4+
5+
| Optimized for | Description
6+
|:--- |:---
7+
| OS | Linux* Ubuntu* 18.04; MacOS* Catalina* or newer
8+
| Hardware | Skylake with GEN9 or newer
9+
| Software | Intel&reg; C++ Compiler 2021.1 or newer;
10+
| What you will learn | How to utlize intrinsics supported by the Intel&reg; C++ Compiler
11+
| Time to complete | 15 minutes
12+
13+
14+
## Purpose
15+
16+
Intrinsics are assembly-coded functions that allow you to use C++ function calls and variables in place of assembly instructions. Intrinsics are expanded inline, eliminating function call overhead. While providing the same benefits as using inline assembly, intrinsics improve code readability, assist instruction scheduling, and help when debugging. They provide access to instructions that cannot be generated using the standard constructs of the C and C++ languages, and allow code to leverage performance enhancing features unique to specific processors.
17+
18+
Further information on intriniscs can be found here: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics.html#intrinsics_GUID-D70F9A9A-BAE1-4242-963E-C3A12DE296A1
19+
20+
## Key Implementation Details
21+
22+
This sample makes use of intrinsic functions to perform common mathematical operations including:
23+
- Computing a dot product of two vectors
24+
- Computing the product of two complex numbers
25+
The implementations include multiple functions to accomplish these tasks, each one leveraging a different set of intrinsics available to Intel&reg; processors.
26+
27+
28+
## License
29+
30+
This code sample is licensed under MIT license.
31+
32+
33+
## Building the `Mandelbrot` Program for CPU and GPU
34+
35+
Perform the following steps:
36+
1. Build the program using the following `make` commands.
37+
```
38+
$ make (or "make debug" to compile with the -g flag)
39+
```
40+
41+
2. Run the program:
42+
```
43+
make run (or "make debug_run" to run the debug version)
44+
```
45+
46+
3. Clean the program using:
47+
```
48+
make clean
49+
```
50+
51+
52+
### Application Parameters
53+
54+
These intrinsics samples have relatively few modifiable parameters. However, certain options are avaiable to the user:
55+
56+
1. intrin_dot_sample: Line 35 defines the size of the vectors used in the dot product computation.
57+
58+
2. intrin_double_sample: Lines 244-247 define the values of the two complex numbers used in the computation.
59+
60+
3. intrin_ftz_sample: This sample has no modifiable parameters.
61+
62+
63+
```
64+
Dot Product computed by C: 4324.000000
65+
Dot Product computed by Intel(R) SSE3 intrinsics: 4324.000000
66+
Dot Product computed by Intel(R) AVX2 intrinsics: 4324.000000
67+
Dot Product computed by Intel(R) AVX intrinsics: 4324.000000
68+
Dot Product computed by Intel(R) MMX(TM) intrinsics: 4324
69+
Complex Product(C): 23.00+ -2.00i
70+
Complex Product(Intel(R) AVX2): 23.00+ -2.00i
71+
Complex Product(Intel(R) AVX): 23.00+ -2.00i
72+
Complex Product(Intel(R) SSE3): 23.00+ -2.00i
73+
Complex Product(Intel(R) SSE2): 23.00+ -2.00i
74+
FTZ is set.
75+
DAZ is set.
76+
```
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
Copyright 2020 Intel Corporation
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8+
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"name": "Intrinsics C++",
3+
"description": "Demonstrates the intrinsic functions of the Intel® C++ Compiler",
4+
"categories": ["Toolkit/Intel® oneAPI HPC Toolkit"],
5+
"os": ["linux", "darwin"],
6+
"builder": ["cmake"],
7+
"languages": [{"cpp":{}}],
8+
"toolchain": ["icc"],
9+
"guid": "ACD0E89E-67CC-4CB4-87AB-B12B84962EAF",
10+
"ciTests": {
11+
"linux": [
12+
{ "id": "standard", "steps": [ "make", "make run", "make clean" ] },
13+
{ "id": "debug", "steps": [ "make debug", "make debug_run", "make clean" ] }
14+
],
15+
"darwin": [
16+
{ "id": "standard", "steps": [ "make", "make run", "make clean" ] },
17+
{ "id": "debug", "steps": [ "make debug", "make debug_run", "make clean" ] }
18+
]
19+
}
20+
}
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
//==============================================================
2+
//
3+
// SAMPLE SOURCE CODE - SUBJECT TO THE TERMS OF SAMPLE CODE LICENSE AGREEMENT,
4+
// http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/
5+
//
6+
// Copyright 2016 Intel Corporation
7+
//
8+
// THIS FILE IS PROVIDED "AS IS" WITH NO WARRANTIES, EXPRESS OR IMPLIED,
9+
// INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS
10+
// FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS.
11+
//
12+
// =============================================================
13+
/* [DESCRIPTION]
14+
* This C code sample demonstrates how to use C, Intel(R) MMX(TM),
15+
* Intel(R) Streaming SIMD Extensions 3 (Intel(R) SSE3),
16+
* Intel(R) Advanced Vector Extensions (Intel(R) AVX), and
17+
* Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2)
18+
* intrinsics to calculate the dot product of two vectors.
19+
*
20+
* Do not run the sample on systems using processors that do
21+
* not support Intel(R) MMX(TM), Intel(R) SSE3; the application
22+
* will fail.
23+
*
24+
* [Output]
25+
* Dot Product computed by C: 4324.000000
26+
* Dot Product computed by Intel(R) SSE3 intrinsics: 4324.000000
27+
* Dot Product computed by Intel(R) AVX intrinsics: 4324.000000
28+
* Dot Product computed by Intel(R) AVX2 intrinsics: 4324.000000
29+
* Dot Product computed by Intel(R) MMX(TM) intrinsics: 4324
30+
*
31+
*/
32+
#include <immintrin.h>
33+
#include <pmmintrin.h>
34+
#include <stdio.h>
35+
#define SIZE 24 // assumes size is a multiple of 8 because
36+
// Intel(R) AVX registers will store 8, 32bit elements.
37+
38+
// Computes dot product using C
39+
float dot_product(float *a, float *b);
40+
// Computes dot product using Intel(R) SSE intrinsics
41+
float dot_product_intrin(float *a, float *b);
42+
// Computes dot product using Intel(R) AVX intrinsics
43+
float AVX_dot_product(float *a, float *b);
44+
float AVX2_dot_product(float *a, float *b);
45+
// Computes dot product using Intel(R) MMX(TM) intrinsics
46+
short MMX_dot_product(short *a, short *b);
47+
48+
#define MMX_DOT_PROD_ENABLED (__INTEL_COMPILER || (_MSC_VER && !_WIN64))
49+
50+
int main() {
51+
float x[SIZE], y[SIZE];
52+
short a[SIZE], b[SIZE];
53+
int i;
54+
float product;
55+
short mmx_product;
56+
for (i = 0; i < SIZE; i++) {
57+
x[i] = i;
58+
y[i] = i;
59+
a[i] = i;
60+
b[i] = i;
61+
}
62+
product = dot_product(x, y);
63+
printf("Dot Product computed by C: %f\n", product);
64+
65+
product = dot_product_intrin(x, y);
66+
printf("Dot Product computed by Intel(R) SSE3 intrinsics: %f\n", product);
67+
68+
// The Visual Studio* editor will show the following section as disabled as it
69+
// does not know that __INTEL_COMPILER is defined by the Intel (R) Compiler
70+
#if __INTEL_COMPILER
71+
if (_may_i_use_cpu_feature(_FEATURE_AVX2)) {
72+
product = AVX2_dot_product(x, y);
73+
printf("Dot Product computed by Intel(R) AVX2 intrinsics: %f\n", product);
74+
} else
75+
printf("Your Processor does not support AVX2 instrinsics.\n");
76+
if (_may_i_use_cpu_feature(_FEATURE_AVX)) {
77+
product = AVX_dot_product(x, y);
78+
printf("Dot Product computed by Intel(R) AVX intrinsics: %f\n", product);
79+
} else
80+
printf("Your Processor does not support AVX intrinsics.\n");
81+
#else
82+
printf("Use Intel(R) Compiler to compute with Intel(R) AVX intrinsics\n");
83+
#endif
84+
85+
#if MMX_DOT_PROD_ENABLED
86+
mmx_product = MMX_dot_product(a, b);
87+
_mm_empty();
88+
printf("Dot Product computed by Intel(R) MMX(TM) intrinsics: %d\n",
89+
mmx_product);
90+
91+
#else
92+
printf(
93+
"Use Intel(R) compiler in order to calculate dot product using Intel(R) "
94+
"MMX(TM) intrinsics\n");
95+
#endif
96+
97+
return 0;
98+
}
99+
100+
float dot_product(float *a, float *b) {
101+
int i;
102+
int sum = 0;
103+
for (i = 0; i < SIZE; i++) {
104+
sum += a[i] * b[i];
105+
}
106+
return sum;
107+
}
108+
109+
// The Visual Studio* editor will show the following section as disabled as it
110+
// does not know that __INTEL_COMPILER is defined by the Intel(R) Compiler
111+
#if __INTEL_COMPILER
112+
113+
float AVX2_dot_product(float *a, float *b) {
114+
float total;
115+
int i;
116+
__m256 num1, num2, num3;
117+
__m128 top, bot;
118+
num3 = _mm256_setzero_ps(); // sets sum to zero
119+
for (i = 0; i < SIZE; i += 8) {
120+
num1 = _mm256_loadu_ps(a + i); // loads unaligned array a into num1
121+
// num1= a[7] a[6] a[5] a[4] a[3] a[2] a[1] a[0]
122+
num2 = _mm256_loadu_ps(b + i); // loads unaligned array b into num2
123+
// num2= b[7] b[6] b[5] b[4] b[3] b[2] b[1] b[0]
124+
num3 = _mm256_fmadd_ps(
125+
num1, num2, num3); // performs multiplication and vertical addition
126+
// num3 = a[7]*b[7]+num3[7] a[6]*b[6]+num3[6] a[5]*b[5]+num3[5]
127+
// a[4]*b[4]+num3[4]
128+
// a[3]*b[3]+num3[3] a[2]*b[2]+num3[2] a[1]*b[1]+num3[1]
129+
// a[0]*b[0]+num3[0]
130+
}
131+
num3 = _mm256_hadd_ps(num3, num3); // performs horizontal addition
132+
// For example, if num3 is filled with: 7 6 5 4 3 2 1 0
133+
// then num3 = 13 9 13 9 5 1 5 1
134+
135+
// extracting the __m128 from the __m256 datatype
136+
top = _mm256_extractf128_ps(num3, 1); // top = 13 9 13 9
137+
bot = _mm256_extractf128_ps(num3, 0); // bot = 5 1 5 1
138+
139+
// completing the reduction
140+
top = _mm_add_ps(top, bot); // top = 14 10 14 10
141+
top = _mm_hadd_ps(top, top); // top = 24 24 24 24
142+
143+
_mm_store_ss(&total, top); // Storing the result in total
144+
145+
return total;
146+
}
147+
148+
float AVX_dot_product(float *a, float *b) {
149+
float total;
150+
int i;
151+
__m256 num1, num2, num3, num4;
152+
__m128 top, bot;
153+
num4 = _mm256_setzero_ps(); // sets sum to zero
154+
for (i = 0; i < SIZE; i += 8) {
155+
num1 = _mm256_loadu_ps(a + i); // loads unaligned array a into num1
156+
// num1= a[7] a[6] a[5] a[4] a[3] a[2] a[1] a[0]
157+
num2 = _mm256_loadu_ps(b + i); // loads unaligned array b into num2
158+
// num2= b[7] b[6] b[5] b[4] b[3] b[2] b[1] b[0]
159+
num3 = _mm256_mul_ps(num1, num2); // performs multiplication
160+
// num3 = a[7]*b[7] a[6]*b[6] a[5]*b[5] a[4]*b[4] a[3]*b[3] a[2]*b[2]
161+
// a[1]*b[1] a[0]*b[0]
162+
num4 = _mm256_add_ps(num4, num3); // performs vertical addition
163+
}
164+
num4 = _mm256_hadd_ps(num4, num4); // performs horizontal addition
165+
// For example, if num4 is filled with: 7 6 5 4 3 2 1 0
166+
// then num4 = 13 9 13 9 5 1 5 1
167+
168+
// extracting the __m128 from the __m256 datatype
169+
top = _mm256_extractf128_ps(num4, 1); // top = 13 9 13 9
170+
bot = _mm256_extractf128_ps(num4, 0); // bot = 5 1 5 1
171+
172+
// completing the reduction
173+
top = _mm_add_ps(top, bot); // top = 14 10 14 10
174+
top = _mm_hadd_ps(top, top); // top = 24 24 24 24
175+
176+
_mm_store_ss(&total, top); // Storing the result in total
177+
178+
return total;
179+
}
180+
#endif
181+
182+
float dot_product_intrin(float *a, float *b) {
183+
float total;
184+
int i;
185+
__m128 num1, num2, num3, num4;
186+
__m128 num5;
187+
num4 = _mm_setzero_ps(); // sets sum to zero
188+
for (i = 0; i < SIZE; i += 4) {
189+
num1 = _mm_loadu_ps(
190+
a +
191+
i); // loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
192+
num2 = _mm_loadu_ps(
193+
b +
194+
i); // loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
195+
num3 = _mm_mul_ps(num1, num2); // performs multiplication num3 =
196+
// a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
197+
num3 = _mm_hadd_ps(num3, num3); // performs horizontal addition
198+
// num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2]
199+
// a[1]*b[1]+a[0]*b[0]
200+
num4 = _mm_add_ps(num4, num3); // performs vertical addition
201+
}
202+
203+
num4 = _mm_hadd_ps(num4, num4);
204+
_mm_store_ss(&total, num4);
205+
return total;
206+
}
207+
208+
// Intel(R) MMX(TM) technology cannot handle single precision floats
209+
#if MMX_DOT_PROD_ENABLED
210+
short MMX_dot_product(short *a, short *b) {
211+
int i;
212+
short result, data;
213+
__m64 num3, sum;
214+
__m64 *ptr1, *ptr2;
215+
_m_empty();
216+
sum = _mm_setzero_si64(); // sets sum to zero
217+
for (i = 0; i < SIZE; i += 4) {
218+
ptr1 = (__m64 *)&a[i]; // Converts array a to a pointer of type
219+
//__m64 and stores four elements into
220+
// Intel(R) MMX(TM) registers
221+
ptr2 = (__m64 *)&b[i];
222+
num3 = _m_pmaddwd(*ptr1, *ptr2); // multiplies elements and adds lower
223+
// elements with lower element and
224+
// higher elements with higher
225+
sum = _m_paddw(sum, num3);
226+
}
227+
228+
data = _m_to_int(sum); // converts __m64 data type to an int
229+
sum = _m_psrlqi(sum, 32); // shifts sum
230+
result = _m_to_int(sum);
231+
result = result + data;
232+
_mm_empty(); // clears the Intel(R) MMX(TM) registers and
233+
// Intel(R) MMX(TM) state.
234+
return result;
235+
}
236+
#endif

0 commit comments

Comments
 (0)