|
1 | 1 | \section{Evaluation} |
2 | 2 |
|
3 | | -In this section we provide results of some experiments with Brahma.FSharp platform which are aimed to demonstarte its main features. |
| 3 | +In this section we provide experiments\footnote{Related sources: \url{!!!}} with Brahma.FSharp platform which are aimed to demonstrate its main features. |
4 | 4 |
|
5 | | -Sources of Brahma.FSharp is available here: \url{https://github.com/YaccConstructor/Brahma.FSharp} |
6 | | -Binary package available here: \url{https://www.nuget.org/packages/Brahma.FSharp/} |
7 | | -Examples of usage: \url{https://github.com/YaccConstructor/Brahma.FSharp.Examples} |
| 5 | +We evaluated Brahma.FSharp in two cases listed below and described in respective sections. |
| 6 | +\begin{enumerate} |
| 7 | +\item The first one is an image convolution. to demonstrate async |
| 8 | +\item Second one is a matrix multiplication. to demonstrate generics, local and private memory support. To evaluate on different devices. |
| 9 | +\end{enumerate} |
8 | 10 |
|
9 | | -\subsection{Matrix multiplication} |
10 | | - |
11 | | -Classical task for GPGPU. |
12 | | - |
13 | | -Naive, optimized in F\#, optimized via type provider. |
14 | | - |
15 | | -Code examples. |
16 | | - |
17 | | -And with type providers too |
18 | | - |
19 | | -\subsection{Substring matching} |
| 11 | +On several platforms. |
| 12 | +\begin{itemize} |
| 13 | + \item Intel |
| 14 | + \item NVIDIA |
| 15 | + \item ImTech, PowerVR, RISC-V |
| 16 | + \item Qualcomm, Mali, ARM |
| 17 | +\end{itemize} |
20 | 18 |
|
21 | | -Data recovery. |
| 19 | +\subsection{Image Convolution} |
22 | 20 |
|
23 | | -CPU vs GPGPU. |
| 21 | +Reading and writing. |
| 22 | +F\# MailboxProcessor used for composing of data reading, data processing on GPGPU, and data processing on CPU. |
24 | 23 |
|
25 | | -Algorithm is not important for real data. Data transferring is bottleneck. |
| 24 | +Graphics, tables. |
26 | 25 |
|
27 | | -\subsection{Substring matching with agents} |
28 | 26 |
|
29 | | -Agents forever~\cite{BrahmaStringMatching}~\cite{aleaGPUasync} |
| 27 | +~\cite{aleaGPUasync} |
30 | 28 |
|
31 | | -Geterogenious, multi-GPGPU platforms |
| 29 | +Gray scale. |
32 | 30 |
|
33 | | -Substring matchng from previous section. |
| 31 | +Multiple GPU-s. |
34 | 32 |
|
35 | | -Results of performance test of GPGPU calculation using Brahma.FSharp and MailboxProcessor composition are presented. |
36 | | -Problem to solve is substring matching for data carving. |
37 | | -Rabin-Karp algorithm was implemented using Brahma.FSharp for substring matching. |
38 | | -F\# MailboxProcessor used for composing of data reading, data processing on GPGPU, and data processing on CPU. |
39 | | -Library for fast and flexible configuration of MailboxProcessors was created. |
40 | | -Set of templates for search was fixed. |
41 | | -Tests were performed for HDD and SSD storages. |
42 | | -Low level sequential reading was implemented. |
43 | | -First 16.5 Mb was processed. |
| 33 | +\subsection{Matrix Multiplication} |
44 | 34 |
|
45 | | -\begin{itemize} |
46 | | -\item OS: Microsoft Windows 8.1 Pro |
47 | | -\item System Type: x64-based PC |
48 | | -\item Processor: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s) |
49 | | -\item RAM: 16.0 GB |
50 | | -\item HDD for test: |
51 | | -\begin{itemize} |
52 | | - \item Model: ST3250410AS |
53 | | - \item Size: 232.88 GB |
54 | | - \item 7200 rpm |
55 | | -\end{itemize} |
56 | | - |
57 | | -\item SSD for test |
58 | | -\begin{itemize} |
59 | | - \item Model: INTEL SSDSC2BW240A4 |
60 | | - \item Size: 223.57 GB |
61 | | - \item Max read speed: 540 Mb/sec |
62 | | -\end{itemize} |
63 | | - |
64 | | -\item GPGPU: |
65 | | -\begin{itemize} |
66 | | - \item NVIDIA GeForce GTX 560 Ti |
67 | | - \item CUDA Cores: 384 |
68 | | - \item Core clock: 822 MHz |
69 | | - \item Shader clock: 1645 MHz |
70 | | - \item Memory data rate: 4008 MHz |
71 | | - \item Memory interface: 256-bit |
72 | | - \item Memory bandwidth: 128.26 GB/s |
73 | | - \item Total available graphics memory: 4095 MB |
74 | | - \item Dedicated video memory: 2048 MB GDDR5 |
75 | | - \item Shared system memory: 2047 MB |
76 | | -\end{itemize} |
77 | | -\end{itemize} |
78 | | - |
79 | | -Tables below present results of tests. |
80 | | -``buffers for data'' --- a number of arrays to fill by disc reader for each MailboxProcessor which communicate with GPGPU. |
81 | | -``threads'' --- a number of MailboxProcessors which communicate with GPGPU. |
82 | | -In current configuration we have only one GPGU, so all MailboxProcessors use it. |
83 | | -For multi-GPGPU systems we can configure k MailboxProcessors for each GPGPU. |
84 | | - |
85 | | -In each cell --- total time and GPGPU loading graph. |
86 | | - |
87 | | -\begin{table*}[ht] |
88 | | -\caption{WEWEW} |
89 | | -\label{tbl:eval1} |
90 | | -\begin{center} |
91 | | - \begin{tabular}{ l | c | r } |
92 | | - \hline |
93 | | - 1 & 2 & 3 \\ \hline |
94 | | - 4 & 5 & 6 \\ \hline |
95 | | - 7 & 8 & 9 \\ |
96 | | - \hline |
97 | | - \end{tabular} |
98 | | -\end{center} |
99 | | -\end{table*} |
100 | | - |
101 | | -Conclusion: |
102 | | -Data reading bufferization can sufficiently increase performance. |
103 | | -Especially for HDD, where speed of reading is low. |
104 | | -For SSD processing with multi-GPGPU systems may be useful. |
105 | | -Data reading is not so critical as for HDD and more than one GPGPU can be fully loaded by using flexible MailboxProcessors configuration. |
106 | | -Configuration with two MailboxProcessors and two buffers for each of them can fully load one GPGPU. |
| 35 | +Classical task for GPGPU. |
107 | 36 |
|
| 37 | +Several optimizations. |
108 | 38 |
|
| 39 | +Generic kernels parametrized by types and operations. |
109 | 40 |
|
| 41 | +Code examples. |
110 | 42 |
|
| 43 | +Sequence of optimizations inspired by !!!\footnote{\url{!!!}}. |
| 44 | +Not all, but memory |
| 45 | +Square matrix. |
| 46 | + |
| 47 | +Flexibility. Kernels are parametrized by operations and id. |
| 48 | +Unsafe Min-plus using max value as ID. |
| 49 | +Matrix of 'e |
| 50 | + |
| 51 | +More accurate min-plus using options. |
| 52 | + |
| 53 | +\begin{listing}[h] |
| 54 | + \begin{minted}[linenos]{fsharp} |
| 55 | +let mXmKernel |
| 56 | + (opAdd: Quotations.Expr<'a -> 'b -> 'a>) |
| 57 | + (opMult: Quotations.Expr<'e -> 'f -> 'b>) |
| 58 | + (zero: Quotations.Expr<'a>) ... (* other parameters *) = |
| 59 | + ... // Supplementary code |
| 60 | + let kernel = <@ fun 2dRange m1 m2 res -> // Quoted code |
| 61 | + ... |
| 62 | + let acc = %zero // Embedded identity value |
| 63 | + let lBuf = localArray lws // captured from context |
| 64 | + ... |
| 65 | + acc <- (%opAdd) acc ((%opMult) x y) // Embedded operations |
| 66 | + ... @> |
| 67 | + ... // Supplementary code |
| 68 | + |
| 69 | +let intArithmeticKernel = mXmKernel <@ (+) @> <@ ( * ) @> <@ 0 @> |
| 70 | +let intMinPlusKernel = |
| 71 | + mXmKernel <@ (min) @> <@ (+) @> <@ Int.MaxValue @> |
| 72 | + \end{minted} |
| 73 | + \caption{An example of masking operation definition} |
| 74 | + \label{lst:mXm_kernels} |
| 75 | +\end{listing} |
| 76 | + |
| 77 | +Graphics, tables. |
0 commit comments