Skip to content

Commit 2deb9f7

Browse files
committed
[WIP] Brahma.FSahrp.
1 parent cef3485 commit 2deb9f7

File tree

8 files changed

+247
-392
lines changed

8 files changed

+247
-392
lines changed
Lines changed: 22 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,24 @@
11
\section{Brahma.FSharp}
22

3-
In this section we present our platform for GPGPU programming in F\#.
4-
5-
Research project: SPbU and JetBrains Research.
6-
7-
Blah-Blah-Blah!!!!
8-
9-
\subsection{Architecture}
10-
11-
Based on F\# code quotation to OpenCL translator.
12-
13-
GPGPU driver is OpenCL.NET~\footnote{OpenCL.NET --- low level .NET bindings for OpenCL. Project site [visited: 20.06.2017]: \url{https://openclnet.codeplex.com/}.}
14-
15-
Picture.
16-
17-
Workflow:
18-
19-
Detailed description of some blocks are provided below.
20-
21-
\subsection{Translator}
22-
23-
Subset of F\# to openCL.
24-
More details are available here: \url{wwww}.
25-
Classical techniques: var scopes,
26-
27-
Strongly typed: exactly same signature, but logic is changed.
28-
29-
Structs, tuples, etc
30-
31-
\subsection{OpenCL specific operations}
32-
33-
Atomic functions.
34-
For kernal code only.
35-
36-
Memory transferring.
37-
38-
\subsection{OpenCL type provider}
39-
40-
It is necessary to provide mechanism for existing kernels reusing.
41-
OpenCL kernels destribution --- source code
42-
43-
Create strongly typed functions from existing OpenCL kernels code.
3+
Brahma.FSharp is a tool that allows one to utilize GPGPUs in .NET applications and write kernels and all supplementary code in pure F\#~\cite{fsharp} that is a functional-first multiparadigmal programming language for .NET platform.
4+
This language combines functional programming, including first-class functions, generics, static strong typing with automatic type inference, with transparent integration with the a platform for business applications development with mature infrastructure.
5+
At the same time, F\# provides an ability to write imperative code that is native for kernel programming.
6+
7+
Core of the tool is a translator of F\# subset to OpenCL C that is based on \emph{code quotations}~\cite{FSharpQuotations} that allows one to get access to annotated tree of the F\# code and transform it during program execution.
8+
This tree can be transformed using regular F\# functions: for example, it can be translated to other language, that is we do to generate OpenCL C code for kernels.
9+
Other words, code quotations is a running time metaprogramming feature that allows us to create running-time configurable kernels.
10+
For example, it is possible, in opposite to compile time metaprogramming, configure work group size dependent parts of kernel (e.g. local buffer size) without recompilation of whole program (look at line 9 of listing~\ref{lst:mXm_kernels}).
11+
The main feature is that all is strongly and statically typed: no unsafe code that uses strings, pointers, objects, etc.
12+
At the user side, compiled quotation (compiled kernel) has the signature that requires parameters of types that are in agreement with initial quotation.
13+
14+
An example of quoted code (actually, part of the generalized mXm kernel) is presented in listing~\ref{lst:mXm_kernels} (lines 6--12).
15+
This code also demonstrates typed composition of quotations: operations \verb|opAdd| and \verb|opMult|, and identity element \verb|zero|, have agreed types and can be specified outside the kernel in run time.
16+
So, we can write highly configurable kernels generator and instantiate specific kernels later, as shown in lines 15--16.
17+
18+
The translator supports not only imperative subset of F\# and primitive types, but also F\#-specific features like structs, tuples, discriminated unions, pattern matching, nested bindings.
19+
Also it supports OpenCL specific features like atomic functions, barriers, local and thread-local allocation arrays allocation.
20+
For data transferring and manipulation Brahma.FSharp provides specified memory primitives (\verb|ClArray<'t>| and \verb|ClCell<'t>|) that is F\#-array-friendly wrappers around \texttt{ClBuffer}.
21+
22+
Brahma.FSharp provides typical workflow to run kernels and implements respective typed wrappers for it\footnote{Configure of path to \texttt{libopencl} making the solution portable.}.
23+
It is worth noting that F\# is friendly to asynchronous programming and provides huge amount of parallel and asyncronious programming primitives~\cite{FSharpAsync}.
24+
Utilization of \emph{mailbox processor}, that is F\#-native massage passing primitive, to wrap command queue allows us to make communication with GPGPU friendly for asynchronous programming in F\#.
Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,20 @@
11
\section{Conclusion and Future Work}
22

3-
Platform presented.
3+
Brahma.FSharp --- a tool to create cross-platform GPGPU-enabled .NET application presented.
4+
We demonstrated portability of application by evaluating them on a set of platforms including RISC-V with PowerVR GPPGU and ARM with Mali GPGPU.
45

5-
Education. Metaprogramming, translators development, GPGPU programming, etc.
6+
While work still in progress, Brahma.FSharp allows one to create linear algebra related kernels performant enough to be integrated in libraries like Math.Net Numerics to offload generic linear algebra transparently to GPGPU.
7+
Such an integration is planned to the nearest future.
68

7-
Graph parsing.
9+
Also, within the translator improvements, it is necessary to improve performance of data transferring between managed and native memory for complex types such as discriminated unions.
810

9-
Geterogenious porgramming generalization. Hopac is better then MBP~\footnote{\url{https://vasily-kirichenko.github.io/fsharpblog/actors}}.
11+
While agent-based approach for communications is a native for both OpenCL and F\#, mailbox processor may not be the best choice for it especially in cases with high-frequent CPU-GPU communications.
12+
It may be better to use more performant libraries like Hopac~\footnote{Hopac and mailbox processor performance comparison: \url{https://vasily-kirichenko.github.io/fsharpblog/actors}} or even provide light-weight wrapper for direct access to command queue for latency-critical code.
1013

11-
Research: Automatic memory management.
14+
One of nontrivial problem for the future research is an automatic memory management.
15+
For now, GPGPU-related memory should be cleaned manually, but .NET has automatic garbage collector.
16+
How can we offload buffers management on it with ability ti switch to manual control if required.
1217

13-
Data to code translation (automata can be translated into code instead of data structures in memory)
18+
%Other technical improvements: IDE support, runtime extensions, etc.
1419

15-
Other technical improvements: IDE support, type provider improvements, new OpenCL standard support, runtime extension, etc.
20+
%Education. Metaprogramming, translators development, GPGPU programming, etc.
Lines changed: 60 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,77 @@
11
\section{Evaluation}
22

3-
In this section we provide results of some experiments with Brahma.FSharp platform which are aimed to demonstarte its main features.
3+
In this section we provide experiments\footnote{Related sources: \url{!!!}} with Brahma.FSharp platform which are aimed to demonstrate its main features.
44

5-
Sources of Brahma.FSharp is available here: \url{https://github.com/YaccConstructor/Brahma.FSharp}
6-
Binary package available here: \url{https://www.nuget.org/packages/Brahma.FSharp/}
7-
Examples of usage: \url{https://github.com/YaccConstructor/Brahma.FSharp.Examples}
5+
We evaluated Brahma.FSharp in two cases listed below and described in respective sections.
6+
\begin{enumerate}
7+
\item The first one is an image convolution. to demonstrate async
8+
\item Second one is a matrix multiplication. to demonstrate generics, local and private memory support. To evaluate on different devices.
9+
\end{enumerate}
810

9-
\subsection{Matrix multiplication}
10-
11-
Classical task for GPGPU.
12-
13-
Naive, optimized in F\#, optimized via type provider.
14-
15-
Code examples.
16-
17-
And with type providers too
18-
19-
\subsection{Substring matching}
11+
On several platforms.
12+
\begin{itemize}
13+
\item Intel
14+
\item NVIDIA
15+
\item ImTech, PowerVR, RISC-V
16+
\item Qualcomm, Mali, ARM
17+
\end{itemize}
2018

21-
Data recovery.
19+
\subsection{Image Convolution}
2220

23-
CPU vs GPGPU.
21+
Reading and writing.
22+
F\# MailboxProcessor used for composing of data reading, data processing on GPGPU, and data processing on CPU.
2423

25-
Algorithm is not important for real data. Data transferring is bottleneck.
24+
Graphics, tables.
2625

27-
\subsection{Substring matching with agents}
2826

29-
Agents forever~\cite{BrahmaStringMatching}~\cite{aleaGPUasync}
27+
~\cite{aleaGPUasync}
3028

31-
Geterogenious, multi-GPGPU platforms
29+
Gray scale.
3230

33-
Substring matchng from previous section.
31+
Multiple GPU-s.
3432

35-
Results of performance test of GPGPU calculation using Brahma.FSharp and MailboxProcessor composition are presented.
36-
Problem to solve is substring matching for data carving.
37-
Rabin-Karp algorithm was implemented using Brahma.FSharp for substring matching.
38-
F\# MailboxProcessor used for composing of data reading, data processing on GPGPU, and data processing on CPU.
39-
Library for fast and flexible configuration of MailboxProcessors was created.
40-
Set of templates for search was fixed.
41-
Tests were performed for HDD and SSD storages.
42-
Low level sequential reading was implemented.
43-
First 16.5 Mb was processed.
33+
\subsection{Matrix Multiplication}
4434

45-
\begin{itemize}
46-
\item OS: Microsoft Windows 8.1 Pro
47-
\item System Type: x64-based PC
48-
\item Processor: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s)
49-
\item RAM: 16.0 GB
50-
\item HDD for test:
51-
\begin{itemize}
52-
\item Model: ST3250410AS
53-
\item Size: 232.88 GB
54-
\item 7200 rpm
55-
\end{itemize}
56-
57-
\item SSD for test
58-
\begin{itemize}
59-
\item Model: INTEL SSDSC2BW240A4
60-
\item Size: 223.57 GB
61-
\item Max read speed: 540 Mb/sec
62-
\end{itemize}
63-
64-
\item GPGPU:
65-
\begin{itemize}
66-
\item NVIDIA GeForce GTX 560 Ti
67-
\item CUDA Cores: 384
68-
\item Core clock: 822 MHz
69-
\item Shader clock: 1645 MHz
70-
\item Memory data rate: 4008 MHz
71-
\item Memory interface: 256-bit
72-
\item Memory bandwidth: 128.26 GB/s
73-
\item Total available graphics memory: 4095 MB
74-
\item Dedicated video memory: 2048 MB GDDR5
75-
\item Shared system memory: 2047 MB
76-
\end{itemize}
77-
\end{itemize}
78-
79-
Tables below present results of tests.
80-
``buffers for data'' --- a number of arrays to fill by disc reader for each MailboxProcessor which communicate with GPGPU.
81-
``threads'' --- a number of MailboxProcessors which communicate with GPGPU.
82-
In current configuration we have only one GPGU, so all MailboxProcessors use it.
83-
For multi-GPGPU systems we can configure k MailboxProcessors for each GPGPU.
84-
85-
In each cell --- total time and GPGPU loading graph.
86-
87-
\begin{table*}[ht]
88-
\caption{WEWEW}
89-
\label{tbl:eval1}
90-
\begin{center}
91-
\begin{tabular}{ l | c | r }
92-
\hline
93-
1 & 2 & 3 \\ \hline
94-
4 & 5 & 6 \\ \hline
95-
7 & 8 & 9 \\
96-
\hline
97-
\end{tabular}
98-
\end{center}
99-
\end{table*}
100-
101-
Conclusion:
102-
Data reading bufferization can sufficiently increase performance.
103-
Especially for HDD, where speed of reading is low.
104-
For SSD processing with multi-GPGPU systems may be useful.
105-
Data reading is not so critical as for HDD and more than one GPGPU can be fully loaded by using flexible MailboxProcessors configuration.
106-
Configuration with two MailboxProcessors and two buffers for each of them can fully load one GPGPU.
35+
Classical task for GPGPU.
10736

37+
Several optimizations.
10838

39+
Generic kernels parametrized by types and operations.
10940

41+
Code examples.
11042

43+
Sequence of optimizations inspired by !!!\footnote{\url{!!!}}.
44+
Not all, but memory
45+
Square matrix.
46+
47+
Flexibility. Kernels are parametrized by operations and id.
48+
Unsafe Min-plus using max value as ID.
49+
Matrix of 'e
50+
51+
More accurate min-plus using options.
52+
53+
\begin{listing}[h]
54+
\begin{minted}[linenos]{fsharp}
55+
let mXmKernel
56+
(opAdd: Quotations.Expr<'a -> 'b -> 'a>)
57+
(opMult: Quotations.Expr<'e -> 'f -> 'b>)
58+
(zero: Quotations.Expr<'a>) ... (* other parameters *) =
59+
... // Supplementary code
60+
let kernel = <@ fun 2dRange m1 m2 res -> // Quoted code
61+
...
62+
let acc = %zero // Embedded identity value
63+
let lBuf = localArray lws // captured from context
64+
...
65+
acc <- (%opAdd) acc ((%opMult) x y) // Embedded operations
66+
... @>
67+
... // Supplementary code
68+
69+
let intArithmeticKernel = mXmKernel <@ (+) @> <@ ( * ) @> <@ 0 @>
70+
let intMinPlusKernel =
71+
mXmKernel <@ (min) @> <@ (+) @> <@ Int.MaxValue @>
72+
\end{minted}
73+
\caption{An example of masking operation definition}
74+
\label{lst:mXm_kernels}
75+
\end{listing}
76+
77+
Graphics, tables.
Lines changed: 16 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,18 @@
11
\section{Introduction}
22

3-
GPGPU is popular technique for....
4-
5-
Tools and languages are low level.
6-
It is ood for high performance, but bad for developers.
7-
8-
OpenCL~\cite{OpenCL}, CUDA~\cite{CUDA} etc.
9-
10-
Complex problems, geterogenious platforms: multicore, multi GPGPU etc.
11-
Special tools, libs required for development simlification.
12-
High level languages and platforms are used for application development.
13-
14-
F\# primitives are helpful for metaprogramming and parallel/asyncronious programming.
15-
16-
General reqirenments: highlevel languge, existing code/dlls/other stuff reusing
17-
18-
Existing solutions, auch as Alea.GPU, FCSL, are not good enough. Why?
19-
Many different attempts for high level platform, such as JVM~\cite{rootbeer, HaskellGPU, jcuda, ScalaGPU, RustGPU}
20-
21-
Brahma.FSharp --- the best platform for GPGPU programming!!!!
22-
Quotations to OpenCL translator with many cool features.
3+
Last decades utilization of GPGPUs not only in scientific or dedicated applications, but also in regular business applications becomes more popular.
4+
In such cases not peak performance, but transparent offloading of computations to accelerator has come into focus.
5+
As a result, respective tools for integration of GPGPUs into such platforms as JVM~\cite{rootbeer,jcuda,ScalaGPU} or .NET~\cite{FSCLPhD,aleaGPUasync} are developed.
6+
Note that in real-world application the problem no only to offload some computations on GPGPU, but to orchestrate heterogenous asynchronous application that involves computations on possible several GPGPUs.
7+
8+
At the same time, utilization of existing functional languages and creation new ones for GPGPU programming, looks promising due to them safety, flexibility, ability to use advanced optimization techniques and to create high-level abstractions.
9+
That lead to such projects as Futhark~\cite{10.1145/3140587.3062354}, Lift~\cite{10.5555/3049832.3049841}, AnyDSL~\cite{10.1145/3276489}, Accelerate~\cite{10.1145/1926354.1926358}.
10+
11+
Nowadays there are very few combination of mature business application development platform and functional programming language.
12+
One of them is a .NET platform and F\# programming language.
13+
There are several tools, such as Alea.GPU~\cite{aleaGPUasync}, FCSL~\cite{FSCLPhD}, ILGPU~\footnote{ILGPU project web page: \url{https://ilgpu.net/}}, that allows one to integrate GPGPUs into .NET application without using such unsafe and low-level mechanisms like string-level kernels creation.
14+
While FSCL and Alea.GPU use F\# to create kernels, ILGPU works on IL level that limits ability to use high-level features and nontrivial optimizations.
15+
16+
In this work we propose a \textbf{Brahma.FSharp}\footnote{
17+
Sources of Brahma.FSharp: \url{https://github.com/YaccConstructor/Brahma.FSharp}.
18+
} --- the tool for portable GPGPU-enabled .NET applications development that provides transparent and safe integration with accelerators --- and demonstrate it's portability across variety of platforms and devices.
154 KB
Binary file not shown.

0 commit comments

Comments
 (0)