You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are grateful to the !!!! and !!! for their careful reading, pointing out some mistakes, and invaluable suggestions.
4
-
This work is supported by grant from JetBrains Research, and by grant from Russian Foundation for Assistance to Small Innovative Enterprises (UMNIK program, \textnumero 162GU1/2013 and \textnumero 6919GU2/2015).
3
+
We sincerely thank the anonymous reviewers for their thorough evaluation, insightful comments, and valuable suggestions that helped improve this work.
4
+
This research has been supported by the St. Petersburg State University, grant id 116636233.
Copy file name to clipboardExpand all lines: InProgress/Brahma.FSharp/paper/BrahmaFSharp.tex
+22-17Lines changed: 22 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,24 @@
1
1
\section{Brahma.FSharp}
2
2
3
-
Brahma.FSharp is a tool that allows one to utilize GPGPUs in .NET applications and write kernels and all supplementary code in pure F\#~\cite{fsharp} that is a functional-first multiparadigmal programming language for .NET platform.
4
-
This language combines functional programming, including first-class functions, generics, static strong typing with automatic type inference, with transparent integration with a platform for business applications development with mature infrastructure.
5
-
At the same time, F\#provides an ability to write imperative code that is native for kernel programming.
3
+
Brahma.FSharp is a tool that enables GPGPU utilization in .NET applications, allowing developers to write both kernels and supporting code in pure F\#~\cite{fsharp}---a functional-first, multiparadigmal programming language for the .NET platform.
4
+
This language combines functional programming features (including first-class functions, generics, and static strong typing with automatic type inference) with seamless integration into a mature business application development platform.
5
+
Additionally, F\#supports imperative coding patterns that are natural for kernel programming.
6
6
7
-
Core of the tool is a translator of F\# subset to OpenCL C that is based on \emph{code quotations}~\cite{FSharpQuotations} that allows one to get access to annotated tree of the F\# code and transform it during program execution.
8
-
This tree can be transformed using regular F\# functions: for example, it can be translated to other language, that is we do to generate OpenCL C code for kernels.
9
-
Other words, code quotations is a runtime metaprogramming feature that allows us to create kernels configurable during program execution.
10
-
For example, in opposite to compile time metaprogramming, it is possible to configure work group size dependent parts of kernel (e.g. local buffer size) without recompilation of whole program (look at line 9 of listing~\ref{lst:mXm_kernels}).
11
-
The main feature is that everything is strongly and statically typed: no unsafe code that uses strings, pointers, objects, etc.
12
-
At the user side, compiled quotation (compiled kernel) has the signature that requires parameters of types that are in agreement with initial quotation.
7
+
The tool's core is an F\#-subset-to-OpenCL-C translator based on \emph{code quotations}~\cite{FSharpQuotations}---language feature that provides access to an annotated syntax tree of F\# code and to transform it during program execution.
8
+
This tree can be processed using standard F\# functions: for instance, we transform it to generate OpenCL C kernel code.
9
+
In other words, code quotations enable runtime metaprogramming for creating configurable kernels during program execution.
10
+
Unlike compile-time metaprogramming, this approach enables dynamic configuration of work-group-dependent kernel aspects (such as local buffer sizes) without full program recompilation (see line 9 in Listing~\ref{lst:mXm_kernels}).
11
+
Crucially, all operations remain strongly and statically typed, eliminating the need for unsafe code involving strings, pointers, or object manipulation.
12
+
From the user's perspective, a compiled quotation (compiled kernel) exposes a type signature that enforces parameter consistency with the original quotation.
13
+
In other words, the compiled kernel retains full type information about its arguments, enabling the compiler to verify parameter binding correctness at compile-time without needing additional type annotations.
13
14
14
-
An example of quoted code (actually, part of the generalized mXm kernel) is presented in listing~\ref{lst:mXm_kernels} (lines 6--12).
15
-
This code also demonstrates typed composition of quotations: operations \verb|opAdd| and \verb|opMult|, and identity element \verb|zero|, have agreed types and can be specified outside the kernel in run time.
16
-
So, we can write highly configurable kernels generator and instantiate specific kernels later, as shown in lines 15--16.
15
+
Actually, creating a truly generic \texttt{compile} function that converts high-level code to compiled kernels is a nontrivial challenge.
16
+
As a result, many other tools either erase all type information or require manual type specifications.
17
+
For example, ILGPU requires manual specification of the compiled kernel type.
18
+
19
+
Listing~\ref{lst:mXm_kernels} (lines 6-12) shows an example of quoted code (part of a generalized matrix multiplication kernel).
20
+
This example demonstrates typed quotation composition: the operations \verb|opAdd| and \verb|opMult|, along with the identity element \verb|zero|, maintain type consistency and can be specified externally at runtime.
21
+
Thus, we can create highly configurable kernels generator and instantiate specific kernels as needed (lines 15-16).
17
22
18
23
\begin{listing}
19
24
\begin{minted}[linenos]{fsharp}
@@ -39,10 +44,10 @@ \section{Brahma.FSharp}
39
44
\label{lst:mXm_kernels}
40
45
\end{listing}
41
46
42
-
The translator supports not only imperative subset of F\# and primitive types, but also F\#-specific features like structs, tuples, discriminated unions, pattern matching, nested bindings.
43
-
Also it supports OpenCLspecific features like atomic functions, barriers, local and thread-local arrays allocation.
44
-
For data transferring and manipulation Brahma.FSharp provides \verb|ClArray<'t>| and \verb|ClCell<'t>| primitives that are F\#-array-friendly wrappers for \texttt{ClBuffer}.
47
+
The translator supports not only the imperative subset of F\# and primitive types, but also F\#-specific features including structs, tuples, discriminated unions, pattern matching, and nested bindings.
48
+
Additionally, it implements OpenCL-specific functionality such as atomic operations, memory barriers, and allocation of local/thread-local arrays.
49
+
For data transfer and manipulation, Brahma.FSharp provides two key primitives---\verb|ClArray<'t>| and \verb|ClCell<'t>|--- that are F\#-array-friendly wrappers for \texttt{ClBuffer}.
45
50
46
-
Brahma.FSharp provides typical workflow to run kernels and implements respective typed wrappers for it\footnote{Configuration of path to \texttt{libopencl} allows one to make the solution portable.}.
51
+
The framework implements a standard kernel execution workflow with typed wrappers\footnote{Portability is achieved through configurable \texttt{libopencl} path specification.}
47
52
It is worth noting that F\# is friendly to asynchronous programming and provides huge amount of parallel and asynchronous programming primitives~\cite{FSharpAsync}.
48
-
Utilization of\emph{MailboxProcessor}, that is F\#-native massage passing primitive, to wrap command queue allows us to make communication with GPGPU friendly for asynchronous programming in F\#.
53
+
By utilizing\emph{MailboxProcessor}---F\#'s built-in message passing primitive---to wrap command queues, we achieve GPGPU communication patterns that naturally complement F\#'s asynchronous programming model.
Brahma.FSharp---a tool to create cross-platform GPGPU-enabled .NET application presented.
4
-
We demonstrated portability of application by evaluating them on a set of platforms including RISC-V with PowerVR GPGPU, and embedded Intel GPUs.
3
+
Brahma.FSharp---a tool for developing cross-platform, GPGPU-accelerated .NET applications---is presented.
4
+
We demonstrated applications portability by evaluating performance across multiple platforms, including RISC-V with PowerVR GPGPU and embedded Intel GPUs.
5
5
6
-
While work still in progress, Brahma.FSharp allows one to create linear algebra related kernels performant enough to be integrated in libraries like Math.Net Numerics to offload generic linear algebra transparently to GPGPU.
7
-
Such an integration is planned to the nearest future.
6
+
While the work remains in progress, Brahma.FSharp already enables creation of linear algebra kernels sufficiently performant for integration into libraries like Math.NET Numerics, allowing transparent offloading of generic linear algebra operations to GPGPUs.
7
+
Such integration is planned for the near future.
8
8
9
-
Also, within the translator improvements, it is necessary to improve performance of data transferring between managed and native memory for complex types such as discriminated unions.
9
+
%Also, within the translator improvements, it is necessary to improve performance of data transferring between managed and native memory for complex types such as discriminated unions.
10
+
%For translator improvements, optimizing data transfer performance between managed and native memory for complex types (e.g., discriminated unions) requires further development.
10
11
11
-
While agent-based approach for communications is a native for both OpenCL and F\#, MailboxProcessor processor may not be the best choice for it especially in cases with high-frequent CPU-GPU communications.
12
-
It may be better to use more performant libraries like Hopac~\footnote{Hopac and MailboxProcessor processor performance comparison: \url{https://vasily-kirichenko.github.io/fsharpblog/actors}} or even provide light-weight wrapper for direct access to command queue for latency-critical code.
12
+
Although the agent-based communication approach aligns naturally with both OpenCL and F\#, MailboxProcessor may not be optimal for high-frequency CPU-GPU communication.
13
+
Alternative solutions like Hopac\footnote{Hopac and MailboxProcessor performance comparison: \url{https://vasily-kirichenko.github.io/fsharpblog/actors}} or lightweight command queue wrappers could provide better performance for latency-critical code.
13
14
14
-
One of nontrivial problem for the future research is an automatic memory management.
15
-
For now, GPGPU-related memory should be cleaned manually, but .NET has automatic garbage collector.
16
-
How can we offload buffers management on it with ability to switch to manual control if required.
15
+
A significant challenge for future research involves automatic memory management.
In this section we provide experiments\footnote{Sources of benchmarking automation infrastructure: \url{https://github.com/vkutuev/matrix-benchmark}.} with Brahma.FSharp platform which are aimed to demonstrate its main features on regular (not HPC) devices\footnote{Looks more suitable for business applications that use .NET.}.
4
-
We evaluated Brahma.FSharp in two cases listed below and described in respective sections.
3
+
In this section, we present experimental evaluations\footnote{Benchmarking automation infrastructure sources: \url{https://github.com/vkutuev/matrix-benchmark}.} of the Brahma.FSharp platform, demonstrating its core capabilities on conventional (non-HPC) devices\footnote{This configuration is particularly relevant for business applications built on .NET.}.
4
+
We assess Brahma.FSharp performance in two representative use cases, detailed in the following sections\footnote{
5
+
Respective code is available on GitHub: \url{https://github.com/gsvgit/ImageProcessing/tree/matrix_multiplication}.
6
+
}.
5
7
\begin{enumerate}
6
8
\item The first one is an image convolution that demonstrates utilization of several GPU-s using F\# MailboxProcessor.
7
-
\item Second one is a matrix multiplication that demonstrates ability to create generic strongly statically typed kernels, to utilize local and private memory for performance optimization, and to demonstrate portability across different devices.
9
+
\item\textbf{Image convolution}: showcases smooth multi-GPU utilization through F\#'s MailboxProcessor for efficient tasks distribution.
10
+
\item\textbf{Matrix multiplication}: demonstrates creation of generic strongly statically typed kernels, utilization of local and private memory for performance optimization, and portability across different devices.
8
11
\end{enumerate}
9
12
10
13
11
14
\subsection{Image Convolution}
12
15
13
-
We implement image convolution in order to demonstrate multiple GPGPUs utilization.
14
-
Native for F\# asynchronous model, as was shown in~\cite{aleaGPUasync}, simplifies creation of complex workflows that include computations on GPGs.
15
-
We use F\# MailboxProcessor because Brahma.FSharp provides it as an interface for communication with GPUs.
16
-
Kernel is simply wrapped as shown in~\ref{lst:img_conv}.
17
-
Simple load balancer that send next image to an agent with less number of messages in the input queue was implemented.
16
+
We implemented image convolution as a demonstration of multi-GPU utilization.
17
+
As established in~\cite{aleaGPUasync}, F\#'s native asynchronous programming model significantly simplifies the creation of complex nonlinear computational workflows combining GPU computations with CPU processing and I/O operations.
18
+
Our implementation leverages F\#'s MailboxProcessor, which Brahma.FSharp exposes as the primary interface for GPU communication.
19
+
The kernel is simply wrapped as illustrated in Listing~\ref{lst:img_conv}.
20
+
For workload distribution, we developed a basic load balancer that dynamically routes each new image to the GPU agent with the fewest pending messages in its input queue.
We evaluate this solution on a \textbf{Lenovo} platform with two GPUs: NVIDIA GeForce MX150 and Intel(R) UHD Graphics 620.
40
-
We assume that all images are loaded into RAM and converted to grayscale.
41
-
Typical chain of filters is applied: 3 Gaussian blur ($5\times5$ kernel), then edges detection ($5\times5$ kernel).
43
+
We assume all images are loaded into RAM and converted to grayscale.
44
+
A typical sequence of filters is applied: 3 Gaussian blur operations ($5\times5$ kernel) followed by edge detection ($5\times5$ kernel).
42
45
420 images (1gb of data) was handled in 40 seconds with two GPUs, in 64 seconds using Nvidia GPU only, and in 97 seconds using Intel GPU only.
43
-
Thus we can see that even naive multi-GPU workflow allows one to achieve up to 30\% speedup.
46
+
These results demonstrate that even a naive multi-GPU workflow can achieve up to 30\% speedup compared to using only the fastest single GPU (NVIDIA) in the system.
44
47
45
48
\subsection{Matrix Multiplication}
46
49
47
-
We evaluate generic kernel parametrized by types and operations (see listing~\ref{lst:mXm_kernels}), implemented in F\#.
50
+
We evaluate a generic kernel parametrized by types and operations (see Listing~\ref{lst:mXm_kernels}), implemented in F\#.
48
51
Several basic optimizations, inspired by ``Tutorial: OpenCL SGEMM tuning for Kepler'' by Cedric Nugteren\footnote{``Tutorial: OpenCL SGEMM tuning for Kepler'': \url{https://cnugteren.github.io/tutorial/pages/page1.html}}, were applied.
49
52
Namely, we use tiling in local and private memory.
50
-
But current version supports only square matrices and square tiles.
53
+
However, the current version supports only square matrices and square tiles.
The first one is the CLBlast\footnote{CLBlast source code: \url{https://github.com/CNugteren/CLBlast}}~\cite{10.1145/3204919.3204924} that is a highly-tuned (even for low-power mobile GPUPUs) OpenCL-based BLAS implementation.
67
-
The second one is the OpenBLAS\footnote{OpenBALAS source code: \url{https://github.com/OpenMathLib/OpenBLAS}} that is a highly-tuned BLAS implementation for CPU.
68
-
Additionally, we run OpenCL-based solutions on CPUs using POCL~\cite{Jskelinen2014}.
69
-
All competitors were compiled and run with default settings.
70
-
68
+
We selected two competitors for evaluation.
69
+
The first is CLBlast\footnote{CLBlast source code: \url{https://github.com/CNugteren/CLBlast}}~\cite{10.1145/3204919.3204924}, a highly optimized OpenCL-based BLAS implementation tuned even for low-power mobile GPUs.
70
+
The second is OpenBLAS\footnote{OpenBLAS source code: \url{https://github.com/OpenMathLib/OpenBLAS}}, a highly optimized CPU-based BLAS implementation.
71
+
Additionally, we executed OpenCL-based solutions on CPUs using POCL~\cite{Jskelinen2014}.
72
+
All competitors were compiled and run with their default configurations.
71
73
72
74
We evaluate all competitors on several platforms listed below.
We generate random square matrices with elements of type \texttt{float32} and use typical arithmetic semiring because our competitors do not provide generic kernels.
81
+
We generate random square matrices with elements of type \texttt{float32} and use the typical arithmetic semiring because our competitors do not provide generic kernels.
81
82
Time is measured as an average of 10 runs.
82
83
We measure time of client function execution, so it includes data transfer.
83
-
Results of evaluation represented in figure~\ref{fig:mxm_perf}: we show both time and relative speedup.
84
+
Results of the evaluation are represented in Figure~\ref{fig:mxm_perf}, where we show both time and relative speedup calculated as the ratio of corresponding average execution times.
84
85
85
-
First of all, we show that Brahma.FSharp allows one to create portable solutions.
86
-
Obviously, our kernel is not such optimized as kernel from CLBlast, but relative speedup analysis shows that more tuning required: in much cases performance gap decreases with data size increase (\textbf{Lenovo} esp. Intel GPU; \textbf{Zen}).
87
-
But in some cases behavior is more complex: for \textbf{MILK-V} our solution on CPU using POCL demonstrates better performance that CLBlast, but on respective GPU performance gap slightly increases with data size increase.
86
+
First, we show that Brahma.FSharp allows one to create portable solutions.
87
+
While our kernel is not as optimized as the kernel from CLBlast, relative speedup analysis shows that there is a room for tuning: in many cases, the performance gap decreases with data size increase (\textbf{Lenovo} esp. Intel GPU; \textbf{Zen}).
88
+
However, in some cases the behavior is more complex: for \textbf{MILK-V}, our solution on CPU using POCL demonstrates better performance than CLBlast, but on the respective GPU the performance gap slightly increases with data size increase.
88
89
89
-
Such behavior can be explained by differences in kernel tuning, but not by the Brahma.FSharp technology problem.
90
-
So, while it is unlikely possible to hide dotnet overhead fully, it looks possible to minimize it to be comparable with competitors on big matrices.
91
-
To do it we should to create finer tuned and more flexible kernel to allows one better fit performance-affecting parameters.
90
+
This behavior can be explained by differences in kernel tuning rather than by a Brahma.FSharp technology problem.
91
+
Thus, while it is unlikely possible to fully hide .NET overhead, it appears possible to minimize it to become comparable with competitors on large matrices.
92
+
Following the methodology proposed in~\cite{10.1145/3204919.3204924}, we plan to support additional kernel parameters enabling more precise tuning of performance-critical aspects.
0 commit comments