YaccConstructor
diff --git a/‎InProgress/Brahma.FSharp/paper/BrahmaFSharp.tex‎
Lines changed: 22 additions & 41 deletions b/‎InProgress/Brahma.FSharp/paper/BrahmaFSharp.tex‎
Lines changed: 22 additions & 41 deletions
diff --git a/‎InProgress/Brahma.FSharp/paper/Conclusion.tex‎
Lines changed: 12 additions & 7 deletions b/‎InProgress/Brahma.FSharp/paper/Conclusion.tex‎
Lines changed: 12 additions & 7 deletions
diff --git a/‎InProgress/Brahma.FSharp/paper/Evaluation.tex‎
Lines changed: 60 additions & 93 deletions b/‎InProgress/Brahma.FSharp/paper/Evaluation.tex‎
Lines changed: 60 additions & 93 deletions
diff --git a/‎InProgress/Brahma.FSharp/paper/Introduction.tex‎
Lines changed: 16 additions & 20 deletions b/‎InProgress/Brahma.FSharp/paper/Introduction.tex‎
Lines changed: 16 additions & 20 deletions
diff --git a/‎InProgress/Brahma.FSharp/paper/Main.pdf‎
154 KB b/‎InProgress/Brahma.FSharp/paper/Main.pdf‎
154 KB
@@ -1,43 +1,24 @@
 \section{Brahma.FSharp}
 
-In this section we present our platform for GPGPU programming in F\#.
-
-Research project: SPbU and JetBrains Research.
-
-Blah-Blah-Blah!!!!
-
-\subsection{Architecture}
-
-Based on F\# code quotation to OpenCL translator.
-
-GPGPU driver is OpenCL.NET~\footnote{OpenCL.NET --- low level .NET bindings for OpenCL. Project site [visited: 20.06.2017]: \url{https://openclnet.codeplex.com/}.}
-
-Picture.
-
-Workflow:
-
-Detailed description of some blocks are provided below.
-
-\subsection{Translator}
-
-Subset of F\# to openCL.
-More details are available here: \url{wwww}.
-Classical techniques: var scopes,
-
-Strongly typed: exactly same signature, but logic is changed.
-
-Structs, tuples, etc
-
-\subsection{OpenCL specific operations}
-
-Atomic functions.
-For kernal code only.
-
-Memory transferring.
-
-\subsection{OpenCL type provider}
-
-It is necessary to provide mechanism for existing kernels reusing.
-OpenCL kernels destribution --- source code
-
-Create strongly typed functions from existing OpenCL kernels code.
+Brahma.FSharp is a tool that allows one to utilize GPGPUs in .NET applications and write kernels and all supplementary code in pure F\#~\cite{fsharp} that is a functional-first multiparadigmal programming language for .NET platform.
+This language combines functional programming, including first-class functions, generics, static strong typing with automatic type inference, with transparent integration with the a platform for business applications development with mature infrastructure.
+At the same time, F\# provides an ability to write imperative code that is native for kernel programming.
+
+Core of the tool is a translator of F\# subset to OpenCL C that is based on \emph{code quotations}~\cite{FSharpQuotations} that allows one to get access to annotated tree of the F\# code and transform it during program execution.
+This tree can be transformed using regular F\# functions: for example, it can be translated to other language, that is we do to generate OpenCL C code for kernels.
+Other words, code quotations is a running time metaprogramming feature that allows us to create running-time configurable kernels.
+For example, it is possible, in opposite to compile time metaprogramming, configure work group size dependent parts of kernel (e.g. local buffer size) without recompilation of whole program (look at line 9 of listing~\ref{lst:mXm_kernels}).
+The main feature is that all is strongly and statically typed: no unsafe code that uses strings, pointers, objects, etc.
+At the user side, compiled quotation (compiled kernel) has the signature that requires parameters of types that are in agreement with initial quotation.
+
+An example of quoted code (actually, part of the generalized mXm kernel) is presented in listing~\ref{lst:mXm_kernels} (lines 6--12).
+This code also demonstrates typed composition of quotations: operations \verb|opAdd| and \verb|opMult|, and identity element \verb|zero|, have agreed types and can be specified outside the kernel in run time.
+So, we can write highly configurable kernels generator and instantiate specific kernels later, as shown in lines 15--16.
+
+The translator supports not only imperative subset of F\# and primitive types, but also F\#-specific features like structs, tuples, discriminated unions, pattern matching, nested bindings.
+Also it supports OpenCL specific features like atomic functions, barriers, local and thread-local allocation arrays allocation.
+For data transferring and manipulation Brahma.FSharp provides specified memory primitives (\verb|ClArray<'t>| and \verb|ClCell<'t>|) that is F\#-array-friendly wrappers around \texttt{ClBuffer}.
+
+Brahma.FSharp provides typical workflow to run kernels and implements respective typed wrappers for it\footnote{Configure of path to \texttt{libopencl} making the solution portable.}.
+It is worth noting that F\# is friendly to asynchronous programming and provides huge amount of parallel and asyncronious programming primitives~\cite{FSharpAsync}.
+Utilization of \emph{mailbox processor}, that is F\#-native massage passing primitive, to wrap command queue allows us to make communication with GPGPU friendly for asynchronous programming in F\#.
@@ -1,15 +1,20 @@
 \section{Conclusion and Future Work}
 
-Platform presented.
+Brahma.FSharp --- a tool to create cross-platform GPGPU-enabled .NET application presented. 
+We demonstrated portability of application by evaluating them on a set of platforms including RISC-V with PowerVR GPPGU and ARM with Mali GPGPU.
 
-Education. Metaprogramming, translators development, GPGPU programming, etc.
+While work still in progress, Brahma.FSharp allows one to create linear algebra related kernels performant enough to be integrated in libraries like Math.Net Numerics to offload generic linear algebra transparently to GPGPU.
+Such an integration is planned to the nearest future.
 
-Graph parsing.
+Also, within the translator improvements, it is necessary to improve performance of data transferring between managed and native memory for complex types such as discriminated unions.
 
-Geterogenious porgramming generalization. Hopac is better then MBP~\footnote{\url{https://vasily-kirichenko.github.io/fsharpblog/actors}}.
+While agent-based approach for communications is a native for both OpenCL and F\#, mailbox processor may not be the best choice for it especially in cases with high-frequent CPU-GPU communications.
+It may be better to use more performant libraries like Hopac~\footnote{Hopac and mailbox processor performance comparison: \url{https://vasily-kirichenko.github.io/fsharpblog/actors}} or even provide light-weight wrapper for direct access to command queue for latency-critical code.
 
-Research: Automatic memory management.
+One of nontrivial problem for the future research is an automatic memory management.
+For now, GPGPU-related memory should be cleaned manually, but .NET has automatic garbage collector. 
+How can we offload buffers management on it with ability ti switch to manual control if required. 
 
-Data to code translation (automata can be translated into code instead of data structures in memory)
+%Other technical improvements: IDE support, runtime extensions, etc.
 
-Other technical improvements: IDE support, type provider improvements, new OpenCL standard support, runtime extension, etc.
+%Education. Metaprogramming, translators development, GPGPU programming, etc.
@@ -1,110 +1,77 @@
 \section{Evaluation}
 
-In this section we provide results of some experiments with Brahma.FSharp platform which are aimed to demonstarte its main features.
+In this section we provide experiments\footnote{Related sources: \url{!!!}} with Brahma.FSharp platform which are aimed to demonstrate its main features.
 
-Sources of Brahma.FSharp is available here: \url{https://github.com/YaccConstructor/Brahma.FSharp}
-Binary package available here: \url{https://www.nuget.org/packages/Brahma.FSharp/}
-Examples of usage: \url{https://github.com/YaccConstructor/Brahma.FSharp.Examples}
+We evaluated Brahma.FSharp in two cases listed below and described in respective sections.
+\begin{enumerate}
+\item The first one is an image convolution.  to demonstrate async
+\item Second one is a matrix multiplication.  to demonstrate generics, local and private memory support. To evaluate on different devices.
+\end{enumerate}
 
-\subsection{Matrix multiplication}
-
-Classical task for GPGPU.
-
-Naive, optimized in F\#, optimized via type provider.
-
-Code examples.
-
-And with type providers too
-
-\subsection{Substring matching}
+On several platforms.
+\begin{itemize}
+  \item Intel
+  \item NVIDIA
+  \item ImTech, PowerVR, RISC-V
+  \item Qualcomm, Mali, ARM
+\end{itemize}
 
-Data recovery.
+\subsection{Image Convolution}
 
-CPU vs GPGPU.
+Reading and writing.
+F\# MailboxProcessor  used for composing of data reading, data processing on GPGPU, and data processing on CPU.
 
-Algorithm is not important for real data. Data transferring is bottleneck.
+Graphics, tables.
 
-\subsection{Substring matching with agents}
 
-Agents forever~\cite{BrahmaStringMatching}~\cite{aleaGPUasync}
+~\cite{aleaGPUasync}
 
-Geterogenious, multi-GPGPU platforms 
+Gray scale.
 
-Substring matchng from previous section.
+Multiple GPU-s.
 
-Results of performance test of GPGPU calculation using Brahma.FSharp and MailboxProcessor composition are presented.
-Problem to solve is substring matching for data carving.
-Rabin-Karp algorithm was implemented using Brahma.FSharp for substring matching.
-F\# MailboxProcessor  used for composing of data reading, data processing on GPGPU, and data processing on CPU.
-Library for fast and flexible configuration of MailboxProcessors was created.
-Set of templates for search was fixed.
-Tests were performed for HDD and SSD storages.
-Low level sequential reading was implemented.
-First 16.5 Mb was processed.
+\subsection{Matrix Multiplication}
 
-\begin{itemize}
-\item OS: Microsoft Windows 8.1 Pro
-\item System Type: x64-based PC
-\item Processor: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s)
-\item RAM: 16.0 GB
-\item HDD for test: 
-\begin{itemize}
-  \item Model: ST3250410AS
-  \item Size: 232.88 GB
-  \item 7200 rpm
-\end{itemize}
- 
-\item SSD for test
-\begin{itemize}
-  \item Model: INTEL SSDSC2BW240A4
-  \item Size: 223.57 GB
-  \item Max read speed: 540 Mb/sec
-\end{itemize}
-
-\item GPGPU:
-\begin{itemize}
-  \item NVIDIA GeForce GTX 560 Ti
-  \item CUDA Cores:     384 
-  \item Core clock:     822 MHz 
-  \item Shader clock:       1645 MHz
-  \item Memory data rate:   4008 MHz
-  \item Memory interface:   256-bit 
-  \item Memory bandwidth:   128.26 GB/s
-  \item Total available graphics memory:    4095 MB
-  \item Dedicated video memory: 2048 MB GDDR5
-  \item Shared system memory:   2047 MB
-\end{itemize}
-\end{itemize}
-
-Tables below present results of tests. 
-``buffers for data'' --- a number of arrays to fill by disc reader for each MailboxProcessor which communicate with GPGPU.
-``threads'' --- a number of MailboxProcessors which communicate with GPGPU.
-In current configuration we have  only one GPGU, so all MailboxProcessors use it.
-For multi-GPGPU systems  we can configure k MailboxProcessors for each GPGPU.
- 
-In each cell  --- total time and GPGPU loading graph.
-
-\begin{table*}[ht]
-\caption{WEWEW}
-\label{tbl:eval1}
-\begin{center}
-  \begin{tabular}{ l | c | r }
-    \hline
-    1 & 2 & 3 \\ \hline
-    4 & 5 & 6 \\ \hline
-    7 & 8 & 9 \\
-    \hline
-  \end{tabular}
-\end{center}
-\end{table*}
-
-Conclusion:
-Data reading bufferization can sufficiently increase performance.
-Especially for HDD, where speed of reading is low.  
-For SSD processing with multi-GPGPU systems may be useful.
-Data reading is not so critical as for HDD and more than one GPGPU can be fully loaded by using flexible MailboxProcessors configuration.
-Configuration with two MailboxProcessors and two buffers for each of them can fully load one GPGPU.
+Classical task for GPGPU.
 
+Several optimizations.
 
+Generic kernels parametrized by types and operations. 
 
+Code examples.
 
+Sequence of optimizations inspired by !!!\footnote{\url{!!!}}.  
+Not all, but memory
+Square matrix. 
+
+Flexibility. Kernels are parametrized by operations and id.
+Unsafe Min-plus using max value as ID.
+Matrix of 'e
+
+More accurate min-plus using options.
+
+\begin{listing}[h]
+  \begin{minted}[linenos]{fsharp}
+let mXmKernel    
+   (opAdd: Quotations.Expr<'a -> 'b -> 'a>) 
+   (opMult: Quotations.Expr<'e -> 'f -> 'b>) 
+   (zero: Quotations.Expr<'a>) ... (* other parameters *)  =
+      ... // Supplementary code
+      let kernel = <@ fun 2dRange m1 m2 res ->  // Quoted code
+        ... 
+        let acc = %zero // Embedded identity value
+        let lBuf = localArray lws // captured from context
+        ... 
+        acc <- (%opAdd) acc ((%opMult) x y) // Embedded operations
+        ... @>
+      ... // Supplementary code
+  
+let intArithmeticKernel = mXmKernel <@ (+) @> <@ ( * ) @> <@ 0 @>
+let intMinPlusKernel = 
+    mXmKernel <@ (min) @> <@ (+) @> <@ Int.MaxValue @>
+  \end{minted}
+  \caption{An example of masking operation definition}
+  \label{lst:mXm_kernels}
+\end{listing}
+
+Graphics, tables.
@@ -1,22 +1,18 @@
 \section{Introduction}
 
-GPGPU is popular technique for....
-
-Tools and languages are low level.
-It is ood for high performance, but bad for developers.
-
-OpenCL~\cite{OpenCL}, CUDA~\cite{CUDA} etc.
-
-Complex problems, geterogenious platforms: multicore, multi GPGPU etc.
-Special tools, libs required for development simlification.
-High level languages and platforms are used for application development.
-
-F\# primitives are helpful for metaprogramming and parallel/asyncronious programming.
-
-General reqirenments: highlevel languge, existing code/dlls/other stuff reusing
-
-Existing solutions, auch as Alea.GPU, FCSL, are not good enough. Why?
-Many different attempts for high level platform, such as JVM~\cite{rootbeer, HaskellGPU, jcuda, ScalaGPU, RustGPU}
-
-Brahma.FSharp --- the best platform for GPGPU programming!!!!
-Quotations to OpenCL translator with many cool features.
+Last decades utilization of GPGPUs not only in scientific or dedicated applications, but also in regular business applications becomes more popular.
+In such cases not peak performance, but transparent offloading of computations to accelerator has come into focus.
+As a result, respective tools for integration of GPGPUs into such platforms as JVM~\cite{rootbeer,jcuda,ScalaGPU} or .NET~\cite{FSCLPhD,aleaGPUasync} are developed.
+Note that in real-world application the problem no only to offload some computations on GPGPU, but to orchestrate heterogenous asynchronous application that involves computations on possible several GPGPUs.
+
+At the same time, utilization of existing functional languages and creation new ones for GPGPU programming, looks promising due to them safety, flexibility, ability to use advanced optimization techniques and to create high-level abstractions.
+That lead to such projects as Futhark~\cite{10.1145/3140587.3062354}, Lift~\cite{10.5555/3049832.3049841}, AnyDSL~\cite{10.1145/3276489}, Accelerate~\cite{10.1145/1926354.1926358}.
+ 
+Nowadays there are very few combination of mature business application development platform and functional programming language.
+One of them is a .NET platform and F\# programming language. 
+There are several tools, such as Alea.GPU~\cite{aleaGPUasync}, FCSL~\cite{FSCLPhD}, ILGPU~\footnote{ILGPU project web page: \url{https://ilgpu.net/}}, that allows one to integrate GPGPUs into .NET application without using such unsafe and low-level mechanisms like string-level kernels creation. 
+While FSCL and Alea.GPU use F\# to create kernels, ILGPU works on IL level that limits ability to use high-level features and nontrivial optimizations.
+ 
+In this work we propose a \textbf{Brahma.FSharp}\footnote{
+    Sources of Brahma.FSharp: \url{https://github.com/YaccConstructor/Brahma.FSharp}.
+} --- the tool for portable GPGPU-enabled .NET applications development that provides transparent and safe integration with accelerators --- and demonstrate it's portability across variety of platforms and devices.