iccsa-21-guile

Functional Programming Interface for Parallel and Distributed Computing
git clone https://git.igankevich.com/iccsa-21-guile.git
Log | Files | Refs

commit 8e4bb0f0596ce1f5c49b38500cd8c48d2ec2bd01
parent e63b58b7c43577db938051f053f45d8e507088cb
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Mon, 14 Jun 2021 14:24:13 +0300

camera-ready version

Diffstat:
Makefile | 3+--
copyright.txt | 2++
main.tex | 187++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------
3 files changed, 135 insertions(+), 57 deletions(-)

diff --git a/Makefile b/Makefile @@ -40,8 +40,7 @@ build/gnuplot/%.svg: gnuplot/%.gnuplot build/%.eps: build/%.svg inkscape -z --export-eps=$@ $< -#build/main.zip: build/gnuplot/*.eps main.tex -build/main.zip: main.tex +build/main.zip: main.tex main.bib llncs.cls splncs04.bst build/gnuplot/results.eps @mkdir -p build zip --filesync build/main.zip $^ diff --git a/copyright.txt b/copyright.txt @@ -4,3 +4,5 @@ Ivan Petriakov, Ivan Gankevich There are many programming frameworks and languages for parallel and distributed computing which are successful both in industry and academia, however, all of them are isolated and self-contained. We believe that the main reason that there is no common denominator between them is that there is no intermediate language for distributed computations. For sequential computations we have bytecode that is used as an intermediate, portable and universal representation of a programme written in any language, but one important feature, that bytecode lacks, is an ability to describe process communication. If we add this feature, we get low-level language on top of which all the frameworks and languages for parallel and distributed computations can be built. In this paper we explore how such intermediate language can be made, how it can reduce programming effort and how it may simplify internal structure of existing frameworks. We also demonstrate how high-level interface can be build for functional language that completely hides all the difficulties that a programmer encounters when he or she works with distributed systems. Keywords: API, intermediate language, C++, Guile. + +Ivan Gankevich, Saint Petersburg State University, 13B Universitetskaya Emb., St Petersburg 199034, Russia, i.gankevich@spbu.ru diff --git a/main.tex b/main.tex @@ -9,6 +9,7 @@ \usepackage{listings} \usepackage{tikz} \usepackage{textcomp} +\usepackage{textcomp} % https://github.com/stuhlmueller/scheme-listings/blob/master/lstlang0.sty \lstdefinelanguage{scheme}{ @@ -40,6 +41,17 @@ showstringspaces=false } +\lstdefinelanguage{cpp}{ + morekeywords=[1]{class,struct,enum,public,private,protected,virtual,override,const, + void,int,new,delete}, + morecomment=[l]{//}, + basicstyle=\small\rmfamily, + keywordstyle={\bf\color[HTML]{4081ec}}, + commentstyle=\color[rgb]{0.33,0.33,0.33}, + stringstyle={\color[HTML]{00a000}}, + escapeinside={LATEX}{END}, +} + \begin{document} \title{Functional programming interface for parallel and distributed computing% @@ -66,22 +78,22 @@ There are many programming frameworks and languages for parallel and distributed computing which are successful both in industry and academia, however, all of them are isolated and self-contained. We believe that the main reason that there is no common denominator between them is that there is no -intermediate language for distributed computations. For sequential +intermediate representation for distributed computations. For sequential computations we have bytecode that is used as an intermediate, portable and -universal representation of a programme written in any language, but one -important feature, that bytecode lacks, is an ability to describe process -communication. If we add this feature, we get low-level language on top of +universal representation of a programme written in any language, but bytecode +lacks an ability to describe process +communication. If we add this feature, we get low-level representation on top of which all the frameworks and languages for parallel and distributed computations can be built. In this paper we explore how such intermediate -language can be made, how it can reduce programming effort and how it may +representation can be made, how it can reduce programming effort and how it may simplify internal structure of existing frameworks. We also demonstrate how -high-level interface can be build for functional language that completely hides +high-level interface can be build for a functional language that completely hides all the difficulties that a programmer encounters when he or she works with distributed systems. \keywords{% \and API -\and intermediate language +\and intermediate representation \and C++ \and Guile. } @@ -95,7 +107,7 @@ computing~\cite{spark2016,fault-tolerant-distributed-haskell,wilde2011swift,fett which are successful both in industry and academia, however, all of them are isolated and self-contained. We believe that the main reason that there is no common denominator between these frameworks and languages is that there is no -protocol or low-level language for distributed computations. For sequential +common protocol or low-level representation for distributed computations. For sequential computations we have bytecode (e.g.~LLVM~\cite{llvm}, Java bytecode, Guile bytecode) that is used as an intermediate, portable and universal representation of a programme written in any language; also we have assembler @@ -104,9 +116,9 @@ important feature, that bytecode and assembler lack, is an ability to communicate between parallel processes. This communication is the common denominator on top of which all the frameworks and languages for parallel and distributed computations can be built, however, there is no universal low-level -protocol or language that describes communication. +representation that describes communication. -Why common low-level language exists for sequential computations, but does not +Why common low-level representation exists for sequential computations, but does not exist for parallel and distributed ones? One of the reasons, which applies to both distributed and parallel computations, is the fact that people still think about programmes as sequences of steps~--- the same way as people themselves @@ -129,7 +141,7 @@ realised this potential to get all their advantages; people realised the full potential of imperative languages, but do not know how to get rid of their disadvantages. -In this paper we describe low-level language and protocol based on \emph{kernels} +In this paper we describe low-level representation based on \emph{kernels} which is suitable for distributed and parallel computations. Kernels provide automatic fault tolerance and can be used to exchange the data between programmes written in different languages. We implement kernels in C++ and @@ -144,7 +156,10 @@ that use several processor cores of single cluster node, the term \emph{distributed} to describe computations that use several cluster nodes and any number of cores on each node, and term \emph{cluster} to describe anything that refers to local cluster (as opposed to geographically distributed clusters -which are not studied in this paper). +which are not studied in this paper). \emph{Intermediate representation} in +our paper is a particular form of abstract syntax tree, e.g.~in functional +languages \emph{continuation passing style} is popular intermediate +representation of the code. %TODO \cite{lang-virt} %\cite{fetterly2009dryadlinq} @@ -160,16 +175,16 @@ unified system} In order to write parallel and distributed programmes the same way as we write sequential programmes, we need the following components. \begin{itemize} - \item Low-level language that acts as an intermediate portable representation of + \item Portable low-level representation of the code and the data and includes means of decomposition of the code and the data into parts that can be computed in parallel. The closest sequential counterpart is LLVM. \item Cluster scheduler that executes - parallel and distributed applications and uses the low-level language to implement + parallel and distributed applications and uses the low-level representation to implement communication between these applications running on different cluster nodes. The closest single-node counterpart is operating system kernel that executes user processes. - \item High-level interface that wraps the low-level language for existing + \item High-level interface that wraps the low-level representation for existing popular programming languages in a form of a framework or a library. This interface uses cluster scheduler, if it is available and node parallelism is needed by the application, otherwise the code is executed on the local node and parallelism @@ -179,7 +194,7 @@ sequential programmes, we need the following components. \end{itemize} These three components are built on top of each other as in classical object-oriented programming approach, and all the complexity is pushed down to the lowest layer: -low-level language is responsible for providing parallelism and fault tolerance to +low-level representation is responsible for providing parallelism and fault tolerance to the applications, cluster scheduler uses these facilities to provide transparent execution of the applications on multiple cluster nodes, and high-level interface maps the underlying system to the target language to simplify the work for @@ -209,20 +224,20 @@ control over communication of the applications that are run on the cluster, but is used as resource allocator instead. The situation in newer big data technologies is different: there are the same -three components with hierarchical structure, but the low-level language is +three components with hierarchical structure, but the low-level representation is integrated in the scheduler. There is YARN cluster scheduler~\cite{vavilapalli2013yarn} with API that is used as a low-level -language for parallel and distributed computing, and there are many high-level +representation for parallel and distributed computing, and there are many high-level libraries that are built on top of YARN~\cite{hadoop,oozie,spark2016,storm}. The scheduler has more control over job execution as jobs are decomposed into tasks and execution of tasks is controlled by the scheduler. Unfortunately, the -lack of common low-level language made all high-level frameworks that are built +lack of common low-level representation made all high-level frameworks that are built on top of YARN API use their own custom protocol for communication, shift responsibility of providing fault tolerance to the scheduler and shift responsibility of data decomposition to higher level frameworks. To summarise, the current state-of-the-art technologies for parallel and -distributed computing can be divided into three classes: low-level languages, +distributed computing can be divided into three classes: low-level representations, cluster schedulers and high-level interfaces; however, responsibilities of each class are not clearly separated by the developers of these technologies. Although, the structure of the components resembles the operating system kernel @@ -231,22 +246,22 @@ each other, but integrated horizontally, and as a result the complexity of the parallel and distributed computations is sometimes visible on the highest levels of abstraction. -Our proposal is to design a low-level language and a protocol for data exchange -that provides fault tolerance, means of data and code decomposition and means -of communication for parallel and distributed applications. Having such a -language at your disposal makes it easy to build higher level components, -because the complexity of cluster systems is hidden from the programmer, the -duplicated effort of implementing the same facilities in higher level -interfaces is reduced, and cluster scheduler has full control of the programme -execution as it speaks the same protocol and uses the same low-level language -internally: the language is general enough to write any distributed programme -including the scheduler itself. +Our proposal is to design a low-level representation that provides fault +tolerance, means of data and code decomposition and means of communication for +parallel and distributed applications. Having such a representation at your disposal +makes it easy to build higher level components, because the complexity of +cluster systems is hidden from the programmer, the duplicated effort of +implementing the same facilities in higher level interfaces is reduced, and +cluster scheduler has full control of the programme execution as it speaks the +same protocol and uses the same low-level representation internally: the representation is +general enough to describe any distributed programme including the scheduler +itself. \subsection{Kernels as objects that control the programme flow} \label{sec-kernels} -In order to create low-level language for parallel and distributed computing we -borrow familiar features from sequential low-level languages and augment them +In order to create low-level representation for parallel and distributed computing we +borrow familiar features from sequential low-level representations and augment them with asynchronous function calls and an ability to read and write call stack frames. @@ -302,14 +317,75 @@ this data in object methods and the output data (the result of the computation) also in object fields. The programmer decides which data is input and output. To reduce network usage the programmer may delete input data when the kernel enters \textit{downstream} state: that way only output data is copied back to -the parent kernel over the network. +the parent kernel over the network. The example programme written using kernels +and using regular function call stack is shown in table~\ref{tab-call-stack}. -This low-level language can be seen as an adaptation of classic function call +\begin{table} + \centering + \begin{minipage}[t]{0.39\textwidth} +\begin{lstlisting}[language=cpp] +int nested(int a) { + return 123 + a; +} +\end{lstlisting} + \end{minipage} + \begin{minipage}[t]{0.59\textwidth} +\begin{lstlisting}[language=cpp] +struct Nested: public Kernel { + int result; + int a; + Nested(int a): a(a) {} + void act() override { + result = a + 123; + LATEX{\color[HTML]{ac4040}\bf{}async\_return}END(); + } +}; +\end{lstlisting} + \end{minipage} + \begin{minipage}[t]{0.39\textwidth} +\begin{lstlisting}[language=cpp] +void main() { + // code before + int result = nested(); + // code after + print(result); +} +\end{lstlisting} + \end{minipage} + \begin{minipage}[t]{0.59\textwidth} +\begin{lstlisting}[language=cpp] +struct Main: public Kernel { + void act() override { + // code before + LATEX{\color[HTML]{ac4040}\bf{}async\_call}END(new Nested); + } + void react(Kernel* child) override { + int result = ((Nested*)child)->result; + // code after + print(result); + LATEX{\color[HTML]{ac4040}\bf{}async\_return}END(); + } +}; + +void main() { + LATEX{\color[HTML]{ac4040}\bf{}async\_call}END(new Main); + wait(); +} +\end{lstlisting} + \end{minipage} + \caption{The same programme written using regular function call stack + (left) and kernels (right). Here \texttt{async\_call} performs asynchronous + function call by pushing the child kernel to the queue, \texttt{async\_return} + performs asynchronous return from the function call by pushing the current + kernel to the queue.\label{tab-call-stack}} +\end{table} + +This low-level representation can be seen as an adaptation of classic function call stack, but with asynchronous function calls and an ability to read and write -stack frames. This differences give kernels the following advantages. +stack frames. These differences give kernels the following advantages. \begin{itemize} \item Kernels define dependencies between function calls, but do not define - the order of computation. This gives natural way of expressing parallel + the order of the computation. This gives natural way of expressing parallel computations on the lowest possible level. \item Kernels can be written to and read from any medium: files, network connections, serial connections etc. This allows to implement fault @@ -376,7 +452,7 @@ concern that the scheduler has to deal with (see~figure~\ref{fig-local-routing}) \item network queue for sending kernels to and receiving from other cluster nodes, \item process queue for creating child processes and sending kernels to and receiving from them, and - \item the main queue for processing kernels in parallel using multiple processor cores. + \item the main processor queue for processing kernels in parallel using multiple processor cores. \end{itemize} This separation of concerns allows us to overlap data transfer and data processing: while the main queue processes kernels in parallel, process and network queues @@ -455,7 +531,7 @@ introduce weight field which tells how many threads will be used by the kernel. The default is one thread for ordinary kernels and nought threads for cluster scheduler kernels. Process queue tracks the total weight of the kernels that were sent to child processes and queues incoming kernels if the weight reaches -the number of processor cores. Also, each cluster reports this information to +the number of processor cores. Also, each cluster node reports this information to other nodes for better load balancing decisions. The scheduler acts as a router for the kernels: when the kernel is received @@ -464,7 +540,7 @@ cluster node it can be sent. If the kernel has \textit{downstream} or \textit{point-to-point} state, the kernel is sent to the node where the target kernel resides; if the kernel has \textit{upstream} state, load balancing algorithm decides which node to send the kernel to. Load balancing algorithm -tracks the total weight of the kernels that were sent to specified node and +tracks the total weight of the kernels that were sent to the specified node and also receives the same information from the node (in case other nodes also send there their kernels), then it chooses the node with the lowest weight and sends the kernel to this node. If all nodes are full, the kernel is retained in the @@ -479,18 +555,18 @@ The last but not the least responsibility of the scheduler is to discover and maintain a list of cluster nodes and establish persistent network connections to neighbours. Cluster scheduler does this automatically by scanning the network using efficient algorithm: the nodes in the network are organised in -artificial tree topology with specified fan-out value and each node try to +artificial tree topology with the specified fan-out value and each node tries to communicate with the nodes which are closer to the root of the tree. This approach significantly reduces the number of data that needs to be sent over the network to find all cluster nodes: in ideal case only one kernel is sent to and received from the parent node. The algorithm is described in~\cite{gankevich2015subord}. After the connections are established, all the \textit{upstream} kernels that are received from the applications' child -processes are routed to neighbouring nodes in the tree topology (both parent +processes are routed to neighbour nodes in the tree topology (both parent and child nodes). This creates a problem because the number of nodes ``behind'' the parent node is generally different than the number of nodes behind the child nodes. In order to solve this problem we track not only the total weight -of all kernels of the neighbouring node, but the total weight of each node in +of all kernels of the neighbour node, but the total weight of each node in the cluster and sum the weight of all nodes that are behind the node \(A\) to compute the total weight of node \(A\) for load balancing. Also, we apply load balancing recursively: when the kernel arrives at the node, load @@ -506,16 +582,17 @@ To summarise, cluster scheduler uses kernels as unit of scheduling and as communication protocol between its daemon processes running on different cluster nodes. Daemon process acts as an intermediary between application processes running on different cluster nodes, and all application kernels are -sent through this process. Kernels that are sent through the scheduler are -heavy-weight: they have more fields than local kernels and the routing through -the scheduler introduces multiple overheads compared to direct communication. -However, using cluster scheduler hugely simplifies application development, as -application developer does not need to worry about networking, fault tolerance, -load balancing and ``how many parallel processes are enough for my -application'': this is now handled by the scheduler. For maximum efficiency and -embedded applications the application can be linked directly to the scheduler -to be able to run in the same daemon process, that way the overhead of the -scheduler is minimal. +sent through this process to other cluster nodes. Kernels that are sent through +the scheduler are heavy-weight: they have more fields than local kernels and +the routing through the scheduler introduces multiple overheads compared to +direct communication. However, using cluster scheduler hugely simplifies +application development, as application developer does not need to worry about +networking, fault tolerance, load balancing and ``how many parallel processes +are enough for my application'': this is now handled by the scheduler. For +maximum efficiency and embedded applications the application can be linked +directly to the scheduler to be able to run in the same daemon process, that +way application kernels are no longer sent though daemon process and the +overhead of the scheduler is minimal. \subsection{Parallel and distributed evaluation of Guile expressions using kernels} @@ -531,11 +608,11 @@ functions that are overridden in subclasses. This direct mapping is natural for a mixed-paradigm language like C++, but functional languages may benefit from implementing the same ideas in the compiler or interpreter. -\begin{lstlisting}[language=C++,% +\begin{lstlisting}[language=cpp,% caption={Public interface of the kernel and the queue classes in C++ (simplified for clarity).},% captionpos=b, label={lst-kernel-api}] -enum class states {upstream, downstream}; +enum class states {upstream, downstream, point_to_point}; class kernel { public: @@ -559,7 +636,7 @@ language~\cite{galassi2002guile}. Guile is a dialect of Scheme~\cite{sussman1998} which in turn is a dialect of LISP~\cite{mccarthy1960}. The distinct feature of LISP-like languages is homoiconicity, i.e.~the code and the data is represented by tree-like -structure (lists that may contain other lists as elements). This feature makes +structure (lists that contain atoms or other lists as elements). This feature makes it possible to express parallelism directly in the language: every list element can be computed independently and it can be sent to other cluster nodes for parallel computation. To implement parallelism we created a Guile interpreter