camera-ready version - iccsa-21-guile - Functional Programming Interface for Parallel and Distributed Computing

commit 8e4bb0f0596ce1f5c49b38500cd8c48d2ec2bd01
parent e63b58b7c43577db938051f053f45d8e507088cb
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Mon, 14 Jun 2021 14:24:13 +0300

camera-ready version

Diffstat:
Makefile  | 3 +--
copyright.txt  | 2 ++
main.tex  | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------

3 files changed, 135 insertions(+), 57 deletions(-)
diff --git a/Makefile b/Makefile
@@ -40,8 +40,7 @@ build/gnuplot/%.svg: gnuplot/%.gnuplot
 build/%.eps: build/%.svg
 	inkscape -z --export-eps=$@ $<
 
-#build/main.zip: build/gnuplot/*.eps main.tex
-build/main.zip: main.tex
+build/main.zip: main.tex main.bib llncs.cls splncs04.bst build/gnuplot/results.eps
 	@mkdir -p build
 	zip --filesync build/main.zip $^
 
diff --git a/copyright.txt b/copyright.txt
@@ -4,3 +4,5 @@ Ivan Petriakov, Ivan Gankevich
 There are many programming frameworks and languages for parallel and distributed computing which are successful both in industry and academia, however, all of them are isolated and self-contained. We believe that the main reason that there is no common denominator between them is that there is no intermediate language for distributed computations.  For sequential computations we have bytecode that is used as an intermediate, portable and universal representation of a programme written in any language, but one important feature, that bytecode lacks, is an ability to describe process communication. If we add this feature, we get low-level language on top of which all the frameworks and languages for parallel and distributed computations can be built.  In this paper we explore how such intermediate language can be made, how it can reduce programming effort and how it may simplify internal structure of existing frameworks. We also demonstrate how high-level interface can be build for functional language that completely hides all the difficulties that a programmer encounters when he or she works with distributed systems.
 
 Keywords: API, intermediate language, C++, Guile.
+
+Ivan Gankevich, Saint Petersburg State University, 13B Universitetskaya Emb., St Petersburg 199034, Russia, i.gankevich@spbu.ru
diff --git a/main.tex b/main.tex
@@ -9,6 +9,7 @@
 \usepackage{listings}
 \usepackage{tikz}
 \usepackage{textcomp}
+\usepackage{textcomp}
 
 % https://github.com/stuhlmueller/scheme-listings/blob/master/lstlang0.sty
 \lstdefinelanguage{scheme}{
@@ -40,6 +41,17 @@
   showstringspaces=false
 }
 
+\lstdefinelanguage{cpp}{
+  morekeywords=[1]{class,struct,enum,public,private,protected,virtual,override,const,
+                   void,int,new,delete},
+  morecomment=[l]{//},
+  basicstyle=\small\rmfamily,
+  keywordstyle={\bf\color[HTML]{4081ec}},
+  commentstyle=\color[rgb]{0.33,0.33,0.33},
+  stringstyle={\color[HTML]{00a000}},
+  escapeinside={LATEX}{END},
+}
+
 \begin{document}
 
 \title{Functional programming interface for parallel and distributed computing%
@@ -66,22 +78,22 @@ There are many programming frameworks and languages for parallel and
 distributed computing which are successful both in industry and academia,
 however, all of them are isolated and self-contained. We believe that the main
 reason that there is no common denominator between them is that there is no
-intermediate language for distributed computations.  For sequential
+intermediate representation for distributed computations.  For sequential
 computations we have bytecode that is used as an intermediate, portable and
-universal representation of a programme written in any language, but one
-important feature, that bytecode lacks, is an ability to describe process
-communication. If we add this feature, we get low-level language on top of
+universal representation of a programme written in any language, but bytecode
+lacks an ability to describe process
+communication. If we add this feature, we get low-level representation on top of
 which all the frameworks and languages for parallel and distributed
 computations can be built.  In this paper we explore how such intermediate
-language can be made, how it can reduce programming effort and how it may
+representation can be made, how it can reduce programming effort and how it may
 simplify internal structure of existing frameworks. We also demonstrate how
-high-level interface can be build for functional language that completely hides
+high-level interface can be build for a functional language that completely hides
 all the difficulties that a programmer encounters when he or she works with
 distributed systems.
 
 \keywords{%
 \and API
-\and intermediate language
+\and intermediate representation
 \and C++
 \and Guile.
 }
@@ -95,7 +107,7 @@ computing~\cite{spark2016,fault-tolerant-distributed-haskell,wilde2011swift,fett
 which are successful both in industry and academia, however, all of them are
 isolated and self-contained. We believe that the main reason that there is no
 common denominator between these frameworks and languages is that there is no
-protocol or low-level language for distributed computations.  For sequential
+common protocol or low-level representation for distributed computations.  For sequential
 computations we have bytecode (e.g.~LLVM~\cite{llvm}, Java bytecode, Guile
 bytecode) that is used as an intermediate, portable and universal
 representation of a programme written in any language; also we have assembler
@@ -104,9 +116,9 @@ important feature, that bytecode and assembler lack, is an ability to
 communicate between parallel processes.  This communication is the common
 denominator on top of which all the frameworks and languages for parallel and
 distributed computations can be built, however, there is no universal low-level
-protocol or language that describes communication.
+representation that describes communication.
 
-Why common low-level language exists for sequential computations, but does not
+Why common low-level representation exists for sequential computations, but does not
 exist for parallel and distributed ones?  One of the reasons, which applies to
 both distributed and parallel computations, is the fact that people still think
 about programmes as sequences of steps~--- the same way as people themselves
@@ -129,7 +141,7 @@ realised this potential to get all their advantages; people realised the full
 potential of imperative languages, but do not know how to get rid of their
 disadvantages.
 
-In this paper we describe low-level language and protocol based on \emph{kernels}
+In this paper we describe low-level representation based on \emph{kernels}
 which is suitable for distributed and parallel computations. Kernels provide
 automatic fault tolerance and can be used to exchange the data between
 programmes written in different languages. We implement kernels in C++ and
@@ -144,7 +156,10 @@ that use several processor cores of single cluster node, the term
 \emph{distributed} to describe computations that use several cluster nodes and
 any number of cores on each node, and term \emph{cluster} to describe anything
 that refers to local cluster (as opposed to geographically distributed clusters
-which are not studied in this paper).
+which are not studied in this paper). \emph{Intermediate representation} in
+our paper is a particular form of abstract syntax tree, e.g.~in functional
+languages \emph{continuation passing style} is popular intermediate
+representation of the code.
 
 %TODO \cite{lang-virt}
 %\cite{fetterly2009dryadlinq}
@@ -160,16 +175,16 @@ unified system}
 In order to write parallel and distributed programmes the same way as we write
 sequential programmes, we need the following components.
 \begin{itemize}
-    \item Low-level language that acts as an intermediate portable representation of
+    \item Portable low-level representation  of
         the code and the data and includes means of decomposition of the code and the data into
         parts that can be computed in parallel. The closest sequential
         counterpart is LLVM.
     \item Cluster scheduler that executes
-        parallel and distributed applications and uses the low-level language to implement
+        parallel and distributed applications and uses the low-level representation to implement
         communication between these applications running on different cluster nodes.
         The closest single-node counterpart is operating system kernel that executes
         user processes.
-    \item High-level interface that wraps the low-level language for existing
+    \item High-level interface that wraps the low-level representation for existing
         popular programming languages in a form of a framework or a library. This interface
         uses cluster scheduler, if it is available and node parallelism is needed by
         the application, otherwise the code is executed on the local node and parallelism
@@ -179,7 +194,7 @@ sequential programmes, we need the following components.
 \end{itemize}
 These three components are built on top of each other as in classical object-oriented
 programming approach, and all the complexity is pushed down to the lowest layer:
-low-level language is responsible for providing parallelism and fault tolerance to
+low-level representation is responsible for providing parallelism and fault tolerance to
 the applications, cluster scheduler uses these facilities to provide transparent execution of
 the applications on multiple cluster nodes, and high-level interface maps
 the underlying system to the target language to simplify the work for
@@ -209,20 +224,20 @@ control over communication of the applications that are run on the cluster, but
 is used as resource allocator instead.
 
 The situation in newer big data technologies is different: there are the same
-three components with hierarchical structure, but the low-level language is
+three components with hierarchical structure, but the low-level representation is
 integrated in the scheduler.  There is YARN cluster
 scheduler~\cite{vavilapalli2013yarn} with API that is used as a low-level
-language for parallel and distributed computing, and there are many high-level
+representation for parallel and distributed computing, and there are many high-level
 libraries that are built on top of YARN~\cite{hadoop,oozie,spark2016,storm}.
 The scheduler has more control over job execution as jobs are decomposed into
 tasks and execution of tasks is controlled by the scheduler.  Unfortunately, the
-lack of common low-level language made all high-level frameworks that are built
+lack of common low-level representation made all high-level frameworks that are built
 on top of YARN API use their own custom protocol for communication, shift
 responsibility of providing fault tolerance to the scheduler and shift
 responsibility of data decomposition to higher level frameworks.
 
 To summarise, the current state-of-the-art technologies for parallel and
-distributed computing can be divided into three classes: low-level languages,
+distributed computing can be divided into three classes: low-level representations,
 cluster schedulers and high-level interfaces; however, responsibilities of each
 class are not clearly separated by the developers of these technologies.
 Although, the structure of the components resembles the operating system kernel
@@ -231,22 +246,22 @@ each other, but integrated horizontally, and as a result the complexity of the
 parallel and distributed computations is sometimes visible on the highest
 levels of abstraction.
 
-Our proposal is to design a low-level language and a protocol for data exchange
-that provides fault tolerance, means of data and code decomposition and means
-of communication for parallel and distributed applications. Having such a
-language at your disposal makes it easy to build higher level components,
-because the complexity of cluster systems is hidden from the programmer, the
-duplicated effort of implementing the same facilities in higher level
-interfaces is reduced, and cluster scheduler has full control of the programme
-execution as it speaks the same protocol and uses the same low-level language
-internally: the language is general enough to write any distributed programme
-including the scheduler itself.
+Our proposal is to design a low-level representation that provides fault
+tolerance, means of data and code decomposition and means of communication for
+parallel and distributed applications. Having such a representation at your disposal
+makes it easy to build higher level components, because the complexity of
+cluster systems is hidden from the programmer, the duplicated effort of
+implementing the same facilities in higher level interfaces is reduced, and
+cluster scheduler has full control of the programme execution as it speaks the
+same protocol and uses the same low-level representation internally: the representation is
+general enough to describe any distributed programme including the scheduler
+itself.
 
 \subsection{Kernels as objects that control the programme flow}
 \label{sec-kernels}
 
-In order to create low-level language for parallel and distributed computing we
-borrow familiar features from sequential low-level languages and augment them
+In order to create low-level representation for parallel and distributed computing we
+borrow familiar features from sequential low-level representations and augment them
 with asynchronous function calls and an ability to read and write call stack
 frames.
 
@@ -302,14 +317,75 @@ this data in object methods and the output data (the result of the computation)
 also in object fields. The programmer decides which data is input and output.
 To reduce network usage the programmer may delete input data when the kernel
 enters \textit{downstream} state: that way only output data is copied back to
-the parent kernel over the network.
+the parent kernel over the network. The example programme written using kernels
+and using regular function call stack is shown in table~\ref{tab-call-stack}.
 
-This low-level language can be seen as an adaptation of classic function call
+\begin{table}
+    \centering
+    \begin{minipage}[t]{0.39\textwidth}
+\begin{lstlisting}[language=cpp]
+int nested(int a) {
+  return 123 + a;
+}
+\end{lstlisting}
+    \end{minipage}
+    \begin{minipage}[t]{0.59\textwidth}
+\begin{lstlisting}[language=cpp]
+struct Nested: public Kernel {
+  int result;
+  int a;
+  Nested(int a): a(a) {}
+  void act() override {
+    result = a + 123;
+    LATEX{\color[HTML]{ac4040}\bf{}async\_return}END();
+  }
+};
+\end{lstlisting}
+    \end{minipage}
+    \begin{minipage}[t]{0.39\textwidth}
+\begin{lstlisting}[language=cpp]
+void main() {
+  // code before
+  int result = nested();
+  // code after
+  print(result);
+}
+\end{lstlisting}
+    \end{minipage}
+    \begin{minipage}[t]{0.59\textwidth}
+\begin{lstlisting}[language=cpp]
+struct Main: public Kernel {
+  void act() override {
+    // code before
+    LATEX{\color[HTML]{ac4040}\bf{}async\_call}END(new Nested);
+  }
+  void react(Kernel* child) override {
+    int result = ((Nested*)child)->result;
+    // code after
+    print(result);
+    LATEX{\color[HTML]{ac4040}\bf{}async\_return}END();
+  }
+};
+
+void main() {
+  LATEX{\color[HTML]{ac4040}\bf{}async\_call}END(new Main);
+  wait();
+}
+\end{lstlisting}
+    \end{minipage}
+    \caption{The same programme written using regular function call stack
+    (left) and kernels (right). Here \texttt{async\_call} performs asynchronous
+    function call by pushing the child kernel to the queue, \texttt{async\_return}
+    performs asynchronous return from the function call by pushing the current
+    kernel to the queue.\label{tab-call-stack}}
+\end{table}
+
+This low-level representation can be seen as an adaptation of classic function call
 stack, but with asynchronous function calls and an ability to read and write
-stack frames. This differences give kernels the following advantages.
+stack frames. These differences give kernels the following advantages.
 \begin{itemize}
     \item Kernels define dependencies between function calls, but do not define
-        the order of computation. This gives natural way of expressing parallel
+        the order of the computation. This gives natural way of expressing parallel
         computations on the lowest possible level.
     \item Kernels can be written to and read from any medium: files, network
         connections, serial connections etc. This allows to implement fault
@@ -376,7 +452,7 @@ concern that the scheduler has to deal with (see~figure~\ref{fig-local-routing})
     \item network queue for sending kernels to and receiving from other cluster nodes,
     \item process queue for creating child processes and sending kernels to and receiving
         from them, and
-    \item the main queue for processing kernels in parallel using multiple processor cores.
+    \item the main processor queue for processing kernels in parallel using multiple processor cores.
 \end{itemize}
 This separation of concerns allows us to overlap data transfer and data processing:
 while the main queue processes kernels in parallel, process and network queues
@@ -455,7 +531,7 @@ introduce weight field which tells how many threads will be used by the kernel.
 The default is one thread for ordinary kernels and nought threads for cluster
 scheduler kernels. Process queue tracks the total weight of the kernels that
 were sent to child processes and queues incoming kernels if the weight reaches
-the number of processor cores. Also, each cluster reports this information to
+the number of processor cores. Also, each cluster node reports this information to
 other nodes for better load balancing decisions.
 
 The scheduler acts as a router for the kernels: when the kernel is received
@@ -464,7 +540,7 @@ cluster node it can be sent. If the kernel has \textit{downstream} or
 \textit{point-to-point} state, the kernel is sent to the node where the target
 kernel resides; if the kernel has \textit{upstream} state, load balancing
 algorithm decides which node to send the kernel to. Load balancing algorithm
-tracks the total weight of the kernels that were sent to specified node and
+tracks the total weight of the kernels that were sent to the specified node and
 also receives the same information from the node (in case other nodes also send
 there their kernels), then it chooses the node with the lowest weight and sends
 the kernel to this node. If all nodes are full, the kernel is retained in the
@@ -479,18 +555,18 @@ The last but not the least responsibility of the scheduler is to discover and
 maintain a list of cluster nodes and establish persistent network connections
 to neighbours. Cluster scheduler does this automatically by scanning the
 network using efficient algorithm: the nodes in the network are organised in
-artificial tree topology with specified fan-out value and each node try to
+artificial tree topology with the specified fan-out value and each node tries to
 communicate with the nodes which are closer to the root of the tree.  This
 approach significantly reduces the number of data that needs to be sent over
 the network to find all cluster nodes: in ideal case only one kernel is sent to
 and received from the parent node. The algorithm is described
 in~\cite{gankevich2015subord}. After the connections are established, all the
 \textit{upstream} kernels that are received from the applications' child
-processes are routed to neighbouring nodes in the tree topology (both parent
+processes are routed to neighbour nodes in the tree topology (both parent
 and child nodes). This creates a problem because the number of nodes ``behind''
 the parent node is generally different than the number of nodes behind the
 child nodes. In order to solve this problem we track not only the total weight
-of all kernels of the neighbouring node, but the total weight of each node in
+of all kernels of the neighbour node, but the total weight of each node in
 the cluster and sum the weight of all nodes that are behind the node \(A\) to
 compute the total weight of node \(A\) for load balancing. Also, we 
 apply load balancing recursively: when the kernel arrives at the node, load
@@ -506,16 +582,17 @@ To summarise, cluster scheduler uses kernels as unit of scheduling and as
 communication protocol between its daemon processes running on different
 cluster nodes. Daemon process acts as an intermediary between application
 processes running on different cluster nodes, and all application kernels are
-sent through this process. Kernels that are sent through the scheduler are
-heavy-weight: they have more fields than local kernels and the routing through
-the scheduler introduces multiple overheads compared to direct communication.
-However, using cluster scheduler hugely simplifies application development, as
-application developer does not need to worry about networking, fault tolerance,
-load balancing and ``how many parallel processes are enough for my
-application'': this is now handled by the scheduler. For maximum efficiency and
-embedded applications the application can be linked directly to the scheduler
-to be able to run in the same daemon process, that way the overhead of the
-scheduler is minimal.
+sent through this process to other cluster nodes. Kernels that are sent through
+the scheduler are heavy-weight: they have more fields than local kernels and
+the routing through the scheduler introduces multiple overheads compared to
+direct communication.  However, using cluster scheduler hugely simplifies
+application development, as application developer does not need to worry about
+networking, fault tolerance, load balancing and ``how many parallel processes
+are enough for my application'': this is now handled by the scheduler. For
+maximum efficiency and embedded applications the application can be linked
+directly to the scheduler to be able to run in the same daemon process, that
+way application kernels are no longer sent though daemon process and the
+overhead of the scheduler is minimal.
 
 \subsection{Parallel and distributed evaluation of Guile expressions using kernels}
 
@@ -531,11 +608,11 @@ functions that are overridden in subclasses. This direct mapping is natural for
 a mixed-paradigm language like C++, but functional languages may benefit from
 implementing the same ideas in the compiler or interpreter.
 
-\begin{lstlisting}[language=C++,%
+\begin{lstlisting}[language=cpp,%
 caption={Public interface of the kernel and the queue classes in C++ (simplified for clarity).},%
 captionpos=b,
 label={lst-kernel-api}]
-enum class states {upstream, downstream};
+enum class states {upstream, downstream, point_to_point};
 
 class kernel {
 public:
@@ -559,7 +636,7 @@ language~\cite{galassi2002guile}.  Guile is a dialect of
 Scheme~\cite{sussman1998} which in turn is a dialect of
 LISP~\cite{mccarthy1960}. The distinct feature of LISP-like languages is
 homoiconicity, i.e.~the code and the data is represented by tree-like
-structure (lists that may contain other lists as elements). This feature makes
+structure (lists that contain atoms or other lists as elements). This feature makes
 it possible to express parallelism directly in the language: every list element
 can be computed independently and it can be sent to other cluster nodes for
 parallel computation. To implement parallelism we created a Guile interpreter

	iccsa-21-guile Functional Programming Interface for Parallel and Distributed Computing
	git clone https://git.igankevich.com/iccsa-21-guile.git
	Log \| Files \| Refs

Makefile	\|	3	+--
copyright.txt	\|	2	++
main.tex	\|	187	++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------