commit 8e4bb0f0596ce1f5c49b38500cd8c48d2ec2bd01
parent e63b58b7c43577db938051f053f45d8e507088cb
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date: Mon, 14 Jun 2021 14:24:13 +0300
camera-ready version
Diffstat:
3 files changed, 135 insertions(+), 57 deletions(-)
diff --git a/Makefile b/Makefile
@@ -40,8 +40,7 @@ build/gnuplot/%.svg: gnuplot/%.gnuplot
build/%.eps: build/%.svg
inkscape -z --export-eps=$@ $<
-#build/main.zip: build/gnuplot/*.eps main.tex
-build/main.zip: main.tex
+build/main.zip: main.tex main.bib llncs.cls splncs04.bst build/gnuplot/results.eps
@mkdir -p build
zip --filesync build/main.zip $^
diff --git a/copyright.txt b/copyright.txt
@@ -4,3 +4,5 @@ Ivan Petriakov, Ivan Gankevich
There are many programming frameworks and languages for parallel and distributed computing which are successful both in industry and academia, however, all of them are isolated and self-contained. We believe that the main reason that there is no common denominator between them is that there is no intermediate language for distributed computations. For sequential computations we have bytecode that is used as an intermediate, portable and universal representation of a programme written in any language, but one important feature, that bytecode lacks, is an ability to describe process communication. If we add this feature, we get low-level language on top of which all the frameworks and languages for parallel and distributed computations can be built. In this paper we explore how such intermediate language can be made, how it can reduce programming effort and how it may simplify internal structure of existing frameworks. We also demonstrate how high-level interface can be build for functional language that completely hides all the difficulties that a programmer encounters when he or she works with distributed systems.
Keywords: API, intermediate language, C++, Guile.
+
+Ivan Gankevich, Saint Petersburg State University, 13B Universitetskaya Emb., St Petersburg 199034, Russia, i.gankevich@spbu.ru
diff --git a/main.tex b/main.tex
@@ -9,6 +9,7 @@
\usepackage{listings}
\usepackage{tikz}
\usepackage{textcomp}
+\usepackage{textcomp}
% https://github.com/stuhlmueller/scheme-listings/blob/master/lstlang0.sty
\lstdefinelanguage{scheme}{
@@ -40,6 +41,17 @@
showstringspaces=false
}
+\lstdefinelanguage{cpp}{
+ morekeywords=[1]{class,struct,enum,public,private,protected,virtual,override,const,
+ void,int,new,delete},
+ morecomment=[l]{//},
+ basicstyle=\small\rmfamily,
+ keywordstyle={\bf\color[HTML]{4081ec}},
+ commentstyle=\color[rgb]{0.33,0.33,0.33},
+ stringstyle={\color[HTML]{00a000}},
+ escapeinside={LATEX}{END},
+}
+
\begin{document}
\title{Functional programming interface for parallel and distributed computing%
@@ -66,22 +78,22 @@ There are many programming frameworks and languages for parallel and
distributed computing which are successful both in industry and academia,
however, all of them are isolated and self-contained. We believe that the main
reason that there is no common denominator between them is that there is no
-intermediate language for distributed computations. For sequential
+intermediate representation for distributed computations. For sequential
computations we have bytecode that is used as an intermediate, portable and
-universal representation of a programme written in any language, but one
-important feature, that bytecode lacks, is an ability to describe process
-communication. If we add this feature, we get low-level language on top of
+universal representation of a programme written in any language, but bytecode
+lacks an ability to describe process
+communication. If we add this feature, we get low-level representation on top of
which all the frameworks and languages for parallel and distributed
computations can be built. In this paper we explore how such intermediate
-language can be made, how it can reduce programming effort and how it may
+representation can be made, how it can reduce programming effort and how it may
simplify internal structure of existing frameworks. We also demonstrate how
-high-level interface can be build for functional language that completely hides
+high-level interface can be build for a functional language that completely hides
all the difficulties that a programmer encounters when he or she works with
distributed systems.
\keywords{%
\and API
-\and intermediate language
+\and intermediate representation
\and C++
\and Guile.
}
@@ -95,7 +107,7 @@ computing~\cite{spark2016,fault-tolerant-distributed-haskell,wilde2011swift,fett
which are successful both in industry and academia, however, all of them are
isolated and self-contained. We believe that the main reason that there is no
common denominator between these frameworks and languages is that there is no
-protocol or low-level language for distributed computations. For sequential
+common protocol or low-level representation for distributed computations. For sequential
computations we have bytecode (e.g.~LLVM~\cite{llvm}, Java bytecode, Guile
bytecode) that is used as an intermediate, portable and universal
representation of a programme written in any language; also we have assembler
@@ -104,9 +116,9 @@ important feature, that bytecode and assembler lack, is an ability to
communicate between parallel processes. This communication is the common
denominator on top of which all the frameworks and languages for parallel and
distributed computations can be built, however, there is no universal low-level
-protocol or language that describes communication.
+representation that describes communication.
-Why common low-level language exists for sequential computations, but does not
+Why common low-level representation exists for sequential computations, but does not
exist for parallel and distributed ones? One of the reasons, which applies to
both distributed and parallel computations, is the fact that people still think
about programmes as sequences of steps~--- the same way as people themselves
@@ -129,7 +141,7 @@ realised this potential to get all their advantages; people realised the full
potential of imperative languages, but do not know how to get rid of their
disadvantages.
-In this paper we describe low-level language and protocol based on \emph{kernels}
+In this paper we describe low-level representation based on \emph{kernels}
which is suitable for distributed and parallel computations. Kernels provide
automatic fault tolerance and can be used to exchange the data between
programmes written in different languages. We implement kernels in C++ and
@@ -144,7 +156,10 @@ that use several processor cores of single cluster node, the term
\emph{distributed} to describe computations that use several cluster nodes and
any number of cores on each node, and term \emph{cluster} to describe anything
that refers to local cluster (as opposed to geographically distributed clusters
-which are not studied in this paper).
+which are not studied in this paper). \emph{Intermediate representation} in
+our paper is a particular form of abstract syntax tree, e.g.~in functional
+languages \emph{continuation passing style} is popular intermediate
+representation of the code.
%TODO \cite{lang-virt}
%\cite{fetterly2009dryadlinq}
@@ -160,16 +175,16 @@ unified system}
In order to write parallel and distributed programmes the same way as we write
sequential programmes, we need the following components.
\begin{itemize}
- \item Low-level language that acts as an intermediate portable representation of
+ \item Portable low-level representation of
the code and the data and includes means of decomposition of the code and the data into
parts that can be computed in parallel. The closest sequential
counterpart is LLVM.
\item Cluster scheduler that executes
- parallel and distributed applications and uses the low-level language to implement
+ parallel and distributed applications and uses the low-level representation to implement
communication between these applications running on different cluster nodes.
The closest single-node counterpart is operating system kernel that executes
user processes.
- \item High-level interface that wraps the low-level language for existing
+ \item High-level interface that wraps the low-level representation for existing
popular programming languages in a form of a framework or a library. This interface
uses cluster scheduler, if it is available and node parallelism is needed by
the application, otherwise the code is executed on the local node and parallelism
@@ -179,7 +194,7 @@ sequential programmes, we need the following components.
\end{itemize}
These three components are built on top of each other as in classical object-oriented
programming approach, and all the complexity is pushed down to the lowest layer:
-low-level language is responsible for providing parallelism and fault tolerance to
+low-level representation is responsible for providing parallelism and fault tolerance to
the applications, cluster scheduler uses these facilities to provide transparent execution of
the applications on multiple cluster nodes, and high-level interface maps
the underlying system to the target language to simplify the work for
@@ -209,20 +224,20 @@ control over communication of the applications that are run on the cluster, but
is used as resource allocator instead.
The situation in newer big data technologies is different: there are the same
-three components with hierarchical structure, but the low-level language is
+three components with hierarchical structure, but the low-level representation is
integrated in the scheduler. There is YARN cluster
scheduler~\cite{vavilapalli2013yarn} with API that is used as a low-level
-language for parallel and distributed computing, and there are many high-level
+representation for parallel and distributed computing, and there are many high-level
libraries that are built on top of YARN~\cite{hadoop,oozie,spark2016,storm}.
The scheduler has more control over job execution as jobs are decomposed into
tasks and execution of tasks is controlled by the scheduler. Unfortunately, the
-lack of common low-level language made all high-level frameworks that are built
+lack of common low-level representation made all high-level frameworks that are built
on top of YARN API use their own custom protocol for communication, shift
responsibility of providing fault tolerance to the scheduler and shift
responsibility of data decomposition to higher level frameworks.
To summarise, the current state-of-the-art technologies for parallel and
-distributed computing can be divided into three classes: low-level languages,
+distributed computing can be divided into three classes: low-level representations,
cluster schedulers and high-level interfaces; however, responsibilities of each
class are not clearly separated by the developers of these technologies.
Although, the structure of the components resembles the operating system kernel
@@ -231,22 +246,22 @@ each other, but integrated horizontally, and as a result the complexity of the
parallel and distributed computations is sometimes visible on the highest
levels of abstraction.
-Our proposal is to design a low-level language and a protocol for data exchange
-that provides fault tolerance, means of data and code decomposition and means
-of communication for parallel and distributed applications. Having such a
-language at your disposal makes it easy to build higher level components,
-because the complexity of cluster systems is hidden from the programmer, the
-duplicated effort of implementing the same facilities in higher level
-interfaces is reduced, and cluster scheduler has full control of the programme
-execution as it speaks the same protocol and uses the same low-level language
-internally: the language is general enough to write any distributed programme
-including the scheduler itself.
+Our proposal is to design a low-level representation that provides fault
+tolerance, means of data and code decomposition and means of communication for
+parallel and distributed applications. Having such a representation at your disposal
+makes it easy to build higher level components, because the complexity of
+cluster systems is hidden from the programmer, the duplicated effort of
+implementing the same facilities in higher level interfaces is reduced, and
+cluster scheduler has full control of the programme execution as it speaks the
+same protocol and uses the same low-level representation internally: the representation is
+general enough to describe any distributed programme including the scheduler
+itself.
\subsection{Kernels as objects that control the programme flow}
\label{sec-kernels}
-In order to create low-level language for parallel and distributed computing we
-borrow familiar features from sequential low-level languages and augment them
+In order to create low-level representation for parallel and distributed computing we
+borrow familiar features from sequential low-level representations and augment them
with asynchronous function calls and an ability to read and write call stack
frames.
@@ -302,14 +317,75 @@ this data in object methods and the output data (the result of the computation)
also in object fields. The programmer decides which data is input and output.
To reduce network usage the programmer may delete input data when the kernel
enters \textit{downstream} state: that way only output data is copied back to
-the parent kernel over the network.
+the parent kernel over the network. The example programme written using kernels
+and using regular function call stack is shown in table~\ref{tab-call-stack}.
-This low-level language can be seen as an adaptation of classic function call
+\begin{table}
+ \centering
+ \begin{minipage}[t]{0.39\textwidth}
+\begin{lstlisting}[language=cpp]
+int nested(int a) {
+ return 123 + a;
+}
+\end{lstlisting}
+ \end{minipage}
+ \begin{minipage}[t]{0.59\textwidth}
+\begin{lstlisting}[language=cpp]
+struct Nested: public Kernel {
+ int result;
+ int a;
+ Nested(int a): a(a) {}
+ void act() override {
+ result = a + 123;
+ LATEX{\color[HTML]{ac4040}\bf{}async\_return}END();
+ }
+};
+\end{lstlisting}
+ \end{minipage}
+ \begin{minipage}[t]{0.39\textwidth}
+\begin{lstlisting}[language=cpp]
+void main() {
+ // code before
+ int result = nested();
+ // code after
+ print(result);
+}
+\end{lstlisting}
+ \end{minipage}
+ \begin{minipage}[t]{0.59\textwidth}
+\begin{lstlisting}[language=cpp]
+struct Main: public Kernel {
+ void act() override {
+ // code before
+ LATEX{\color[HTML]{ac4040}\bf{}async\_call}END(new Nested);
+ }
+ void react(Kernel* child) override {
+ int result = ((Nested*)child)->result;
+ // code after
+ print(result);
+ LATEX{\color[HTML]{ac4040}\bf{}async\_return}END();
+ }
+};
+
+void main() {
+ LATEX{\color[HTML]{ac4040}\bf{}async\_call}END(new Main);
+ wait();
+}
+\end{lstlisting}
+ \end{minipage}
+ \caption{The same programme written using regular function call stack
+ (left) and kernels (right). Here \texttt{async\_call} performs asynchronous
+ function call by pushing the child kernel to the queue, \texttt{async\_return}
+ performs asynchronous return from the function call by pushing the current
+ kernel to the queue.\label{tab-call-stack}}
+\end{table}
+
+This low-level representation can be seen as an adaptation of classic function call
stack, but with asynchronous function calls and an ability to read and write
-stack frames. This differences give kernels the following advantages.
+stack frames. These differences give kernels the following advantages.
\begin{itemize}
\item Kernels define dependencies between function calls, but do not define
- the order of computation. This gives natural way of expressing parallel
+ the order of the computation. This gives natural way of expressing parallel
computations on the lowest possible level.
\item Kernels can be written to and read from any medium: files, network
connections, serial connections etc. This allows to implement fault
@@ -376,7 +452,7 @@ concern that the scheduler has to deal with (see~figure~\ref{fig-local-routing})
\item network queue for sending kernels to and receiving from other cluster nodes,
\item process queue for creating child processes and sending kernels to and receiving
from them, and
- \item the main queue for processing kernels in parallel using multiple processor cores.
+ \item the main processor queue for processing kernels in parallel using multiple processor cores.
\end{itemize}
This separation of concerns allows us to overlap data transfer and data processing:
while the main queue processes kernels in parallel, process and network queues
@@ -455,7 +531,7 @@ introduce weight field which tells how many threads will be used by the kernel.
The default is one thread for ordinary kernels and nought threads for cluster
scheduler kernels. Process queue tracks the total weight of the kernels that
were sent to child processes and queues incoming kernels if the weight reaches
-the number of processor cores. Also, each cluster reports this information to
+the number of processor cores. Also, each cluster node reports this information to
other nodes for better load balancing decisions.
The scheduler acts as a router for the kernels: when the kernel is received
@@ -464,7 +540,7 @@ cluster node it can be sent. If the kernel has \textit{downstream} or
\textit{point-to-point} state, the kernel is sent to the node where the target
kernel resides; if the kernel has \textit{upstream} state, load balancing
algorithm decides which node to send the kernel to. Load balancing algorithm
-tracks the total weight of the kernels that were sent to specified node and
+tracks the total weight of the kernels that were sent to the specified node and
also receives the same information from the node (in case other nodes also send
there their kernels), then it chooses the node with the lowest weight and sends
the kernel to this node. If all nodes are full, the kernel is retained in the
@@ -479,18 +555,18 @@ The last but not the least responsibility of the scheduler is to discover and
maintain a list of cluster nodes and establish persistent network connections
to neighbours. Cluster scheduler does this automatically by scanning the
network using efficient algorithm: the nodes in the network are organised in
-artificial tree topology with specified fan-out value and each node try to
+artificial tree topology with the specified fan-out value and each node tries to
communicate with the nodes which are closer to the root of the tree. This
approach significantly reduces the number of data that needs to be sent over
the network to find all cluster nodes: in ideal case only one kernel is sent to
and received from the parent node. The algorithm is described
in~\cite{gankevich2015subord}. After the connections are established, all the
\textit{upstream} kernels that are received from the applications' child
-processes are routed to neighbouring nodes in the tree topology (both parent
+processes are routed to neighbour nodes in the tree topology (both parent
and child nodes). This creates a problem because the number of nodes ``behind''
the parent node is generally different than the number of nodes behind the
child nodes. In order to solve this problem we track not only the total weight
-of all kernels of the neighbouring node, but the total weight of each node in
+of all kernels of the neighbour node, but the total weight of each node in
the cluster and sum the weight of all nodes that are behind the node \(A\) to
compute the total weight of node \(A\) for load balancing. Also, we
apply load balancing recursively: when the kernel arrives at the node, load
@@ -506,16 +582,17 @@ To summarise, cluster scheduler uses kernels as unit of scheduling and as
communication protocol between its daemon processes running on different
cluster nodes. Daemon process acts as an intermediary between application
processes running on different cluster nodes, and all application kernels are
-sent through this process. Kernels that are sent through the scheduler are
-heavy-weight: they have more fields than local kernels and the routing through
-the scheduler introduces multiple overheads compared to direct communication.
-However, using cluster scheduler hugely simplifies application development, as
-application developer does not need to worry about networking, fault tolerance,
-load balancing and ``how many parallel processes are enough for my
-application'': this is now handled by the scheduler. For maximum efficiency and
-embedded applications the application can be linked directly to the scheduler
-to be able to run in the same daemon process, that way the overhead of the
-scheduler is minimal.
+sent through this process to other cluster nodes. Kernels that are sent through
+the scheduler are heavy-weight: they have more fields than local kernels and
+the routing through the scheduler introduces multiple overheads compared to
+direct communication. However, using cluster scheduler hugely simplifies
+application development, as application developer does not need to worry about
+networking, fault tolerance, load balancing and ``how many parallel processes
+are enough for my application'': this is now handled by the scheduler. For
+maximum efficiency and embedded applications the application can be linked
+directly to the scheduler to be able to run in the same daemon process, that
+way application kernels are no longer sent though daemon process and the
+overhead of the scheduler is minimal.
\subsection{Parallel and distributed evaluation of Guile expressions using kernels}
@@ -531,11 +608,11 @@ functions that are overridden in subclasses. This direct mapping is natural for
a mixed-paradigm language like C++, but functional languages may benefit from
implementing the same ideas in the compiler or interpreter.
-\begin{lstlisting}[language=C++,%
+\begin{lstlisting}[language=cpp,%
caption={Public interface of the kernel and the queue classes in C++ (simplified for clarity).},%
captionpos=b,
label={lst-kernel-api}]
-enum class states {upstream, downstream};
+enum class states {upstream, downstream, point_to_point};
class kernel {
public:
@@ -559,7 +636,7 @@ language~\cite{galassi2002guile}. Guile is a dialect of
Scheme~\cite{sussman1998} which in turn is a dialect of
LISP~\cite{mccarthy1960}. The distinct feature of LISP-like languages is
homoiconicity, i.e.~the code and the data is represented by tree-like
-structure (lists that may contain other lists as elements). This feature makes
+structure (lists that contain atoms or other lists as elements). This feature makes
it possible to express parallelism directly in the language: every list element
can be computed independently and it can be sent to other cluster nodes for
parallel computation. To implement parallelism we created a Guile interpreter