Parallel fold, conclusion, spell-check.

commit 221f6d9dbcffc4c237cc010da7a0f5253fc123ed
parent 9f4f35cd7c7dd78545e043c1c9192af47383c1e4
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Thu,  6 May 2021 17:26:39 +0300

Parallel fold, conclusion, spell-check.

Diffstat:
main.tex  | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------

1 file changed, 80 insertions(+), 38 deletions(-)
diff --git a/main.tex b/main.tex
@@ -8,6 +8,7 @@
 \usepackage{url}
 \usepackage{listings}
 \usepackage{tikz}
+\usepackage{textcomp}
 
 % https://github.com/stuhlmueller/scheme-listings/blob/master/lstlang0.sty
 \lstdefinelanguage{scheme}{
@@ -20,17 +21,18 @@
     let-values, let*-values,
     and, or, not, delay, force,
     quasiquote, quote, unquote, unquote-splicing,
-    syntax, syntax-rules, eval, environment, query },
+    syntax, syntax-rules, eval, environment, query,
+    car, cdr, cons},
   morekeywords=[3]{import, export},
   alsodigit=!\$\%&*+-./:<=>?@^_~,
   sensitive=true,
   morecomment=[l]{;},
   morecomment=[s]{\#|}{|\#},
   morestring=[b]",
-  basicstyle=\small\ttfamily,
-  keywordstyle=\bf\ttfamily,
+  basicstyle=\small\rmfamily,
+  keywordstyle={\bf\rmfamily\color[HTML]{4081ec}},
   commentstyle=\color[rgb]{0.33,0.33,0.33},
-  stringstyle={\color[rgb]{0.75,0.49,0.07}},
+  stringstyle={\color[HTML]{00a000}},
   upquote=true,
   breaklines=true,
   breakatwhitespace=true,
@@ -84,7 +86,7 @@ which is non-portable, but still popular intermediate representation.  One
 important feature, that bytecode and assembler lack, is an ability to
 communicate between parallel processes.  This communication is the common
 denominator on top of which all the frameworks and languages for parallel and
-distributed computations can be built, and there is no universal low-level
+distributed computations can be built, however, there is no universal low-level
 protocol or language that describes communication.
 
 Why common low-level language exists for sequential computations, but does not
@@ -97,7 +99,7 @@ contrast to unpopular functional languages in which programmes are written as
 compositions of functions with no implied order of computation.  Another reason
 which applies to distributed computations is the fact that these computations
 are inherently unreliable and there is no universal approach for handling
-cluster node failures.  While imperative langauges produce more efficient
+cluster node failures.  While imperative languages produce more efficient
 programmes, they do not provide safety from deadlocks and fault tolerance
 guarantees. Also, they are much more difficult to write, as a human have to
 work with mutable state (local and global variables, objects etc.) and it is
@@ -151,8 +153,8 @@ sequential programmes, we need the following components.
         The closest single-node counterpart is operating system kernel that executes
         user processes.
     \item High-level interface that wraps the low-level language for existing
-        programming languages in a form of a framework or a library. This interface
-        uses cluster scheduler if it is available and node parallelism is needed by
+        popular programming languages in a form of a framework or a library. This interface
+        uses cluster scheduler, if it is available and node parallelism is needed by
         the application, otherwise the code is executed on the local node and parallelism
         is limited to the parallelism of the node. The closest
         single-node counterpart is C library that provides an interface to system
@@ -187,7 +189,7 @@ from the scheduler.  Although, high-level interface is built on top of the
 low-level interface, batch job scheduler is fully integrated with neither of
 them: the cluster-wide counterpart of operating system kernel does not have
 control over communication of the applications that are run on the cluster, but
-is used as resource allocator.
+is used as resource allocator instead.
 
 The situation in newer big data technologies is different: there are the same
 three components with hierarchical structure, but the low-level language is
@@ -196,7 +198,7 @@ scheduler~\cite{vavilapalli2013yarn} with API that is used as a low-level
 language for parallel and distributed computing, and there are many high-level
 libraries that are built on top of YARN~\cite{hadoop,oozie,spark2016,storm}.
 The scheduler has more control over job execution as jobs are decomposed into
-tasks and execution of tasks is controled by the scheduler.  Unfortunately, the
+tasks and execution of tasks is controlled by the scheduler.  Unfortunately, the
 lack of common low-level language made all high-level frameworks that are built
 on top of YARN API use their own custom protocol for communication, shift
 responsibility of providing fault tolerance to the scheduler and shift
@@ -246,7 +248,7 @@ the stack can not be copied between cluster nodes or saved to and read from the
 file, because they often contain pointers to memory blocks that may be invalid
 on another cluster node or in the process that reads the stack from the file.
 Second, there is no natural way to express parallelism in this language: all
-function calls are syncrhonous and all instructions are executed in the
+function calls are synchronous and all instructions are executed in the
 specified order. In order to solve these problems we use object-oriented
 techniques.
 
@@ -261,7 +263,7 @@ call returns (this method takes the corresponding object as an argument). The
 object also has \texttt{read} and \texttt{write} methods that are used to read
 and write its fields to and from file or to copy the object to another cluster
 node. In this model each object contains enough information to perform the
-corresponding funciton call, and we can make these calls in any order we like.
+corresponding function call, and we can make these calls in any order we like.
 Also, the object is self-contained, and we can ask another cluster node to
 perform the function call or save the object to disk to perform the call when
 the user wants to resume the computation (e.g.~after the computer is upgraded
@@ -289,7 +291,7 @@ This low-level language can be seen as an adaptation of classic function call
 stack, but with asynchronous function calls and an ability to read and write
 stack frames. This differences give kernels the following advantages.
 \begin{itemize}
-    \item Kernels define depedencies between function calls, but do not define
+    \item Kernels define dependencies between function calls, but do not define
         the order of computation. This gives natural way of expressing parallel
         computations on the lowest possible level.
     \item Kernels can be written to and read from any medium: files, network
@@ -310,7 +312,7 @@ stack frames. This differences give kernels the following advantages.
         abstraction is higher, therefore, compiler modification is possible
         only for languages that use high-level intermediate representation
         (e.g.~LISP-like languages and purely functional languages that have
-        natural way of expressing parallelism by computing arguments to
+        natural way of expressing parallelism by computing arguments of
         functions in parallel).
 \end{itemize}
 
@@ -320,7 +322,7 @@ Kernels are general enough to write any programme, and the first programme that
 we wrote using them was cluster scheduler that uses kernels to implement its
 internal logic and to run applications spanning multiple cluster nodes.
 Single-node version of the scheduler is as simple as thread pool attached to
-kernel queue decribed in section~\ref{sec-kernels}. The programme starts with
+kernel queue described in section~\ref{sec-kernels}. The programme starts with
 pushing the first (or \emph{main}) kernel to the queue and ends when the main
 kernel changes its state to \textit{downstream} and pushes itself to the queue.
 The number of threads in the pool equals the number of processor cores, but can
@@ -330,7 +332,7 @@ scheduler is more involved and uses kernels to implement its logic.
 Cluster scheduler runs in a separate daemon process on each cluster node, and
 processes communicate with each other using kernels: process on node \(A\)
 writes some kernel to network connection with node \(B\) and process on node
-\(B\) read the kernel and performs useful operation with it. Here kernels are
+\(B\) reads the kernel and performs useful operation with it. Here kernels are
 used like messages rather than stack frames: kernel that always resides in node
 \(A\) creates child message kernel and sends it to the kernel that always
 resides in node \(B\). In order to implement this logic we added
@@ -344,7 +346,7 @@ the location of the target kernel. The first tuple is also used by
 \textit{downstream} kernels that return back to their parents, but the second
 tuple is used only by \textit{point-to-point} kernels.
 
-There several responsibilities of cluster scheduler:
+There are several responsibilities of cluster scheduler:
 \begin{itemize}
     \item run applications in child processes,
     \item route kernels between application processes running on different cluster nodes,
@@ -354,13 +356,13 @@ In order to implement them we created a kernel queue and a thread pool for each
 concern that the scheduler has to deal with (see~figure~\ref{fig-local-routing}): we have
 \begin{itemize}
     \item timer queue for scheduled and periodic tasks,
-    \item network queue for sending to and receiving kernels from other cluster nodes,
-    \item process queue for creating child processes and sending to and receiving
-        kernels from them, and
+    \item network queue for sending kernels to and receiving from other cluster nodes,
+    \item process queue for creating child processes and sending kernels to and receiving
+        from them, and
     \item the main queue for processing kernels in parallel using multiple processor cores.
 \end{itemize}
 This separation of concerns allows us to overlap data transfer and data processing:
-while the main queue processes kernels in parallel, process queue and network queue
+while the main queue processes kernels in parallel, process and network queues
 send and receive other kernels. This approach leads to small amount of oversubscription
 as separate threads are used to send and receive kernels, but benchmarks showed that
 this is not a big problem as most of the time these threads wait for the operating
@@ -394,10 +396,10 @@ system kernel to transfer the data.
         \path[draw,thick] (process) -- (parent);
         \path[draw,thick] (process) -- (child);
     \end{tikzpicture}
-    \caption{Default kernel queues for each concern.\label{fig-local-routing}}
+    \caption{Default kernel queues for each cluster scheduler concern.\label{fig-local-routing}}
 \end{figure}
 
-Cluster scheduler runs applications in child proecesses; this approach is
+Cluster scheduler runs applications in child processes; this approach is
 natural for UNIX-like operating systems as the parent process has full control
 of its children: the amount of resources can be limited (the number of
 processor cores, the amount of memory etc.) and the process can be terminated
@@ -406,7 +408,7 @@ cluster-wide source (target) application identifier field that uniquely
 identifies the source (the target) application from which the kernel originated
 (to which the kernel was sent). Also each kernel carries an application
 descriptor that specifies how to run the application (command line arguments,
-environment variable etc.) and if the application is not running, it is
+environment variables etc.) and if the corresponding process is not running, it is
 launched automatically by the scheduler.  Child processes are needed only as
 means of controlling resources usage: a process is a scheduling unit for
 operating system kernel, but in cluster scheduler a child process performs
@@ -419,7 +421,7 @@ received. This behaviour allows us to implement dynamic parallelism: we do not
 need to specify the number of parallel processes on application launch, the
 scheduler will automatically create them.  To reduce memory consumption stale
 processes, that have not received any kernel for a long period of time, may be
-terminated (they will be launched automatically, when the kernel arrives
+terminated (new processes will be created automatically, when the kernel arrives
 anyway).  Kernels can be sent from one application to another by specifying
 different application descriptor.
 
@@ -473,7 +475,7 @@ the parent node is generally different than the number of nodes behind the
 child nodes. In order to solve this problem we track not only the total weight
 of all kernels of the neighbouring node, but the total weight of each node in
 the cluster and sum the weight of all nodes that are behind the node \(A\) to
-compute the total weight of node \(A\) for load balancing. Also, we introduced
+compute the total weight of node \(A\) for load balancing. Also, we 
 apply load balancing recursively: when the kernel arrives at the node, load
 balancing algorithm is executed once more to decide whether the kernel can be
 sent locally or to another cluster node (except the source node). This approach
@@ -488,10 +490,10 @@ communication protocol between its daemon processes running on different
 cluster nodes. Daemon process acts as an intermediary between application
 processes running on different cluster nodes, and all application kernels are
 sent through this process. Kernels that are sent through the scheduler are
-hwavy-weight: they have more fields than local kernels and the routing through
+heavy-weight: they have more fields than local kernels and the routing through
 the scheduler introduces multiple overheads compared to direct communication.
 However, using cluster scheduler hugely simplifies application development, as
-application developer does not need worry about networking, fault tolerance,
+application developer does not need to worry about networking, fault tolerance,
 load balancing and ``how many parallel processes are enough for my
 application'': this is now handled by the scheduler. For maximum efficiency and
 embedded applications the application can be linked directly to the scheduler
@@ -539,14 +541,14 @@ We made a reference implementation of kernels for Guile
 language~\cite{galassi2002guile}.  Guile is a dialect of
 Scheme~\cite{sussman1998} which in turn is a dialect of
 LISP~\cite{mccarthy1960}. The distinct feature of LISP-like languages is
-homoiconicity, i.e.~the code and the data is reprensented by tree-like
+homoiconicity, i.e.~the code and the data is represented by tree-like
 structure (lists that may contain other lists as elements). This feature makes
 it possible to express parallelism directly in the language: every list element
 can be computed independently and it can be sent to other cluster nodes for
 parallel computation. To implement parallelism we created a Guile interpreter
 that evaluates every list element in parallel using kernels. In practice this
-means that every argument of a function call (a function call is also a list
-with the first element being the function name) is computed in parallel.  This
+means that every argument of a procedure call (a procedure call is also a list
+with the first element being the procedure name) is computed in parallel.  This
 interpreter is able to run any existing Guile programme (provided that it does
 not use threads, locks and semaphores explicitly) and the output will be the
 same as with the original interpreter, the programme will automatically use
@@ -563,22 +565,45 @@ Inside \textit{eval} we use hybrid approach for parallelism: we use kernels to
 evaluate arguments of procedure calls and arguments of \texttt{cons} primitive
 asynchronously only if these arguments contain other procedure calls. This
 means that all simple arguments (variables, symbols, other primitives etc.) are
-computed sequentially without creating child kernels. Evaluating procedure
-calls and \texttt{cons} using kernels is enough to make \texttt{map} and
-\texttt{fold} forms parallel, and on these forms many other forms are
-based~\cite{hutton-fold}. Our implementations of these forms are shown in
-listing~\ref{lst-parallel-forms}.
+computed sequentially without creating child kernels.
+
+Evaluating procedure calls and \texttt{cons} using kernels is enough to make
+\texttt{map} form parallel, but we had to rewrite \texttt{fold} form to make it
+parallel. Our parallelism is based on the fact that procedure arguments can be
+evaluated in parallel without affecting the correctness of the procedure,
+however, evaluating arguments in parallel in \texttt{fold} does not give
+speedup because of the nested \texttt{fold} and \texttt{proc} calls: the next
+recursive call to \texttt{fold} waits until the call to \texttt{proc}
+completes. Alternative procedure \texttt{fold-pairwise} does not have this
+deficiency, but is only correct for \texttt{proc} that does not care about the
+order of the arguments (\texttt{+}, \texttt{*} operators etc.). In this
+procedure we apply \texttt{proc} to successive pairs of elements from the
+initial list, after that we recursively call \texttt{fold-pairwise} for the
+resulting list. The iteration is continued until only one element is left in
+the list, then we return this element as the result of the procedure call. This
+new procedure is also iterative, but parallel inside each iteration. We choose
+\texttt{map} and \texttt{fold} forms to illustrate automatic parallelism
+because many other forms are based on them~\cite{hutton-fold}. Our
+implementation is shown in listing~\ref{lst-parallel-forms}. 
 
 \begin{lstlisting}[language=Scheme,%
 caption={Parallel \texttt{map} and \texttt{fold} forms in Guile.},
 captionpos=b,
 label={lst-parallel-forms}]
-(define (map proc lst)
+(define (map proc lst) "Parallel map."
   (if (null? lst) lst
     (cons (proc (car lst)) (map proc (cdr lst)))))
-(define (fold proc init lst)
+(define (fold proc init lst) "Sequential fold."
   (if (null? lst) init
     (fold proc (proc (car lst) init) (cdr lst))))
+(define (do-fold-pairwise proc lst)
+  (if (null? lst) '()
+    (if (null? (cdr lst)) lst
+      (do-fold-pairwise proc
+        (cons (proc (car lst) (car (cdr lst)))
+              (do-fold-pairwise proc (cdr (cdr lst))))))))
+(define (fold-pairwise proc lst) "Parallel pairwise fold."
+  (car (do-fold-pairwise proc lst)))
 \end{lstlisting}
 
 \section{Results}
@@ -598,6 +623,23 @@ in Guile with kernels and in Guile without kernels.
 
 \section{Conclusion}
 
+Using procedure arguments to define parallel programme parts gives new
+perspective on writing parallel programmes. In imperative languages programmers
+are used to rearranging loops to optimise memory access patterns and help the
+compiler vectorise the code, but with parallel-arguments approach in functional
+languages they can rearrange the forms to help the interpreter to extract more
+parallelism. This parallelism is automatic and does not affect the correctness
+of the programme (of course, you need to serialise access and modification of
+the global variables). With help of kernels these parallel computations are
+extended to distributed computations. Kernels provide standard way of
+expressing parallel and distributed programme parts, automatic fault tolerance
+for master and worker nodes and automatic load balancing via cluster scheduler.
+Together kernels and arguments-based parallelism provide low- and high-level
+programming interface for clusters and single nodes that conveniently hide
+the shortcomings of parallel and distributed computations allowing the
+programmer to focus on the actual problem being solved rather than fixing
+bugs in his or her parallel and distributed code.
+
 \subsubsection*{Acknowledgements.}
 Research work is supported by Council for grants of the President of the
 Russian Federation (grant no.~MK-383.2020.9).

	iccsa-21-guile
	git clone https://git.igankevich.com/iccsa-21-guile.git
	Log \| Files \| Refs

iccsa-21-guile