iccsa-21-guile

git clone https://git.igankevich.com/iccsa-21-guile.git
Log | Files | Refs

commit 221f6d9dbcffc4c237cc010da7a0f5253fc123ed
parent 9f4f35cd7c7dd78545e043c1c9192af47383c1e4
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Thu,  6 May 2021 17:26:39 +0300

Parallel fold, conclusion, spell-check.

Diffstat:
main.tex | 118+++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------
1 file changed, 80 insertions(+), 38 deletions(-)

diff --git a/main.tex b/main.tex @@ -8,6 +8,7 @@ \usepackage{url} \usepackage{listings} \usepackage{tikz} +\usepackage{textcomp} % https://github.com/stuhlmueller/scheme-listings/blob/master/lstlang0.sty \lstdefinelanguage{scheme}{ @@ -20,17 +21,18 @@ let-values, let*-values, and, or, not, delay, force, quasiquote, quote, unquote, unquote-splicing, - syntax, syntax-rules, eval, environment, query }, + syntax, syntax-rules, eval, environment, query, + car, cdr, cons}, morekeywords=[3]{import, export}, alsodigit=!\$\%&*+-./:<=>?@^_~, sensitive=true, morecomment=[l]{;}, morecomment=[s]{\#|}{|\#}, morestring=[b]", - basicstyle=\small\ttfamily, - keywordstyle=\bf\ttfamily, + basicstyle=\small\rmfamily, + keywordstyle={\bf\rmfamily\color[HTML]{4081ec}}, commentstyle=\color[rgb]{0.33,0.33,0.33}, - stringstyle={\color[rgb]{0.75,0.49,0.07}}, + stringstyle={\color[HTML]{00a000}}, upquote=true, breaklines=true, breakatwhitespace=true, @@ -84,7 +86,7 @@ which is non-portable, but still popular intermediate representation. One important feature, that bytecode and assembler lack, is an ability to communicate between parallel processes. This communication is the common denominator on top of which all the frameworks and languages for parallel and -distributed computations can be built, and there is no universal low-level +distributed computations can be built, however, there is no universal low-level protocol or language that describes communication. Why common low-level language exists for sequential computations, but does not @@ -97,7 +99,7 @@ contrast to unpopular functional languages in which programmes are written as compositions of functions with no implied order of computation. Another reason which applies to distributed computations is the fact that these computations are inherently unreliable and there is no universal approach for handling -cluster node failures. While imperative langauges produce more efficient +cluster node failures. While imperative languages produce more efficient programmes, they do not provide safety from deadlocks and fault tolerance guarantees. Also, they are much more difficult to write, as a human have to work with mutable state (local and global variables, objects etc.) and it is @@ -151,8 +153,8 @@ sequential programmes, we need the following components. The closest single-node counterpart is operating system kernel that executes user processes. \item High-level interface that wraps the low-level language for existing - programming languages in a form of a framework or a library. This interface - uses cluster scheduler if it is available and node parallelism is needed by + popular programming languages in a form of a framework or a library. This interface + uses cluster scheduler, if it is available and node parallelism is needed by the application, otherwise the code is executed on the local node and parallelism is limited to the parallelism of the node. The closest single-node counterpart is C library that provides an interface to system @@ -187,7 +189,7 @@ from the scheduler. Although, high-level interface is built on top of the low-level interface, batch job scheduler is fully integrated with neither of them: the cluster-wide counterpart of operating system kernel does not have control over communication of the applications that are run on the cluster, but -is used as resource allocator. +is used as resource allocator instead. The situation in newer big data technologies is different: there are the same three components with hierarchical structure, but the low-level language is @@ -196,7 +198,7 @@ scheduler~\cite{vavilapalli2013yarn} with API that is used as a low-level language for parallel and distributed computing, and there are many high-level libraries that are built on top of YARN~\cite{hadoop,oozie,spark2016,storm}. The scheduler has more control over job execution as jobs are decomposed into -tasks and execution of tasks is controled by the scheduler. Unfortunately, the +tasks and execution of tasks is controlled by the scheduler. Unfortunately, the lack of common low-level language made all high-level frameworks that are built on top of YARN API use their own custom protocol for communication, shift responsibility of providing fault tolerance to the scheduler and shift @@ -246,7 +248,7 @@ the stack can not be copied between cluster nodes or saved to and read from the file, because they often contain pointers to memory blocks that may be invalid on another cluster node or in the process that reads the stack from the file. Second, there is no natural way to express parallelism in this language: all -function calls are syncrhonous and all instructions are executed in the +function calls are synchronous and all instructions are executed in the specified order. In order to solve these problems we use object-oriented techniques. @@ -261,7 +263,7 @@ call returns (this method takes the corresponding object as an argument). The object also has \texttt{read} and \texttt{write} methods that are used to read and write its fields to and from file or to copy the object to another cluster node. In this model each object contains enough information to perform the -corresponding funciton call, and we can make these calls in any order we like. +corresponding function call, and we can make these calls in any order we like. Also, the object is self-contained, and we can ask another cluster node to perform the function call or save the object to disk to perform the call when the user wants to resume the computation (e.g.~after the computer is upgraded @@ -289,7 +291,7 @@ This low-level language can be seen as an adaptation of classic function call stack, but with asynchronous function calls and an ability to read and write stack frames. This differences give kernels the following advantages. \begin{itemize} - \item Kernels define depedencies between function calls, but do not define + \item Kernels define dependencies between function calls, but do not define the order of computation. This gives natural way of expressing parallel computations on the lowest possible level. \item Kernels can be written to and read from any medium: files, network @@ -310,7 +312,7 @@ stack frames. This differences give kernels the following advantages. abstraction is higher, therefore, compiler modification is possible only for languages that use high-level intermediate representation (e.g.~LISP-like languages and purely functional languages that have - natural way of expressing parallelism by computing arguments to + natural way of expressing parallelism by computing arguments of functions in parallel). \end{itemize} @@ -320,7 +322,7 @@ Kernels are general enough to write any programme, and the first programme that we wrote using them was cluster scheduler that uses kernels to implement its internal logic and to run applications spanning multiple cluster nodes. Single-node version of the scheduler is as simple as thread pool attached to -kernel queue decribed in section~\ref{sec-kernels}. The programme starts with +kernel queue described in section~\ref{sec-kernels}. The programme starts with pushing the first (or \emph{main}) kernel to the queue and ends when the main kernel changes its state to \textit{downstream} and pushes itself to the queue. The number of threads in the pool equals the number of processor cores, but can @@ -330,7 +332,7 @@ scheduler is more involved and uses kernels to implement its logic. Cluster scheduler runs in a separate daemon process on each cluster node, and processes communicate with each other using kernels: process on node \(A\) writes some kernel to network connection with node \(B\) and process on node -\(B\) read the kernel and performs useful operation with it. Here kernels are +\(B\) reads the kernel and performs useful operation with it. Here kernels are used like messages rather than stack frames: kernel that always resides in node \(A\) creates child message kernel and sends it to the kernel that always resides in node \(B\). In order to implement this logic we added @@ -344,7 +346,7 @@ the location of the target kernel. The first tuple is also used by \textit{downstream} kernels that return back to their parents, but the second tuple is used only by \textit{point-to-point} kernels. -There several responsibilities of cluster scheduler: +There are several responsibilities of cluster scheduler: \begin{itemize} \item run applications in child processes, \item route kernels between application processes running on different cluster nodes, @@ -354,13 +356,13 @@ In order to implement them we created a kernel queue and a thread pool for each concern that the scheduler has to deal with (see~figure~\ref{fig-local-routing}): we have \begin{itemize} \item timer queue for scheduled and periodic tasks, - \item network queue for sending to and receiving kernels from other cluster nodes, - \item process queue for creating child processes and sending to and receiving - kernels from them, and + \item network queue for sending kernels to and receiving from other cluster nodes, + \item process queue for creating child processes and sending kernels to and receiving + from them, and \item the main queue for processing kernels in parallel using multiple processor cores. \end{itemize} This separation of concerns allows us to overlap data transfer and data processing: -while the main queue processes kernels in parallel, process queue and network queue +while the main queue processes kernels in parallel, process and network queues send and receive other kernels. This approach leads to small amount of oversubscription as separate threads are used to send and receive kernels, but benchmarks showed that this is not a big problem as most of the time these threads wait for the operating @@ -394,10 +396,10 @@ system kernel to transfer the data. \path[draw,thick] (process) -- (parent); \path[draw,thick] (process) -- (child); \end{tikzpicture} - \caption{Default kernel queues for each concern.\label{fig-local-routing}} + \caption{Default kernel queues for each cluster scheduler concern.\label{fig-local-routing}} \end{figure} -Cluster scheduler runs applications in child proecesses; this approach is +Cluster scheduler runs applications in child processes; this approach is natural for UNIX-like operating systems as the parent process has full control of its children: the amount of resources can be limited (the number of processor cores, the amount of memory etc.) and the process can be terminated @@ -406,7 +408,7 @@ cluster-wide source (target) application identifier field that uniquely identifies the source (the target) application from which the kernel originated (to which the kernel was sent). Also each kernel carries an application descriptor that specifies how to run the application (command line arguments, -environment variable etc.) and if the application is not running, it is +environment variables etc.) and if the corresponding process is not running, it is launched automatically by the scheduler. Child processes are needed only as means of controlling resources usage: a process is a scheduling unit for operating system kernel, but in cluster scheduler a child process performs @@ -419,7 +421,7 @@ received. This behaviour allows us to implement dynamic parallelism: we do not need to specify the number of parallel processes on application launch, the scheduler will automatically create them. To reduce memory consumption stale processes, that have not received any kernel for a long period of time, may be -terminated (they will be launched automatically, when the kernel arrives +terminated (new processes will be created automatically, when the kernel arrives anyway). Kernels can be sent from one application to another by specifying different application descriptor. @@ -473,7 +475,7 @@ the parent node is generally different than the number of nodes behind the child nodes. In order to solve this problem we track not only the total weight of all kernels of the neighbouring node, but the total weight of each node in the cluster and sum the weight of all nodes that are behind the node \(A\) to -compute the total weight of node \(A\) for load balancing. Also, we introduced +compute the total weight of node \(A\) for load balancing. Also, we apply load balancing recursively: when the kernel arrives at the node, load balancing algorithm is executed once more to decide whether the kernel can be sent locally or to another cluster node (except the source node). This approach @@ -488,10 +490,10 @@ communication protocol between its daemon processes running on different cluster nodes. Daemon process acts as an intermediary between application processes running on different cluster nodes, and all application kernels are sent through this process. Kernels that are sent through the scheduler are -hwavy-weight: they have more fields than local kernels and the routing through +heavy-weight: they have more fields than local kernels and the routing through the scheduler introduces multiple overheads compared to direct communication. However, using cluster scheduler hugely simplifies application development, as -application developer does not need worry about networking, fault tolerance, +application developer does not need to worry about networking, fault tolerance, load balancing and ``how many parallel processes are enough for my application'': this is now handled by the scheduler. For maximum efficiency and embedded applications the application can be linked directly to the scheduler @@ -539,14 +541,14 @@ We made a reference implementation of kernels for Guile language~\cite{galassi2002guile}. Guile is a dialect of Scheme~\cite{sussman1998} which in turn is a dialect of LISP~\cite{mccarthy1960}. The distinct feature of LISP-like languages is -homoiconicity, i.e.~the code and the data is reprensented by tree-like +homoiconicity, i.e.~the code and the data is represented by tree-like structure (lists that may contain other lists as elements). This feature makes it possible to express parallelism directly in the language: every list element can be computed independently and it can be sent to other cluster nodes for parallel computation. To implement parallelism we created a Guile interpreter that evaluates every list element in parallel using kernels. In practice this -means that every argument of a function call (a function call is also a list -with the first element being the function name) is computed in parallel. This +means that every argument of a procedure call (a procedure call is also a list +with the first element being the procedure name) is computed in parallel. This interpreter is able to run any existing Guile programme (provided that it does not use threads, locks and semaphores explicitly) and the output will be the same as with the original interpreter, the programme will automatically use @@ -563,22 +565,45 @@ Inside \textit{eval} we use hybrid approach for parallelism: we use kernels to evaluate arguments of procedure calls and arguments of \texttt{cons} primitive asynchronously only if these arguments contain other procedure calls. This means that all simple arguments (variables, symbols, other primitives etc.) are -computed sequentially without creating child kernels. Evaluating procedure -calls and \texttt{cons} using kernels is enough to make \texttt{map} and -\texttt{fold} forms parallel, and on these forms many other forms are -based~\cite{hutton-fold}. Our implementations of these forms are shown in -listing~\ref{lst-parallel-forms}. +computed sequentially without creating child kernels. + +Evaluating procedure calls and \texttt{cons} using kernels is enough to make +\texttt{map} form parallel, but we had to rewrite \texttt{fold} form to make it +parallel. Our parallelism is based on the fact that procedure arguments can be +evaluated in parallel without affecting the correctness of the procedure, +however, evaluating arguments in parallel in \texttt{fold} does not give +speedup because of the nested \texttt{fold} and \texttt{proc} calls: the next +recursive call to \texttt{fold} waits until the call to \texttt{proc} +completes. Alternative procedure \texttt{fold-pairwise} does not have this +deficiency, but is only correct for \texttt{proc} that does not care about the +order of the arguments (\texttt{+}, \texttt{*} operators etc.). In this +procedure we apply \texttt{proc} to successive pairs of elements from the +initial list, after that we recursively call \texttt{fold-pairwise} for the +resulting list. The iteration is continued until only one element is left in +the list, then we return this element as the result of the procedure call. This +new procedure is also iterative, but parallel inside each iteration. We choose +\texttt{map} and \texttt{fold} forms to illustrate automatic parallelism +because many other forms are based on them~\cite{hutton-fold}. Our +implementation is shown in listing~\ref{lst-parallel-forms}. \begin{lstlisting}[language=Scheme,% caption={Parallel \texttt{map} and \texttt{fold} forms in Guile.}, captionpos=b, label={lst-parallel-forms}] -(define (map proc lst) +(define (map proc lst) "Parallel map." (if (null? lst) lst (cons (proc (car lst)) (map proc (cdr lst))))) -(define (fold proc init lst) +(define (fold proc init lst) "Sequential fold." (if (null? lst) init (fold proc (proc (car lst) init) (cdr lst)))) +(define (do-fold-pairwise proc lst) + (if (null? lst) '() + (if (null? (cdr lst)) lst + (do-fold-pairwise proc + (cons (proc (car lst) (car (cdr lst))) + (do-fold-pairwise proc (cdr (cdr lst)))))))) +(define (fold-pairwise proc lst) "Parallel pairwise fold." + (car (do-fold-pairwise proc lst))) \end{lstlisting} \section{Results} @@ -598,6 +623,23 @@ in Guile with kernels and in Guile without kernels. \section{Conclusion} +Using procedure arguments to define parallel programme parts gives new +perspective on writing parallel programmes. In imperative languages programmers +are used to rearranging loops to optimise memory access patterns and help the +compiler vectorise the code, but with parallel-arguments approach in functional +languages they can rearrange the forms to help the interpreter to extract more +parallelism. This parallelism is automatic and does not affect the correctness +of the programme (of course, you need to serialise access and modification of +the global variables). With help of kernels these parallel computations are +extended to distributed computations. Kernels provide standard way of +expressing parallel and distributed programme parts, automatic fault tolerance +for master and worker nodes and automatic load balancing via cluster scheduler. +Together kernels and arguments-based parallelism provide low- and high-level +programming interface for clusters and single nodes that conveniently hide +the shortcomings of parallel and distributed computations allowing the +programmer to focus on the actual problem being solved rather than fixing +bugs in his or her parallel and distributed code. + \subsubsection*{Acknowledgements.} Research work is supported by Council for grants of the President of the Russian Federation (grant no.~MK-383.2020.9).