commit a92f0288aafa6c7fef1ff8537b431f3e1eda82c7
parent 24b076991cca5be8a2da5f949176f525ae94fed5
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date: Wed, 14 Apr 2021 13:52:08 +0300
Methods 1.
Diffstat:
main.tex | | | 99 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- |
1 file changed, 97 insertions(+), 2 deletions(-)
diff --git a/main.tex b/main.tex
@@ -79,7 +79,7 @@ realised this potential to get all their advantages; people realised the full
potential of imperative languages, but do not know how to get rid of their
disadvantages.
-In this paper we describe low-level language and protocol called \emph{kernels}
+In this paper we describe low-level language and protocol based on \emph{kernels}
which is suitable for distributed and parallel computations. Kernels provide
automatic fault tolerance and can be used to exchange the data between
programmes written in different languages. We implement kernels in C++ and
@@ -89,7 +89,7 @@ intermediate representation for Guile programming language, run benchmarks
using the scheduler and compare the performance of different implementations of
the same programme.
-\cite{lang-virt}
+TODO \cite{lang-virt}
%\cite{fetterly2009dryadlinq}
%\cite{wilde2011swift}
%\cite{pinho2014oopp}
@@ -97,6 +97,101 @@ the same programme.
\section{Methods}
+\subsection{Parallel and distributed computing technologies as components of
+cluster operating system}
+
+In order to write parallel and distributed programmes the same way as we write
+sequential programmes, we need the following components.
+\begin{itemize}
+ \item Low-level language that acts as an intermediate portable representation of
+ the code and the data and includes means of decomposition of the code and the data into
+ parts that can be computed in parallel. The closest sequential
+ counterpart is LLVM.
+ \item Cluster scheduler that executes
+ parallel and distributed applications and uses the low-level language to implement
+ communication between these applications running on different cluster nodes.
+ The closest single-node counterpart is operating system kernel that executes
+ user processes.
+ \item High-level interface that wraps the low-level language for existing
+ programming languages in a form of a framework or a library. This interface
+ uses cluster scheduler if it is available and node parallelism is needed by
+ the application, otherwise the code is executed on the local node and parallelism
+ is limited to the parallelism of the node. The closest
+ single-node counterpart is C library that provides an interface to system
+ calls of the operating system kernel.
+\end{itemize}
+These three components are built on top of each other as in classical object-oriented
+programming arroach, and all the complexity is pushed to the lowest layer.
+Low-level language is responsible for providing parallelism and fault tolerance to
+the applications, cluster scheduler uses these facilities to provide transparent execution of
+the applications on multiple cluster nodes, and high-level interface maps
+the underlying system to the target language to simplify the work for
+application programmers.
+
+High-performance computing technologies have the same three-component
+structure: message passing library (MPI) is widely considered a low-level
+language of parallel computing, batch job schedulers are used to allocate
+resources and high-level interface is some library that is built on top of MPI;
+however, the responsibilities of the components are not clearly separated and
+the hierarchical structure is not maintained. MPI provides means of
+communication between parallel processes, but does not provide data
+decomposition facilities and fault tolerance guarantees: data decomposition is
+done either in high-level language or manually, and fault tolerance is provided
+by batch job scheduler. Batch jobs schedulers provide means to allocate
+resources (cluster nodes, processor cores, memory etc.) and launch parallel MPI
+processes, but do not have control over messages that are sent between these
+processes and do not control the actual number of resources used by the
+programme (all resources are exclusively owned by the programme), i.e.~cluster
+schedulers and MPI programmes do not talk to each other after the parallel
+processes were launched. Consequently, high-level interface is also separated
+from the scheduler. Although, high-level interface is built on top of the
+low-level interface, batch job scheduler is fully integrated with neither of
+them: the cluster-wide counterpart of operating system kernel does not have
+control over communication of the applications that are run on the cluster, but
+is used as resource allocator.
+
+The situation in newer big data technologies is different: there are the same
+three components with hierarchical structure, but lack low-level language.
+There are many high-level libraries that are integrated with YARN cluster
+scheduler (TODO cite). The scheduler has more control over job execution as
+jobs are decomposed into tasks and execution of tasks is controled by the
+scheduler. Unfortunately, the lack of common low-level language made all
+high-level frameworks that are built on top of YARN API use their own custom
+protocol for communication, shift responsibility of providing fault tolerance
+to the scheduler and shift responsibility of data decomposition to higher level
+frameworks.
+
+To summarise, the current state-of-the-art technologies for parallel and
+distributed computing can be divided into three classes: low-level languages,
+cluster schedulers and high-level interfaces; however, responsibilities of each
+class are not clearly separated by the developers of these technologies.
+Although, the structure of the components resembles the operating system kernel
+and its application interface, the components sometimes are not built on top of
+each other, but integrated horizontally, and as a result the complexity of the
+parallel and distributed computations is sometimes visible on the highest
+levels of abstraction.
+
+Our proposal is to design a low-level language and a protocol for data exchange
+that provides fault tolerance, means of data and code decomposition and means
+of communication for parallel and distributed applications. Having such a
+language at your disposal makes it easy to build higher level components,
+because the complexity of cluster systems is hidden from the programmer, the
+duplicated effort of implementing the same facilities in higher level
+interfaces is reduced, and cluster scheduler has full control of the programmes
+execution as it speaks the same protocol and uses the same low-level language
+internally: the language is general enough to write any distributed programme
+including the scheduler itself.
+
+\subsection{Kernels as objects that control the programme flow}
+
+\subsection{Reference cluster scheduler based on kernels}
+
+TODO
+
+\subsection{Kernels as intermediate representation for Guile language}
+
+TODO
+
\section{Results}
\section{Discussion}