cluster scheduler - iccsa-21-guile

commit e7356e59fe198f916c18176b3cc6f576d039a796
parent 160fa5db54c2d3f7d9fee7941fcd42375a981037
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Fri, 16 Apr 2021 15:42:41 +0300

cluster scheduler

Diffstat:
main.bib  | 42 ++++++++++++++++++++++++++++++++++++++++++
main.tex  | 165 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----

2 files changed, 200 insertions(+), 7 deletions(-)
diff --git a/main.bib b/main.bib
@@ -177,3 +177,45 @@
   version       = {2.2.0},
   date          = {2021-04-15}
 }
+
+@InProceedings{   gankevich2015subord,
+  author        = {Gankevich, Ivan and Tipikin, Yuri and Gaiduchok, Vladimir},
+  booktitle     = {International Conference on High Performance Computing
+                  Simulation (HPCS)},
+  title         = {Subordination: Cluster management without distributed
+                  consensus},
+  year          = {2015},
+  pages         = {639--642},
+  keywords      = {Clustering algorithms;Computers;Heuristic algorithms;IP
+                  networks;Network topology;Nominations and
+                  elections;Topology;cluster accounting;cluster
+                  management;cluster monitoring;job scheduling;leader election},
+  doi           = {10.1109/HPCSim.2015.7237106}
+}
+
+@InProceedings{   gankevich2016factory,
+  author        = {I. Gankevich and Y. Tipikin and V. Korkhov and V. Gaiduchok},
+  booktitle     = {International Conference on High Performance Computing
+                  Simulation (HPCS'16)},
+  title         = {Factory: Non-stop batch jobs without checkpointing},
+  year          = {2016},
+  pages         = {979--984},
+  doi           = {10.1109/HPCSim.2016.7568441},
+  isbn          = {978-1-5090-2088-1},
+  month         = {July}
+}
+
+@inproceedings{gankevich2017subord,
+  author = {I. Gankevich and Y. Tipikin and V. Korkhov},
+  booktitle = {Proceedings of International Conference on High Performance
+                  Computing Simulation (HPCS'17)},
+  title = {Subordination: Providing Resilience to Simultaneous Failure
+                  of Multiple Cluster Nodes},
+  year = {2017},
+  pages = {832--838},
+  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
+  address = {NJ, USA},
+  doi = {10.1109/HPCS.2017.126},
+  isbn = {978-1-5386-3250-5},
+  month = {July},
+}
diff --git a/main.tex b/main.tex
@@ -186,6 +186,7 @@ internally: the language is general enough to write any distributed programme
 including the scheduler itself.
 
 \subsection{Kernels as objects that control the programme flow}
+\label{sec-kernels}
 
 In order to create low-level language for parallel and distributed computing we
 borrow familiar features from sequential low-level languages and augment them
@@ -228,7 +229,7 @@ perform the function call or save the object to disk to perform the call when
 the user wants to resume the computation (e.g.~after the computer is upgraded
 and rebooted).
 
-The function calls are made asyncrhonous with help of thread pool. Each thread
+The function calls are made asynchronous with help of thread pool. Each thread
 pool consists of an object queue and an array of threads. When the object is
 placed in the queue, one of the threads extracts it and calls \texttt{act} or
 \texttt{react} method depending on the state of the object. There are two
@@ -248,7 +249,7 @@ the parent kernel over the network.
 
 This low-level language can be seen as an adaptation of classic function call
 stack, but with asynchronous function calls and an ability to read and write
-stack frames. This differences constitute the main advantages of the kernels.
+stack frames. This differences give kernels the following advantages.
 \begin{itemize}
     \item Kernels define depedencies between function calls, but do not define
         the order of computation. This gives natural way of expressing parallel
@@ -267,16 +268,166 @@ stack frames. This differences constitute the main advantages of the kernels.
     \item Finally, kernels are simple enough to be used as an intermediate
         representation for high-level languages: either via a compiler
         modification, or via wrapper library that calls the low-level implementation
-        directly.
+        directly. Kernels can not replace LLVM or assembler, because their level of 
+        abstraction is higher, therefore, compiler modification is possible
+        only for languages that use high-level intermediate representation
+        (e.g.~LISP-like languages and purely functional languages that have
+        natural way of expressing parallelism by computing arguments to
+        functions in parallel).
 \end{itemize}
 
 \subsection{Reference cluster scheduler based on kernels}
 
 Kernels are general enough to write any programme, and the first programme that
-we wrote using them was cluster scheduler that uses kernels to implement
-its internal logic and to run applications spanning multiple cluster nodes.
-
-TODO
+we wrote using them was cluster scheduler that uses kernels to implement its
+internal logic and to run applications spanning multiple cluster nodes.
+Single-node version of the scheduler is as simple as thread pool attached to
+kernel queue decribed in section~\ref{sec-kernels}. The programme starts with
+pushing the first (or \emph{main}) kernel to the queue and ends when the main
+kernel changes its state to \textit{downstream} and pushes itself to the queue.
+The number of threads in the pool equals the number of processor cores, but can
+be set manually to limit the amount of parallelism.  Cluster version of the
+scheduler is more involved and uses kernels to implement its logic.
+
+Cluster scheduler runs in a separate daemon process on each cluster node, and
+processes communicate with each other using kernels: process on node \(A\)
+writes some kernel to network connection with node \(B\) and process on node
+\(B\) read the kernel and performs useful operation with it. Here kernels are
+used like messages rather than stack frames: kernel that always resides in node
+\(A\) creates child message kernel and sends it to the kernel that always
+resides in node \(B\). In order to implement this logic we added
+\textit{point-to-point} state and a field that specifies the identifier of the
+target kernel. In addition to that we added source and destination address
+fields to be able to route the kernel to the target cluster node and return it
+back to the source node: \((\text{parent-kernel},\text{source-address})\) tuple
+uniquely identifies the location of the parent kernel, and
+\((\text{target-kernel},\text{destination-address})\) tuple uniquely identifies
+the location of the target kernel. The first tuple is also used by
+\textit{downstream} kernels that return back to their parents, but the second
+tuple is used only by \textit{point-to-point} kernels.
+
+There several responsibilities of cluster scheduler:
+\begin{itemize}
+    \item run applications in child processes,
+    \item route kernels between application processes running on different cluster nodes,
+    \item maintain a list of available cluster nodes.
+\end{itemize}
+In order to implement them we created a kernel queue and a thread pool for each
+concern that the scheduler has to deal with: we have
+\begin{itemize}
+    \item timer queue for scheduled and periodic tasks,
+    \item network queue for sending to and receiving kernels from other cluster nodes,
+    \item process queue for creating child processes and sending to and receiving
+        kernels from them, and
+    \item the main queue for processing kernels in parallel using multiple processor cores.
+\end{itemize}
+This separation of concerns allows us to overlap data transfer and data processing:
+while the main queue processes kernels in parallel, process queue and network queue
+send and receive other kernels. This approach leads to small amount of oversubscription
+as separate threads are used to send and receive kernels, but benchmarks showed that
+this is not a big problem as most of the time these threads wait for the operating
+system kernel to transfer the data.
+
+Cluster scheduler runs applications in child proecesses; this approach is
+natural for UNIX-like operating systems as the parent process has full control
+of its children: the amount of resources can be limited (the number of
+processor cores, the amount of memory etc.) and the process can be terminated
+at any time. After introducing child processes into the scheduler we added
+cluster-wide source (target) application identifier field that uniquely
+identifies the source (the target) application from which the kernel originated
+(to which the kernel was sent). Also each kernel carries an application
+descriptor that specifies how to run the application (command line arguments,
+environment variable etc.) and if the application is not running, it is
+launched automatically by the scheduler.  Child processes are needed only as
+means of controlling resources usage: a process is a scheduling unit for
+operating system kernel, but in cluster scheduler a child process performs
+something useful only when the kernel (which is a unit of scheduling in our
+scheduler) is sent to the corresponding application and launched automatically
+if there is no such application.  Application spans multiple cluster nodes and
+may have any number of child processes (but no more than one process per node).
+These processes are launched on-demand and do nothing until the kernel is
+received. This behaviour allows us to implement dynamic parallelism: we do not
+need to specify the number of parallel processes on application launch, the
+scheduler will automatically create them.  To reduce memory consumption stale
+processes, that have not received any kernel for a long period of time, may be
+terminated (they will be launched automatically, when the kernel arrives
+anyway).  Kernels can be sent from one application to another by specifying
+different application descriptor.
+
+Child process communicates with its parent using optimised child process queue.
+If the parent process does not want to communicate, the child process continues
+execution on the local node: the applications written using cluster scheduler
+interface work correctly even when the scheduler is not available, but use
+local node instead of the cluster.
+
+Since the node may have multiple child processes, we may have a situation
+when all of them try to use all processor cores, which will lead to
+oversubscription and suboptimal performance. In order to solve this problem, we
+introduce weight field which tells how many threads will be used by the kernel.
+The default is one thread for ordinary kernels and nought threads for cluster
+scheduler kernels. Process queue tracks the total weight of the kernels that
+were sent to child processes and queues incoming kernels if the weight reaches
+the number of processor cores. Also, each cluster reports this information to
+other nodes for better load balancing decisions.
+
+The scheduler acts as a router for the kernels: when the kernel is received
+from the application, the scheduler analyses its fields and decides to which
+cluster node it can be sent. If the kernel has \textit{downstream} or
+\textit{point-to-point} state, the kernel is sent to the node where the target
+kernel resides; if the kernel has \textit{upstream} state, load balancing
+algorithm decides which node to send the kernel to. Load balancing algorithm
+tracks the total weight of the kernels that were sent to specified node and
+also receives the same information from the node (in case other nodes also send
+there their kernels), then it chooses the node with the lowest weight and sends
+the kernel to this node. If all nodes are full, the kernel is retained in the
+queue until the enough processor cores become available. The algorithm is very
+conservative and does not use work-stealing for improved performance, however,
+the fault tolerance is easier to
+implement~\cite{gankevich2016factory,gankevich2017subord} when the target and
+the source node fields do not change during kernel lifetime which is not the
+case for work-stealing scenario.
+
+The last but not the least responsibility of the scheduler is to discover and
+maintain a list of cluster nodes and establish persistent network connections
+to neighbours. Cluster scheduler does this automatically by scanning the
+network using efficient algorithm: the nodes in the network are organised in
+artificial tree topology with specified fan-out value and each node try to
+communicate with the nodes which are closer to the root of the tree.  This
+approach significantly reduces the number of data that needs to be sent over
+the network to find all cluster nodes: in ideal case only one kernel is sent to
+and received from the parent node. The algorithm is described
+in~\cite{gankevich2015subord}. After the connections are established, all the
+\textit{upstream} kernels that are received from the applications' child
+processes are routed to neighbouring nodes in the tree topology (both parent
+and child nodes). This creates a problem because the number of nodes ``behind''
+the parent node is generally different than the number of nodes behind the
+child nodes. In order to solve this problem we track not only the total weight
+of all kernels of the neighbouring node, but the total weight of each node in
+the cluster and sum the weight of all nodes that are behind the node \(A\) to
+compute the total weight of node \(A\) for load balancing. Also, we introduced
+apply load balancing recursively: when the kernel arrives at the node, load
+balancing algorithm is executed once more to decide whether the kernel can be
+sent locally or to another cluster node (except the source node). This approach
+solves the problem, and now applications can be launched not only on the root
+node, but on any node without load balancing problems. This approach adds small
+overhead, as the kernel goes through intermediate node, but if the overhead is
+undesirable, the application can be launched on the root node.  Node discovery
+and node state updates are implemented using \textit{point-to-point} kernels.
+
+To summarise, cluster scheduler uses kernels as unit of scheduling and as
+communication protocol between its daemon processes running on different
+cluster nodes. Daemon process acts as an intermediary between application
+processes running on different cluster nodes, and all application kernels are
+sent through this process. Kernels that are sent through the scheduler are
+hwavy-weight: they have more fields than local kernels and the routing through
+the scheduler introduces multiple overheads compared to direct communication.
+However, using cluster scheduler hugely simplifies application development, as
+application developer does not need worry about networking, fault tolerance,
+load balancing and ``how many parallel processes are enough for my
+application'': this is now handled by the scheduler. For maximum efficiency and
+embedded applications the application can be linked directly to the scheduler
+to be able to run in the same daemon process, that way the overhead of the
+scheduler is minimal.
 
 \subsection{Kernels as intermediate representation for Guile language}

	iccsa-21-guile
	git clone https://git.igankevich.com/iccsa-21-guile.git
	Log \| Files \| Refs

main.bib	\|	42	++++++++++++++++++++++++++++++++++++++++++
main.tex	\|	165	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----