hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 430b7998dc7d713f031351d3cad6d57b15c8bdc5
parent eb2c73c6f6ef54e4bc257fe8de765e84515e15e6
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Tue, 16 May 2017 11:33:39 +0300

Revise the text.

Diffstat:
src/body.tex | 131+++++++++++++++++++++++++++++++++++++++----------------------------------------
src/head.tex | 4++--
src/tail.tex | 16++++++++--------
3 files changed, 75 insertions(+), 76 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -3,14 +3,22 @@ Our model of computer system has layered architecture (fig.~\ref{fig:pipeline}): +\begin{figure}% + \centering% + \includegraphics{ppl}% + \caption{Mapping of parent and child process pipelines to compute devices. + Solid lines denote aggregation, dashed lines denote mapping between + logical and physical entities.\label{fig:pipeline}} +\end{figure}% + \paragraph{Physical layer} Consists of nodes and direct/routed physical network links. On this layer full network connectivity, i.e. an ability to send packet from one cluster node to any other, is assumed. \paragraph{Daemon layer} Consists of daemon processes residing on cluster nodes and hierarchical (master/slave) logical links between them. Master and slave -roles are dynamically assigned to daemon processes, any physical cluster node -may become a master or a slave. Dynamic reassignment uses leader election +roles are dynamically assigned to daemon processes, i.e.~any physical cluster +node may become a master or a slave. Dynamic reassignment uses leader election algorithm that does not require periodic broadcasting of messages, and the role is derived from node's IP address. Detailed explanation of the algorithm is provided in~\cite{gankevich2015subordination}. Its strengths are scalability to @@ -25,33 +33,34 @@ hierarchy is used to distribute the load from the current node to its neighbours by simply iterating over all directly connected daemons. Upon reconfiguration due to node failure or due to new node joining the cluster, daemons exchange messages telling each other how many daemons are ``behind'' -them in the hierarchy. This information is used to distribute the load evenly, -even if a parallel programme is launched on a slave node. In addition, this -topology reduces the number of simultaneous connections, thus preventing -network overload. +the corresponding link in the hierarchy. This information is used to distribute +the load evenly, even if a parallel programme is launched on a slave node. In +addition, this topology reduces the number of simultaneous connections, thus +preventing network overload. Load balancing is implemented as follows. When daemon $A$ tries to become a subordinate of daemon $B$, it sends a message to a corresponding IP address -telling how many slave daemons are already connected to it (including itself). -If there are no slaves, then it counts itself only. After all links between +telling how many daemons are already connected to it (including itself). If +there are no connections, then it counts itself only. After all links between daemons in the cluster are established, every daemon has enough information to tell, how many nodes exist behind each link. If the link is between a slave and a master, and the slave wants to know, how many nodes are behind the master, -then it simply subtracts the total number of nodes behind its slave nodes from -the total number of nodes behind the master to get the correct count. To -distribute kernels across nodes we use simple round-robin algorithm, +then it simply subtracts the total number of nodes behind all of its slave +nodes from the total number of nodes behind the master to get the correct +amount. To distribute kernels across nodes we use simple round-robin algorithm, i.e.~iterate over all links of the current daemon (including the link to its master) taking into account how many nodes are behind each link: the pointer advances to a next link, only when enough number of kernels are sent through the current link. That way even if an application is launched on a slave node in the bottom of the hierarchy, the kernels will be distributed evenly across -all cluster nodes. +all cluster nodes. A kernel can not be sent through the link, from which it was +received. The advantage of this approach is that it can be extended to include sophisticated logic into load distribution policy. Any metric, that is required to implement such policy, can be sent from the directly linked daemon during the link initialisation. As of now, only the number of nodes behind the link is -sent. The disadvantage of the approach is that update of the metric happens +sent. The disadvantage of the approach is that an update of the metric happens only when a change in the hierarchy occurs: if a metric changes periodically, then periodically sending update messages is also required for implementing the policy, and too frequent updates may consume considerable amount of network @@ -59,14 +68,14 @@ bandwidth. The other disadvantage is that when reconfiguration of the hierarchy occurs due to a node failure or a new node joining the cluster, the kernels that are already executed on the nodes are not taken into account in the load distribution, so frequent updates to the hierarchy may cause uneven load -distribution, which, however, balances over time. Uneven load distribution does +distribution (which, however, balances over time). Uneven load distribution does not cause node overload, since there is a kernel pool on each node that queues the kernels prior to execution. Dynamic master/slave role distribution coupled with kernel distribution makes -overall distributed system architecture homogeneous within single cluster. On -every node the same daemon is run, and no configuration is needed to make a -hierarchy of daemons~--- it happens automatically. +overall system architecture homogeneous within single cluster. On every node +the same daemon is run, and no configuration is needed to make a hierarchy of +daemons~--- it happens automatically. \paragraph{Kernel layer} Consists of kernels and hierarchical (parent/child) logical links between them. The only purpose of kernel hierarchy is to provide @@ -102,7 +111,7 @@ platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline consists of a kernel pool, which contains all the subordinate kernels sent by their parents, and a thread pool that processes kernels in accordance with rules outlined in the previous paragraph. A~separate pipeline exists for each -compute device: There are pipelines for parallel processing, schedule-based +compute device: there are pipelines for parallel processing, schedule-based processing (periodic and delayed tasks), and a proxy pipeline for processing of kernels on other cluster nodes (see~fig.~\ref{fig:pipeline}). @@ -118,14 +127,6 @@ are necessary because calls are asynchronous and one must wait before subordinate kernels complete their work. Pipelines allow circumventing active wait, and call correct kernel methods by analysing their internal state. -\begin{figure}% - \centering% - \includegraphics{ppl}% - \caption{Mapping of parent and child process pipelines to compute devices. - Solid lines denote aggregation, dashed lines denote mapping between - logical and physical entities.\label{fig:pipeline}} -\end{figure}% - \section{Resilience to multiple node failures} To disambiguate hierarchical links between daemon processes and kernels and to @@ -184,12 +185,12 @@ discuss this algorithm and the rule in detail. The main purpose of the system is to provide continuous execution of kernels in the presence of node failures. There are three types of such failures. -\begin{itemize} +\begin{enumerate} \item Simultaneous failure of at most one node. \item Simultaneous failure of more than one node but less than total number of nodes. \item Simultaneous failure of all nodes (electricity outage). -\end{itemize} +\end{enumerate} For the sake of simplicity, it is assumed that parallel programme runs on all cluster nodes. Our system provide resilience to node failures for the first and the second scenario. @@ -205,14 +206,14 @@ a node failure, or as a consequence of the daemon hierarchy change, all kernels which reside on the corresponding cluster node are considered failed, and recovery process is triggered in accordance with the following scenarios. -\paragraph*{Scenario~1} With respect to kernel hierarchy, there are two -possible variants of this failure: when a principal fails and when a +\paragraph*{Scenario~1 \& 2} With respect to kernel hierarchy, there are three +possible variants of this failure: when a principal fails, when a subordinate fails (and both of them may or may not reside on the same cluster -node). +node) and when any combination of a principal and its subordinates fail. When a subordinate fails, its copy is simply restored from the outbound buffer on the node where its principal is located. When the corresponding network -connection closes all kernels from the list are automatically distributed +connection closes all kernels from the buffer are automatically distributed across available nodes, or executed locally if there are no network connections. @@ -226,12 +227,14 @@ and deletes itself. This algorithm is executed on every node, to which a copy of the principal was sent, and the guarantee that only one copy of the principal is restored is provided the implied hierarchy of IP addresses: every subordinate of the principal has the list of nodes to which only -\emph{previously created} subordinates were sent, and communication originating -from previously created subordinate to the newer subordinate is possible (only -the other way round). Subordinate deletion is necessary, because the whole -computational step, modelled by the principal, is re-executed from the initial -state, and there is no simple and reliable way of taking into account partial -results which were produced so far by the subordinates. +\emph{previously created} subordinates were sent, and no communication +originating from previously created subordinate to the newer subordinate is +possible (only the other way round). Subordinate deletion is necessary, because +the whole computational step, modelled by the principal, is re-executed from +the initial state, and there is no simple and reliable way of taking into +account partial results which were produced so far by the subordinates. +Simultaneous failure of a combination of a principal and a numbero of its +subordinates is handled the same way. %\begin{algorithm} % \KwData{$s$ --- subordinate kernel, $result$ --- \texttt{send} status.} @@ -264,12 +267,9 @@ results which were produced so far by the subordinates. % \label{fig:subordinate-and-principal-fail} %\end{figure} -\paragraph*{Scenario~2} In comparison to the first scenario, the second one is -more complicated, but also more frequent. While on kernel hierarchy the system -acts the same as in the first scenario, when we move to daemon hierarchy one -more possible variant is added. In deep kernel hierarchy a kernel may act as a -subordinate and as a principal at the same time. Thus, we need to copy not -only direct principal of each subordinate kernel, but also all principals +\paragraph*{Deep kernel hierarchies} In deep kernel hierarchy a kernel may act +as a subordinate and as a principal at the same time. Thus, we need to copy +not only direct principal of each subordinate kernel, but also all principals higher in the hierarchy recursively. So, the additional variant is a generalisation of the two previous ones for deep kernel hierarchies. @@ -380,11 +380,11 @@ sequential application step all parallel application processes except one were shutdown with a small delay to give principal kernel time to distribute its subordinates between cluster nodes. The experiment was repeated 12 times with a different surviving process each time. For each run total application running -time was measured and compared to each other. In this experiment the principal -kernel was executed on the first node, and subordinate kernels are evenly -distributed across all node including the first one. The result of the -experiment is the overhead of recovery from a failure of a specific kernel in -the hierarchy, which should be different for principal and subordinate kernel. +time was measured. In this experiment the principal kernel was executed on the +first node, and subordinate kernels are evenly distributed across all nodes +including the first one. The result of the experiment is the overhead of +recovery from a failure of a specific kernel in the hierarchy, which should be +different for principal and subordinate kernel. In the second experiment we benchmarked overhead of the multiple node failure handling code by instrumenting it with calls to time measuring routines. For @@ -434,19 +434,18 @@ to recover, and only failed kernel is executed on one of the remaining nodes. \end{figure} The second experiment showed that overhead of multiple node failure handling -code increases linearly with the number of nodes -(fig.~\ref{fig:test-2-phys}), however, its absolute value is small -compared to the programme run time. Linear increase in overhead is attributed -to the fact that for each subordinate kernel linear search algorithms are used -when sending or receiving it from other node to maintain an array of its -neighbours. When subordinate kernel is sent to remote node, all its left -neighbours IP addresses are added to the neighbours array without duplication, -and the kernel itself is appended to the global internal map which stores -principal kernels and theirs subordinates; when subordinate kernel returns from -the remote node, it is removed from the array of its principal subordinates -(retrieved from the internal map), which also requires linear search. So, the -second experiment showed that for real-world programme overhead of multiple -node failure handling is small. +code increases linearly with the number of nodes (fig.~\ref{fig:test-2-phys}), +however, its absolute value is small compared to the programme run time. Linear +increase in overhead is attributed to the fact that for each subordinate kernel +linear search algorithms are used when sending or receiving it from other node +to maintain an array of its neighbours. When subordinate kernel is sent to +remote node, all of its previously created neighbours IP addresses are added to +the neighbours array without duplication, and the kernel itself is appended to +the global internal map which stores principal kernels and theirs subordinates; +when subordinate kernel returns from the remote node, it is removed from the +array of its principal subordinates (retrieved from the internal map), which +also requires linear search. So, the second experiment showed that for +real-world programme overhead of multiple node failure handling is small. \begin{figure} \centering @@ -505,16 +504,16 @@ hierarchy, which makes this optimisation equivalent to the transformation. There are essentially two scenarios of failures. Failure of more than one node at a time and electricity outage. In the first scenario failure is handled by -sending a list previous IP addresses to the subsequent kernels in the batch. +sending a list of previous IP addresses to the subsequent kernels in the batch. Then if subordinate node and its master fail simultaneously, the surviving subordinate nodes scan all of the IP addresses they received until they find -alive node and parent is revived on this node. +alive node and the parent is revived on this node. We believe that having kernel state and their inter-dependencies is enough to mitigate any combination of node failures: given that at least one node survives, all programmes continue their execution in possibly degraded state. -However it requires recursively duplicating principals and sending the along -with the subordinates. Only electricity outage requires writing data to disk +However it requires recursively duplicating principals and sending them along +with the subordinates. Only electricity outage requires writing data to disk, other failures can be mitigated by duplicating kernels in memory. The framework has not been compared to other similar approaches, because to the diff --git a/src/head.tex b/src/head.tex @@ -25,7 +25,7 @@ papers~\cite{gankevich2015subordination,gankevich2016factory}, where only one node failure at a time is guaranteed to not interrupt programme execution. In this paper failure detection methods are not studied, and node failure is -assumed if the corresponding network connection prematurely closes. Node +assumed if the corresponding network connection abruptly closes. Node failure handling, provided by our algorithm, is transparent for a programmer: there is no need explicitly specify which kernels should be copied to other cluster nodes. However, its implementation cannot be used to provide fault @@ -33,4 +33,4 @@ tolerance to existing parallel programmes based on MPI or other libraries: the purpose of software framework developed here is to seamlessly provide fault tolerance for new parallel applications. If a failure is detected by some external programme, then removing this node from the cluster is as simple as -killing the daemon process. +killing the daemon process which is integral part of the framework. diff --git a/src/tail.tex b/src/tail.tex @@ -34,13 +34,13 @@ described in this paper to provide fault-tolerance: upon a failure we re-execute subordinate kernels and copy principal kernels to be able to re-execute them as well. Our approach blends checkpoint/restart and message logging: each kernel which is sent to other cluster node is saved (logged) in -memory of the sender, and removed from the log upon return. Since subordinate -kernels are allowed to communicate only with their principals (all other -communication may happen only when physical location of the kernel is known, if -the communication fails, then the kernel also fails to trigger recovery by the -principal), a collection of all logs on each cluster nodes constitutes the -current state of programme execution, which is used to restart failed kernels -on the surviving nodes. +the outbound buffer of the sender, and removed from the buffer upon return. +Since subordinate kernels are allowed to communicate only with their principals +(all other communication may happen only when physical location of the kernel +is known, if the communication fails, then the kernel also fails to trigger +recovery by the principal), a collection of all logs on each cluster nodes +constitutes the current state of programme execution, which is used to restart +failed kernels on the surviving nodes. To summarise, the feature that distinguishes our model with respect to models proposed for improving parallel programme fault-tolerance is the use of kernel @@ -58,7 +58,7 @@ work~\cite{gankevich2015subordination}), hence, the use of several switches and routers is possible within single cluster. Second, our approach does not require the use of standby servers to provide high availability of a master node: we provide fault tolerance on kernel layer instead. As the computation -progresses, kernels copy themselves on nodes that are directly connected to the +progresses, kernels copy themselves on nodes that are logically connected to the current one, and these can be any nodes from the cluster. Finally, high-availability cluster projects do not deal with parallel programme failures, they aim to provide high-availability for services running on master