hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit f7cf8392bd2040e2f4c93ffb3eef7c10e71af73a
parent e7932b0622313f64c15e623ae065e732b921f46d
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 20:30:08 +0300

Revise the second and the third failure scenarios.

Diffstat:
src/body.tex | 150++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
1 file changed, 83 insertions(+), 67 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -103,21 +103,21 @@ one node to another (local kernels are not considered). \subsection{Failure scenarios} \label{sec:failure-scenarios} -To disambiguate hierarchical links between daemon processes and kernels, we -will use the following naming conventions throughout the text. If the link is -between two daemon processes, the relationship is \emph{master-slave}. If the -link is between two kernels, then the relationship is -\emph{principal-subordinate} (or \emph{parent-child}). Two hierarchies are -orthogonal to each other in a sense that no kernel may have a link to a daemon, -and vice versa. Since daemon hierarchy is used to distribute the load on the -cluster, kernel hierarchy is mapped onto it, and this mapping can be arbitrary. -It is common to have principal kernel on a slave node with its subordinate -kernels distributed evenly between all cluster nodes (including the node where -the principal is located). Both hierarchies can be arbitrarily deep, but -``shallow'' ones are preferred for highly parallel programmes, as there are -less number of hops when kernels are distributed between cluster nodes. Since -there is one-to-one correspondence between daemons and cluster nodes, they are -used interchangeably in the paper. +To disambiguate hierarchical links between daemon processes and kernels and to +simplify the discussion, we will use the following naming conventions +throughout the text. If the link is between two daemon processes, the +relationship is \emph{master-slave}. If the link is between two kernels, then +the relationship is \emph{principal-subordinate} (or \emph{parent-child}). Two +hierarchies are orthogonal to each other in a sense that no kernel may have a +link to a daemon, and vice versa. Since daemon hierarchy is used to distribute +the load on the cluster, kernel hierarchy is mapped onto it, and this mapping +can be arbitrary. It is common to have principal kernel on a slave node with +its subordinate kernels distributed evenly between all cluster nodes (including +the node where the principal is located). Both hierarchies can be arbitrarily +deep, but ``shallow'' ones are preferred for highly parallel programmes, as +there are less number of hops when kernels are distributed between cluster +nodes. Since there is one-to-one correspondence between daemons and cluster +nodes, they are used interchangeably in the paper. The main purpose of the system is to provide continuous execution of kernels in the presence of daemon (and consequently node) failures. There are three types @@ -168,74 +168,90 @@ simple and reliable way of taking into account the results which was produced so far by the subordinates. \begin{figure} - \centering - \includegraphics{sc12} - \caption{Restoration after only one subordinate fails.} - \label{fig:subordinate-fails} + \centering + \includegraphics{sc12} + \caption{First failure scenario. Recovery of a subordinate.} + \label{fig:subordinate-fails} \end{figure} \begin{figure} - \centering - \includegraphics{sc1} - \caption{First scenario of restoration after principle fails.} - \label{fig:principal-fails} + \centering + \includegraphics{sc1} + \caption{First failure scenario. Recovery of a principal.} + \label{fig:principal-fails} \end{figure} \paragraph*{Scenario~2} In comparison to the first scenario, the second one is -more complicate yet frequent. While on principal-to-subordinate layer scheduler -act same~\ref{fig:subordinate-fails}, then we move to daemons layer one more -variant added. In kernel hierarchy principal kernels mostly a dual kernel. For -a higher level kernels it seems like a subordinate, for rest lower kernels it -is a principal. Thus, we need to add to our restoration scope only a state of -principals principle. As a result, we add to variants from first scenario -situation,a one where principals principal also is gone. Since scheduler -through daemons knew all kernels state before it begin a restoration process, -first it will check state of principals principle. If it is -gone~\ref{fig:subordinate-fails}, all subordinates will be started accordingly -to hierarchy once again, despite their states. +more complicated, but also more frequent. While on kernel hierarchy the system +acts the same as in the first scenario, when we move to daemon hierarchy one +more possible variant is added. In deep kernel hierarchy a kernel may act as a +subordinate and as a principal at the same time. Thus, we need to copy not +only direct principal of each subordinate kernel, but also all principals +higher in the hierarchy recursively (Figure~\ref{fig:sc2} and~\ref{fig:sc3}). +So, the additional variant is a generalisation of the two previous ones for +deep kernel hierarchies. + +Handling principal failure in a deep kernel hierarchy may involve a lot of +overhead, because its failure is detected only when a subordinate finishes its +execution. So, for sufficiently large number of subordinates, there can be a +situation in which some of them finished their execution and triggered +principal recovery, whereas other continue their execution in parallel to the +newly created subordinates from the recovered principal. This behaviour may not +be a desired one for programmes with sophisticated logic, which interact with +external databases, as this may lead to deadlocks or information corruption in +the corresponding database. For batch processing jobs this means, that writing +to files by multiple subordinates is not reliable, and to avoid data loss +programme logic should be changed so that only one (principal) kernel writes to +a file, whereas subordinates only process separate dataset parts. \begin{figure} - \centering - \includegraphics{sc2} - \caption{Simultaneous failure of a subordinate and its principal.} - \label{fig:sc2} + \centering + \includegraphics{sc2} + \caption{Simultaneous failure of a subordinate and its principal.} + \label{fig:sc2} \end{figure} \begin{figure} - \centering - \includegraphics{sc3} - \caption{The condition for restarting an execution subtree.} - \label{fig:sc3} + \centering + \includegraphics{sc3} + \caption{Simultaneous failure of two principals.} + \label{fig:sc3} \end{figure} -This two scenarios imply cases in runtime, that means scheduler operates kernels -in memory and will not stop execution of whole task if some part of it was -placed on failed node. But occasionally, all nodes of cluster may fail at same -time. That case is describe in third scenario. The main difference of this case -is a log usage. Log is stored on trusted storage and contains kernel states at a -beginning of execution and each ``updated'' state. By term ``updated'' state we -define principal state after subordinates \Method{react} calls. Files of -execution log is individual for each daemon, but have replicas on selected -number of nodes to provide hardware redundancy. Scheduler at startup have empty -memory, so we develop a procedure of state restoration from log as follows: - +\paragraph*{Scenario~3} Both failure scenarios are handled at runtime: the +system will not stop execution of a programme, if some of its kernels are +placed on failed node, unless all nodes on which the programme runs, fail +simultaneously. This scenario is commonly occur as a result of electricity +outage, and the main difference of this scenario is kernel log usage. Kernel +log is stored on reliable storage and contains kernel initial states, recorded +at a beginning of their execution, and each ``update'' to this state, recorded +after a subordinate returns to its principal (a call to \Method{react}). Each +daemon maintains its own kernel log file, which is replicated on the selected +number of nodes to provide resilience. Replication is configured externally by +means of a parallel file system, RAID array or any other suitable technology. + +When a daemon starts, recovery from the failure of all cluster nodes is handled +as the follow. \begin{itemize} - \item First, scheduler will take a defined timeout before restoration process - begins to ensure nodes startup. - \item Next, scheduler will build a sequential, virtually unified log for every - task. Log parts distributed over nodes by architecture. - \item After unified log will build, we detect kernels latest states and when - decide how to rerun execution with knowledge of failure scenarios. + \item First, the system waits until a defined timeout elapses before staring + recovery process to ensure as many nodes as possible are bootstrapped. + \item Next, the system builds a sequential unified log from all log files for + every programme that was run on the cluster when the failure occurred. + \item After the unified log is ready, the system detects latest states of all + alive kernels and re-executes them. \end{itemize} -Problems, which may occur while restoration process also need a discussion. -Thus, all nodes with log parts of current task would be excluded from cluster -completely for some reason, restart kernels from a known hierarchy level is a -only available choice. Also, while process of restoration after electricity -outage, if node will gone offline again, scheduler would not continue execution -of kernels lies next to kernels on that node in hierarchy. In that case, -restoration process will rerun again from place in log before newly appeared -gap. + + +Recovery from a failure of all nodes is the most inefficient, because it +involves the use of persistent storage and there is no reliable way to ensure +that all cluster nodes have been bootstrapped. If some nodes were not +bootstrapped properly, missing kernels are considered failed in accordance with +the first and the second scenarios. This may lead to re-execution of +considerable portion of parallel programme kernels, especially when multiple +principal kernels in the same hierarchy branch have failed. If a node fails in +the middle of recovery process, the whole process is restarted from the +beginning. \section{Evaluation}