git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 5f27624c5dcf93e3854ef227ef247beff062cb12
parent 8d16b2b3249cbe96592843be388781caa4985cb1
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 18:23:47 +0300

Add introductory paragraphs to failure scenarios.

src/body.tex | 33++++++++++++++++++++++-----------
src/tail.tex | 2+-
2 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -79,17 +79,28 @@ are necessary because calls are asynchronous and one must wait before subordinate kernels complete their work. Pipelines allow circumventing active wait, and call correct kernel methods by analysing their internal state. -\subsection{Fault tolerance and high availability} - -The scheduler has fault tolerance and high availability built into its -low-level core API. Every failed node kernels is restarted on healthy nodes or on its -parent nodes, however, failure is detected only for kernels that are sent from -one node to another (local kernels are not considered). High availability is -provided by replicating master kernel to a subordinate node. When any of the -replicas fails, another one is used in place. Detailed explanation of the fail -over algorithm is provided in Section~\ref{sec:failure-scenarios}. - -\section{Failure scenarios} +\section{Resilience to multiple node failures} + +In our system a node is considered failed if the corresponding network +connection is abruptly closed. Developing more sophisticated failure techniques +is out of scope of this paper. For the purpose of studying recovery procedeures +upon node failure this simple approach is sufficient. + +Consequently, any kernel which resided on the failed node is considered failed, +and failure recovery procedure is triggered. Depending on the position of the +kernel in kernel hierarchy recovery is carried out on the node where parent or +one of the subordinate kernels resides. Recovery procedure for failed +subordinate kernel is re-execution of this kernel on a healthy node, which is +triggered automatically by the node where ots parent kernel is located. If the +subordinate communiates with other subordinates of the same parent kernel and +one of them fails, all kernels as well as their parent are considered failed, +and a copy of the parent is re-executed on a healthy node. If parent kernel +fails, then its copy, which is sent along with every subordinate on other +cluster nodes, is re-executed on the node where the first survived subordinate +kernel resides. Kernel failure is detected only for kernels that are sent from +one node to another (local kernels are not considered). + +\subsection{Failure scenarios} \label{sec:failure-scenarios} Now we discuss failure scenarios and how scheduler can handle it. First, define diff --git a/src/tail.tex b/src/tail.tex @@ -1,4 +1,4 @@ -\subsection{Related work} +\section{Related work} The feature that distinguishes our research with respect to some others, is the use of hierarchy as the only possible way of defining dependencies between