hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit dca290b508fbf2d47453733a8d52c1dc2b260db5
parent 483d1dde46c1cf5bd1afe67d74f96fd505cc0538
Author: Yuri Tipikin <yuriitipikin@gmail.com>
Date:   Thu, 16 Feb 2017 19:19:46 +0300

Added only fisrt scenario. No spell or grammar check.

Also added some definitions on node and kernel hierachies.

Diffstat:
src/body.tex | 46++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -71,8 +71,8 @@ of the algorithm is provided in [hpcs-2015-paper]. \subsection{Fault tolerance and high availability} The scheduler has fault tolerance and high availability built into its -low-level core API. Every failed kernel is restarted on healthy node or on its -parent node, however, failure is detected only for kernels that are sent from +low-level core API. Every failed node kernels is restarted on healthy nodes or on its +parent nodes, however, failure is detected only for kernels that are sent from one node to another (local kernels are not considered). High availability is provided by replicating master kernel to a subordinate node. When any of the replicas fails, another one is used in place. Detailed explanation of the fail @@ -94,11 +94,53 @@ Java. \section{Failure scenarios} \label{sec:failure-scenoarios} +Now we discuss failure scenarios and how scheduler can handle it. First, define +clearly relations between sets of deamons and kernels. We named such relations +in diffrent manner to avoid misunderstanding, becouse order rule itself is the +same. There are two intersection hierarcies, the horizontal one --- daemons +hierarchy, and vertical one --- kernels hierarchy. In horizontal +daemon-to-daemon hierarchy relations defined as master-slave. Thus, node (and, +accordingly, its daemon) with the nearest IP address to gateway will be a +master, and every other node will be a slave. This master-slave hierarchy +introduced to scheduler for better kernels distribtion. Vertical hierarchy of +kernels organized in principal-to-subordinate order. Principal kernel produce +subordinates and so provides task atomization to acrhive fault tolerance. + +The main purpose of scheduler is to continue or restore execution while failures +occure in daemons hierarchy. There are three types of such failures. + \begin{itemize} \item Failure of at most one node. \item Failure of more than one node but less than total number of node. \item Failure of all nodes (electricity outage). \end{itemize} + +By diveding kernels on principals and subordinate we create restore points. Each +principal is, mainly, a control unit, with a goal. To archive it, principal make +portion of task and deligates parts to subordinates. With such deligation +principal copys itself to each subordinate in order of appearence. To ensure +correct restoration, when the new partition is ready to deploy as new +subordinate, principal include in that kernel information about all previosly +generated subordinates, expressed as ordered list of daemons address where subordinates +transfered. So, then we discuss about failures, we mean that daemon is gone, and +all kernels of all types at this node break their tasks execution process. To +resolve failed states scheduler restore kernels using exisitng or newly appeared +daemons accordingly to each mentioned scenarios. + +Consider first scenario. In accordance to principal-to-subordinate hierarchy, +there are two variants of this failure: then principal was gone and then any +subordinate was gone. Subordinate itself is not a valueble part of execution, it +is a simple worker. Our scheduler not stored any of subordinate, but only +principle state. Thus, to restore execution, scheduler use principle to simply +recrate failed subordinate on most appropriate daemon. When principle is gone we +need to restore it only once and only on one node. To archive this limitation, +each subordinate will try to find any available daemon from addresses list in +reverse order. If such daemon exists and available, finding process will stop, +as current subordinate kernel will assume the found kernel will take principal +restoration process. + +\begin{itemize} + \item \section{Evaluation}