hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 19eb3b8d0e67c7bf11ba3904ec4fa0d75f6b0c5f
parent b5c438a22224a80e8121776800a3a40dad36ec3e
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Sat, 18 Feb 2017 19:31:24 +0300

Merge branch 'master' of bitbucket.org:igankevich-latex/hpcs-17-subord

Diffstat:
src/body.tex | 61+++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 49 insertions(+), 12 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -115,22 +115,59 @@ daemon is gone, and all kernels of all types at this node break their tasks execution process. To resolve failed states scheduler restore kernels using existing or newly appeared daemons accordingly to each mentioned scenarios. -Consider first scenario. In accordance to principal-to-subordinate hierarchy, -there are two variants of this failure: then principal was gone and then any -subordinate was gone. Subordinate itself is not a valuable part of execution, it -is a simple worker. Our scheduler not stored any of subordinate, but only -principle state. Thus, to restore execution, scheduler use principle to simply -recreate failed subordinate on most appropriate daemon. When principle is gone we -need to restore it only once and only on one node. To archive this limitation, -each subordinate will try to find any available daemon from addresses list in -reverse order. If such daemon exists and available, finding process will stop, -as current subordinate kernel will assume the found kernel will take principal -restoration process. +Consider the first scenario. In accordance to principal-to-subordinate +hierarchy, there are two variants of this failure: then principal is gone and +then any subordinate is gone. Subordinate itself is not a valuable part of +execution, it is a simple worker. Our scheduler not stored any subordinate +states, but only principle state. Thus, to restore execution, scheduler finds +last valid principle state and simply recreate failed subordinate on most +appropriate daemon. When principle is gone we need to restore it only once and +only on one node. To archive this limitation, each subordinate will try to find +any available daemon from its addresses list in reverse order. If such daemon +exists and available, finding process will stop, as current subordinate kernel +will assume that the kernel found will take principal restoration process. + +In comparison with first scenario, the second one is more complicate yet +frequent. While on principal-to-subordinate layer scheduler act same, then we +move to daemons layer one more variant added. In kernel hierarchy principal +kernels mostly a dual kernel. For a higher level kernels it seems like a +subordinate, for rest lower kernels it is a principal. Thus, we need to add to +our restoration scope only a state of principals principle. As a result, we add +to variants from first scenario situation,a one where principals principal also +is gone. Since scheduler through daemons knew all kernels state before it begin +a restoration process, first it will check state of principals principle. If +it's gone, all subordinates will be started accordingly to hierarchy once again, +despite their states. + +This two scenarios imply cases in runtime, that means scheduler operates kernels +in memory and will not stop execution of whole task if some part of it was +placed on failed node. But occasionally, all nodes of cluster may fail at same +time. That case is describe in third scenario. The main difference of this case +is a log usage. Log is stored on trusted storage and contains kernel states at a +beginning of execution and each 'updated' state. By term 'updated' state we +define principal state after subordinates \Method{React} calls. Files of +execution log is individual for each daemon, but have replicas on selected +number of nodes to provide hardware redundancy. Scheduler at startup have empty +memory, so we develop a procedure of state restoration from log as follows: \begin{itemize} - \item + \item First, scheduler will take a defined timeout before restoration process + begins to ensure nodes startup. + \item Next, scheduler will build a sequential, virtually unified log for every + task. Log parts distributed over nodes by architecture. + \item After unified log will build, we detect kernels latest states and when + decide how to rerun execution with knowledge of failure scenarios. \end{itemize} +Problems, which may occur while restoration process also need a discussion. +Thus, all nodes with log parts of current task would be excluded from cluster +completely for some reason, restart kernels from a known hierarchy level is a +only available choice. Also, while process of restoration after electricity +outage, if node will gone offline again, scheduler would not continue execution +of kernels lies next to kernels on that node in hierarchy. In that case, +restoration process will rerun again from place in log before newly appeared +gap. + \section{Evaluation} \section{Discussion}