Translate fail over algorithm in action.

commit 0aaef4d58fac391ca5b57f2742363185243c5085
parent 27ae655707692f07d1b7bff331aff8d4e5274090
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Mon, 27 Feb 2017 16:29:21 +0300

Translate fail over algorithm in action.

Diffstat:
phd-diss-ru.org  | 45 ++++++++++++++++++++++++++++++++++++---------
phd-diss.org  | 50 ++++++++++++++++++++++----------------------------

2 files changed, 58 insertions(+), 37 deletions(-)
diff --git a/phd-diss-ru.org b/phd-diss-ru.org
@@ -2900,28 +2900,55 @@ Keepalived\nbsp{}cite:cassen2002keepalived.
 кластера за один шаг вычислений или произвольного количества подчинненых узлов в
 любой момент работы программы.
 
-TODO translate
+Далее следует пример работы алгоритма восстановления после сбоев
+(рис.\nbsp{}[[fig-fail-over-example]]).
+1. Исходное состояние. На начальном этапе вычислительный кластер не требует
+   никакой настройки за исключением настройки сети. Алгоритм предполагает полную
+   связность узлов кластера и лучше всего работает с древовидными топологиями, в
+   которых все узлы кластера соединены несколькими коммутаторами.
+2. Построение иерархии узлов. При первичной загрузке на всех узлах кластера
+   запускаются процессы-сервисы, которые совместно строят иерархию таких же
+   процессов поверх топологии сети кластера. Положение процесса-сервиса в
+   иерархии определяется позицией IP-адреса его узла в диапазоне IP-адресов
+   сети. Для установления связи каждый из процессов соединеняется только с
+   предполагаемым руководящим процессом. В данном случае процесс на узле \(A\)
+   становится руководящим процессом для всех остальных. Иерархия может
+   измениться, только если новый узел присоденяется к кластеру или какой-либо из
+   узлов выходит из строя.
+3. Запуск главного управляющего объекта. Первый управляющий объект запускается
+   на одном из подчиненных узлов (узел \(B\)). Главный объект может иметь только
+   один подчиненный объект в каждый момент времени, и резервная копия главного
+   объекта посылается вместе с этим подчиненным объектом \(T_1\) на руководящий узел
+   \(A\). \(T_1\) представляет собой последовательный шаг программы. В программе
+   может быть произвольное количество последовательных шагов, и, когда узел
+   \(A\) выходит из строя, текущий шаг перезапускается с начала.
+4. Запуск подчиненных управляющих объектов. Управлящие объекты \(S_1\), \(S_2\),
+   \(S_3\) запускаются на подчиненных узлах кластера. Когда узел \(B\), \(C\)
+   или \(D\), соответствующий руководящий управляющий объект перезапускает
+   завершившиеся некорректно подчиненные объекты (\(T_1\) перезапускает \(S_1\),
+   главный объект перезапускает \(T_1\) и т.д.). Когда выходит из строя узел
+   \(B\), главный объект восстанавливается из резервной копии.
 
 #+name: fig-fail-over-example
 #+header: :headers '("\\input{preamble}\\setdefaultlanguage{russian}")
 #+begin_src latex :file build/fail-over-example-ru.pdf :exports results :results raw
 \input{tex/preamble}
 \newcommand*{\spbuInsertFigure}[1]{%
-%\flushright%
 \vspace{2\baselineskip}%
-\begin{minipage}{0.5\textwidth}%
+\begin{minipage}[b]{0.5\linewidth}%
     \Large%
     \input{#1}%
 \end{minipage}%
 }%
 \noindent%
-\spbuInsertFigure{tex/cluster-0}~\spbuInsertFigure{tex/frame-0}
-\spbuInsertFigure{tex/frame-3}~\spbuInsertFigure{tex/frame-4}
+\spbuInsertFigure{tex/cluster-0}~\spbuInsertFigure{tex/frame-0}\newline
+\spbuInsertFigure{tex/frame-3}~\spbuInsertFigure{tex/frame-4}\newline
 \spbuInsertFigure{tex/legend-ru}
 #+end_src
 
 #+caption: Пример работы алгоритма восстановления после сбоев.
 #+label: fig-fail-over-example
+#+attr_latex: :width \textwidth
 #+RESULTS: fig-fail-over-example
 [[file:build/fail-over-example-ru.pdf]]
 
@@ -2958,11 +2985,11 @@ TODO translate
 В ряде экспериментов была измерена производительность новой версии программы при
 выходе из строя различных типов узлов во время выполнения программы (номера
 пунктов соответствуют номерам графиков рис.\nbsp{}[[fig-benchmark]]):
-1. без выхода из строя узлов,
-2. выход из строя подчиненного узла (на котором генерируется часть взволнованной
+1) без выхода из строя узлов,
+2) выход из строя подчиненного узла (на котором генерируется часть взволнованной
    поверхности),
-3. выход из строя главного узла (на котором запускается программа),
-4. выход из строя резервного узла (на который копируется главный объект
+3) выход из строя главного узла (на котором запускается программа),
+4) выход из строя резервного узла (на который копируется главный объект
    программы).
 Древовидная иерархия узлов со значением ветвления равного 64 использовалась в
 экспериментах, для того чтобы удостовериться, что все подчиненные узлы кластера
diff --git a/phd-diss.org b/phd-diss.org
@@ -2723,37 +2723,30 @@ An example of fail over algorithm follows (fig.\nbsp{}[[fig-fail-over-example]])
    except setting up local network. The algorithm assumes full connectivity of
    cluster nodes, and works best with tree topologies in which several network
    switches connect all cluster nodes.
-2. Node hierarchy. When the cluster is bootstrapped, each node starts a
-   /daemon/ process, whose responsibility is to establish hierarchy of such
-   processes superimposed on the topology of cluster nodes. Hierarchical links
-   are solely defined by the position of node's IP address in the local network
-   IP address range eliminating the need for complex distributed consensus
-   algorithm. A node may act as a subordinate or a principal simultaneously thus
-   multiple hierarchy layers may be created. The hierarchy is changed only when
-   a new node joins or leaves the cluster, and is reused by every application
-   running on top of it. In an event of node failure its role is reassigned to
-   another node, and tasks that were executing on this node are restarted on
-   healthy ones.
-3. Launch main kernel. HPC application is decomposed into computational
-   kernels with hierarchical dependence. The first, or /main/ kernel, is
-   started on the leaf node. Main kernel may have only one subordinate at a
-   time, and /backup/ copy of the main kernel is sent along with the
-   subordinate kernel \(T_1\) to the root node. \(T_1\) represents one
-   sequential step of a programme (a superstep in Bulk Synchronous Parallel
-   model). There can be any number of sequential steps in a programme, and when
-   node \(B\) fails, the current step is restarted from the beginning.
+2. Build node hierarchy. When the cluster is bootstrapped, daemon processes
+   start on all cluster nodes and collectively build hierarchy of such processes
+   superimposed on the topology of cluster network. Position of a daemon process
+   in the hierarchy is defined by the position of its node IP address in the
+   network IP address range. To establish hierarchical link each process
+   connects to its assumed principal process. The hierarchy is changed only when
+   a new node joins the cluster or a node fails.
+3. Launch main kernel. The first kernel launches on one of the subordinate nodes
+   (node \(B\)). Main kernel may have only one subordinate at a time, and backup
+   copy of the main kernel is sent along with the subordinate kernel \(T_1\) to
+   the principal node \(A\). \(T_1\) represents one sequential step of a
+   programme. There can be any number of sequential steps in a programme, and
+   when node \(A\) fails, the current step is restarted from the beginning.
 4. Launch subordinate kernels. Kernels \(S_1\), \(S_2\), \(S_3\) are launched on
-   the leaf nodes. When node \(B\), \(C\) or \(D\) fails, corresponding master
-   kernel restarts failed subordinates (\(T_1\) restarts \(S_1\), master kernel
-   restarts \(T_1\) etc.). When node \(A\) fails, master kernel is recovered
-   from backup.
+   subordinate cluster nodes. When node \(B\), \(C\) or \(D\) fails,
+   corresponding main kernel restarts failed subordinates (\(T_1\) restarts
+   \(S_1\), master kernel restarts \(T_1\) etc.). When node \(B\) fails, master
+   kernel is recovered from backup.
 
 #+name: fig-fail-over-example
 #+header: :headers '("\\input{preamble}")
 #+begin_src latex :file build/fail-over-example.pdf :exports results :results raw
 \input{tex/preamble}
 \newcommand*{\spbuInsertFigure}[1]{%
-%\flushright%
 \vspace{2\baselineskip}%
 \begin{minipage}{0.5\textwidth}%
     \Large%
@@ -2768,6 +2761,7 @@ An example of fail over algorithm follows (fig.\nbsp{}[[fig-fail-over-example]])
 
 #+caption: An example of fail over algorithm in action.
 #+label: fig-fail-over-example
+#+attr_latex: :width \textwidth
 #+RESULTS: fig-fail-over-example
 [[file:build/fail-over-example.pdf]]
 
@@ -2802,11 +2796,11 @@ which only demands explicit marking of replicated kernels.
 In a series of experiments performance of the new version of the application in
 the presence of different types of failures was benchmarked (numbers correspond
 to the graphs in fig.\nbsp{}[[fig-benchmark]]):
-1. no failures,
-2. failure of a subordinate node (a node where a part of wavy surface is
+1) no failures,
+2) failure of a subordinate node (a node where a part of wavy surface is
    generated),
-3. failure of a principal node (a node where the main kernel is run),
-4. failure of a backup node (a node where a copy of the main kernel is stored).
+3) failure of a principal node (a node where the main kernel is run),
+4) failure of a backup node (a node where a copy of the main kernel is stored).
 A tree hierarchy with fan-out value of 64 was chosen to make all subordinate
 cluster nodes connect directly to the one having the first IP-address in the
 network IP address range. A victim node was made offline after a fixed amount of

	arma-thesis
	git clone https://git.igankevich.com/arma-thesis.git
	Log \| Files \| Refs \| LICENSE

phd-diss-ru.org	\|	45	++++++++++++++++++++++++++++++++++++---------
phd-diss.org	\|	50	++++++++++++++++++++++----------------------------

arma-thesis