body.tex - hpcs-17-subord - Subordination: Providing Resilience to Simultaneous Failure of Multiple Cluster Nodes

body.tex (30484B)
      1 \section{System architecture}
      2 
      3 Our model of computer system has layered architecture
      4 (fig.~\ref{fig:pipeline}):
      5 
      6 \begin{figure}%
      7 	\centering%
      8 	\includegraphics{ppl}%
      9 	\caption{Mapping of parent and child process pipelines to compute devices.
     10 	Solid lines denote aggregation, dashed lines denote mapping between
     11 	logical and physical entities.\label{fig:pipeline}}
     12 \end{figure}%
     13 
     14 \paragraph{Physical layer} Consists of nodes and direct/routed physical
     15 network links. On this layer full network connectivity, i.e. an ability to send
     16 packet from one cluster node to any other, is assumed.
     17 
     18 \paragraph{Daemon layer} Consists of daemon processes residing on cluster nodes
     19 and hierarchical (master/slave) logical links between them. Master and slave
     20 roles are dynamically assigned to daemon processes, i.e.~any physical cluster
     21 node may become a master or a slave. Dynamic reassignment uses leader election
     22 algorithm that does not require periodic broadcasting of messages, and the role
     23 is derived from node's IP address. Detailed explanation of the algorithm is
     24 provided in~\cite{gankevich2015subordination}. Its strengths are scalability to
     25 a large number of nodes and low overhead, which are essential for large-scale
     26 high-performance computations, and its weakness is in artificial dependence of
     27 node's position in the hierarchy on its IP address, which may not desirable in
     28 virtual environments, where nodes' IP addresses may change without a notice.
     29 
     30 The only purpose of daemon hierarchy is to provide load balancing and
     31 automatically reconfigurable logical tree hierarchy of cluster nodes. This
     32 hierarchy is used to distribute the load from the current node to its
     33 neighbours by simply iterating over all directly connected daemons. Upon
     34 reconfiguration due to node failure or due to new node joining the cluster,
     35 daemons exchange messages telling each other how many daemons are ``behind''
     36 the corresponding link in the hierarchy. This information is used to distribute
     37 the load evenly, even if a parallel programme is launched on a slave node. In
     38 addition, this topology reduces the number of simultaneous connections, thus
     39 preventing network overload.
     40 
     41 Load balancing is implemented as follows. When daemon $A$ tries to become a
     42 subordinate of daemon $B$, it sends a message to a corresponding IP address
     43 telling how many daemons are already connected to it (including itself).  If
     44 there are no connections, then it counts itself only. After all links between
     45 daemons in the cluster are established, every daemon has enough information to
     46 tell, how many nodes exist behind each link. If the link is between a slave and
     47 a master, and the slave wants to know, how many nodes are behind the master,
     48 then it simply subtracts the total number of nodes behind all of its slave
     49 nodes from the total number of nodes behind the master to get the correct
     50 amount. To distribute kernels across nodes we use simple round-robin algorithm,
     51 i.e.~iterate over all links of the current daemon (including the link to its
     52 master) taking into account how many nodes are behind each link: the pointer
     53 advances to a next link, only when enough number of kernels are sent through
     54 the current link. That way even if an application is launched on a slave node
     55 in the bottom of the hierarchy, the kernels will be distributed evenly across
     56 all cluster nodes. A kernel can not be sent through the link, from which it was
     57 received.
     58 
     59 The advantage of this approach is that it can be extended to include
     60 sophisticated logic into load distribution policy. Any metric, that is required
     61 to implement such policy, can be sent from the directly linked daemon during
     62 the link initialisation. As of now, only the number of nodes behind the link is
     63 sent. The disadvantage of the approach is that an update of the metric happens
     64 only when a change in the hierarchy occurs: if a metric changes periodically,
     65 then periodically sending update messages is also required for implementing the
     66 policy, and too frequent updates may consume considerable amount of network
     67 bandwidth. The other disadvantage is that when reconfiguration of the hierarchy
     68 occurs due to a node failure or a new node joining the cluster, the kernels
     69 that are already executed on the nodes are not taken into account in the load
     70 distribution, so frequent updates to the hierarchy may cause uneven load
     71 distribution (which, however, balances over time). Uneven load distribution does
     72 not cause node overload, since there is a kernel pool on each node that queues
     73 the kernels prior to execution.
     74 
     75 Dynamic master/slave role distribution coupled with kernel distribution makes
     76 overall system architecture homogeneous within single cluster. On every node
     77 the same daemon is run, and no configuration is needed to make a hierarchy of
     78 daemons~--- it happens automatically.
     79 
     80 \paragraph{Kernel layer} Consists of kernels and hierarchical (parent/child)
     81 logical links between them. The only purpose of kernel hierarchy is to provide
     82 fail over for kernels.
     83 
     84 The framework provides classes and methods to simplify development of
     85 distributed applications and middleware. The focus is to make distributed
     86 application resilient to failures, i.e.~make it fault tolerant and highly
     87 available, and do it transparently to a programmer. All classes are divided
     88 into two layers: the lower layer consists of classes for single node
     89 applications, and the upper layer consists of classes for applications that run
     90 on an arbitrary number of nodes. There are two kinds of tightly coupled
     91 entities in the framework~--- \emph{kernels} and \emph{pipelines}~--- which are
     92 used together to compose a~programme.
     93 
     94 Kernels implement control flow logic in theirs \Method{act} and \Method{react}
     95 methods and store the state of the current control flow branch. Domain-specific
     96 logic and state are implemented by a programmer. In~\Method{act} method some
     97 function is either sequentially computed or decomposed into subtasks
     98 (represented by another set of kernels) which are subsequently sent to a
     99 pipeline. In~\Method{react} method subordinate kernels that returned from the
    100 pipeline are processed by their parent. Calls to \Method{act} and
    101 \Method{react} methods are asynchronous and are made within threads spawned by
    102 a pipeline. For each kernel \Method{act} is called only once, and for multiple
    103 kernels the calls are done in parallel to each other, whereas \Method{react}
    104 method is called once for each subordinate kernel, and all the calls are made
    105 in the same thread to prevent race conditions (for different parent kernels
    106 different threads may be used).
    107 
    108 Pipelines implement asynchronous calls to \Method{act} and \Method{react}, and
    109 try to make as many parallel calls as possible considering concurrency of the
    110 platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline
    111 consists of a kernel pool, which contains all the subordinate kernels sent by
    112 their parents, and a thread pool that processes kernels in accordance with
    113 rules outlined in the previous paragraph. A~separate pipeline exists for each
    114 compute device: there are pipelines for parallel processing, schedule-based
    115 processing (periodic and delayed tasks), and a proxy pipeline for processing of
    116 kernels on other cluster nodes (see~fig.~\ref{fig:pipeline}).
    117 
    118 In principle, kernels and pipelines machinery reflect the one of procedures and
    119 call stacks, with the advantage that kernel methods are called asynchronously
    120 and in parallel to each other. The stack, which ordinarily stores local
    121 variables, is modelled by fields of a kernel. The sequence of processor
    122 instructions before nested procedure calls is modelled by~\Method{act} method,
    123 and sequence of processor instructions after the calls is modelled
    124 by~\Method{react} method. The procedure calls themselves are modelled
    125 by~constructing and sending subordinate kernels to the pipeline. Two methods
    126 are necessary because calls are asynchronous and one must wait before
    127 subordinate kernels complete their work. Pipelines allow circumventing active
    128 wait, and call correct kernel methods by analysing their internal state.
    129 
    130 \section{Resilience to multiple node failures}
    131 
    132 To disambiguate hierarchical links between daemon processes and kernels and to
    133 simplify the discussion, we will use the following naming conventions
    134 throughout the text. If the link is between two daemon processes, the
    135 relationship is \emph{master-slave}. If the link is between two kernels, then
    136 the relationship is \emph{principal-subordinate} (or \emph{parent-child}). Two
    137 hierarchies are orthogonal to each other in a sense that no kernel may have a
    138 link to a daemon, and vice versa. Since daemon hierarchy is used to distribute
    139 the load on the cluster, kernel hierarchy is mapped onto it, and this mapping
    140 can be arbitrary.  It is common to have principal kernel on a slave node with
    141 its subordinate kernels distributed evenly between all cluster nodes (including
    142 the node where the principal is located). Both hierarchies can be arbitrarily
    143 deep, but ``shallow'' ones are preferred for highly parallel programmes, as
    144 there are less number of hops when kernels are distributed between cluster
    145 nodes. Since there is one-to-one correspondence between daemons and cluster
    146 nodes, they are used interchangeably in the paper. 
    147 
    148 In our system a node is considered failed if the corresponding network
    149 connection is abruptly closed. Developing more sophisticated failure detection
    150 techniques is out of scope of this paper. For the purpose of studying recovery
    151 procedures upon node failure this simple approach is sufficient.
    152 
    153 Consequently, any kernel which resided on the failed node is considered failed,
    154 and failure recovery procedure is triggered. Depending on the position of the
    155 kernel in kernel hierarchy recovery is carried out on the node where parent or
    156 one of the subordinate kernels resides. Recovery procedure for failed
    157 subordinate kernel is re-execution of this kernel on a healthy node, which is
    158 triggered automatically by the node where its parent kernel is located. If the
    159 subordinate communicates with other subordinates of the same parent kernel and
    160 one of them fails, all kernels as well as their parent are considered failed,
    161 and a copy of the parent is re-executed on a healthy node. If parent kernel
    162 fails, then its copy, which is sent along with every subordinate on other
    163 cluster nodes, is re-executed on the node where the first survived subordinate
    164 kernel resides. Kernel failure is detected only for kernels that are sent from
    165 one node to another (local kernels are not considered). A healthy node does not
    166 need to be a new one, any already loaded node will do: recovery does not
    167 overload the node, because each node has its own pool of kernels in which they
    168 wait before being executed by a pipeline.
    169 
    170 When a kernel is sent to other node, its copy is saved in the outbound buffer
    171 (a list of kernels, that were sent to a particular node), from which it is
    172 removed only when the kernel returns to its parent. If the corresponding
    173 connection closes, all kernels from this buffer are retrieved and distributed
    174 across available nodes including the current node. The fail over algorithm is
    175 straightforward for a subordinate, but for a principal it is more involved.
    176 Whereas a subordinate is implicitly copied to another node as a consequence of
    177 load distribution, a principal is left on the one node. In order to implement
    178 resilience to a principal failure, one needs to copy it along with each of its
    179 subordinates to other nodes, and provide a rule to determine from which copy
    180 the principal is restored upon the node failure. The following paragraphs
    181 discuss this algorithm and the rule in detail.
    182 
    183 \subsection{Failure scenarios}
    184 \label{sec:failure-scenarios}
    185 
    186 The main purpose of the system is to provide continuous execution of kernels in
    187 the presence of node failures. There are three types of such failures.
    188 \begin{enumerate}
    189 	\item Simultaneous failure of at most one node.
    190 	\item Simultaneous failure of more than one node but less than total number
    191 		of nodes.
    192 	\item Simultaneous failure of all nodes (electricity outage).
    193 \end{enumerate}
    194 For the sake of simplicity, it is assumed that parallel programme runs on all
    195 cluster nodes. Our system provide resilience to node failures for the first and
    196 the second scenario.
    197  
    198 By dividing kernels into principals and subordinates we create recovery points.
    199 Each principal is, mainly, a control unit, with a goal. To achieve it,
    200 principal divides the task into parts and creates a subordinate to compute each
    201 of them. The principal copies itself to each subordinate in the order of their
    202 creation, and includes in each subordinate a list of all node IP addresses to
    203 which previously created subordinates were sent (a list of \emph{neighbours}).
    204 When a connection from master node to slave node closes either as a result of
    205 a node failure, or as a consequence of the daemon hierarchy change, all kernels
    206 which reside on the corresponding cluster node are considered failed, and
    207 recovery process is triggered in accordance with the following scenarios.
    208 
    209 \paragraph*{Scenario~1 \& 2} With respect to kernel hierarchy, there are three
    210 possible variants of this failure: when a principal fails, when a
    211 subordinate fails (and both of them may or may not reside on the same cluster
    212 node) and when any combination of a principal and its subordinates fail.
    213 
    214 When a subordinate fails, its copy is simply restored from the outbound buffer
    215 on the node where its principal is located. When the corresponding network
    216 connection closes all kernels from the buffer are automatically distributed
    217 across available nodes, or executed locally if there are no network
    218 connections.
    219 
    220 When a principal fails every subordinate has its copy, but we need to restore
    221 it only once and only on one node to correctly continue programme execution. To
    222 ensure that the principal is restored only once, each subordinate tries to find
    223 the first surviving node from the IP address list of neighbours. If such node
    224 is online, the search stops and the subordinate is deleted. If the node is not
    225 found, the subordinate restores the principal from the copy on the current node
    226 and deletes itself. This algorithm is executed on every node, to which a copy
    227 of the principal was sent, and the guarantee that only one copy of the
    228 principal is restored is provided the implied hierarchy of IP addresses: every
    229 subordinate of the principal has the list of nodes to which only
    230 \emph{previously created} subordinates were sent, and no communication
    231 originating from previously created subordinate to the newer subordinate is
    232 possible (only the other way round). Subordinate deletion is necessary, because
    233 the whole computational step, modelled by the principal, is re-executed from
    234 the initial state, and there is no simple and reliable way of taking into
    235 account partial results which were produced so far by the subordinates.
    236 Simultaneous failure of a combination of a principal and a number of its
    237 subordinates is handled the same way.
    238 
    239 %\begin{algorithm}
    240 %	\KwData{$s$ --- subordinate kernel, $result$ --- \texttt{send} status.}
    241 %	\While{neighbours list is not empty \textnormal{\textbf{and}} $result\neq0$}{
    242 %		$n \leftarrow \text{neighbours.front()}$\\
    243 %		$result \leftarrow$ Send $s$ with the copy of its principal to $n$.
    244 %	}\\
    245 %	Delete $s$.
    246 %	\caption{An algorithm for fail over.\label{alg:failover}}
    247 %\end{algorithm}
    248 
    249 %\begin{figure}
    250 %	\centering
    251 %	\includegraphics{sc12}
    252 %	\caption{First failure scenario. Recovery of a subordinate.}
    253 %	\label{fig:subordinate-fails}
    254 %\end{figure}
    255 %
    256 %\begin{figure}
    257 %	\centering
    258 %	\includegraphics{sc1}
    259 %	\caption{First failure scenario. Recovery of a principal.}
    260 %	\label{fig:principal-fails}
    261 %\end{figure}
    262 %
    263 %\begin{figure}
    264 %	\centering
    265 %	\includegraphics{sc2}
    266 %	\caption{Simultaneous failure of a subordinate and its principal.}
    267 %	\label{fig:subordinate-and-principal-fail}
    268 %\end{figure}
    269 
    270 \paragraph*{Deep kernel hierarchies} In deep kernel hierarchy a kernel may act
    271 as a subordinate and as a principal at the same time.  Thus, we need to copy
    272 not only direct principal of each subordinate kernel, but also all principals
    273 higher in the hierarchy recursively. So, the additional variant is a
    274 generalisation of the two previous ones for deep kernel hierarchies.
    275 
    276 Handling principal failure in a deep kernel hierarchy may involve a lot of
    277 overhead, because its failure is detected only when a subordinate finishes its
    278 execution. So, for sufficiently large number of subordinates, there can be a
    279 situation in which some of them finished their execution and triggered
    280 principal recovery, whereas other continue their execution in parallel to the
    281 newly created subordinates from the recovered principal. This behaviour may not
    282 be a desired one for programmes with sophisticated logic, which interact with
    283 external databases, as this may lead to deadlocks or information corruption in
    284 the corresponding database. For batch processing jobs this means, that writing
    285 to files by multiple subordinates is not reliable, and to avoid data loss
    286 programme logic should be changed so that only one (principal) kernel writes to
    287 a file, whereas subordinates only process separate dataset parts.
    288 
    289 %\begin{figure}
    290 %	\centering
    291 %	\includegraphics{sc3}
    292 %	\caption{Simultaneous failure of two principals.}
    293 %	\label{fig:sc3}
    294 %\end{figure}
    295 
    296 \paragraph*{Scenario~3} Both failure scenarios are handled at runtime: the
    297 system will not stop execution of a programme, if some of its kernels are
    298 placed on failed node, unless all nodes on which the programme runs, fail
    299 simultaneously. This scenario is commonly occur as a result of electricity
    300 outage, and the main difference of this scenario is kernel log usage. Kernel
    301 log is stored on reliable storage and contains kernel initial states, recorded
    302 at a beginning of their execution, and each ``update'' to this state, recorded
    303 after a subordinate returns to its principal (a call to \Method{react}). Each
    304 daemon maintains its own kernel log file, which is replicated on the selected
    305 number of nodes to provide resilience. Replication is configured externally by
    306 means of a parallel file system, RAID array or any other suitable technology.
    307 
    308 When a daemon starts, recovery from the failure of all cluster nodes is handled
    309 as the follow.
    310 \begin{itemize}
    311   \item First, the system waits until a defined timeout elapses before staring
    312 	  recovery process to ensure as many nodes as possible are bootstrapped.
    313   \item Next, the system builds a sequential unified log from all log files for
    314 	  every programme that was run on the cluster when the failure occurred.
    315   \item After the unified log is ready, the system detects latest states of all
    316 	  alive kernels and re-executes them. 
    317 \end{itemize}
    318 
    319 Recovery from a failure of all nodes is the most inefficient, because it
    320 involves the use of persistent storage and there is no reliable way to ensure
    321 that all cluster nodes have been bootstrapped.  If some nodes were not
    322 bootstrapped properly, missing kernels are considered failed in accordance with
    323 the first and the second scenarios. This may lead to re-execution of
    324 considerable portion of parallel programme, especially when multiple principal
    325 kernels in the same hierarchy branch have failed. If a node fails in the middle
    326 of recovery process, the whole process is restarted from the beginning.
    327 
    328 %\begin{figure}
    329 %	\noindent%
    330 %	\spbuInsertFigure{tex/cluster-0}~\spbuInsertFigure{tex/frame-0}\newline
    331 %	\spbuInsertFigure{tex/frame-3}~\spbuInsertFigure{tex/frame-4}\newline
    332 %	\spbuInsertFigure{tex/legend}%
    333 %	\caption{An example of fail over algorithm in
    334 %	action.\label{fig:failover-example}}
    335 %\end{figure}
    336 
    337 \section{Evaluation}
    338 
    339 Proposed node failure handling approach was evaluated on the example of
    340 real-world application~\cite{factoryGithub}. The application generates ocean
    341 wavy surface in parallel with specified frequency-directional spectrum. There
    342 are two sequential steps in the programme. The first step is to compute model
    343 coefficients by solving system of linear algebraic equations. The system is
    344 solved in parallel on all cores of the principal node. The second step is to
    345 generate wavy surface, parts of which are generated in parallel on all cluster
    346 nodes including the principal one. All generated parts are written in parallel
    347 to individual files. So, from computational point of view the programme is
    348 embarrassingly parallel with little synchronisation between concurrent
    349 processes; the corresponding kernel hierarchy has one principal and $N$
    350 subordinates.
    351 
    352 All experiments were run on physical computer cluster consisting of 12 nodes.
    353 Wavy ocean surface parts were written to Network File System (NFS) which is
    354 located on a dedicated server. The input data was read from NFS mounted on each
    355 node, as it is small in size and does not cause big overhead. Platform
    356 configuration is presented in Table~\ref{tab:platform-configuration}. A dry run
    357 of each experiment~--- a run in which all expensive computations (wavy surface
    358 generation and coefficient computation) were disabled, but memory allocations
    359 and communication between processes were retained~--- was performed on the
    360 virtual cluster.
    361 
    362 \begin{table}
    363 	\centering
    364 	\caption{Test platform configuration.\label{tab:platform-configuration}}
    365 	\begin{tabular}{ll}
    366 		\toprule
    367 		CPU & Intel Xeon E5440, 2.83GHz \\
    368 		RAM & 4Gb \\
    369 		HDD & ST3250310NS, 7200rpm \\
    370 		No. of nodes & 12 \\
    371 		No. of CPU cores per node & 8 \\
    372 		Interconnect & 1Gbit Ethernet \\
    373 		\bottomrule
    374 	\end{tabular}
    375 \end{table}
    376 
    377 The first failure scenario (see Section~\ref{sec:failure-scenarios}) was
    378 evaluated in the following experiment. At the beginning of the second
    379 sequential application step all parallel application processes except one were
    380 shutdown with a small delay to give principal kernel time to distribute its
    381 subordinates between cluster nodes. The experiment was repeated 12 times with a
    382 different surviving process each time. For each run total application running
    383 time was measured. In this experiment the principal kernel was executed on the
    384 first node, and subordinate kernels are evenly distributed across all nodes
    385 including the first one. The result of the experiment is the overhead of
    386 recovery from a failure of a specific kernel in the hierarchy, which should be
    387 different for principal and subordinate kernel.
    388 
    389 In the second experiment we benchmarked overhead of the multiple node failure
    390 handling code by instrumenting it with calls to time measuring routines. For
    391 this experiment all logging and output was disabled to exclude its time from
    392 the measurements. This test was repeated for different number of cluster nodes.
    393 The purpose of the experiment is to measure precisely the overhead of multiple
    394 node failure handling code and to investigate how failure handling overhead
    395 affects scalability of the application to a large number of nodes.
    396 
    397 \section{Results}
    398 
    399 The first experiment showed that in terms of performance there are three
    400 possible outcomes when all nodes except one fail
    401 (fig.~\ref{fig:test-1-phys}). The first case is failure of all kernels except
    402 the principal and its first subordinate. There is no communication with other
    403 nodes to find the survivor and no recomputation of the current sequential step
    404 of the application, so it takes the least time to recover from the failure. The
    405 second case is failure of all kernels except any subordinate kernel other than
    406 the first one.  Here the survivor tries to communicate with all subordinates
    407 that were created before the survivor, so the overhead of recovery is larger.
    408 The third case is failure of all kernels except the last subordinate. Here
    409 performance is different only in the test environment, because this is the node
    410 to which standard output and error streams from each parallel process are
    411 copied over the network. So, the overhead is smaller, because there is no
    412 communication over the network for streaming the output. The same effect does
    413 not occur on virtual cluster (fig.~\ref{fig:test-1-virt}). To summarise,
    414 performance degradation is larger when principal kernel fails, because the
    415 survivor needs to recover initial principal state from the backup and start the
    416 current sequential application step again on the surviving node; performance
    417 degradation is smaller when subordinate kernel fails, because there is no state
    418 to recover, and only failed kernel is executed on one of the remaining nodes.
    419 
    420 \begin{figure}
    421 	\centering
    422 	\includegraphics{test-1-phys}
    423 	\caption{Application running time in the presence of a failure of all
    424 	physical cluster nodes except one for different surviving cluster
    425 	nodes.\label{fig:test-1-phys}}
    426 \end{figure}
    427 
    428 \begin{figure}
    429 	\centering
    430 	\includegraphics{test-1-virt}
    431 	\caption{Application running time in the presence of a failure of all
    432 	virtual cluster nodes except one for different surviving cluster
    433 	nodes.\label{fig:test-1-virt}}
    434 \end{figure}
    435 
    436 The second experiment showed that overhead of multiple node failure handling
    437 code increases linearly with the number of nodes (fig.~\ref{fig:test-2-phys}),
    438 however, its absolute value is small compared to the programme run time. Linear
    439 increase in overhead is attributed to the fact that for each subordinate kernel
    440 linear search algorithms are used when sending or receiving it from other node
    441 to maintain an array of its neighbours. When subordinate kernel is sent to
    442 remote node, all of its previously created neighbours IP addresses are added to
    443 the neighbours array without duplication, and the kernel itself is appended to
    444 the global internal map which stores principal kernels and theirs subordinates;
    445 when subordinate kernel returns from the remote node, it is removed from the
    446 array of its principal subordinates (retrieved from the internal map), which
    447 also requires linear search. So, the second experiment showed that for
    448 real-world programme overhead of multiple node failure handling is small.
    449 
    450 \begin{figure}
    451 	\centering
    452 	\includegraphics{test-2-phys}
    453 	\caption{Overhead of failure handling code for different number of physical
    454 	cluster nodes.\label{fig:test-2-phys}}
    455 \end{figure}
    456 
    457 \begin{figure}
    458 	\centering
    459 	\includegraphics{test-2-virt}
    460 	\caption{Overhead of failure handling code for different
    461 	number of virtual cluster nodes.\label{fig:test-2-virt}}
    462 \end{figure}
    463 
    464 \section{Discussion}
    465 
    466 Linear complexity in multiple node failure handling code can be avoided by
    467 replacing arrays with sets or maps, but the total overhead is small, so we
    468 deemed this optimisation unnecessary complication of the source code. Moreover,
    469 in real-world scenario it is probably impractical to copy principal kernel
    470 state to each subordinate node, and minimal number of copies may be configured
    471 in the programme instead. In this case using maps and sets over arrays may
    472 incur more overhead as they require certain amount of elements to make
    473 searching for an element more efficient than in
    474 arrays~\cite{alexandrescu2001modern,stroustrup2012software}. There is no such
    475 thing as minimal number of object copies that ensures fault-tolerance in HPC,
    476 but for parallel file systems there is a number of replicas. This number is
    477 typically set to 2 or 3 depending on the particular site. We believe that there
    478 is no need to set number of object copies more than that, as it allows to
    479 tolerate simultaneous failure of 2 and 3 nodes respectively: it should be more
    480 than enough to tolerate node failures which are common at large
    481 scale~\cite{schroeder2007understanding}. So, using arrays  with linear search
    482 complexity is more efficient than maps and sets, because the number of elements
    483 in them is small, and linear search takes less time than fixed time hash-based
    484 lookup.
    485 
    486 Transmitting IP addresses of previous nodes is an optimisation over mapping to
    487 only linear hierarchies, that is hierarchies where only one subordinate is
    488 allowed at any given time point. For a hierarchy consisting of a principal
    489 kernel with multiple subordinates there is unique mapping that transforms it to
    490 linear hierarchy: the principal creates and sends to the pipeline only the
    491 first subordinate, after that the first subordinate creates and sends the
    492 second subordinate to the pipeline and so on. This approach is inefficient
    493 because creation of subordinates is sequential and each subordinate is created
    494 after sending the previous one to a cluster node. Moreover, each subordinate
    495 carries a copy of its parent to be able to proceed programme execution when the
    496 parent fails. Instead of transforming initial hierarchy to a linear one, one
    497 can copy IP addresses of all previously created subordinates along with the
    498 next subordinate to the cluster node. The number of copies may be adjusted in
    499 the programme or a configuration file. When principal kernel fails each
    500 subordinate determines alive subordinate kernel starting from the first address
    501 in the list. If such kernel is not found, execution proceeds on the current
    502 node. The sequence of IP addresses in the list implicitly forms linear
    503 hierarchy, which makes this optimisation equivalent to the transformation.
    504 
    505 There are essentially two scenarios of failures. Failure of more than one node
    506 at a time and electricity outage. In the first scenario failure is handled by
    507 sending a list of previous IP addresses to the subsequent kernels in the batch.
    508 Then if subordinate node and its master fail simultaneously, the surviving
    509 subordinate nodes scan all of the IP addresses they received until they find
    510 alive node and the parent is revived on this node.
    511 
    512 We believe that having kernel state and their inter-dependencies is enough to
    513 mitigate any combination of node failures: given that at least one node
    514 survives, all programmes continue their execution in possibly degraded state.
    515 However it requires recursively duplicating principals and sending them along
    516 with the subordinates. Only electricity outage requires writing data to disk,
    517 other failures can be mitigated by duplicating kernels in memory.
    518 
    519 The framework has not been compared to other similar approaches, because to the
    520 best of our knowledge there is no library/framework that provides resilience to
    521 simultaneous failure of more than one node (including master node), and
    522 comparison to checkpoint/restart approach would be unfair, as we do not stop
    523 all parallel processes of an application and dump RAM image to stable storage,
    524 but only copy kernels into memory of another node. This approach is far more
    525 efficient than checkpoint/restart as no data is written to disk, and only a
    526 small fraction of the whole memory occupied by the application is copied to the
    527 other node.
    528
	hpcs-17-subord Subordination: Providing Resilience to Simultaneous Failure of Multiple Cluster Nodes
	git clone https://git.igankevich.com/hpcs-17-subord.git
	Log \| Files \| Refs