body.tex (30484B)
1 \section{System architecture} 2 3 Our model of computer system has layered architecture 4 (fig.~\ref{fig:pipeline}): 5 6 \begin{figure}% 7 \centering% 8 \includegraphics{ppl}% 9 \caption{Mapping of parent and child process pipelines to compute devices. 10 Solid lines denote aggregation, dashed lines denote mapping between 11 logical and physical entities.\label{fig:pipeline}} 12 \end{figure}% 13 14 \paragraph{Physical layer} Consists of nodes and direct/routed physical 15 network links. On this layer full network connectivity, i.e. an ability to send 16 packet from one cluster node to any other, is assumed. 17 18 \paragraph{Daemon layer} Consists of daemon processes residing on cluster nodes 19 and hierarchical (master/slave) logical links between them. Master and slave 20 roles are dynamically assigned to daemon processes, i.e.~any physical cluster 21 node may become a master or a slave. Dynamic reassignment uses leader election 22 algorithm that does not require periodic broadcasting of messages, and the role 23 is derived from node's IP address. Detailed explanation of the algorithm is 24 provided in~\cite{gankevich2015subordination}. Its strengths are scalability to 25 a large number of nodes and low overhead, which are essential for large-scale 26 high-performance computations, and its weakness is in artificial dependence of 27 node's position in the hierarchy on its IP address, which may not desirable in 28 virtual environments, where nodes' IP addresses may change without a notice. 29 30 The only purpose of daemon hierarchy is to provide load balancing and 31 automatically reconfigurable logical tree hierarchy of cluster nodes. This 32 hierarchy is used to distribute the load from the current node to its 33 neighbours by simply iterating over all directly connected daemons. Upon 34 reconfiguration due to node failure or due to new node joining the cluster, 35 daemons exchange messages telling each other how many daemons are ``behind'' 36 the corresponding link in the hierarchy. This information is used to distribute 37 the load evenly, even if a parallel programme is launched on a slave node. In 38 addition, this topology reduces the number of simultaneous connections, thus 39 preventing network overload. 40 41 Load balancing is implemented as follows. When daemon $A$ tries to become a 42 subordinate of daemon $B$, it sends a message to a corresponding IP address 43 telling how many daemons are already connected to it (including itself). If 44 there are no connections, then it counts itself only. After all links between 45 daemons in the cluster are established, every daemon has enough information to 46 tell, how many nodes exist behind each link. If the link is between a slave and 47 a master, and the slave wants to know, how many nodes are behind the master, 48 then it simply subtracts the total number of nodes behind all of its slave 49 nodes from the total number of nodes behind the master to get the correct 50 amount. To distribute kernels across nodes we use simple round-robin algorithm, 51 i.e.~iterate over all links of the current daemon (including the link to its 52 master) taking into account how many nodes are behind each link: the pointer 53 advances to a next link, only when enough number of kernels are sent through 54 the current link. That way even if an application is launched on a slave node 55 in the bottom of the hierarchy, the kernels will be distributed evenly across 56 all cluster nodes. A kernel can not be sent through the link, from which it was 57 received. 58 59 The advantage of this approach is that it can be extended to include 60 sophisticated logic into load distribution policy. Any metric, that is required 61 to implement such policy, can be sent from the directly linked daemon during 62 the link initialisation. As of now, only the number of nodes behind the link is 63 sent. The disadvantage of the approach is that an update of the metric happens 64 only when a change in the hierarchy occurs: if a metric changes periodically, 65 then periodically sending update messages is also required for implementing the 66 policy, and too frequent updates may consume considerable amount of network 67 bandwidth. The other disadvantage is that when reconfiguration of the hierarchy 68 occurs due to a node failure or a new node joining the cluster, the kernels 69 that are already executed on the nodes are not taken into account in the load 70 distribution, so frequent updates to the hierarchy may cause uneven load 71 distribution (which, however, balances over time). Uneven load distribution does 72 not cause node overload, since there is a kernel pool on each node that queues 73 the kernels prior to execution. 74 75 Dynamic master/slave role distribution coupled with kernel distribution makes 76 overall system architecture homogeneous within single cluster. On every node 77 the same daemon is run, and no configuration is needed to make a hierarchy of 78 daemons~--- it happens automatically. 79 80 \paragraph{Kernel layer} Consists of kernels and hierarchical (parent/child) 81 logical links between them. The only purpose of kernel hierarchy is to provide 82 fail over for kernels. 83 84 The framework provides classes and methods to simplify development of 85 distributed applications and middleware. The focus is to make distributed 86 application resilient to failures, i.e.~make it fault tolerant and highly 87 available, and do it transparently to a programmer. All classes are divided 88 into two layers: the lower layer consists of classes for single node 89 applications, and the upper layer consists of classes for applications that run 90 on an arbitrary number of nodes. There are two kinds of tightly coupled 91 entities in the framework~--- \emph{kernels} and \emph{pipelines}~--- which are 92 used together to compose a~programme. 93 94 Kernels implement control flow logic in theirs \Method{act} and \Method{react} 95 methods and store the state of the current control flow branch. Domain-specific 96 logic and state are implemented by a programmer. In~\Method{act} method some 97 function is either sequentially computed or decomposed into subtasks 98 (represented by another set of kernels) which are subsequently sent to a 99 pipeline. In~\Method{react} method subordinate kernels that returned from the 100 pipeline are processed by their parent. Calls to \Method{act} and 101 \Method{react} methods are asynchronous and are made within threads spawned by 102 a pipeline. For each kernel \Method{act} is called only once, and for multiple 103 kernels the calls are done in parallel to each other, whereas \Method{react} 104 method is called once for each subordinate kernel, and all the calls are made 105 in the same thread to prevent race conditions (for different parent kernels 106 different threads may be used). 107 108 Pipelines implement asynchronous calls to \Method{act} and \Method{react}, and 109 try to make as many parallel calls as possible considering concurrency of the 110 platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline 111 consists of a kernel pool, which contains all the subordinate kernels sent by 112 their parents, and a thread pool that processes kernels in accordance with 113 rules outlined in the previous paragraph. A~separate pipeline exists for each 114 compute device: there are pipelines for parallel processing, schedule-based 115 processing (periodic and delayed tasks), and a proxy pipeline for processing of 116 kernels on other cluster nodes (see~fig.~\ref{fig:pipeline}). 117 118 In principle, kernels and pipelines machinery reflect the one of procedures and 119 call stacks, with the advantage that kernel methods are called asynchronously 120 and in parallel to each other. The stack, which ordinarily stores local 121 variables, is modelled by fields of a kernel. The sequence of processor 122 instructions before nested procedure calls is modelled by~\Method{act} method, 123 and sequence of processor instructions after the calls is modelled 124 by~\Method{react} method. The procedure calls themselves are modelled 125 by~constructing and sending subordinate kernels to the pipeline. Two methods 126 are necessary because calls are asynchronous and one must wait before 127 subordinate kernels complete their work. Pipelines allow circumventing active 128 wait, and call correct kernel methods by analysing their internal state. 129 130 \section{Resilience to multiple node failures} 131 132 To disambiguate hierarchical links between daemon processes and kernels and to 133 simplify the discussion, we will use the following naming conventions 134 throughout the text. If the link is between two daemon processes, the 135 relationship is \emph{master-slave}. If the link is between two kernels, then 136 the relationship is \emph{principal-subordinate} (or \emph{parent-child}). Two 137 hierarchies are orthogonal to each other in a sense that no kernel may have a 138 link to a daemon, and vice versa. Since daemon hierarchy is used to distribute 139 the load on the cluster, kernel hierarchy is mapped onto it, and this mapping 140 can be arbitrary. It is common to have principal kernel on a slave node with 141 its subordinate kernels distributed evenly between all cluster nodes (including 142 the node where the principal is located). Both hierarchies can be arbitrarily 143 deep, but ``shallow'' ones are preferred for highly parallel programmes, as 144 there are less number of hops when kernels are distributed between cluster 145 nodes. Since there is one-to-one correspondence between daemons and cluster 146 nodes, they are used interchangeably in the paper. 147 148 In our system a node is considered failed if the corresponding network 149 connection is abruptly closed. Developing more sophisticated failure detection 150 techniques is out of scope of this paper. For the purpose of studying recovery 151 procedures upon node failure this simple approach is sufficient. 152 153 Consequently, any kernel which resided on the failed node is considered failed, 154 and failure recovery procedure is triggered. Depending on the position of the 155 kernel in kernel hierarchy recovery is carried out on the node where parent or 156 one of the subordinate kernels resides. Recovery procedure for failed 157 subordinate kernel is re-execution of this kernel on a healthy node, which is 158 triggered automatically by the node where its parent kernel is located. If the 159 subordinate communicates with other subordinates of the same parent kernel and 160 one of them fails, all kernels as well as their parent are considered failed, 161 and a copy of the parent is re-executed on a healthy node. If parent kernel 162 fails, then its copy, which is sent along with every subordinate on other 163 cluster nodes, is re-executed on the node where the first survived subordinate 164 kernel resides. Kernel failure is detected only for kernels that are sent from 165 one node to another (local kernels are not considered). A healthy node does not 166 need to be a new one, any already loaded node will do: recovery does not 167 overload the node, because each node has its own pool of kernels in which they 168 wait before being executed by a pipeline. 169 170 When a kernel is sent to other node, its copy is saved in the outbound buffer 171 (a list of kernels, that were sent to a particular node), from which it is 172 removed only when the kernel returns to its parent. If the corresponding 173 connection closes, all kernels from this buffer are retrieved and distributed 174 across available nodes including the current node. The fail over algorithm is 175 straightforward for a subordinate, but for a principal it is more involved. 176 Whereas a subordinate is implicitly copied to another node as a consequence of 177 load distribution, a principal is left on the one node. In order to implement 178 resilience to a principal failure, one needs to copy it along with each of its 179 subordinates to other nodes, and provide a rule to determine from which copy 180 the principal is restored upon the node failure. The following paragraphs 181 discuss this algorithm and the rule in detail. 182 183 \subsection{Failure scenarios} 184 \label{sec:failure-scenarios} 185 186 The main purpose of the system is to provide continuous execution of kernels in 187 the presence of node failures. There are three types of such failures. 188 \begin{enumerate} 189 \item Simultaneous failure of at most one node. 190 \item Simultaneous failure of more than one node but less than total number 191 of nodes. 192 \item Simultaneous failure of all nodes (electricity outage). 193 \end{enumerate} 194 For the sake of simplicity, it is assumed that parallel programme runs on all 195 cluster nodes. Our system provide resilience to node failures for the first and 196 the second scenario. 197 198 By dividing kernels into principals and subordinates we create recovery points. 199 Each principal is, mainly, a control unit, with a goal. To achieve it, 200 principal divides the task into parts and creates a subordinate to compute each 201 of them. The principal copies itself to each subordinate in the order of their 202 creation, and includes in each subordinate a list of all node IP addresses to 203 which previously created subordinates were sent (a list of \emph{neighbours}). 204 When a connection from master node to slave node closes either as a result of 205 a node failure, or as a consequence of the daemon hierarchy change, all kernels 206 which reside on the corresponding cluster node are considered failed, and 207 recovery process is triggered in accordance with the following scenarios. 208 209 \paragraph*{Scenario~1 \& 2} With respect to kernel hierarchy, there are three 210 possible variants of this failure: when a principal fails, when a 211 subordinate fails (and both of them may or may not reside on the same cluster 212 node) and when any combination of a principal and its subordinates fail. 213 214 When a subordinate fails, its copy is simply restored from the outbound buffer 215 on the node where its principal is located. When the corresponding network 216 connection closes all kernels from the buffer are automatically distributed 217 across available nodes, or executed locally if there are no network 218 connections. 219 220 When a principal fails every subordinate has its copy, but we need to restore 221 it only once and only on one node to correctly continue programme execution. To 222 ensure that the principal is restored only once, each subordinate tries to find 223 the first surviving node from the IP address list of neighbours. If such node 224 is online, the search stops and the subordinate is deleted. If the node is not 225 found, the subordinate restores the principal from the copy on the current node 226 and deletes itself. This algorithm is executed on every node, to which a copy 227 of the principal was sent, and the guarantee that only one copy of the 228 principal is restored is provided the implied hierarchy of IP addresses: every 229 subordinate of the principal has the list of nodes to which only 230 \emph{previously created} subordinates were sent, and no communication 231 originating from previously created subordinate to the newer subordinate is 232 possible (only the other way round). Subordinate deletion is necessary, because 233 the whole computational step, modelled by the principal, is re-executed from 234 the initial state, and there is no simple and reliable way of taking into 235 account partial results which were produced so far by the subordinates. 236 Simultaneous failure of a combination of a principal and a number of its 237 subordinates is handled the same way. 238 239 %\begin{algorithm} 240 % \KwData{$s$ --- subordinate kernel, $result$ --- \texttt{send} status.} 241 % \While{neighbours list is not empty \textnormal{\textbf{and}} $result\neq0$}{ 242 % $n \leftarrow \text{neighbours.front()}$\\ 243 % $result \leftarrow$ Send $s$ with the copy of its principal to $n$. 244 % }\\ 245 % Delete $s$. 246 % \caption{An algorithm for fail over.\label{alg:failover}} 247 %\end{algorithm} 248 249 %\begin{figure} 250 % \centering 251 % \includegraphics{sc12} 252 % \caption{First failure scenario. Recovery of a subordinate.} 253 % \label{fig:subordinate-fails} 254 %\end{figure} 255 % 256 %\begin{figure} 257 % \centering 258 % \includegraphics{sc1} 259 % \caption{First failure scenario. Recovery of a principal.} 260 % \label{fig:principal-fails} 261 %\end{figure} 262 % 263 %\begin{figure} 264 % \centering 265 % \includegraphics{sc2} 266 % \caption{Simultaneous failure of a subordinate and its principal.} 267 % \label{fig:subordinate-and-principal-fail} 268 %\end{figure} 269 270 \paragraph*{Deep kernel hierarchies} In deep kernel hierarchy a kernel may act 271 as a subordinate and as a principal at the same time. Thus, we need to copy 272 not only direct principal of each subordinate kernel, but also all principals 273 higher in the hierarchy recursively. So, the additional variant is a 274 generalisation of the two previous ones for deep kernel hierarchies. 275 276 Handling principal failure in a deep kernel hierarchy may involve a lot of 277 overhead, because its failure is detected only when a subordinate finishes its 278 execution. So, for sufficiently large number of subordinates, there can be a 279 situation in which some of them finished their execution and triggered 280 principal recovery, whereas other continue their execution in parallel to the 281 newly created subordinates from the recovered principal. This behaviour may not 282 be a desired one for programmes with sophisticated logic, which interact with 283 external databases, as this may lead to deadlocks or information corruption in 284 the corresponding database. For batch processing jobs this means, that writing 285 to files by multiple subordinates is not reliable, and to avoid data loss 286 programme logic should be changed so that only one (principal) kernel writes to 287 a file, whereas subordinates only process separate dataset parts. 288 289 %\begin{figure} 290 % \centering 291 % \includegraphics{sc3} 292 % \caption{Simultaneous failure of two principals.} 293 % \label{fig:sc3} 294 %\end{figure} 295 296 \paragraph*{Scenario~3} Both failure scenarios are handled at runtime: the 297 system will not stop execution of a programme, if some of its kernels are 298 placed on failed node, unless all nodes on which the programme runs, fail 299 simultaneously. This scenario is commonly occur as a result of electricity 300 outage, and the main difference of this scenario is kernel log usage. Kernel 301 log is stored on reliable storage and contains kernel initial states, recorded 302 at a beginning of their execution, and each ``update'' to this state, recorded 303 after a subordinate returns to its principal (a call to \Method{react}). Each 304 daemon maintains its own kernel log file, which is replicated on the selected 305 number of nodes to provide resilience. Replication is configured externally by 306 means of a parallel file system, RAID array or any other suitable technology. 307 308 When a daemon starts, recovery from the failure of all cluster nodes is handled 309 as the follow. 310 \begin{itemize} 311 \item First, the system waits until a defined timeout elapses before staring 312 recovery process to ensure as many nodes as possible are bootstrapped. 313 \item Next, the system builds a sequential unified log from all log files for 314 every programme that was run on the cluster when the failure occurred. 315 \item After the unified log is ready, the system detects latest states of all 316 alive kernels and re-executes them. 317 \end{itemize} 318 319 Recovery from a failure of all nodes is the most inefficient, because it 320 involves the use of persistent storage and there is no reliable way to ensure 321 that all cluster nodes have been bootstrapped. If some nodes were not 322 bootstrapped properly, missing kernels are considered failed in accordance with 323 the first and the second scenarios. This may lead to re-execution of 324 considerable portion of parallel programme, especially when multiple principal 325 kernels in the same hierarchy branch have failed. If a node fails in the middle 326 of recovery process, the whole process is restarted from the beginning. 327 328 %\begin{figure} 329 % \noindent% 330 % \spbuInsertFigure{tex/cluster-0}~\spbuInsertFigure{tex/frame-0}\newline 331 % \spbuInsertFigure{tex/frame-3}~\spbuInsertFigure{tex/frame-4}\newline 332 % \spbuInsertFigure{tex/legend}% 333 % \caption{An example of fail over algorithm in 334 % action.\label{fig:failover-example}} 335 %\end{figure} 336 337 \section{Evaluation} 338 339 Proposed node failure handling approach was evaluated on the example of 340 real-world application~\cite{factoryGithub}. The application generates ocean 341 wavy surface in parallel with specified frequency-directional spectrum. There 342 are two sequential steps in the programme. The first step is to compute model 343 coefficients by solving system of linear algebraic equations. The system is 344 solved in parallel on all cores of the principal node. The second step is to 345 generate wavy surface, parts of which are generated in parallel on all cluster 346 nodes including the principal one. All generated parts are written in parallel 347 to individual files. So, from computational point of view the programme is 348 embarrassingly parallel with little synchronisation between concurrent 349 processes; the corresponding kernel hierarchy has one principal and $N$ 350 subordinates. 351 352 All experiments were run on physical computer cluster consisting of 12 nodes. 353 Wavy ocean surface parts were written to Network File System (NFS) which is 354 located on a dedicated server. The input data was read from NFS mounted on each 355 node, as it is small in size and does not cause big overhead. Platform 356 configuration is presented in Table~\ref{tab:platform-configuration}. A dry run 357 of each experiment~--- a run in which all expensive computations (wavy surface 358 generation and coefficient computation) were disabled, but memory allocations 359 and communication between processes were retained~--- was performed on the 360 virtual cluster. 361 362 \begin{table} 363 \centering 364 \caption{Test platform configuration.\label{tab:platform-configuration}} 365 \begin{tabular}{ll} 366 \toprule 367 CPU & Intel Xeon E5440, 2.83GHz \\ 368 RAM & 4Gb \\ 369 HDD & ST3250310NS, 7200rpm \\ 370 No. of nodes & 12 \\ 371 No. of CPU cores per node & 8 \\ 372 Interconnect & 1Gbit Ethernet \\ 373 \bottomrule 374 \end{tabular} 375 \end{table} 376 377 The first failure scenario (see Section~\ref{sec:failure-scenarios}) was 378 evaluated in the following experiment. At the beginning of the second 379 sequential application step all parallel application processes except one were 380 shutdown with a small delay to give principal kernel time to distribute its 381 subordinates between cluster nodes. The experiment was repeated 12 times with a 382 different surviving process each time. For each run total application running 383 time was measured. In this experiment the principal kernel was executed on the 384 first node, and subordinate kernels are evenly distributed across all nodes 385 including the first one. The result of the experiment is the overhead of 386 recovery from a failure of a specific kernel in the hierarchy, which should be 387 different for principal and subordinate kernel. 388 389 In the second experiment we benchmarked overhead of the multiple node failure 390 handling code by instrumenting it with calls to time measuring routines. For 391 this experiment all logging and output was disabled to exclude its time from 392 the measurements. This test was repeated for different number of cluster nodes. 393 The purpose of the experiment is to measure precisely the overhead of multiple 394 node failure handling code and to investigate how failure handling overhead 395 affects scalability of the application to a large number of nodes. 396 397 \section{Results} 398 399 The first experiment showed that in terms of performance there are three 400 possible outcomes when all nodes except one fail 401 (fig.~\ref{fig:test-1-phys}). The first case is failure of all kernels except 402 the principal and its first subordinate. There is no communication with other 403 nodes to find the survivor and no recomputation of the current sequential step 404 of the application, so it takes the least time to recover from the failure. The 405 second case is failure of all kernels except any subordinate kernel other than 406 the first one. Here the survivor tries to communicate with all subordinates 407 that were created before the survivor, so the overhead of recovery is larger. 408 The third case is failure of all kernels except the last subordinate. Here 409 performance is different only in the test environment, because this is the node 410 to which standard output and error streams from each parallel process are 411 copied over the network. So, the overhead is smaller, because there is no 412 communication over the network for streaming the output. The same effect does 413 not occur on virtual cluster (fig.~\ref{fig:test-1-virt}). To summarise, 414 performance degradation is larger when principal kernel fails, because the 415 survivor needs to recover initial principal state from the backup and start the 416 current sequential application step again on the surviving node; performance 417 degradation is smaller when subordinate kernel fails, because there is no state 418 to recover, and only failed kernel is executed on one of the remaining nodes. 419 420 \begin{figure} 421 \centering 422 \includegraphics{test-1-phys} 423 \caption{Application running time in the presence of a failure of all 424 physical cluster nodes except one for different surviving cluster 425 nodes.\label{fig:test-1-phys}} 426 \end{figure} 427 428 \begin{figure} 429 \centering 430 \includegraphics{test-1-virt} 431 \caption{Application running time in the presence of a failure of all 432 virtual cluster nodes except one for different surviving cluster 433 nodes.\label{fig:test-1-virt}} 434 \end{figure} 435 436 The second experiment showed that overhead of multiple node failure handling 437 code increases linearly with the number of nodes (fig.~\ref{fig:test-2-phys}), 438 however, its absolute value is small compared to the programme run time. Linear 439 increase in overhead is attributed to the fact that for each subordinate kernel 440 linear search algorithms are used when sending or receiving it from other node 441 to maintain an array of its neighbours. When subordinate kernel is sent to 442 remote node, all of its previously created neighbours IP addresses are added to 443 the neighbours array without duplication, and the kernel itself is appended to 444 the global internal map which stores principal kernels and theirs subordinates; 445 when subordinate kernel returns from the remote node, it is removed from the 446 array of its principal subordinates (retrieved from the internal map), which 447 also requires linear search. So, the second experiment showed that for 448 real-world programme overhead of multiple node failure handling is small. 449 450 \begin{figure} 451 \centering 452 \includegraphics{test-2-phys} 453 \caption{Overhead of failure handling code for different number of physical 454 cluster nodes.\label{fig:test-2-phys}} 455 \end{figure} 456 457 \begin{figure} 458 \centering 459 \includegraphics{test-2-virt} 460 \caption{Overhead of failure handling code for different 461 number of virtual cluster nodes.\label{fig:test-2-virt}} 462 \end{figure} 463 464 \section{Discussion} 465 466 Linear complexity in multiple node failure handling code can be avoided by 467 replacing arrays with sets or maps, but the total overhead is small, so we 468 deemed this optimisation unnecessary complication of the source code. Moreover, 469 in real-world scenario it is probably impractical to copy principal kernel 470 state to each subordinate node, and minimal number of copies may be configured 471 in the programme instead. In this case using maps and sets over arrays may 472 incur more overhead as they require certain amount of elements to make 473 searching for an element more efficient than in 474 arrays~\cite{alexandrescu2001modern,stroustrup2012software}. There is no such 475 thing as minimal number of object copies that ensures fault-tolerance in HPC, 476 but for parallel file systems there is a number of replicas. This number is 477 typically set to 2 or 3 depending on the particular site. We believe that there 478 is no need to set number of object copies more than that, as it allows to 479 tolerate simultaneous failure of 2 and 3 nodes respectively: it should be more 480 than enough to tolerate node failures which are common at large 481 scale~\cite{schroeder2007understanding}. So, using arrays with linear search 482 complexity is more efficient than maps and sets, because the number of elements 483 in them is small, and linear search takes less time than fixed time hash-based 484 lookup. 485 486 Transmitting IP addresses of previous nodes is an optimisation over mapping to 487 only linear hierarchies, that is hierarchies where only one subordinate is 488 allowed at any given time point. For a hierarchy consisting of a principal 489 kernel with multiple subordinates there is unique mapping that transforms it to 490 linear hierarchy: the principal creates and sends to the pipeline only the 491 first subordinate, after that the first subordinate creates and sends the 492 second subordinate to the pipeline and so on. This approach is inefficient 493 because creation of subordinates is sequential and each subordinate is created 494 after sending the previous one to a cluster node. Moreover, each subordinate 495 carries a copy of its parent to be able to proceed programme execution when the 496 parent fails. Instead of transforming initial hierarchy to a linear one, one 497 can copy IP addresses of all previously created subordinates along with the 498 next subordinate to the cluster node. The number of copies may be adjusted in 499 the programme or a configuration file. When principal kernel fails each 500 subordinate determines alive subordinate kernel starting from the first address 501 in the list. If such kernel is not found, execution proceeds on the current 502 node. The sequence of IP addresses in the list implicitly forms linear 503 hierarchy, which makes this optimisation equivalent to the transformation. 504 505 There are essentially two scenarios of failures. Failure of more than one node 506 at a time and electricity outage. In the first scenario failure is handled by 507 sending a list of previous IP addresses to the subsequent kernels in the batch. 508 Then if subordinate node and its master fail simultaneously, the surviving 509 subordinate nodes scan all of the IP addresses they received until they find 510 alive node and the parent is revived on this node. 511 512 We believe that having kernel state and their inter-dependencies is enough to 513 mitigate any combination of node failures: given that at least one node 514 survives, all programmes continue their execution in possibly degraded state. 515 However it requires recursively duplicating principals and sending them along 516 with the subordinates. Only electricity outage requires writing data to disk, 517 other failures can be mitigated by duplicating kernels in memory. 518 519 The framework has not been compared to other similar approaches, because to the 520 best of our knowledge there is no library/framework that provides resilience to 521 simultaneous failure of more than one node (including master node), and 522 comparison to checkpoint/restart approach would be unfair, as we do not stop 523 all parallel processes of an application and dump RAM image to stable storage, 524 but only copy kernels into memory of another node. This approach is far more 525 efficient than checkpoint/restart as no data is written to disk, and only a 526 small fraction of the whole memory occupied by the application is copied to the 527 other node. 528