tail.tex - hpcs-17-subord - Subordination: Providing Resilience to Simultaneous Failure of Multiple Cluster Nodes

tail.tex (5674B)
      1 \section{Related work}
      2 
      3 The feature that distinguishes our research with respect to some others, is the
      4 use of hierarchy as the only possible way of defining dependencies between
      5 objects, into which a programme is decomposed. The main advantage of hierarchy
      6 is trivial handling of object failures. 
      7 
      8 In~\cite{zuckerman2011using} the authors describe codelet model for exascale
      9 machines. This model breaks a programme into small bits of functionality,
     10 called codelets, and dependencies between them. The programme dataflow
     11 represents directed graph, which is called well-behaved if forward progress of
     12 the programme is guaranteed. In contrast to our model, in codelet model
     13 hierarchical dependencies are not enforced, and resilience to failures is
     14 provided by object migration and relies on hardware fault detection mechanisms.
     15 Furthermore, execution of kernel hierarchies in our model resembles
     16 stack-based execution of ordinary programmes: the programme finishes only when
     17 all subordinate kernels of the main kernel finish. So, there is no need to
     18 define well-behaved graph to guarantee programme termination.
     19 
     20 In~\cite{meneses2015using} the authors describe migratable objects model for
     21 parallel programmes. In the framework of this model a programme is decomposed
     22 into objects that may communicate with each other by sending messages, and can
     23 be migrated to any cluster node if desired. The authors propose several
     24 possibilities, how this model may enhance fault-tolerance techniques for
     25 Charm++/AMPI programmes: proactive fault detection, checkpoint/restart and
     26 message logging. In contrast to our model, migratable objects do not compose a
     27 hierarchy, but may exchange messages with any object address of which is known
     28 to the sender. A spanning tree of nodes is used to orchestrate collective
     29 operations between objects. This tree is similar to tree hierarchy of nodes,
     30 which is used in our work to distribute kernels between available cluster
     31 nodes, but we use this hierarchy for any operations that require distribution
     32 of work, rather than collective ones. Our model does not use techniques
     33 described in this paper to provide fault-tolerance: upon a failure we
     34 re-execute subordinate kernels and copy principal kernels to be able to
     35 re-execute them as well. Our approach blends checkpoint/restart and message
     36 logging: each kernel which is sent to other cluster node is saved (logged) in
     37 the outbound buffer of the sender, and removed from the buffer upon return.
     38 Since subordinate kernels are allowed to communicate only with their principals
     39 (all other communication may happen only when physical location of the kernel
     40 is known, if the communication fails, then the kernel also fails to trigger
     41 recovery by the principal), a collection of all logs on each cluster nodes
     42 constitutes the current state of programme execution, which is used to restart
     43 failed kernels on the surviving nodes.
     44 
     45 To summarise, the feature that distinguishes our model with respect to models
     46 proposed for improving parallel programme fault-tolerance is the use of kernel
     47 hierarchy~--- an abstraction which defines strict total order on a set of
     48 kernels (their execution order) and, consequently, defines for each kernel a
     49 principal kernel, responsibility of which is to re-execute failed subordinate
     50 kernels upon a failure.
     51 
     52 With respect to various high-availability cluster
     53 projects~\cite{robertson2000linux,haddad2003ha,leangsuksun2005achieving} our
     54 approach has the following advantages.  First, it scales with the large number
     55 of nodes, as only point-to-point communication between slave and master node is
     56 used instead of broadcast messages (which has been shown in the previous
     57 work~\cite{gankevich2015subordination}), hence, the use of several switches and
     58 routers is possible within single cluster. Second, our approach does not
     59 require the use of standby servers to provide high availability of a master
     60 node: we provide fault tolerance on kernel layer instead. As the computation
     61 progresses, kernels copy themselves on nodes that are logically connected to the
     62 current one, and these can be any nodes from the cluster. Finally,
     63 high-availability cluster projects do not deal with parallel programme
     64 failures, they aim to provide high-availability for services running on master
     65 node (NFS, SMB, DHCP, etc.), whereas our approach is specifically targeted at
     66 providing continuous execution of parallel applications.
     67 
     68 \section{Conclusion}
     69 
     70 In the paper we propose a system architecture consisting of two tree
     71 hierarchies of entities, mapped on each other, that simplifies provision of
     72 resilience to failures for parallel programmes. The resilience is solely
     73 provided by the use of hierarchical dependencies between entities, and is
     74 independent on each layer of the system. To optimise handling failure of
     75 multiple cluster nodes, we use the hierarchy implied by the order of creation
     76 of subordinate entities. The hierarchical approach to fault tolerance is
     77 efficient, scales to a large number of cluster nodes, and requires slow I/O
     78 operations only for the most disastrous scenario~--- simultaneous failure of
     79 all cluster nodes.  
     80 
     81 The future work is to standardise application programming interface of the
     82 system and investigate load-balancing techniques, which are optimal for a
     83 programme composed of many computational kernels.
     84 
     85 \section*{Acknowledgement}
     86 The research was carried out using computational resources of Resource Centre
     87 ``Computational Centre of Saint Petersburg State University'' (\mbox{T-EDGE96}
     88 \mbox{HPC-0011828-001}) within frameworks of grants of Russian Foundation for
     89 Basic Research (projects no.~\mbox{16-07-01111}, \mbox{16-07-00886},
     90 \mbox{16-07-01113}).
	hpcs-17-subord Subordination: Providing Resilience to Simultaneous Failure of Multiple Cluster Nodes
	git clone https://git.igankevich.com/hpcs-17-subord.git
	Log \| Files \| Refs