hpcs-17-subord

Subordination: Providing Resilience to Simultaneous Failure of Multiple Cluster Nodes
git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

head.tex (2411B)


      1 \section{Introduction}
      2 
      3 In large scale cluster environments node failures are common. In general this
      4 does not lead to global cluster malfunction, but it has huge impact on job
      5 running on faulty resources. Classical MPI programmes fail, if any one of
      6 cluster nodes on which the programme is running fails. Today existing solutions
      7 mainly focus on making application checkpoints, but with increasing size of
      8 supercomputers and HPC clusters this approach becomes less efficient. Our
      9 approach to make cluster computations reliable and efficient is to use special
     10 framework focused on structuring parallel programme  in strict hierarchy of
     11 parallel and sequential parts. Using different fault tolerant scenarios based
     12 on hierarchy interactions, this framework provides continuous execution of a
     13 parallel programme in case of hardware errors or electricity outages.
     14 
     15 The aim of the research reported here is to investigate how continuous
     16 execution of parallel programmes in the presence of node failures can be
     17 provided on the level of software framework. This framework replaces both MPI
     18 library and batch job scheduler by introducing the notion of a kernel~--- a
     19 unit of work which can be copied between cluster nodes and re-executed any
     20 number of times~--- if it is required to provide resilience to node failures.
     21 In this paper we present an algorithm that guarantees continuous execution of a
     22 parallel programme upon failure of all nodes except one. This algorithm is
     23 based on the one developed in previous
     24 papers~\cite{gankevich2015subordination,gankevich2016factory}, where only one
     25 node failure at a time is guaranteed to not interrupt programme execution.
     26 
     27 In this paper failure detection methods are not studied, and node failure is
     28 assumed if the corresponding network connection abruptly closes. Node
     29 failure handling, provided by our algorithm, is transparent for a programmer:
     30 there is no need explicitly specify which kernels should be copied to other
     31 cluster nodes. However, its implementation cannot be used to provide fault
     32 tolerance to existing parallel programmes based on MPI or other libraries: the
     33 purpose of software framework developed here is to seamlessly provide fault
     34 tolerance for new parallel applications. If a failure is detected by some
     35 external programme, then removing this node from the cluster is as simple as
     36 killing the daemon process which is integral part of the framework.