head.tex (2411B)
1 \section{Introduction} 2 3 In large scale cluster environments node failures are common. In general this 4 does not lead to global cluster malfunction, but it has huge impact on job 5 running on faulty resources. Classical MPI programmes fail, if any one of 6 cluster nodes on which the programme is running fails. Today existing solutions 7 mainly focus on making application checkpoints, but with increasing size of 8 supercomputers and HPC clusters this approach becomes less efficient. Our 9 approach to make cluster computations reliable and efficient is to use special 10 framework focused on structuring parallel programme in strict hierarchy of 11 parallel and sequential parts. Using different fault tolerant scenarios based 12 on hierarchy interactions, this framework provides continuous execution of a 13 parallel programme in case of hardware errors or electricity outages. 14 15 The aim of the research reported here is to investigate how continuous 16 execution of parallel programmes in the presence of node failures can be 17 provided on the level of software framework. This framework replaces both MPI 18 library and batch job scheduler by introducing the notion of a kernel~--- a 19 unit of work which can be copied between cluster nodes and re-executed any 20 number of times~--- if it is required to provide resilience to node failures. 21 In this paper we present an algorithm that guarantees continuous execution of a 22 parallel programme upon failure of all nodes except one. This algorithm is 23 based on the one developed in previous 24 papers~\cite{gankevich2015subordination,gankevich2016factory}, where only one 25 node failure at a time is guaranteed to not interrupt programme execution. 26 27 In this paper failure detection methods are not studied, and node failure is 28 assumed if the corresponding network connection abruptly closes. Node 29 failure handling, provided by our algorithm, is transparent for a programmer: 30 there is no need explicitly specify which kernels should be copied to other 31 cluster nodes. However, its implementation cannot be used to provide fault 32 tolerance to existing parallel programmes based on MPI or other libraries: the 33 purpose of software framework developed here is to seamlessly provide fault 34 tolerance for new parallel applications. If a failure is detected by some 35 external programme, then removing this node from the cluster is as simple as 36 killing the daemon process which is integral part of the framework.