hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit e0481e7d595d587a7dd04044977d32a94a30a434
parent 3b692627a5155b9deb2e7173d892bbc781ea6aa8
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Mon, 15 May 2017 13:34:28 +0300

Failure detection.

Diffstat:
src/head.tex | 19+++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/head.tex b/src/head.tex @@ -13,13 +13,14 @@ on hierarchy interactions, this framework provides continuous execution of a parallel programme in case of hardware errors or electricity outages. The aim of the research reported here is to investigate how continuous -execution of parallel programmes can be provided on the level of software -framework. This framework replaces both MPI library and batch job scheduler by -intoriducing the notion of a kernel~--- a unit of work which can be copied -between cluster nodes and re-executed any number of times~--- if it is required -to provide resilience to node failures. In this paper we present an algorithm -that guarantees continuous execution of a parallel programme upon failure of -all nodes except one. This algorithm is based on the one developed in previous +execution of parallel programmes in the presence of node failures can be +provided on the level of software framework. This framework replaces both MPI +library and batch job scheduler by introducing the notion of a kernel~--- a +unit of work which can be copied between cluster nodes and re-executed any +number of times~--- if it is required to provide resilience to node failures. +In this paper we present an algorithm that guarantees continuous execution of a +parallel programme upon failure of all nodes except one. This algorithm is +based on the one developed in previous papers~\cite{gankevich2015subordination,gankevich2016factory}, where only one node failure at a time is guaranteed to not interrupt programme execution. @@ -30,4 +31,6 @@ there is no need explicitly specify which kernels should be copied to other cluster nodes. However, its implementation cannot be used to provide fault tolerance to existing parallel programmes based on MPI or other libraries: the purpose of software framework developed here is to seamlessly provide fault -tolerance for new parallel applications. +tolerance for new parallel applications. If a failure is detected by some +external programme, then removing this node from the cluster is as simple as +killing the daemon process.