abstract.tex (792B)
1 \begin{abstract} 2 3 In this paper we describe a new framework for creating distributed programmes 4 which are resilient to cluster node failures. Our main goal is to create a 5 simple and reliable model, that ensures continuous execution of parallel 6 programmes without creation of checkpoints, memory dumps and other I/O 7 intensive activities. To achieve this we introduce multi-layered system 8 architecture, each layer of which consists of unified entities organised into 9 hierarchies, and then show how this system handles different node failure 10 scenarios. We benchmark our system on the example of real-world HPC application 11 on both physical and virtual clusters. The results of the experiments show that 12 our approach has low overhead and scales to a large number of cluster nodes. 13 14 \end{abstract}