Subordination: Providing Resilience to Simultaneous Failure of Multiple Cluster Nodes
git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

abstract.tex (792B)

      1 \begin{abstract}
      3 In this paper we describe a new framework for creating distributed programmes
      4 which are resilient to cluster node failures. Our main goal is to create a
      5 simple and reliable model, that ensures continuous execution of parallel
      6 programmes without creation of checkpoints, memory dumps and other I/O
      7 intensive activities. To achieve this we introduce multi-layered system
      8 architecture, each layer of which consists of unified entities organised into
      9 hierarchies, and then show how this system handles different node failure
     10 scenarios. We benchmark our system on the example of real-world HPC application
     11 on both physical and virtual clusters. The results of the experiments show that
     12 our approach has low overhead and scales to a large number of cluster nodes.
     14 \end{abstract}