hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 5f06a229c6a651b5b0d6e8c1bd5d6c7bc6227e1e
parent 79068920b16525da87c0f55fcd3349e95311b03e
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 20:44:05 +0300

Revise abstract.

Diffstat:
src/abstract.tex | 20++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/src/abstract.tex b/src/abstract.tex @@ -1,10 +1,14 @@ \begin{abstract} - In this paper we describe a new framework for creating a reliable to hardware - errors distributed programs. Our main goal was to create a simply yet powerful - tool to archiving fault tolerance without creation of checkpoints, memory - dumps and other highly disk usage activities. To archive this we first - introduce a strong hierarchy of program components (or parts) and then discuss - about scenarios for continue computations. The programs parts hierarchy based - on Actor model by C. Hewitt, failure scenarios cover most common hardware - errors; software error handling are not covered by this article. + +In this paper we describe a new framework for creating distributed programmes +which are resilient to cluster node failures. Our main goal is to create a +simple and reliable model, that ensures continuous execution of parallel +programmes without creation of checkpoints, memory dumps and other I/O +intensive activities. To achieve this we introduce multi-layered system +architecture, each layer of which consists of unified entities organised into +hierarchies, and then show how this system handles different node failure +scenarios. We benchmark our system on the example of real-world HPC application +on both physical and virtual clusters. The results of the experiments show that +our approach has low overhead and scales to a large number of cluster nodes. + \end{abstract}