grid-21-mpi

git clone https://git.igankevich.com/grid-21-mpi.git
Log | Files | Refs

commit 88a7005cec64e444836bf5703d9c6a374d88167d
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Mon, 31 May 2021 18:09:16 +0300

Initial.

Diffstat:
.gitignore | 40++++++++++++++++++++++++++++++++++++++++
abstract.txt | 17+++++++++++++++++
2 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/.gitignore b/.gitignore @@ -0,0 +1,40 @@ + +# Created by https://www.toptal.com/developers/gitignore/api/vim,linux +# Edit at https://www.toptal.com/developers/gitignore?templates=vim,linux + +### Linux ### +*~ + +# temporary files which can be created if a process still has a handle open of a deleted file +.fuse_hidden* + +# KDE directory preferences +.directory + +# Linux trash folder which might appear on any partition or disk +.Trash-* + +# .nfs files are created when an open file is removed but is still being accessed +.nfs* + +### Vim ### +# Swap +[._]*.s[a-v][a-z] +!*.svg # comment out if you don't need vector files +[._]*.sw[a-p] +[._]s[a-rt-v][a-z] +[._]ss[a-gi-z] +[._]sw[a-p] + +# Session +Session.vim +Sessionx.vim + +# Temporary +.netrwhist +# Auto-generated tag files +tags +# Persistent undo +[._]*.un~ + +# End of https://www.toptal.com/developers/gitignore/api/vim,linux diff --git a/abstract.txt b/abstract.txt @@ -0,0 +1,17 @@ +Verifiable application-level checkpoint and restart framework for parallel computing + +Ivan Gankevich, Ivan Petriakov, Anton Gavrikov, Dmitry Tereschenko, Gleb Mozhaiskii + +Fault tolerance of parallel and distributed applications is one of the concerns +that becomes topical for large computer clusters and large distributed systems. +For a long time the common solution to this problem was checkpoint and restart +mechanisms implemented on operating system level, however, they are inefficient +for large systems and now application-level checkpoint and restart is considered +as a more efficient alternative. In this paper we implement application-level +checkpoint and restart manually for the well-known parallel computing benchmarks +to evaluate this alternative approach. We measure the overheads introduced +by creating and restarting from a checkpoint, and the amount of effort +that is needed to implement and verify the correctness of the resulting programme. +Based on the results we propose generic framework for application-level checkpointing +that simplifies the process and allows to verify that the application +gives correct output when restarted from any checkpoint.