grid-21-mpi

Verifiable Application-Level Checkpoint and Restart Framework for Parallel Computing
git clone https://git.igankevich.com/grid-21-mpi.git
Log | Files | Refs

abstract.txt (1172B)


      1 Verifiable application-level checkpoint and restart framework for parallel computing
      2 
      3 Ivan Gankevich, Ivan Petriakov, Anton Gavrikov, Dmitry Tereschenko, Gleb Mozhaiskii
      4 
      5 Fault tolerance of parallel and distributed applications is one of the concerns
      6 that becomes topical for large computer clusters and large distributed systems.
      7 For a long time the common solution to this problem was checkpoint and restart
      8 mechanisms implemented on operating system level, however, they are inefficient
      9 for large systems and now application-level checkpoint and restart is considered
     10 as a more efficient alternative. In this paper we implement application-level
     11 checkpoint and restart manually for the well-known parallel computing benchmarks
     12 to evaluate this alternative approach. We measure the overheads introduced
     13 by creating and restarting from a checkpoint, and the amount of effort
     14 that is needed to implement and verify the correctness of the resulting programme.
     15 Based on the results we propose generic framework for application-level checkpointing
     16 that simplifies the process and allows to verify that the application
     17 gives correct output when restarted from any checkpoint.