abstract.txt (1172B)
1 Verifiable application-level checkpoint and restart framework for parallel computing 2 3 Ivan Gankevich, Ivan Petriakov, Anton Gavrikov, Dmitry Tereschenko, Gleb Mozhaiskii 4 5 Fault tolerance of parallel and distributed applications is one of the concerns 6 that becomes topical for large computer clusters and large distributed systems. 7 For a long time the common solution to this problem was checkpoint and restart 8 mechanisms implemented on operating system level, however, they are inefficient 9 for large systems and now application-level checkpoint and restart is considered 10 as a more efficient alternative. In this paper we implement application-level 11 checkpoint and restart manually for the well-known parallel computing benchmarks 12 to evaluate this alternative approach. We measure the overheads introduced 13 by creating and restarting from a checkpoint, and the amount of effort 14 that is needed to implement and verify the correctness of the resulting programme. 15 Based on the results we propose generic framework for application-level checkpointing 16 that simplifies the process and allows to verify that the application 17 gives correct output when restarted from any checkpoint.