hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

abstract.tex (1044B)


      1 \begin{abstract}
      2 	Nowadays many job schedulers rely on checkpoint mechanisms to make
      3 	long-running batch jobs resilient to node failures. At large scale stopping
      4 	a job and creating its image consumes considerable amount of time. The aim
      5 	of this study is to propose a method that eliminates this overhead. For
      6 	this purpose we decompose a problem being solved into computational
      7 	micro-kernels which have strict hierarchical dependence on each other. When
      8 	a kernel abruptly stops its execution due to a node failure, it is
      9 	responsibility of its principal to restart computation on a healthy node.
     10 	In the course of experiments we successfully applied this method to make
     11 	hydrodynamics HPC application run on constantly changing number of nodes.
     12 	We believe, that this technique can be generalised to other types of
     13 	scientific applications as well.
     14 \end{abstract}
     15 
     16 \vspace{0.1in}
     17 \begin{IEEEkeywords}
     18 job scheduling, parallel computing, cluster computing, distributed computing, fault tolerance
     19 \end{IEEEkeywords}
     20 
     21 \IEEEpeerreviewmaketitle