hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

intro.tex (3357B)


      1 \section{INTRODUCTION}
      2 
      3 There are three types of node failures that may occur in computer cluster---failure of a subordinate node, failure of a master node and an unplanned electricity outage---and each type of failure is handled differently. The usual way of handling a failure of a subordinate is to periodically checkpoint each long-running parallel job, i.e.~temporarily suspend it, dump its memory image to some stable storage, and resume it on healthy nodes upon a failure. To handle a failure of a master node usually means to continuously replicate its state to a backup node which takes master's role upon a failure. Similarly, to overcome an unplanned outage the state of a master node can be replicated to geographically distant backup node connected to another cluster, but the execution state of all jobs would probably be lost. 
      4 
      5 Considerable effort is being put to make dumping job's memory image to disk less costly~\cite{egwutuoha2013survey}, and approaches alternative to checkpoint-based fault tolerance are not attracting much attention in this area. Why does this happen? Usually HPC applications use message passing for communication of parallel processes and store their state in global memory space, and there is no way one can restart a failed process from its current state without writing the whole memory image to disk. Usually the total number of processes is fixed by the job scheduler, and all parallel processes restart upon a failure. There is ongoing effort to make it possible to restart only the failed process~\cite{meyer2012radic} at a cost of overloading a healthy node or maintaining a number of spare nodes. Although, it would be more practical to proceed execution of a failed application in degraded state (without failed node), the message passing library does not allow to change the number of processes at runtime, and most programmes use this number to distribute the load. So, there is no practical way to implement fault tolerance in message passing library other than restarting all parallel processes from a checkpoint or restarting failed process on a spare healthy node.
      6 
      7 There is, however, a possibility to implement fault tolerance to continue execution of a job on lesser number of nodes than it was initially requested. In this case the load is dynamically redistributed among available nodes. Although, dynamic load balancing was implemented in a number of projects~\cite{bhandarkar2001adaptive,lusk2010more} based on message passing library, it was not used to implement fault tolerance.
      8 
      9 We do not deal with failure detection in this work, and conservatively assume that a node fails when the corresponding network connection prematurely closes. This allows us to concentrate on the logic of fail over, which can be possibly incorporated into any existing framework of failure detection or vice versa.
     10 
     11 In this paper we deal with failures of subordinate and master nodes and give a hint on how an unplanned outage can be handled without loosing much of the execution state of jobs. We show how to use well-established object-oriented programming techniques to store execution state in a hierarchy of objects rather than in hard-to-manage global and local variables. Finally, we show how to implement fault tolerance on top of the message passing library with some restrictions on process restarts.