hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

related-work.tex (2382B)


      1 \section{RELATED WORK}
      2 
      3 In~\cite{lusk2010more} the author describes master-slave programming model suitable for dynamic load balancing. In the framework of this model multiple master nodes arranged in a ring are used to distribute the load on other nodes, the state of the master nodes is synchronised by sending work queue across the ring. Although, this model does not provide fault tolerance, from computational point of view it is similar to our tree hierarchy of nodes with an infinite maximal fan-out value. So, tree hierarchy can be seen as a generalisation of master-slave model for arbitrary number of levels.
      4 
      5 In~\cite{bala2012fault} the author describes popular fault tolerance approaches employed in cloud computing, with load balancing using highly-available proxy server being the most popular one. The author mentions the time to recover from the failure of a single node being several milliseconds. Although, this result was obtained in non-HPC domain, it somewhat correlates with our findings that performance of a parallel application with a slave node failure roughly equals the time without failures and without this node participating in computations.
      6 
      7 In~\cite{egwutuoha2013survey} the author compares various checkpoint/restart implementations used in HPC. The author mentions that one of the drawbacks of checkpoint/restart mechanism is that it is not portable, e.g. neither every checkpoint/restart implementation supports restoring network socket state, nor it is fully compatible with every operating system kernel version. Although, application level fault tolerance---the one that is provided by tree hierarchy---does not have any of these disadvantages, it cannot provide fault tolerance to existing message-passing based parallel programmes. So, there are different trade-offs for different technologies.
      8 
      9 In~\cite{guermouche2011uncoordinated} the author describes an optimised checkpointing algorithm for send-deterministic MPI applications. In a series of tests they show that the algorithm reduces the number of parallel processes that are required to restart from a checkpoint by half. It is achieved by carefully tracking causal dependencies between messages sent by every process and grouping messages by epochs---a sequential steps of programme execution. The algorithm looks promising, but still requires creating checkpoints for each process.