hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

conclusion.tex (1746B)


      1 \section{CONCLUSION}
      2 
      3 Proposed master node fault-tolerance approach works only for kernels that do not have a parent and have only one subordinate at a time, which is act similar to how manually triggered checkpoints function. The advantage is that they
      4 \begin{itemize}
      5     \item save results after each sequential step when memory footprint of a programme is low so that they save only relevant data,
      6     \item and they use memory of a subordinate node instead of stable storage.
      7 \end{itemize}
      8 This allows them to be much faster than traditional checkpoints at a cost of using small amount of memory of subordinate node to store execution state of a sequential step of the programme.
      9 
     10 Although, after a failure of backup node it takes more time to recover present execution state, it is not dangerous requiring only simple restart. At the same time a failure of master node may lead to a full programme stop, if backup node fails before master node recovery completes. One of the way to mitigate this is to make multiple copies of the first kernel and send them synchronously to different subordinate nodes. This approach requires some complicated logic to recover from master node failure, but may increase the number of nodes that may simultaneously fail. This is one of the directions of future research work.
     11 
     12 Hierarchical dependence between computational kernels coupled with tree hierarchy of nodes simplifies implementation of application level fault-tolerance. Provided with a reasonably large amount of nodes, an application can survive failure of any node during a single programme run. So, the other direction of future work is to ``daemonise'' the framework to make it possible to benchmark multiple applications on the same cluster.