hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

discussion.tex (3210B)


      1 \section{DISCUSSION}
      2 
      3 The benchmark from the previous section show that it is essential for a
      4 parallel application to have multiple sequential steps to make it resilient to
      5 cluster node failures. Although, the probability of a master node failure is
      6 lower than the probability of failure of any of the slave nodes, it does not
      7 justify loosing all the data when the programme run is near completion. In
      8 general, the more sequential steps one has in an HPC application the less is
      9 performance penalty in an event of master node failure, and the more parallel
     10 parts each step has the less is performance penalty in case of a slave node
     11 failure. In other words, \emph{the more scalable an application is the more
     12 resilient to node failures it becomes}.
     13 
     14 In our experiments we specified manually where the programme starts its
     15 execution to make mapping of hierarchy of computational kernels to tree
     16 hierarchy of nodes optimal, however, it does not seem practical for real-world
     17 cluster. The framework may perform such tasks automatically, and distribute the
     18 load efficiently no matter whether the master node of the application is
     19 located in the root or leaf of the tree hierarchy: Allocating the same node for
     20 the first kernel of each application deteriorates fault-tolerance.
     21 
     22 Although it may not be clear from the benchmarks, Factory does not only provide
     23 tolerance to node failures: new nodes automatically join the cluster and
     24 receive their portion of the load as soon as it is possible. This is trivial
     25 process as it does not involve restarting failed kernels or managing their
     26 state, so it is not presented in this work.
     27 
     28 In theory, hierarchy-based fault-tolerance can be implemented on top of the
     29 message-passing library without loss of generality. Although it would be
     30 complicated to reuse free nodes instead of failed ones, as the number of nodes
     31 is often fixed in such libraries, allocating reasonably large number of nodes
     32 for the application would be enough to make it fault-tolerant. However,
     33 implementing hierarchy-based fault-tolerance ``below'' message-passing
     34 library does not seem beneficial, because it would require saving the state
     35 of a parallel application which equals to the total amount of memory it
     36 ccupies on each host, which would not make it more efficient than
     37 checkpoints.
     38 
     39 The weak point of the proposed technology is the length of the period of time
     40 starting from a failure of master node up to the moment when the failure is
     41 detected, the first kernel is restored and new subordinate kernel with the
     42 parent's copy is received by a subordinate node. If during this period of time
     43 backup node fails, execution state of application is completely lost, and there
     44 is no way to recover it other than fully restarting the application. The length
     45 of the dangerous period can be minimised but the possibility of a abrupt
     46 programme stop can not be fully eliminated. This result is consistent with the
     47 scrutiny of ``impossibility theory'', in the framework of which it is proved
     48 the impossibility of the distributed consensus with one faulty
     49 process~\cite{fischer1985impossibility} and impossibility of reliable
     50 communication in the presence of node failures~\cite{fekete1993impossibility}.