discussion.tex (3210B)
1 \section{DISCUSSION} 2 3 The benchmark from the previous section show that it is essential for a 4 parallel application to have multiple sequential steps to make it resilient to 5 cluster node failures. Although, the probability of a master node failure is 6 lower than the probability of failure of any of the slave nodes, it does not 7 justify loosing all the data when the programme run is near completion. In 8 general, the more sequential steps one has in an HPC application the less is 9 performance penalty in an event of master node failure, and the more parallel 10 parts each step has the less is performance penalty in case of a slave node 11 failure. In other words, \emph{the more scalable an application is the more 12 resilient to node failures it becomes}. 13 14 In our experiments we specified manually where the programme starts its 15 execution to make mapping of hierarchy of computational kernels to tree 16 hierarchy of nodes optimal, however, it does not seem practical for real-world 17 cluster. The framework may perform such tasks automatically, and distribute the 18 load efficiently no matter whether the master node of the application is 19 located in the root or leaf of the tree hierarchy: Allocating the same node for 20 the first kernel of each application deteriorates fault-tolerance. 21 22 Although it may not be clear from the benchmarks, Factory does not only provide 23 tolerance to node failures: new nodes automatically join the cluster and 24 receive their portion of the load as soon as it is possible. This is trivial 25 process as it does not involve restarting failed kernels or managing their 26 state, so it is not presented in this work. 27 28 In theory, hierarchy-based fault-tolerance can be implemented on top of the 29 message-passing library without loss of generality. Although it would be 30 complicated to reuse free nodes instead of failed ones, as the number of nodes 31 is often fixed in such libraries, allocating reasonably large number of nodes 32 for the application would be enough to make it fault-tolerant. However, 33 implementing hierarchy-based fault-tolerance ``below'' message-passing 34 library does not seem beneficial, because it would require saving the state 35 of a parallel application which equals to the total amount of memory it 36 ccupies on each host, which would not make it more efficient than 37 checkpoints. 38 39 The weak point of the proposed technology is the length of the period of time 40 starting from a failure of master node up to the moment when the failure is 41 detected, the first kernel is restored and new subordinate kernel with the 42 parent's copy is received by a subordinate node. If during this period of time 43 backup node fails, execution state of application is completely lost, and there 44 is no way to recover it other than fully restarting the application. The length 45 of the dangerous period can be minimised but the possibility of a abrupt 46 programme stop can not be fully eliminated. This result is consistent with the 47 scrutiny of ``impossibility theory'', in the framework of which it is proved 48 the impossibility of the distributed consensus with one faulty 49 process~\cite{fischer1985impossibility} and impossibility of reliable 50 communication in the presence of node failures~\cite{fekete1993impossibility}.