WIP. - hpcs-17-subord

commit 6f1c630d1205273e506ecc2ca81c3d809f1c2fea
parent 84c6d4c3e7baf081a48097f2d4f91866d9b5fc70
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 12 May 2017 20:51:38 +0300

WIP.

Diffstat:
src/body.tex  | 19 +++++++++++--------

1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/src/body.tex b/src/body.tex
@@ -91,9 +91,9 @@ wait, and call correct kernel methods by analysing their internal state.
 \section{Resilience to multiple node failures}
 
 In our system a node is considered failed if the corresponding network
-connection is abruptly closed. Developing more sophisticated failure techniques
-is out of scope of this paper. For the purpose of studying recovery procedures
-upon node failure this simple approach is sufficient.
+connection is abruptly closed. Developing more sophisticated failure detection
+techniques is out of scope of this paper. For the purpose of studying recovery
+procedures upon node failure this simple approach is sufficient.
 
 Consequently, any kernel which resided on the failed node is considered failed,
 and failure recovery procedure is triggered. Depending on the position of the
@@ -107,7 +107,10 @@ and a copy of the parent is re-executed on a healthy node. If parent kernel
 fails, then its copy, which is sent along with every subordinate on other
 cluster nodes, is re-executed on the node where the first survived subordinate
 kernel resides. Kernel failure is detected only for kernels that are sent from
-one node to another (local kernels are not considered). 
+one node to another (local kernels are not considered). Healthy node does not
+need to be a new one, any already loaded node will do: recovery does not
+overload the node, because each node has its own pool of kernels in which they
+wait before being executed by a pipeline.
 
 \subsection{Failure scenarios}
 \label{sec:failure-scenarios}
@@ -204,7 +207,7 @@ acts the same as in the first scenario, when we move to daemon hierarchy one
 more possible variant is added. In deep kernel hierarchy a kernel may act as a
 subordinate and as a principal at the same time.  Thus, we need to copy not
 only direct principal of each subordinate kernel, but also all principals
-higher in the hierarchy recursively (Figure~\ref{fig:sc3}). So, the additional
+higher in the hierarchy recursively (fig.~\ref{fig:sc3}). So, the additional
 variant is a generalisation of the two previous ones for deep kernel
 hierarchies.
 
@@ -333,7 +336,7 @@ affects scalability of the application to a large number of nodes.
 
 The first experiment showed that in terms of performance there are three
 possible outcomes when all nodes except one fail
-(Figure~\ref{fig:test-1-phys}). The first case is failure of all kernels except
+(fig.~\ref{fig:test-1-phys}). The first case is failure of all kernels except
 the principal and its first subordinate. There is no communication with other
 nodes to find the survivor and no recomputation of the current sequential step
 of the application, so it takes the least time to recover from the failure. The
@@ -345,7 +348,7 @@ performance is different only in the test environment, because this is the node
 to which standard output and error streams from each parallel process are
 copied over the network. So, the overhead is smaller, because there is no
 communication over the network for streaming the output. The same effect does
-not occur on virtual cluster (Figure~\ref{fig:test-1-virt}). To summarise,
+not occur on virtual cluster (fig.~\ref{fig:test-1-virt}). To summarise,
 performance degradation is larger when principal kernel fails, because the
 survivor needs to recover initial principal state from the backup and start the
 current sequential application step again on the surviving node; performance
@@ -370,7 +373,7 @@ to recover, and only failed kernel is executed on one of the remaining nodes.
 
 The second experiment showed that overhead of multiple node failure handling
 code increases linearly with the number of nodes
-(Figure~\ref{fig:test-2-phys}), however, its absolute value is small
+(fig.~\ref{fig:test-2-phys}), however, its absolute value is small
 compared to the programme run time. Linear increase in overhead is attributed
 to the fact that for each subordinate kernel linear search algorithms are used
 when sending or receiving it from other node to maintain an array of its

	hpcs-17-subord
	git clone https://git.igankevich.com/hpcs-17-subord.git
	Log \| Files \| Refs