commit 4ef0cdd588879c06a763d176f251690e90c9b821
parent 6cbd9920b3f6a0c112e99881932f818fa7ddb19e
Author: Ivan Gankevich <igankevich@ya.ru>
Date: Tue, 21 Mar 2017 19:06:30 +0300
Improve results description.
Diffstat:
src/body.tex | | | 77 | +++++++++++++++++++++++++++++++++++++++++------------------------------------ |
1 file changed, 41 insertions(+), 36 deletions(-)
diff --git a/src/body.tex b/src/body.tex
@@ -220,16 +220,16 @@ configuration is presented in Table~\ref{tab:platform-configuration}.
\begin{table}
\centering
\caption{Test platform configuration.\label{tab:platform-configuration}}
- \begin{tabular}{ll}
- \toprule
- CPU & Intel Xeon E5440, 2.83GHz \\
- RAM & 4Gb \\
- HDD & ST3250310NS, 7200rpm \\
- No. of nodes & 12 \\
- No. of CPU cores per node & 8 \\
- Interconnect & 100Mbit ethernet \\
- \bottomrule
- \end{tabular}
+ \begin{tabular}{ll}
+ \toprule
+ CPU & Intel Xeon E5440, 2.83GHz \\
+ RAM & 4Gb \\
+ HDD & ST3250310NS, 7200rpm \\
+ No. of nodes & 12 \\
+ No. of CPU cores per node & 8 \\
+ Interconnect & 100Mbit Ethernet \\
+ \bottomrule
+ \end{tabular}
\end{table}
The first failure scenario (see Section~\ref{sec:failure-scenarios}) was
@@ -254,17 +254,22 @@ of the application to a large number of nodes.
\section{Results}
The first experiment showed that in terms of performance there are three
-possible outcomes when all nodes except one fail. The first case is failure of
-all kernels except the principal and its first subordinate. There is no
-communication with other nodes to find the survivor, so it takes the least time
-to recover from the failure. The second case is failure of all kernels except
-any subordinate kernel other than the first one. Here the survivor try to
-communicate with all subordinates that were created before the survivor, so the
-overhead of recovery is larger. The third case is failure of all kernels except
-the last subordinate. Here performance is different only in the test
-environment, because this is the node where output data and logs are gathered.
-So, the overhead is smaller, because there is no communication over the network
-for storing output.
+possible outcomes when all nodes except one fail (Figure~\ref{fig:test-1}). The
+first case is failure of all kernels except the principal and its first
+subordinate. There is no communication with other nodes to find the survivor
+and no recomputation of the current sequential step of the application, so it
+takes the least time to recover from the failure. The second case is failure of
+all kernels except any subordinate kernel other than the first one. Here the
+survivor tries to communicate with all subordinates that were created before
+the survivor, so the overhead of recovery is larger. The third case is failure
+of all kernels except the last subordinate. Here performance is different only
+in the test environment, because this is the node to which standard output and
+error streams from each parallel process is copied over the network. So, the
+overhead is smaller, because there is no communication over the network for
+streaming the output. To summarise, performance degradation is larger when
+principal kernel fails, because the survivor needs to recover initial principal
+state from the backup and start the current sequential application step again
+on the surviving node.
\begin{figure}
\centering
@@ -296,22 +301,22 @@ for storing output.
% run).\label{fig:test-2-dryrun-virt}}
%\end{figure}
-\begin{figure}
- \centering
- \includegraphics{test-2-dryrun-virt-overhead}
- \caption{Application running time with failure handling code for
- different number of virtual cluster nodes (dry
- run, only overhead was measured).\label{fig:test-2-dryrun-virt-overhead}}
-\end{figure}
+%\begin{figure}
+% \centering
+% \includegraphics{test-2-dryrun-virt-overhead}
+% \caption{Application running time with failure handling code for
+% different number of virtual cluster nodes (dry
+% run, only overhead was measured).\label{fig:test-2-dryrun-virt-overhead}}
+%\end{figure}
-\begin{figure}
- \centering
- \includegraphics{test-2-dryrun-virt-overhead-ndebug}
- \caption{Application running time with failure handling code for
- different number of virtual cluster nodes (dry
- run, only overhead was measured, no debug
- output).\label{fig:test-2-dryrun-virt-overhead-ndebug}}
-\end{figure}
+%\begin{figure}
+% \centering
+% \includegraphics{test-2-dryrun-virt-overhead-ndebug}
+% \caption{Application running time with failure handling code for
+% different number of virtual cluster nodes (dry
+% run, only overhead was measured, no debug
+% output).\label{fig:test-2-dryrun-virt-overhead-ndebug}}
+%\end{figure}
\begin{figure}
\centering