hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 4ef0cdd588879c06a763d176f251690e90c9b821
parent 6cbd9920b3f6a0c112e99881932f818fa7ddb19e
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Tue, 21 Mar 2017 19:06:30 +0300

Improve results description.

Diffstat:
src/body.tex | 77+++++++++++++++++++++++++++++++++++++++++------------------------------------
1 file changed, 41 insertions(+), 36 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -220,16 +220,16 @@ configuration is presented in Table~\ref{tab:platform-configuration}. \begin{table} \centering \caption{Test platform configuration.\label{tab:platform-configuration}} - \begin{tabular}{ll} - \toprule - CPU & Intel Xeon E5440, 2.83GHz \\ - RAM & 4Gb \\ - HDD & ST3250310NS, 7200rpm \\ - No. of nodes & 12 \\ - No. of CPU cores per node & 8 \\ - Interconnect & 100Mbit ethernet \\ - \bottomrule - \end{tabular} + \begin{tabular}{ll} + \toprule + CPU & Intel Xeon E5440, 2.83GHz \\ + RAM & 4Gb \\ + HDD & ST3250310NS, 7200rpm \\ + No. of nodes & 12 \\ + No. of CPU cores per node & 8 \\ + Interconnect & 100Mbit Ethernet \\ + \bottomrule + \end{tabular} \end{table} The first failure scenario (see Section~\ref{sec:failure-scenarios}) was @@ -254,17 +254,22 @@ of the application to a large number of nodes. \section{Results} The first experiment showed that in terms of performance there are three -possible outcomes when all nodes except one fail. The first case is failure of -all kernels except the principal and its first subordinate. There is no -communication with other nodes to find the survivor, so it takes the least time -to recover from the failure. The second case is failure of all kernels except -any subordinate kernel other than the first one. Here the survivor try to -communicate with all subordinates that were created before the survivor, so the -overhead of recovery is larger. The third case is failure of all kernels except -the last subordinate. Here performance is different only in the test -environment, because this is the node where output data and logs are gathered. -So, the overhead is smaller, because there is no communication over the network -for storing output. +possible outcomes when all nodes except one fail (Figure~\ref{fig:test-1}). The +first case is failure of all kernels except the principal and its first +subordinate. There is no communication with other nodes to find the survivor +and no recomputation of the current sequential step of the application, so it +takes the least time to recover from the failure. The second case is failure of +all kernels except any subordinate kernel other than the first one. Here the +survivor tries to communicate with all subordinates that were created before +the survivor, so the overhead of recovery is larger. The third case is failure +of all kernels except the last subordinate. Here performance is different only +in the test environment, because this is the node to which standard output and +error streams from each parallel process is copied over the network. So, the +overhead is smaller, because there is no communication over the network for +streaming the output. To summarise, performance degradation is larger when +principal kernel fails, because the survivor needs to recover initial principal +state from the backup and start the current sequential application step again +on the surviving node. \begin{figure} \centering @@ -296,22 +301,22 @@ for storing output. % run).\label{fig:test-2-dryrun-virt}} %\end{figure} -\begin{figure} - \centering - \includegraphics{test-2-dryrun-virt-overhead} - \caption{Application running time with failure handling code for - different number of virtual cluster nodes (dry - run, only overhead was measured).\label{fig:test-2-dryrun-virt-overhead}} -\end{figure} +%\begin{figure} +% \centering +% \includegraphics{test-2-dryrun-virt-overhead} +% \caption{Application running time with failure handling code for +% different number of virtual cluster nodes (dry +% run, only overhead was measured).\label{fig:test-2-dryrun-virt-overhead}} +%\end{figure} -\begin{figure} - \centering - \includegraphics{test-2-dryrun-virt-overhead-ndebug} - \caption{Application running time with failure handling code for - different number of virtual cluster nodes (dry - run, only overhead was measured, no debug - output).\label{fig:test-2-dryrun-virt-overhead-ndebug}} -\end{figure} +%\begin{figure} +% \centering +% \includegraphics{test-2-dryrun-virt-overhead-ndebug} +% \caption{Application running time with failure handling code for +% different number of virtual cluster nodes (dry +% run, only overhead was measured, no debug +% output).\label{fig:test-2-dryrun-virt-overhead-ndebug}} +%\end{figure} \begin{figure} \centering