hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 057520512da8fd8b139cf2046dd52000303b37cc
parent 4ef0cdd588879c06a763d176f251690e90c9b821
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Wed, 22 Mar 2017 10:43:13 +0300

Describe the third experiment.

Diffstat:
src/body.tex | 59++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -238,18 +238,31 @@ sequential application step all parallel application processes except one were shutdown with a small delay to give principal kernel time to distribute its subordinates between cluster nodes. The experiment was repeated 12 times with a different surviving process each time. For each run total application running -time was measured and compared to each other. The result of the experiment is -the overhead of recovery from a failure of a specific kernel in the hierarchy -(the overhead of recovering from failure of a principal kernel is different -from the failure of a subordinate kernel). +time was measured and compared to each other. In this experiment the principal +kernel was executed on the first node, and subordinate kernels are evenly +distributed across all node including the first one. The result of the +experiment is the overhead of recovery from a failure of a specific kernel in +the hierarchy, which should be different for principal and subordinate kernel. In the second experiment we compared the time to generate ocean wavy surface -without process failures and with/without failure handling code in -the programme. This test was repeated for different number of cluster nodes. -Apart from physical cluster the test was run on virtual cluster with a large -number of nodes all launched on the same physical node. The purpose of the -experiment is to investigate how failure handling overhead affects scalability -of the application to a large number of nodes. +without process failures and with/without failure handling code in the +programme. This test was repeated for different number of cluster nodes. Apart +from physical cluster the test was run on virtual cluster with a large number +of nodes all launched on the same physical node. Since only one physical node +is used for the virtual cluster, only a dry run of the programme was performed: +all expensive computations (wavy surface generation and coefficient +computation) were disabled to reduce the load on the node, but memory +allocations and communication between processes were retained. The purpose of +the experiment is to investigate how failure handling overhead affects +scalability of the application to a large number of nodes. + +In the final experiment we benchmarked overhead of the multiple node failure +handling code by instrumenting it with calls to time measuring routines. For +this experiment all logging and output was disabled to exclude its time from the +measurements. A dry run was performed on virtual cluster and real run on the +physical cluster. The purpose of the experiment is to complement results of the +previous one with precisely measured overhead of multiple node failure handling +code. \section{Results} @@ -264,12 +277,14 @@ survivor tries to communicate with all subordinates that were created before the survivor, so the overhead of recovery is larger. The third case is failure of all kernels except the last subordinate. Here performance is different only in the test environment, because this is the node to which standard output and -error streams from each parallel process is copied over the network. So, the +error streams from each parallel process are copied over the network. So, the overhead is smaller, because there is no communication over the network for streaming the output. To summarise, performance degradation is larger when principal kernel fails, because the survivor needs to recover initial principal state from the backup and start the current sequential application step again -on the surviving node. +on the surviving node; performance degradation is smaller when subordinate +kernel fails, because there is no state to recover, and failed kernel is +executed on one of the remaining nodes. \begin{figure} \centering @@ -279,11 +294,13 @@ on the surviving node. nodes.\label{fig:test-1}} \end{figure} + + \begin{figure} \centering \includegraphics{test-2} - \caption{Application running time ratio with/without failure handling code for - different number of cluster nodes.\label{fig:test-2}} + \caption{Application running time ratio with/without failure handling code + for different number of cluster nodes.\label{fig:test-2}} \end{figure} %\begin{figure} @@ -293,13 +310,13 @@ on the surviving node. % different number of virtual cluster nodes.\label{fig:test-2-virt}} %\end{figure} -%\begin{figure} -% \centering -% \includegraphics{test-2-dryrun-virt} -% \caption{Application running time with/without failure handling code for -% different number of virtual cluster nodes (dry -% run).\label{fig:test-2-dryrun-virt}} -%\end{figure} +\begin{figure} + \centering + \includegraphics{test-2-dryrun-virt} + \caption{Application running time with/without failure handling code for + different number of virtual cluster nodes (dry + run).\label{fig:test-2-dryrun-virt}} +\end{figure} %\begin{figure} % \centering