hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

results.tex (6336B)


      1 \section{RESULTS}
      2 
      3 Factory framework is evaluated on physical cluster (Table~\ref{tab:cluster}) on
      4 the example of hydrodynamics HPC application which was developed
      5 in~\cite{autoreg-stab,autoreg2011csit,autoreg1,autoreg2}. This programme
      6 generates wavy ocean surface using ARMA model, its output is a set of files
      7 representing different parts of the realisation. From a computer scientist point
      8 of view the application consists of a series of filters, each applying to the
      9 result of the previous one. Some of the filters are parallel, so the programme
     10 is written as a sequence of big steps and some steps are made internally
     11 parallel to get better performance. In the programme only the most
     12 compute-intensive step (the surface generation) is executed in parallel across
     13 all cluster nodes, and other steps are executed in parallel across all cores of
     14 the master node.
     15 
     16 \begin{table}
     17     \centering
     18     \caption{Test platform configuration.}
     19     \begin{tabular}{ll}
     20          \toprule
     21          CPU & Intel Xeon E5440, 2.83GHz \\
     22          RAM & 4Gb \\
     23          HDD & ST3250310NS, 7200rpm \\
     24          No. of nodes & 12 \\
     25          No. of CPU cores per node & 8 \\
     26          \bottomrule
     27     \end{tabular}
     28     \label{tab:cluster}
     29 \end{table}
     30 
     31 The application was rewritten for the new version of the framework which
     32 required only slight modifications to handle failure of a node with the first
     33 kernel: The kernel was flagged so that the framework makes a replica and sends
     34 it to some subordinate node. There were no additional code changes other than
     35 modifying some parts to match the new API. So, the tree hierarchy of kernels is
     36 mostly non-intrusive model for providing fault tolerance which demands explicit
     37 marking of replicated kernels.
     38 
     39 In a series of experiments we benchmarked performance of the new version of the
     40 application in the presence of different types of failures (numbers correspond
     41 to the graphs in Figure~\ref{fig:benchmark}):
     42 \begin{enumerate}
     43     \item no failures,
     44     \item failure of a slave node (a node where a part of wavy surface is
     45       generated),
     46     \item failure of a master node (a node where the first kernel is run),
     47     \item failure of a backup node (a node where a copy of the first kernel is
     48       stored).
     49 \end{enumerate}
     50 A tree hierarchy with fan-out value of 64 was chosen to make all cluster nodes
     51 connect directly to the first one. In each run the first kernel was launched on
     52 a different node to make mapping of kernel hierarchy to the tree hierarchy
     53 optimal. A victim node was made offline after a fixed amount of time after the
     54 programme start which is equivalent approximately to $1/3$ of the total run time
     55 without failures on a single node. All relevant parameters are summarised in
     56 Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the
     57 tree hierarchy). The results of these runs were compared to the run without node
     58 failures (Figures~\ref{fig:benchmark}-\ref{fig:slowdown}).
     59 
     60 There is considerable difference in net performance for different types of
     61 failures. Graphs 2 and 3 in Figure~\ref{fig:benchmark} show that performance in
     62 case of master or slave node failure is the same. In case of master node failure
     63 a backup node stores a copy of the first kernel and uses this copy when it fails
     64 to connect to the master node. In case of slave node failure, the master node
     65 redistributes the load across remaining slave nodes. In both cases execution
     66 state is not lost and no time is spent to restore it, that is why performance is
     67 the same. Graph 4 in Figure~\ref{fig:benchmark} shows that performance in case
     68 of a backup node failure is much lower. It happens because master node stores
     69 only the current step of the computation plus some additional fixed amount of
     70 data, whereas a backup node not only stores the copy of this information but
     71 executes this step in parallel with other subordinate nodes. So, when a backup
     72 node fails, the master node executes the whole step once again on arbitrarily
     73 chosen healthy node.
     74 
     75 \begin{table}
     76     \centering
     77     \caption{Benchmark parameters.}
     78     \begin{tabular}{llll}
     79          \toprule
     80          Experiment no. & Master node & Victim node & Time to offline, s \\
     81          \midrule
     82          1 & root &      &    \\
     83          2 & root & leaf & 10 \\
     84          3 & leaf & leaf & 10 \\
     85          4 & leaf & root & 10 \\
     86          \bottomrule
     87     \end{tabular}
     88     \label{tab:benchmark}
     89 \end{table}
     90 
     91 Finally, to measure how much time is lost due to a failure we divide the total
     92 execution time with a failure by the total execution time without the failure
     93 but with the number of nodes minus one. The results for this calculation are
     94 obtained from the same benchmark and are presented in Figure~\ref{fig:slowdown}.
     95 The difference in performance in case of master and slave node failures lies
     96 within 5\% margin, and in case of backup node failure within 50\% margin for the
     97 number of node less than~6\footnote{Measuring this margin for higher number of
     98   nodes does not make sense since time before failure is greater than total
     99   execution time with these numbers of nodes, and programme's execution finishes
    100   before a failure occurs.}. Increase in execution time of 50\% is more than
    101 $1/3$ of execution time after which a failure occurs, but backup node failure
    102 need some time to be discovered: they are detected only when subordinate kernel
    103 carrying the copy of the first kernel finishes its execution and tries to reach
    104 its parent. Instant detection requires abrupt stopping of the subordinate kernel
    105 which may be undesirable for programmes with complicated logic.
    106 
    107 \begin{figure}
    108     \centering
    109     \includegraphics{factory-3000}
    110     \caption{Performance of hydrodynamics HPC application in the presence of node failures.}
    111     \label{fig:benchmark}
    112 \end{figure}
    113 
    114 To summarise, the benchmark showed that \emph{no matter a master or a slave node
    115   fails, the resulting performance roughly equals to the one without failures
    116   with the number of nodes minus one}, however, when a backup node fails
    117 performance penalty is much higher.
    118 
    119 \begin{figure}
    120     \centering
    121     \includegraphics{slowdown-3000}
    122     \caption{Slowdown of the hydrodynamics HPC application in the presence of different types of node failures compared to execution without failures but with the number of nodes minus one.}
    123     \label{fig:slowdown}
    124 \end{figure}