iccsa-16-factory

Factory: Master Node High-Availability for Big Data Applications and Beyond
git clone https://git.igankevich.com/iccsa-16-factory.git
Log | Files | Refs

sections.tex (17320B)


      1 \section{Methods}
      2 
      3 \subsection{Model of computation}
      4 
      5 To infer fault tolerance model which is suitable for big data applications we
      6 use bulk-synchronous parallel model~\cite{valiant1990bridging} as the basis.
      7 This model assumes that a parallel programme is composed of several sequential
      8 steps that are internally parallel, and global synchronisation of all parallel
      9 processes occurs after each step. In our model all sequential steps are
     10 pipelined where it is possible. The evolution of the computational model is
     11 described as follows.
     12 
     13 Given a programme that is sequential and large enough to be decomposed into
     14 several sequential steps, the simplest way to make it run faster is to exploit
     15 data parallelism. Usually it means finding multi-dimensional arrays and loops
     16 that access their elements and trying to make them parallel. After transforming
     17 several loops the programme will still have the same number of sequential steps,
     18 but every step will (ideally) be internally parallel.
     19 
     20 After that the only possibility to speedup the programme is to overlap execution
     21 of code blocks that work with different hardware devices. The most common
     22 pattern is to overlap computation with network I/O or disk I/O. This approach
     23 makes sense because all devices operate with little synchronisation, and issuing
     24 commands in parallel makes the whole programme perform better. This behaviour
     25 can be achieved by allocating a separate task queue for each device and
     26 submitting tasks to these queues asynchronously with execution of the main
     27 thread. So, after this optimisation, the programme will be composed of several
     28 steps chained into the pipeline, each step is implemented as a task queue for a
     29 particular device.
     30 
     31 Pipelining of otherwise sequential steps is beneficial not only for code
     32 accessing different devices, but for code different branches of which are
     33 suitable for execution by multiple hardware threads of the same core, i.e.
     34 branches accessing different regions of memory or performing mixed arithmetic
     35 (floating point and integer). In other words, code branches which use different
     36 modules of processor are good candidates to run in parallel on a processor core
     37 with multiple hardware threads.
     38 
     39 Even though pipelining may not add parallelism for a programme that uses only
     40 one input file (or a set of input parameters), it adds parallelism when the
     41 programme can process multiple input files: each input generates tasks which
     42 travel through the whole pipeline in parallel with tasks generated by other
     43 inputs. With a pipeline an array of files is processed in parallel by the same
     44 set of resources allocated for a batch job, and possibly with greater efficiency
     45 for busy HPC clusters compared to executing a separate job for each input file,
     46 because the time that each subsequent job after the first spends in a queue is
     47 eliminated.
     48 
     49 Computational model with a pipeline can be seen as \emph{bulk-asynchronous
     50   model}, because of the parallel nature of otherwise sequential execution
     51 steps. This model is the basis of the fault-tolerance model developed here.
     52 
     53 \subsection{Fail over model}
     54 
     55 Although, fault-tolerance and high-availability are different terms, in essence
     56 they describe the same property---an ability of a system to switch processing
     57 from a failed component to its live spare or backup component. In case of
     58 fault-tolerance it is the ability to switch from a failed slave node to a spare
     59 one, i.e. to repeat computation step on a healthy slave node. In case of
     60 high-availability it is the ability to switch from a failed master node to a
     61 backup node with full restoration of execution state. These are the core
     62 abilities that constitute distributed system's ability to \emph{fail over}.
     63 
     64 The key feature that is missing in the current parallel programming and big data
     65 processing technologies is a possibility to specify hierarchical dependencies
     66 between parallel tasks. When one has such dependency, it is trivial to determine
     67 which task should be responsible for re-executing a failed task on a healthy
     68 node. To re-execute the root of the hierarchy, a backup root task is created and
     69 executed on a different node. There exists a number of engines that are capable
     70 of executing directed acyclic graphs of tasks in
     71 parallel~\cite{acun2014charmpp,islam2012oozie}, but graphs are not good to infer
     72 master-slave relationship between tasks, because a node in the graph may have
     73 multiple parent nodes.
     74 
     75 \subsection{Programming model}
     76 
     77 This work is based on the results of previous research:
     78 In~\cite{gankevich2015subordination,gankevich2015iccsa} we developed an
     79 algorithm that allows to build a tree hierarchy from strictly ordered set of
     80 cluster nodes. The sole purpose of this hierarchy is to make a cluster more
     81 fault-tolerant by introducing multiple master nodes. If a master node fails,
     82 then its subordinates try to connect to another node from the same or higher
     83 level of the hierarchy. If there is no such node, one of the subordinates
     84 becomes the master. In~\cite{gankevich2015spec} we developed a framework for big
     85 data processing without fault tolerance, and here this framework is combined
     86 with fault-tolerance techniques described in this paper.
     87 
     88 Each programme that runs on top of the tree hierarchy is composed of
     89 computational kernels---objects that contain data and code to process it. To
     90 exploit parallelism a kernel may create arbitrary number of subordinate kernels
     91 which are automatically spread first across available processor cores, second
     92 across subordinate nodes in the tree hierarchy. The programme is itself a kernel
     93 (without a parent as it is executed by a user), which either solves the problem
     94 sequentially on its own or creates subordinate kernels to solve it in parallel.
     95 
     96 In contrast to HPC applications, in big data applications it is inefficient to
     97 run computational kernels on arbitrary chosen nodes. More practical approach is
     98 to bind every kernel to a file location in a parallel file system and transfer
     99 the kernel to that location before processing the file. That way expensive data
    100 transfer is eliminated, and the file is always read from a local drive. This
    101 approach is more deterministic compared to existing ones, e.g. MapReduce
    102 framework runs jobs on nodes that are ``close'' to the file location, but not
    103 necessarily the exact node where the file is located~\cite{dean2008mapreduce}.
    104 However, this approach does not come without disadvantages: scalability of a big
    105 data application is limited by the strategy that was employed to distribute its
    106 input files across cluster nodes. The more nodes used to store input files, the
    107 more read performance is achieved. The advantage of our approach is that the I/O
    108 performance is more predictable, than one of hybrid approach with streaming
    109 files over the network.
    110 
    111 % From mathematical point of view kernel $K$ can be described as a vector-valued functional which recursively maps a kernel to $n$-component vector of kernels:
    112 % \begin{equation*}
    113 %     K(f): \mathbb{K} \rightarrow \mathbb{K}^n
    114 %     \qquad
    115 %     \mathbb{K}^n = \left\{ f: \mathbb{K} \rightarrow \mathbb{K}^n \right\}.
    116 % \end{equation*}
    117 % Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops recursion, is used to call the first kernel and finish execution of the programme. An argument to each kernel is interpreted using the following rules.
    118 % \begin{enumerate}
    119 %     \item If a kernel is a new kernel, then its argument is its parent kernel.
    120 %     \item If a kernel is a parent of the kernel that produced it or some other existing kernel, then the argument is the kernel that produced it.
    121 % \end{enumerate}
    122 
    123 % Engine that executes kernels is implemented as a simple loop. It starts with calling the first kernel with dummy kernel as an argument, then calls each kernel that was produced by this call and so forth. The loop finishes when a dummy kernel is returned as a result of the call.
    124 
    125 % Since kernel call may return multiple kernels they are executed in parallel. Parallel execution quickly produces a pool of kernels which permit execution in an unspecified order. Several threads concurrently retrieve kernels from the pool and may ``spill'' remaining kernels to neighbouring cluster nodes.
    126 
    127 % Kernels are implemented as closures---function objects containing all their arguments, a reference to parent kernel and user-supplied data. The data is either processed upon kernel call, or subordinate kernels are created to process it in parallel. When the processing is complete a parent kernel closure with its subordinate kernel as an argument is called to collect data from it.
    128 
    129 \subsection{Handling master node failures}
    130 
    131 A possible way of handling a failure of a node where the first kernel is located
    132 (a master node) is to replicate this kernel to a backup node, and make all
    133 updates to its state propagate to the backup node by means of a distributed
    134 transaction. This approach requires synchronisation between all nodes that
    135 execute subordinates of the first kernel and the node with the first kernel
    136 itself. When a node with the first kernel goes offline, the nodes with
    137 subordinate kernels must know what node is the backup one. However, if the
    138 backup node also goes offline in the middle of execution of some subordinate
    139 kernel, then it is impossible for this kernel to discover the next backup node
    140 to return to, because this kernel has not discovered the unavailability of the
    141 master node yet. One can think of a consensus-based algorithm to ensure that
    142 subordinate kernels always know where the backup node is, but distributed
    143 consensus algorithms do not scale well to the large number of nodes and they are
    144 not reliable~\cite{fischer1985impossibility}. So, consensus-based approach does
    145 not play well with asynchronous nature of computational kernels as it may
    146 inhibit scalability of a parallel programme.
    147 
    148 Fortunately, the first kernel usually does not perform operations in parallel,
    149 it is rather sequentially launches execution steps one by one, so it has only
    150 one subordinate at a time. Such behaviour is described by bulk-synchronous
    151 parallel programming model, in the framework of which a programme consists of
    152 sequential supersteps which are internally parallel~\cite{valiant1990bridging}.
    153 Keeping this in mind, we can simplify synchronisation of its state: we can send
    154 the first kernel along with its subordinate to the subordinate node. When the
    155 node with the first kernel fails, its copy receives its subordinate, and no
    156 execution time is lost. When the node with its copy fails, its subordinate is
    157 rescheduled on some other node, and in the worst case a whole step of
    158 computation is lost.
    159 
    160 Described approach works only for kernels that do not have a parent and have
    161 only one subordinate at a time, and act similar to manually triggered
    162 checkpoints. The advantage is that they
    163 \begin{itemize}
    164     \item save results after each sequential step when memory footprint of a
    165       programme is low,
    166     \item they save only relevant data,
    167     \item and they use memory of a subordinate node instead of stable storage.
    168 \end{itemize}
    169 
    170 \section{Results}
    171 
    172 Master node fail over technique is evaluated on the example of wave energy
    173 spectra processing application. This programme uses NDBC
    174 dataset~\cite{ndbc-dataset} to reconstruct frequency-directional spectra from
    175 wave rider buoy measurements and compute variance. Each spectrum is
    176 reconstructed from five variables using the following
    177 formula~\cite{earle1996nondirectional}.
    178 \begin{equation*}
    179     S(\omega, \theta) = \frac{1}{\pi}
    180     \left[
    181         \frac{1}{2} + 
    182         r_1 \cos \left( \theta - \alpha_1 \right) +
    183         r_2 \sin \left( 2 \left( \theta - \alpha_2 \right) \right)
    184     \right]
    185     S_0(\omega).
    186 \end{equation*}
    187 Here $\omega$ denotes frequency, $\theta$ is wave direction, $r_{1,2}$ and
    188 $\alpha_{1,2}$ are parameters of spectrum decomposition and $S_0$ is
    189 non-directional spectrum; $r_{1,2}$, $\alpha_{1,2}$ and $S_0$ are acquired
    190 through measurements. Properties of the dataset which is used in evaluation are
    191 listed in Table~\ref{tab:ndbc-dataset}.
    192 
    193 \begin{table}
    194     \centering
    195     \caption{NDBC dataset properties.}
    196     \begin{tabular}{ll}
    197         \toprule
    198         Dataset size & 144MB \\
    199         Dataset size (uncompressed) & 770MB \\
    200         No. of wave stations & 24 \\
    201         Time span & 3 years (2010--2012) \\
    202         Total no. of spectra & 445422 \\
    203         \bottomrule
    204     \end{tabular}
    205     \label{tab:ndbc-dataset}
    206 \end{table}
    207 
    208 The algorithm of processing spectra is as follows. First, current directory is
    209 recursively scanned for input files. Data for all buoys is distributed across
    210 cluster nodes and each buoy's data processing is distributed across processor
    211 cores of a node. Processing begins with joining corresponding measurements for
    212 each spectrum variables into a tuple, then for each tuple frequency-directional
    213 spectrum is reconstructed and its variance is computed. Results are gradually
    214 copied back to the machine where application was executed and when the
    215 processing is complete the programme terminates.
    216 
    217 \begin{table}
    218     \centering
    219     \caption{Test platform configuration.}
    220     \begin{tabular}{ll}
    221          \toprule
    222          CPU & Intel Xeon E5440, 2.83GHz \\
    223          RAM & 4Gb \\
    224          HDD & ST3250310NS, 7200rpm \\
    225          No. of nodes & 12 \\
    226          No. of CPU cores per node & 8 \\
    227          \bottomrule
    228     \end{tabular}
    229     \label{tab:cluster}
    230 \end{table}
    231 
    232 In a series of test runs we benchmarked performance of the application in the
    233 presence of different types of failures:
    234 \begin{itemize}
    235     \item failure of a master node (a node where the first kernel is run),
    236     \item failure of a slave node (a node where spectra from a particular
    237       station are reconstructed) and
    238     \item failure of a backup node (a node where the first kernel is copied).
    239 \end{itemize}
    240 A tree hierarchy with sufficiently large fan-out value was chosen to make all
    241 cluster nodes connect directly to the first one so that only one master node
    242 exists in the cluster. In each run the first kernel was launched on a different
    243 node to make mapping of kernel hierarchy to the tree hierarchy optimal. A victim
    244 node was made offline after a fixed amount of time early after the programme
    245 start. To make up for the node failure all data files have replicas stored on
    246 different cluster nodes. All relevant parameters are summarised in
    247 Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the
    248 tree hierarchy). The results of these runs were compared to the run without node
    249 failures (Figure~\ref{fig:benchmark-bigdata}).
    250 
    251 \begin{table}
    252     \centering
    253     \caption{Benchmark parameters.}
    254     \begin{tabular}{llll}
    255          \toprule
    256          Experiment no. & Master node & Victim node & Time to offline, s \\
    257          \midrule
    258          1 & root &      & \\
    259          2 & root & leaf & 30 \\
    260          3 & leaf & leaf & 30 \\
    261          4 & leaf & root & 30 \\
    262          \bottomrule
    263     \end{tabular}
    264     \label{tab:benchmark}
    265 \end{table}
    266 
    267 The benchmark showed that only a backup node failure results in significant
    268 performance penalty, in all other cases the performance is roughly equals to the
    269 one without failures but with the number of nodes minus one. It happens because
    270 a backup node not only stores the copy of the state of the current computation
    271 step but executes this step in parallel with other subordinate nodes. So, when a
    272 backup node fails, the master node executes the whole step once again on
    273 arbitrarily chosen healthy subordinate node.
    274 
    275 % \begin{figure}
    276 %     \centering
    277 %     \includegraphics{factory-3000}
    278 %     \caption{Performance of hydrodynamics HPC application in the presence of node failures.}
    279 %     \label{fig:benchmark-hpc}
    280 % \end{figure}
    281 
    282 \begin{figure}
    283     \centering
    284     \includegraphics{spec}
    285     \caption{Performance of spectrum processing application in the presence of different types of node failures.}
    286     \label{fig:benchmark-bigdata}
    287 \end{figure}
    288 
    289 \section{Discussion}
    290 
    291 Described algorithm guarantees to handle one failure per computational step,
    292 more failures can be tolerated if they do not affect the master node. The system
    293 handles simultaneous failure of all subordinate nodes, however, if both master
    294 and backup nodes fail, there is no chance for an application to survive. In this
    295 case the state of the current computation step is lost, and the only way to
    296 restore it is to restart the application.
    297 
    298 Computational kernels are means of abstraction that decouple distributed
    299 application from physical hardware: it does not matter how many nodes are online
    300 for an application to run successfully. Computational kernels eliminate the need
    301 to allocate a physical backup node to make master node highly-available, with
    302 computational kernels approach any node can act as a backup one. Finally,
    303 computational kernels can handle subordinate node failures in a way that is
    304 transparent to a programmer.
    305 
    306 The disadvantage of this approach is evident: there is no way of making existing
    307 middleware highly-available without rewriting their source code. Although, our
    308 programming framework is lightweight, it is not easy to map architecture of
    309 existing middleware systems to it: most systems are developed keeping in mind
    310 static assignment of server/client roles, which is not easy to make dynamic.
    311 Hopefully, our approach will simplify design of future middleware systems.