sections.tex (17320B)
1 \section{Methods} 2 3 \subsection{Model of computation} 4 5 To infer fault tolerance model which is suitable for big data applications we 6 use bulk-synchronous parallel model~\cite{valiant1990bridging} as the basis. 7 This model assumes that a parallel programme is composed of several sequential 8 steps that are internally parallel, and global synchronisation of all parallel 9 processes occurs after each step. In our model all sequential steps are 10 pipelined where it is possible. The evolution of the computational model is 11 described as follows. 12 13 Given a programme that is sequential and large enough to be decomposed into 14 several sequential steps, the simplest way to make it run faster is to exploit 15 data parallelism. Usually it means finding multi-dimensional arrays and loops 16 that access their elements and trying to make them parallel. After transforming 17 several loops the programme will still have the same number of sequential steps, 18 but every step will (ideally) be internally parallel. 19 20 After that the only possibility to speedup the programme is to overlap execution 21 of code blocks that work with different hardware devices. The most common 22 pattern is to overlap computation with network I/O or disk I/O. This approach 23 makes sense because all devices operate with little synchronisation, and issuing 24 commands in parallel makes the whole programme perform better. This behaviour 25 can be achieved by allocating a separate task queue for each device and 26 submitting tasks to these queues asynchronously with execution of the main 27 thread. So, after this optimisation, the programme will be composed of several 28 steps chained into the pipeline, each step is implemented as a task queue for a 29 particular device. 30 31 Pipelining of otherwise sequential steps is beneficial not only for code 32 accessing different devices, but for code different branches of which are 33 suitable for execution by multiple hardware threads of the same core, i.e. 34 branches accessing different regions of memory or performing mixed arithmetic 35 (floating point and integer). In other words, code branches which use different 36 modules of processor are good candidates to run in parallel on a processor core 37 with multiple hardware threads. 38 39 Even though pipelining may not add parallelism for a programme that uses only 40 one input file (or a set of input parameters), it adds parallelism when the 41 programme can process multiple input files: each input generates tasks which 42 travel through the whole pipeline in parallel with tasks generated by other 43 inputs. With a pipeline an array of files is processed in parallel by the same 44 set of resources allocated for a batch job, and possibly with greater efficiency 45 for busy HPC clusters compared to executing a separate job for each input file, 46 because the time that each subsequent job after the first spends in a queue is 47 eliminated. 48 49 Computational model with a pipeline can be seen as \emph{bulk-asynchronous 50 model}, because of the parallel nature of otherwise sequential execution 51 steps. This model is the basis of the fault-tolerance model developed here. 52 53 \subsection{Fail over model} 54 55 Although, fault-tolerance and high-availability are different terms, in essence 56 they describe the same property---an ability of a system to switch processing 57 from a failed component to its live spare or backup component. In case of 58 fault-tolerance it is the ability to switch from a failed slave node to a spare 59 one, i.e. to repeat computation step on a healthy slave node. In case of 60 high-availability it is the ability to switch from a failed master node to a 61 backup node with full restoration of execution state. These are the core 62 abilities that constitute distributed system's ability to \emph{fail over}. 63 64 The key feature that is missing in the current parallel programming and big data 65 processing technologies is a possibility to specify hierarchical dependencies 66 between parallel tasks. When one has such dependency, it is trivial to determine 67 which task should be responsible for re-executing a failed task on a healthy 68 node. To re-execute the root of the hierarchy, a backup root task is created and 69 executed on a different node. There exists a number of engines that are capable 70 of executing directed acyclic graphs of tasks in 71 parallel~\cite{acun2014charmpp,islam2012oozie}, but graphs are not good to infer 72 master-slave relationship between tasks, because a node in the graph may have 73 multiple parent nodes. 74 75 \subsection{Programming model} 76 77 This work is based on the results of previous research: 78 In~\cite{gankevich2015subordination,gankevich2015iccsa} we developed an 79 algorithm that allows to build a tree hierarchy from strictly ordered set of 80 cluster nodes. The sole purpose of this hierarchy is to make a cluster more 81 fault-tolerant by introducing multiple master nodes. If a master node fails, 82 then its subordinates try to connect to another node from the same or higher 83 level of the hierarchy. If there is no such node, one of the subordinates 84 becomes the master. In~\cite{gankevich2015spec} we developed a framework for big 85 data processing without fault tolerance, and here this framework is combined 86 with fault-tolerance techniques described in this paper. 87 88 Each programme that runs on top of the tree hierarchy is composed of 89 computational kernels---objects that contain data and code to process it. To 90 exploit parallelism a kernel may create arbitrary number of subordinate kernels 91 which are automatically spread first across available processor cores, second 92 across subordinate nodes in the tree hierarchy. The programme is itself a kernel 93 (without a parent as it is executed by a user), which either solves the problem 94 sequentially on its own or creates subordinate kernels to solve it in parallel. 95 96 In contrast to HPC applications, in big data applications it is inefficient to 97 run computational kernels on arbitrary chosen nodes. More practical approach is 98 to bind every kernel to a file location in a parallel file system and transfer 99 the kernel to that location before processing the file. That way expensive data 100 transfer is eliminated, and the file is always read from a local drive. This 101 approach is more deterministic compared to existing ones, e.g. MapReduce 102 framework runs jobs on nodes that are ``close'' to the file location, but not 103 necessarily the exact node where the file is located~\cite{dean2008mapreduce}. 104 However, this approach does not come without disadvantages: scalability of a big 105 data application is limited by the strategy that was employed to distribute its 106 input files across cluster nodes. The more nodes used to store input files, the 107 more read performance is achieved. The advantage of our approach is that the I/O 108 performance is more predictable, than one of hybrid approach with streaming 109 files over the network. 110 111 % From mathematical point of view kernel $K$ can be described as a vector-valued functional which recursively maps a kernel to $n$-component vector of kernels: 112 % \begin{equation*} 113 % K(f): \mathbb{K} \rightarrow \mathbb{K}^n 114 % \qquad 115 % \mathbb{K}^n = \left\{ f: \mathbb{K} \rightarrow \mathbb{K}^n \right\}. 116 % \end{equation*} 117 % Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops recursion, is used to call the first kernel and finish execution of the programme. An argument to each kernel is interpreted using the following rules. 118 % \begin{enumerate} 119 % \item If a kernel is a new kernel, then its argument is its parent kernel. 120 % \item If a kernel is a parent of the kernel that produced it or some other existing kernel, then the argument is the kernel that produced it. 121 % \end{enumerate} 122 123 % Engine that executes kernels is implemented as a simple loop. It starts with calling the first kernel with dummy kernel as an argument, then calls each kernel that was produced by this call and so forth. The loop finishes when a dummy kernel is returned as a result of the call. 124 125 % Since kernel call may return multiple kernels they are executed in parallel. Parallel execution quickly produces a pool of kernels which permit execution in an unspecified order. Several threads concurrently retrieve kernels from the pool and may ``spill'' remaining kernels to neighbouring cluster nodes. 126 127 % Kernels are implemented as closures---function objects containing all their arguments, a reference to parent kernel and user-supplied data. The data is either processed upon kernel call, or subordinate kernels are created to process it in parallel. When the processing is complete a parent kernel closure with its subordinate kernel as an argument is called to collect data from it. 128 129 \subsection{Handling master node failures} 130 131 A possible way of handling a failure of a node where the first kernel is located 132 (a master node) is to replicate this kernel to a backup node, and make all 133 updates to its state propagate to the backup node by means of a distributed 134 transaction. This approach requires synchronisation between all nodes that 135 execute subordinates of the first kernel and the node with the first kernel 136 itself. When a node with the first kernel goes offline, the nodes with 137 subordinate kernels must know what node is the backup one. However, if the 138 backup node also goes offline in the middle of execution of some subordinate 139 kernel, then it is impossible for this kernel to discover the next backup node 140 to return to, because this kernel has not discovered the unavailability of the 141 master node yet. One can think of a consensus-based algorithm to ensure that 142 subordinate kernels always know where the backup node is, but distributed 143 consensus algorithms do not scale well to the large number of nodes and they are 144 not reliable~\cite{fischer1985impossibility}. So, consensus-based approach does 145 not play well with asynchronous nature of computational kernels as it may 146 inhibit scalability of a parallel programme. 147 148 Fortunately, the first kernel usually does not perform operations in parallel, 149 it is rather sequentially launches execution steps one by one, so it has only 150 one subordinate at a time. Such behaviour is described by bulk-synchronous 151 parallel programming model, in the framework of which a programme consists of 152 sequential supersteps which are internally parallel~\cite{valiant1990bridging}. 153 Keeping this in mind, we can simplify synchronisation of its state: we can send 154 the first kernel along with its subordinate to the subordinate node. When the 155 node with the first kernel fails, its copy receives its subordinate, and no 156 execution time is lost. When the node with its copy fails, its subordinate is 157 rescheduled on some other node, and in the worst case a whole step of 158 computation is lost. 159 160 Described approach works only for kernels that do not have a parent and have 161 only one subordinate at a time, and act similar to manually triggered 162 checkpoints. The advantage is that they 163 \begin{itemize} 164 \item save results after each sequential step when memory footprint of a 165 programme is low, 166 \item they save only relevant data, 167 \item and they use memory of a subordinate node instead of stable storage. 168 \end{itemize} 169 170 \section{Results} 171 172 Master node fail over technique is evaluated on the example of wave energy 173 spectra processing application. This programme uses NDBC 174 dataset~\cite{ndbc-dataset} to reconstruct frequency-directional spectra from 175 wave rider buoy measurements and compute variance. Each spectrum is 176 reconstructed from five variables using the following 177 formula~\cite{earle1996nondirectional}. 178 \begin{equation*} 179 S(\omega, \theta) = \frac{1}{\pi} 180 \left[ 181 \frac{1}{2} + 182 r_1 \cos \left( \theta - \alpha_1 \right) + 183 r_2 \sin \left( 2 \left( \theta - \alpha_2 \right) \right) 184 \right] 185 S_0(\omega). 186 \end{equation*} 187 Here $\omega$ denotes frequency, $\theta$ is wave direction, $r_{1,2}$ and 188 $\alpha_{1,2}$ are parameters of spectrum decomposition and $S_0$ is 189 non-directional spectrum; $r_{1,2}$, $\alpha_{1,2}$ and $S_0$ are acquired 190 through measurements. Properties of the dataset which is used in evaluation are 191 listed in Table~\ref{tab:ndbc-dataset}. 192 193 \begin{table} 194 \centering 195 \caption{NDBC dataset properties.} 196 \begin{tabular}{ll} 197 \toprule 198 Dataset size & 144MB \\ 199 Dataset size (uncompressed) & 770MB \\ 200 No. of wave stations & 24 \\ 201 Time span & 3 years (2010--2012) \\ 202 Total no. of spectra & 445422 \\ 203 \bottomrule 204 \end{tabular} 205 \label{tab:ndbc-dataset} 206 \end{table} 207 208 The algorithm of processing spectra is as follows. First, current directory is 209 recursively scanned for input files. Data for all buoys is distributed across 210 cluster nodes and each buoy's data processing is distributed across processor 211 cores of a node. Processing begins with joining corresponding measurements for 212 each spectrum variables into a tuple, then for each tuple frequency-directional 213 spectrum is reconstructed and its variance is computed. Results are gradually 214 copied back to the machine where application was executed and when the 215 processing is complete the programme terminates. 216 217 \begin{table} 218 \centering 219 \caption{Test platform configuration.} 220 \begin{tabular}{ll} 221 \toprule 222 CPU & Intel Xeon E5440, 2.83GHz \\ 223 RAM & 4Gb \\ 224 HDD & ST3250310NS, 7200rpm \\ 225 No. of nodes & 12 \\ 226 No. of CPU cores per node & 8 \\ 227 \bottomrule 228 \end{tabular} 229 \label{tab:cluster} 230 \end{table} 231 232 In a series of test runs we benchmarked performance of the application in the 233 presence of different types of failures: 234 \begin{itemize} 235 \item failure of a master node (a node where the first kernel is run), 236 \item failure of a slave node (a node where spectra from a particular 237 station are reconstructed) and 238 \item failure of a backup node (a node where the first kernel is copied). 239 \end{itemize} 240 A tree hierarchy with sufficiently large fan-out value was chosen to make all 241 cluster nodes connect directly to the first one so that only one master node 242 exists in the cluster. In each run the first kernel was launched on a different 243 node to make mapping of kernel hierarchy to the tree hierarchy optimal. A victim 244 node was made offline after a fixed amount of time early after the programme 245 start. To make up for the node failure all data files have replicas stored on 246 different cluster nodes. All relevant parameters are summarised in 247 Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the 248 tree hierarchy). The results of these runs were compared to the run without node 249 failures (Figure~\ref{fig:benchmark-bigdata}). 250 251 \begin{table} 252 \centering 253 \caption{Benchmark parameters.} 254 \begin{tabular}{llll} 255 \toprule 256 Experiment no. & Master node & Victim node & Time to offline, s \\ 257 \midrule 258 1 & root & & \\ 259 2 & root & leaf & 30 \\ 260 3 & leaf & leaf & 30 \\ 261 4 & leaf & root & 30 \\ 262 \bottomrule 263 \end{tabular} 264 \label{tab:benchmark} 265 \end{table} 266 267 The benchmark showed that only a backup node failure results in significant 268 performance penalty, in all other cases the performance is roughly equals to the 269 one without failures but with the number of nodes minus one. It happens because 270 a backup node not only stores the copy of the state of the current computation 271 step but executes this step in parallel with other subordinate nodes. So, when a 272 backup node fails, the master node executes the whole step once again on 273 arbitrarily chosen healthy subordinate node. 274 275 % \begin{figure} 276 % \centering 277 % \includegraphics{factory-3000} 278 % \caption{Performance of hydrodynamics HPC application in the presence of node failures.} 279 % \label{fig:benchmark-hpc} 280 % \end{figure} 281 282 \begin{figure} 283 \centering 284 \includegraphics{spec} 285 \caption{Performance of spectrum processing application in the presence of different types of node failures.} 286 \label{fig:benchmark-bigdata} 287 \end{figure} 288 289 \section{Discussion} 290 291 Described algorithm guarantees to handle one failure per computational step, 292 more failures can be tolerated if they do not affect the master node. The system 293 handles simultaneous failure of all subordinate nodes, however, if both master 294 and backup nodes fail, there is no chance for an application to survive. In this 295 case the state of the current computation step is lost, and the only way to 296 restore it is to restart the application. 297 298 Computational kernels are means of abstraction that decouple distributed 299 application from physical hardware: it does not matter how many nodes are online 300 for an application to run successfully. Computational kernels eliminate the need 301 to allocate a physical backup node to make master node highly-available, with 302 computational kernels approach any node can act as a backup one. Finally, 303 computational kernels can handle subordinate node failures in a way that is 304 transparent to a programmer. 305 306 The disadvantage of this approach is evident: there is no way of making existing 307 middleware highly-available without rewriting their source code. Although, our 308 programming framework is lightweight, it is not easy to map architecture of 309 existing middleware systems to it: most systems are developed keeping in mind 310 static assignment of server/client roles, which is not easy to make dynamic. 311 Hopefully, our approach will simplify design of future middleware systems.