sections.tex - iccsa-16-factory - Factory: Master Node High-Availability for Big Data Applications and Beyond

sections.tex (17320B)

1 \section{Methods}
2
3 \subsection{Model of computation}
4
5 To infer fault tolerance model which is suitable for big data applications we
6 use bulk-synchronous parallel model~\cite{valiant1990bridging} as the basis.
7 This model assumes that a parallel programme is composed of several sequential
8 steps that are internally parallel, and global synchronisation of all parallel
9 processes occurs after each step. In our model all sequential steps are
10 pipelined where it is possible. The evolution of the computational model is
11 described as follows.
12
13 Given a programme that is sequential and large enough to be decomposed into
14 several sequential steps, the simplest way to make it run faster is to exploit
15 data parallelism. Usually it means finding multi-dimensional arrays and loops
16 that access their elements and trying to make them parallel. After transforming
17 several loops the programme will still have the same number of sequential steps,
18 but every step will (ideally) be internally parallel.
19
20 After that the only possibility to speedup the programme is to overlap execution
21 of code blocks that work with different hardware devices. The most common
22 pattern is to overlap computation with network I/O or disk I/O. This approach
23 makes sense because all devices operate with little synchronisation, and issuing
24 commands in parallel makes the whole programme perform better. This behaviour
25 can be achieved by allocating a separate task queue for each device and
26 submitting tasks to these queues asynchronously with execution of the main
27 thread. So, after this optimisation, the programme will be composed of several
28 steps chained into the pipeline, each step is implemented as a task queue for a
29 particular device.
30
31 Pipelining of otherwise sequential steps is beneficial not only for code
32 accessing different devices, but for code different branches of which are
33 suitable for execution by multiple hardware threads of the same core, i.e.
34 branches accessing different regions of memory or performing mixed arithmetic
35 (floating point and integer). In other words, code branches which use different
36 modules of processor are good candidates to run in parallel on a processor core
37 with multiple hardware threads.
38
39 Even though pipelining may not add parallelism for a programme that uses only
40 one input file (or a set of input parameters), it adds parallelism when the
41 programme can process multiple input files: each input generates tasks which
42 travel through the whole pipeline in parallel with tasks generated by other
43 inputs. With a pipeline an array of files is processed in parallel by the same
44 set of resources allocated for a batch job, and possibly with greater efficiency
45 for busy HPC clusters compared to executing a separate job for each input file,
46 because the time that each subsequent job after the first spends in a queue is
47 eliminated.
48
49 Computational model with a pipeline can be seen as \emph{bulk-asynchronous
50 model}, because of the parallel nature of otherwise sequential execution
51 steps. This model is the basis of the fault-tolerance model developed here.
52
53 \subsection{Fail over model}
54
55 Although, fault-tolerance and high-availability are different terms, in essence
56 they describe the same property---an ability of a system to switch processing
57 from a failed component to its live spare or backup component. In case of
58 fault-tolerance it is the ability to switch from a failed slave node to a spare
59 one, i.e. to repeat computation step on a healthy slave node. In case of
60 high-availability it is the ability to switch from a failed master node to a
61 backup node with full restoration of execution state. These are the core
62 abilities that constitute distributed system's ability to \emph{fail over}.
63
64 The key feature that is missing in the current parallel programming and big data
65 processing technologies is a possibility to specify hierarchical dependencies
66 between parallel tasks. When one has such dependency, it is trivial to determine
67 which task should be responsible for re-executing a failed task on a healthy
68 node. To re-execute the root of the hierarchy, a backup root task is created and
69 executed on a different node. There exists a number of engines that are capable
70 of executing directed acyclic graphs of tasks in
71 parallel~\cite{acun2014charmpp,islam2012oozie}, but graphs are not good to infer
72 master-slave relationship between tasks, because a node in the graph may have
73 multiple parent nodes.
74
75 \subsection{Programming model}
76
77 This work is based on the results of previous research:
78 In~\cite{gankevich2015subordination,gankevich2015iccsa} we developed an
79 algorithm that allows to build a tree hierarchy from strictly ordered set of
80 cluster nodes. The sole purpose of this hierarchy is to make a cluster more
81 fault-tolerant by introducing multiple master nodes. If a master node fails,
82 then its subordinates try to connect to another node from the same or higher
83 level of the hierarchy. If there is no such node, one of the subordinates
84 becomes the master. In~\cite{gankevich2015spec} we developed a framework for big
85 data processing without fault tolerance, and here this framework is combined
86 with fault-tolerance techniques described in this paper.
87
88 Each programme that runs on top of the tree hierarchy is composed of
89 computational kernels---objects that contain data and code to process it. To
90 exploit parallelism a kernel may create arbitrary number of subordinate kernels
91 which are automatically spread first across available processor cores, second
92 across subordinate nodes in the tree hierarchy. The programme is itself a kernel
93 (without a parent as it is executed by a user), which either solves the problem
94 sequentially on its own or creates subordinate kernels to solve it in parallel.
95
96 In contrast to HPC applications, in big data applications it is inefficient to
97 run computational kernels on arbitrary chosen nodes. More practical approach is
98 to bind every kernel to a file location in a parallel file system and transfer
99 the kernel to that location before processing the file. That way expensive data
100 transfer is eliminated, and the file is always read from a local drive. This
101 approach is more deterministic compared to existing ones, e.g. MapReduce
102 framework runs jobs on nodes that are ``close'' to the file location, but not
103 necessarily the exact node where the file is located~\cite{dean2008mapreduce}.
104 However, this approach does not come without disadvantages: scalability of a big
105 data application is limited by the strategy that was employed to distribute its
106 input files across cluster nodes. The more nodes used to store input files, the
107 more read performance is achieved. The advantage of our approach is that the I/O
108 performance is more predictable, than one of hybrid approach with streaming
109 files over the network.
110
111 % From mathematical point of view kernel $K$ can be described as a vector-valued functional which recursively maps a kernel to $n$-component vector of kernels:
112 % \begin{equation*}
113 % K(f): \mathbb{K} \rightarrow \mathbb{K}^n
114 % \qquad
115 % \mathbb{K}^n = \left\{ f: \mathbb{K} \rightarrow \mathbb{K}^n \right\}.
116 % \end{equation*}
117 % Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops recursion, is used to call the first kernel and finish execution of the programme. An argument to each kernel is interpreted using the following rules.
118 % \begin{enumerate}
119 % \item If a kernel is a new kernel, then its argument is its parent kernel.
120 % \item If a kernel is a parent of the kernel that produced it or some other existing kernel, then the argument is the kernel that produced it.
121 % \end{enumerate}
122
123 % Engine that executes kernels is implemented as a simple loop. It starts with calling the first kernel with dummy kernel as an argument, then calls each kernel that was produced by this call and so forth. The loop finishes when a dummy kernel is returned as a result of the call.
124
125 % Since kernel call may return multiple kernels they are executed in parallel. Parallel execution quickly produces a pool of kernels which permit execution in an unspecified order. Several threads concurrently retrieve kernels from the pool and may ``spill'' remaining kernels to neighbouring cluster nodes.
126
127 % Kernels are implemented as closures---function objects containing all their arguments, a reference to parent kernel and user-supplied data. The data is either processed upon kernel call, or subordinate kernels are created to process it in parallel. When the processing is complete a parent kernel closure with its subordinate kernel as an argument is called to collect data from it.
128
129 \subsection{Handling master node failures}
130
131 A possible way of handling a failure of a node where the first kernel is located
132 (a master node) is to replicate this kernel to a backup node, and make all
133 updates to its state propagate to the backup node by means of a distributed
134 transaction. This approach requires synchronisation between all nodes that
135 execute subordinates of the first kernel and the node with the first kernel
136 itself. When a node with the first kernel goes offline, the nodes with
137 subordinate kernels must know what node is the backup one. However, if the
138 backup node also goes offline in the middle of execution of some subordinate
139 kernel, then it is impossible for this kernel to discover the next backup node
140 to return to, because this kernel has not discovered the unavailability of the
141 master node yet. One can think of a consensus-based algorithm to ensure that
142 subordinate kernels always know where the backup node is, but distributed
143 consensus algorithms do not scale well to the large number of nodes and they are
144 not reliable~\cite{fischer1985impossibility}. So, consensus-based approach does
145 not play well with asynchronous nature of computational kernels as it may
146 inhibit scalability of a parallel programme.
147
148 Fortunately, the first kernel usually does not perform operations in parallel,
149 it is rather sequentially launches execution steps one by one, so it has only
150 one subordinate at a time. Such behaviour is described by bulk-synchronous
151 parallel programming model, in the framework of which a programme consists of
152 sequential supersteps which are internally parallel~\cite{valiant1990bridging}.
153 Keeping this in mind, we can simplify synchronisation of its state: we can send
154 the first kernel along with its subordinate to the subordinate node. When the
155 node with the first kernel fails, its copy receives its subordinate, and no
156 execution time is lost. When the node with its copy fails, its subordinate is
157 rescheduled on some other node, and in the worst case a whole step of
158 computation is lost.
159
160 Described approach works only for kernels that do not have a parent and have
161 only one subordinate at a time, and act similar to manually triggered
162 checkpoints. The advantage is that they
163 \begin{itemize}
164 \item save results after each sequential step when memory footprint of a
165 programme is low,
166 \item they save only relevant data,
167 \item and they use memory of a subordinate node instead of stable storage.
168 \end{itemize}
169
170 \section{Results}
171
172 Master node fail over technique is evaluated on the example of wave energy
173 spectra processing application. This programme uses NDBC
174 dataset~\cite{ndbc-dataset} to reconstruct frequency-directional spectra from
175 wave rider buoy measurements and compute variance. Each spectrum is
176 reconstructed from five variables using the following
177 formula~\cite{earle1996nondirectional}.
178 \begin{equation*}
179 S(\omega, \theta) = \frac{1}{\pi}
180 \left[
181 \frac{1}{2} +
182 r_1 \cos \left( \theta - \alpha_1 \right) +
183 r_2 \sin \left( 2 \left( \theta - \alpha_2 \right) \right)
184 \right]
185 S_0(\omega).
186 \end{equation*}
187 Here $\omega$ denotes frequency, $\theta$ is wave direction, $r_{1,2}$ and
188 $\alpha_{1,2}$ are parameters of spectrum decomposition and $S_0$ is
189 non-directional spectrum; $r_{1,2}$, $\alpha_{1,2}$ and $S_0$ are acquired
190 through measurements. Properties of the dataset which is used in evaluation are
191 listed in Table~\ref{tab:ndbc-dataset}.
192
193 \begin{table}
194 \centering
195 \caption{NDBC dataset properties.}
196 \begin{tabular}{ll}
197 \toprule
198 Dataset size & 144MB \\
199 Dataset size (uncompressed) & 770MB \\
200 No. of wave stations & 24 \\
201 Time span & 3 years (2010--2012) \\
202 Total no. of spectra & 445422 \\
203 \bottomrule
204 \end{tabular}
205 \label{tab:ndbc-dataset}
206 \end{table}
207
208 The algorithm of processing spectra is as follows. First, current directory is
209 recursively scanned for input files. Data for all buoys is distributed across
210 cluster nodes and each buoy's data processing is distributed across processor
211 cores of a node. Processing begins with joining corresponding measurements for
212 each spectrum variables into a tuple, then for each tuple frequency-directional
213 spectrum is reconstructed and its variance is computed. Results are gradually
214 copied back to the machine where application was executed and when the
215 processing is complete the programme terminates.
216
217 \begin{table}
218 \centering
219 \caption{Test platform configuration.}
220 \begin{tabular}{ll}
221 \toprule
222 CPU & Intel Xeon E5440, 2.83GHz \\
223 RAM & 4Gb \\
224 HDD & ST3250310NS, 7200rpm \\
225 No. of nodes & 12 \\
226 No. of CPU cores per node & 8 \\
227 \bottomrule
228 \end{tabular}
229 \label{tab:cluster}
230 \end{table}
231
232 In a series of test runs we benchmarked performance of the application in the
233 presence of different types of failures:
234 \begin{itemize}
235 \item failure of a master node (a node where the first kernel is run),
236 \item failure of a slave node (a node where spectra from a particular
237 station are reconstructed) and
238 \item failure of a backup node (a node where the first kernel is copied).
239 \end{itemize}
240 A tree hierarchy with sufficiently large fan-out value was chosen to make all
241 cluster nodes connect directly to the first one so that only one master node
242 exists in the cluster. In each run the first kernel was launched on a different
243 node to make mapping of kernel hierarchy to the tree hierarchy optimal. A victim
244 node was made offline after a fixed amount of time early after the programme
245 start. To make up for the node failure all data files have replicas stored on
246 different cluster nodes. All relevant parameters are summarised in
247 Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the
248 tree hierarchy). The results of these runs were compared to the run without node
249 failures (Figure~\ref{fig:benchmark-bigdata}).
250
251 \begin{table}
252 \centering
253 \caption{Benchmark parameters.}
254 \begin{tabular}{llll}
255 \toprule
256 Experiment no. & Master node & Victim node & Time to offline, s \\
257 \midrule
258 1 & root & & \\
259 2 & root & leaf & 30 \\
260 3 & leaf & leaf & 30 \\
261 4 & leaf & root & 30 \\
262 \bottomrule
263 \end{tabular}
264 \label{tab:benchmark}
265 \end{table}
266
267 The benchmark showed that only a backup node failure results in significant
268 performance penalty, in all other cases the performance is roughly equals to the
269 one without failures but with the number of nodes minus one. It happens because
270 a backup node not only stores the copy of the state of the current computation
271 step but executes this step in parallel with other subordinate nodes. So, when a
272 backup node fails, the master node executes the whole step once again on
273 arbitrarily chosen healthy subordinate node.
274
275 % \begin{figure}
276 % \centering
277 % \includegraphics{factory-3000}
278 % \caption{Performance of hydrodynamics HPC application in the presence of node failures.}
279 % \label{fig:benchmark-hpc}
280 % \end{figure}
281
282 \begin{figure}
283 \centering
284 \includegraphics{spec}
285 \caption{Performance of spectrum processing application in the presence of different types of node failures.}
286 \label{fig:benchmark-bigdata}
287 \end{figure}
288
289 \section{Discussion}
290
291 Described algorithm guarantees to handle one failure per computational step,
292 more failures can be tolerated if they do not affect the master node. The system
293 handles simultaneous failure of all subordinate nodes, however, if both master
294 and backup nodes fail, there is no chance for an application to survive. In this
295 case the state of the current computation step is lost, and the only way to
296 restore it is to restart the application.
297
298 Computational kernels are means of abstraction that decouple distributed
299 application from physical hardware: it does not matter how many nodes are online
300 for an application to run successfully. Computational kernels eliminate the need
301 to allocate a physical backup node to make master node highly-available, with
302 computational kernels approach any node can act as a backup one. Finally,
303 computational kernels can handle subordinate node failures in a way that is
304 transparent to a programmer.
305
306 The disadvantage of this approach is evident: there is no way of making existing
307 middleware highly-available without rewriting their source code. Although, our
308 programming framework is lightweight, it is not easy to map architecture of
309 existing middleware systems to it: most systems are developed keeping in mind
310 static assignment of server/client roles, which is not easy to make dynamic.
311 Hopefully, our approach will simplify design of future middleware systems.

	iccsa-16-factory Factory: Master Node High-Availability for Big Data Applications and Beyond
	git clone https://git.igankevich.com/iccsa-16-factory.git
	Log \| Files \| Refs