iccsa-16-factory

git clone https://git.igankevich.com/iccsa-16-factory.git
Log | Files | Refs

commit 10f29124925205d659ce22671ca80ea3c84f33d0
parent fc60d2879b5dc18ed6597ccee3ff90a87443f7a3
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Tue, 24 Jan 2017 10:56:59 +0300

Reformat sections.

Diffstat:
src/sections.tex | 231++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 194 insertions(+), 37 deletions(-)

diff --git a/src/sections.tex b/src/sections.tex @@ -2,31 +2,111 @@ \subsection{Model of computation} -To infer fault tolerance model which is suitable for big data applications we use bulk-synchronous parallel model~\cite{valiant1990bridging} as the basis. This model assumes that a parallel programme is composed of several sequential steps that are internally parallel, and global synchronisation of all parallel processes occurs after each step. In our model all sequential steps are pipelined where it is possible. The evolution of the computational model is described as follows. - -Given a programme that is sequential and large enough to be decomposed into several sequential steps, the simplest way to make it run faster is to exploit data parallelism. Usually it means finding multi-dimensional arrays and loops that access their elements and trying to make them parallel. After transforming several loops the programme will still have the same number of sequential steps, but every step will (ideally) be internally parallel. - -After that the only possibility to speedup the programme is to overlap execution of code blocks that work with different hardware devices. The most common pattern is to overlap computation with network I/O or disk I/O. This approach makes sense because all devices operate with little synchronisation, and issuing commands in parallel makes the whole programme perform better. This behaviour can be achieved by allocating a separate task queue for each device and submitting tasks to these queues asynchronously with execution of the main thread. So, after this optimisation, the programme will be composed of several steps chained into the pipeline, each step is implemented as a task queue for a particular device. - -Pipelining of otherwise sequential steps is beneficial not only for code accessing different devices, but for code different branches of which are suitable for execution by multiple hardware threads of the same core, i.e. branches accessing different regions of memory or performing mixed arithmetic (floating point and integer). In other words, code branches which use different modules of processor are good candidates to run in parallel on a processor core with multiple hardware threads. - -Even though pipelining may not add parallelism for a programme that uses only one input file (or a set of input parameters), it adds parallelism when the programme can process multiple input files: each input generates tasks which travel through the whole pipeline in parallel with tasks generated by other inputs. With a pipeline an array of files is processed in parallel by the same set of resources allocated for a batch job, and possibly with greater efficiency for busy HPC clusters compared to executing a separate job for each input file, because the time that each subsequent job after the first spends in a queue is eliminated. - -Computational model with a pipeline can be seen as \emph{bulk-asynchronous model}, because of the parallel nature of otherwise sequential execution steps. This model is the basis of the fault-tolerance model developed here. +To infer fault tolerance model which is suitable for big data applications we +use bulk-synchronous parallel model~\cite{valiant1990bridging} as the basis. +This model assumes that a parallel programme is composed of several sequential +steps that are internally parallel, and global synchronisation of all parallel +processes occurs after each step. In our model all sequential steps are +pipelined where it is possible. The evolution of the computational model is +described as follows. + +Given a programme that is sequential and large enough to be decomposed into +several sequential steps, the simplest way to make it run faster is to exploit +data parallelism. Usually it means finding multi-dimensional arrays and loops +that access their elements and trying to make them parallel. After transforming +several loops the programme will still have the same number of sequential steps, +but every step will (ideally) be internally parallel. + +After that the only possibility to speedup the programme is to overlap execution +of code blocks that work with different hardware devices. The most common +pattern is to overlap computation with network I/O or disk I/O. This approach +makes sense because all devices operate with little synchronisation, and issuing +commands in parallel makes the whole programme perform better. This behaviour +can be achieved by allocating a separate task queue for each device and +submitting tasks to these queues asynchronously with execution of the main +thread. So, after this optimisation, the programme will be composed of several +steps chained into the pipeline, each step is implemented as a task queue for a +particular device. + +Pipelining of otherwise sequential steps is beneficial not only for code +accessing different devices, but for code different branches of which are +suitable for execution by multiple hardware threads of the same core, i.e. +branches accessing different regions of memory or performing mixed arithmetic +(floating point and integer). In other words, code branches which use different +modules of processor are good candidates to run in parallel on a processor core +with multiple hardware threads. + +Even though pipelining may not add parallelism for a programme that uses only +one input file (or a set of input parameters), it adds parallelism when the +programme can process multiple input files: each input generates tasks which +travel through the whole pipeline in parallel with tasks generated by other +inputs. With a pipeline an array of files is processed in parallel by the same +set of resources allocated for a batch job, and possibly with greater efficiency +for busy HPC clusters compared to executing a separate job for each input file, +because the time that each subsequent job after the first spends in a queue is +eliminated. + +Computational model with a pipeline can be seen as \emph{bulk-asynchronous + model}, because of the parallel nature of otherwise sequential execution +steps. This model is the basis of the fault-tolerance model developed here. \subsection{Fail over model} -Although, fault-tolerance and high-availability are different terms, in essence they describe the same property---an ability of a system to switch processing from a failed component to its live spare or backup component. In case of fault-tolerance it is the ability to switch from a failed slave node to a spare one, i.e. to repeat computation step on a healthy slave node. In case of high-availability it is the ability to switch from a failed master node to a backup node with full restoration of execution state. These are the core abilities that constitute distributed system's ability to \emph{fail over}. - -The key feature that is missing in the current parallel programming and big data processing technologies is a possibility to specify hierarchical dependencies between parallel tasks. When one has such dependency, it is trivial to determine which task should be responsible for re-executing a failed task on a healthy node. To re-execute the root of the hierarchy, a backup root task is created and executed on a different node. There exists a number of engines that are capable of executing directed acyclic graphs of tasks in parallel~\cite{acun2014charmpp,islam2012oozie}, but graphs are not good to infer master-slave relationship between tasks, because a node in the graph may have multiple parent nodes. +Although, fault-tolerance and high-availability are different terms, in essence +they describe the same property---an ability of a system to switch processing +from a failed component to its live spare or backup component. In case of +fault-tolerance it is the ability to switch from a failed slave node to a spare +one, i.e. to repeat computation step on a healthy slave node. In case of +high-availability it is the ability to switch from a failed master node to a +backup node with full restoration of execution state. These are the core +abilities that constitute distributed system's ability to \emph{fail over}. + +The key feature that is missing in the current parallel programming and big data +processing technologies is a possibility to specify hierarchical dependencies +between parallel tasks. When one has such dependency, it is trivial to determine +which task should be responsible for re-executing a failed task on a healthy +node. To re-execute the root of the hierarchy, a backup root task is created and +executed on a different node. There exists a number of engines that are capable +of executing directed acyclic graphs of tasks in +parallel~\cite{acun2014charmpp,islam2012oozie}, but graphs are not good to infer +master-slave relationship between tasks, because a node in the graph may have +multiple parent nodes. \subsection{Programming model} -This work is based on the results of previous research: In~\cite{gankevich2015subordination,gankevich2015iccsa} we developed an algorithm that allows to build a tree hierarchy from strictly ordered set of cluster nodes. The sole purpose of this hierarchy is to make a cluster more fault-tolerant by introducing multiple master nodes. If a master node fails, then its subordinates try to connect to another node from the same or higher level of the hierarchy. If there is no such node, one of the subordinates becomes the master. In~\cite{gankevich2015spec} we developed a framework for big data processing without fault tolerance, and here this framework is combined with fault-tolerance techniques described in this paper. - -Each programme that runs on top of the tree hierarchy is composed of computational kernels---objects that contain data and code to process it. To exploit parallelism a kernel may create arbitrary number of subordinate kernels which are automatically spread first across available processor cores, second across subordinate nodes in the tree hierarchy. The programme is itself a kernel (without a parent as it is executed by a user), which either solves the problem sequentially on its own or creates subordinate kernels to solve it in parallel. - -In contrast to HPC applications, in big data applications it is inefficient to run computational kernels on arbitrary chosen nodes. More practical approach is to bind every kernel to a file location in a parallel file system and transfer the kernel to that location before processing the file. That way expensive data transfer is eliminated, and the file is always read from a local drive. This approach is more deterministic compared to existing ones, e.g. MapReduce framework runs jobs on nodes that are ``close'' to the file location, but not necessarily the exact node where the file is located~\cite{dean2008mapreduce}. However, this approach does not come without disadvantages: scalability of a big data application is limited by the strategy that was employed to distribute its input files across cluster nodes. The more nodes used to store input files, the more read performance is achieved. The advantage of our approach is that the I/O performance is more predictable, than one of hybrid approach with streaming files over the network. +This work is based on the results of previous research: +In~\cite{gankevich2015subordination,gankevich2015iccsa} we developed an +algorithm that allows to build a tree hierarchy from strictly ordered set of +cluster nodes. The sole purpose of this hierarchy is to make a cluster more +fault-tolerant by introducing multiple master nodes. If a master node fails, +then its subordinates try to connect to another node from the same or higher +level of the hierarchy. If there is no such node, one of the subordinates +becomes the master. In~\cite{gankevich2015spec} we developed a framework for big +data processing without fault tolerance, and here this framework is combined +with fault-tolerance techniques described in this paper. + +Each programme that runs on top of the tree hierarchy is composed of +computational kernels---objects that contain data and code to process it. To +exploit parallelism a kernel may create arbitrary number of subordinate kernels +which are automatically spread first across available processor cores, second +across subordinate nodes in the tree hierarchy. The programme is itself a kernel +(without a parent as it is executed by a user), which either solves the problem +sequentially on its own or creates subordinate kernels to solve it in parallel. + +In contrast to HPC applications, in big data applications it is inefficient to +run computational kernels on arbitrary chosen nodes. More practical approach is +to bind every kernel to a file location in a parallel file system and transfer +the kernel to that location before processing the file. That way expensive data +transfer is eliminated, and the file is always read from a local drive. This +approach is more deterministic compared to existing ones, e.g. MapReduce +framework runs jobs on nodes that are ``close'' to the file location, but not +necessarily the exact node where the file is located~\cite{dean2008mapreduce}. +However, this approach does not come without disadvantages: scalability of a big +data application is limited by the strategy that was employed to distribute its +input files across cluster nodes. The more nodes used to store input files, the +more read performance is achieved. The advantage of our approach is that the I/O +performance is more predictable, than one of hybrid approach with streaming +files over the network. % From mathematical point of view kernel $K$ can be described as a vector-valued functional which recursively maps a kernel to $n$-component vector of kernels: % \begin{equation*} @@ -48,20 +128,53 @@ In contrast to HPC applications, in big data applications it is inefficient to r \subsection{Handling master node failures} -A possible way of handling a failure of a node where the first kernel is located (a master node) is to replicate this kernel to a backup node, and make all updates to its state propagate to the backup node by means of a distributed transaction. This approach requires synchronisation between all nodes that execute subordinates of the first kernel and the node with the first kernel itself. When a node with the first kernel goes offline, the nodes with subordinate kernels must know what node is the backup one. However, if the backup node also goes offline in the middle of execution of some subordinate kernel, then it is impossible for this kernel to discover the next backup node to return to, because this kernel has not discovered the unavailability of the master node yet. One can think of a consensus-based algorithm to ensure that subordinate kernels always know where the backup node is, but distributed consensus algorithms do not scale well to the large number of nodes and they are not reliable~\cite{fischer1985impossibility}. So, consensus-based approach does not play well with asynchronous nature of computational kernels as it may inhibit scalability of a parallel programme. - -Fortunately, the first kernel usually does not perform operations in parallel, it is rather sequentially launches execution steps one by one, so it has only one subordinate at a time. Such behaviour is described by bulk-synchronous parallel programming model, in the framework of which a programme consists of sequential supersteps which are internally parallel~\cite{valiant1990bridging}. Keeping this in mind, we can simplify synchronisation of its state: we can send the first kernel along with its subordinate to the subordinate node. When the node with the first kernel fails, its copy receives its subordinate, and no execution time is lost. When the node with its copy fails, its subordinate is rescheduled on some other node, and in the worst case a whole step of computation is lost. - -Described approach works only for kernels that do not have a parent and have only one subordinate at a time, and act similar to manually triggered checkpoints. The advantage is that they +A possible way of handling a failure of a node where the first kernel is located +(a master node) is to replicate this kernel to a backup node, and make all +updates to its state propagate to the backup node by means of a distributed +transaction. This approach requires synchronisation between all nodes that +execute subordinates of the first kernel and the node with the first kernel +itself. When a node with the first kernel goes offline, the nodes with +subordinate kernels must know what node is the backup one. However, if the +backup node also goes offline in the middle of execution of some subordinate +kernel, then it is impossible for this kernel to discover the next backup node +to return to, because this kernel has not discovered the unavailability of the +master node yet. One can think of a consensus-based algorithm to ensure that +subordinate kernels always know where the backup node is, but distributed +consensus algorithms do not scale well to the large number of nodes and they are +not reliable~\cite{fischer1985impossibility}. So, consensus-based approach does +not play well with asynchronous nature of computational kernels as it may +inhibit scalability of a parallel programme. + +Fortunately, the first kernel usually does not perform operations in parallel, +it is rather sequentially launches execution steps one by one, so it has only +one subordinate at a time. Such behaviour is described by bulk-synchronous +parallel programming model, in the framework of which a programme consists of +sequential supersteps which are internally parallel~\cite{valiant1990bridging}. +Keeping this in mind, we can simplify synchronisation of its state: we can send +the first kernel along with its subordinate to the subordinate node. When the +node with the first kernel fails, its copy receives its subordinate, and no +execution time is lost. When the node with its copy fails, its subordinate is +rescheduled on some other node, and in the worst case a whole step of +computation is lost. + +Described approach works only for kernels that do not have a parent and have +only one subordinate at a time, and act similar to manually triggered +checkpoints. The advantage is that they \begin{itemize} - \item save results after each sequential step when memory footprint of a programme is low, + \item save results after each sequential step when memory footprint of a + programme is low, \item they save only relevant data, \item and they use memory of a subordinate node instead of stable storage. \end{itemize} \section{Results} -Master node fail over technique is evaluated on the example of wave energy spectra processing application. This programme uses NDBC dataset~\cite{ndbc-dataset} to reconstruct frequency-directional spectra from wave rider buoy measurements and compute variance. Each spectrum is reconstructed from five variables using the following formula~\cite{earle1996nondirectional}. +Master node fail over technique is evaluated on the example of wave energy +spectra processing application. This programme uses NDBC +dataset~\cite{ndbc-dataset} to reconstruct frequency-directional spectra from +wave rider buoy measurements and compute variance. Each spectrum is +reconstructed from five variables using the following +formula~\cite{earle1996nondirectional}. \begin{equation*} S(\omega, \theta) = \frac{1}{\pi} \left[ @@ -71,7 +184,11 @@ Master node fail over technique is evaluated on the example of wave energy spect \right] S_0(\omega). \end{equation*} -Here $\omega$ denotes frequency, $\theta$ is wave direction, $r_{1,2}$ and $\alpha_{1,2}$ are parameters of spectrum decomposition and $S_0$ is non-directional spectrum; $r_{1,2}$, $\alpha_{1,2}$ and $S_0$ are acquired through measurements. Properties of the dataset which is used in evaluation are listed in Table~\ref{tab:ndbc-dataset}. +Here $\omega$ denotes frequency, $\theta$ is wave direction, $r_{1,2}$ and +$\alpha_{1,2}$ are parameters of spectrum decomposition and $S_0$ is +non-directional spectrum; $r_{1,2}$, $\alpha_{1,2}$ and $S_0$ are acquired +through measurements. Properties of the dataset which is used in evaluation are +listed in Table~\ref{tab:ndbc-dataset}. \begin{table} \centering @@ -88,7 +205,14 @@ Here $\omega$ denotes frequency, $\theta$ is wave direction, $r_{1,2}$ and $\alp \label{tab:ndbc-dataset} \end{table} -The algorithm of processing spectra is as follows. First, current directory is recursively scanned for input files. Data for all buoys is distributed across cluster nodes and each buoy's data processing is distributed across processor cores of a node. Processing begins with joining corresponding measurements for each spectrum variables into a tuple, then for each tuple frequency-directional spectrum is reconstructed and its variance is computed. Results are gradually copied back to the machine where application was executed and when the processing is complete the programme terminates. +The algorithm of processing spectra is as follows. First, current directory is +recursively scanned for input files. Data for all buoys is distributed across +cluster nodes and each buoy's data processing is distributed across processor +cores of a node. Processing begins with joining corresponding measurements for +each spectrum variables into a tuple, then for each tuple frequency-directional +spectrum is reconstructed and its variance is computed. Results are gradually +copied back to the machine where application was executed and when the +processing is complete the programme terminates. \begin{table} \centering @@ -105,13 +229,24 @@ The algorithm of processing spectra is as follows. First, current directory is r \label{tab:cluster} \end{table} -In a series of test runs we benchmarked performance of the application in the presence of different types of failures: +In a series of test runs we benchmarked performance of the application in the +presence of different types of failures: \begin{itemize} \item failure of a master node (a node where the first kernel is run), - \item failure of a slave node (a node where spectra from a particular station are reconstructed) and + \item failure of a slave node (a node where spectra from a particular + station are reconstructed) and \item failure of a backup node (a node where the first kernel is copied). \end{itemize} -A tree hierarchy with sufficiently large fan-out value was chosen to make all cluster nodes connect directly to the first one so that only one master node exists in the cluster. In each run the first kernel was launched on a different node to make mapping of kernel hierarchy to the tree hierarchy optimal. A victim node was made offline after a fixed amount of time early after the programme start. To make up for the node failure all data files have replicas stored on different cluster nodes. All relevant parameters are summarised in Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the tree hierarchy). The results of these runs were compared to the run without node failures (Figure~\ref{fig:benchmark-bigdata}). +A tree hierarchy with sufficiently large fan-out value was chosen to make all +cluster nodes connect directly to the first one so that only one master node +exists in the cluster. In each run the first kernel was launched on a different +node to make mapping of kernel hierarchy to the tree hierarchy optimal. A victim +node was made offline after a fixed amount of time early after the programme +start. To make up for the node failure all data files have replicas stored on +different cluster nodes. All relevant parameters are summarised in +Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the +tree hierarchy). The results of these runs were compared to the run without node +failures (Figure~\ref{fig:benchmark-bigdata}). \begin{table} \centering @@ -129,7 +264,13 @@ A tree hierarchy with sufficiently large fan-out value was chosen to make all cl \label{tab:benchmark} \end{table} -The benchmark showed that only a backup node failure results in significant performance penalty, in all other cases the performance is roughly equals to the one without failures but with the number of nodes minus one. It happens because a backup node not only stores the copy of the state of the current computation step but executes this step in parallel with other subordinate nodes. So, when a backup node fails, the master node executes the whole step once again on arbitrarily chosen healthy subordinate node. +The benchmark showed that only a backup node failure results in significant +performance penalty, in all other cases the performance is roughly equals to the +one without failures but with the number of nodes minus one. It happens because +a backup node not only stores the copy of the state of the current computation +step but executes this step in parallel with other subordinate nodes. So, when a +backup node fails, the master node executes the whole step once again on +arbitrarily chosen healthy subordinate node. % \begin{figure} % \centering @@ -147,8 +288,24 @@ The benchmark showed that only a backup node failure results in significant perf \section{Discussion} -Described algorithm guarantees to handle one failure per computational step, more failures can be tolerated if they do not affect the master node. The system handles simultaneous failure of all subordinate nodes, however, if both master and backup nodes fail, there is no chance for an application to survive. In this case the state of the current computation step is lost, and the only way to restore it is to restart the application. - -Computational kernels are means of abstraction that decouple distributed application from physical hardware: it does not matter how many nodes are online for an application to run successfully. Computational kernels eliminate the need to allocate a physical backup node to make master node highly-available, with computational kernels approach any node can act as a backup one. Finally, computational kernels can handle subordinate node failures in a way that is transparent to a programmer. - -The disadvantage of this approach is evident: there is no way of making existing middleware highly-available without rewriting their source code. Although, our programming framework is lightweight, it is not easy to map architecture of existing middleware systems to it: most systems are developed keeping in mind static assignment of server/client roles, which is not easy to make dynamic. Hopefully, our approach will simplify design of future middleware systems. +Described algorithm guarantees to handle one failure per computational step, +more failures can be tolerated if they do not affect the master node. The system +handles simultaneous failure of all subordinate nodes, however, if both master +and backup nodes fail, there is no chance for an application to survive. In this +case the state of the current computation step is lost, and the only way to +restore it is to restart the application. + +Computational kernels are means of abstraction that decouple distributed +application from physical hardware: it does not matter how many nodes are online +for an application to run successfully. Computational kernels eliminate the need +to allocate a physical backup node to make master node highly-available, with +computational kernels approach any node can act as a backup one. Finally, +computational kernels can handle subordinate node failures in a way that is +transparent to a programmer. + +The disadvantage of this approach is evident: there is no way of making existing +middleware highly-available without rewriting their source code. Although, our +programming framework is lightweight, it is not easy to map architecture of +existing middleware systems to it: most systems are developed keeping in mind +static assignment of server/client roles, which is not easy to make dynamic. +Hopefully, our approach will simplify design of future middleware systems.