hpcs-16-factory

git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

commit e0253294bbe77ef52aa274c1a1e6edc4985ba272
parent b0f61471f06c6d111408e9b578ef5264c2e55a60
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Tue, 24 Jan 2017 11:16:12 +0300

Reformat.

Diffstat:
src/methods.tex | 130+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
src/results.tex | 77++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 175 insertions(+), 32 deletions(-)

diff --git a/src/methods.tex b/src/methods.tex @@ -2,48 +2,133 @@ \subsection{HIERARCHY OF NODES} -This work is based on the results of previous research: In~\cite{gankevich2015subordination} we developed an algorithm that allows to build a tree hierarchy from strictly ordered set of cluster nodes. The sole purpose of this hierarchy is to make a cluster more fault-tolerant by introducing multiple master nodes. If a master node fails, then its subordinates try to connect to another node from the same or higher level of the hierarchy. If there is no such node, one of the subordinates becomes the master. - -A position of a node in a hierarchy is determined by mapping its IP address position in a network to its layer and offset in a tree hierarchy. The number of layers is controlled by a fan-out value. As nodes' IP addresses change infrequently this mapping is mostly static, and affected only by node failures. Thus with help of tree hierarchy we can precisely determine IP address of a master node without resorting to costly leader election algorithms which are commonly used for this purpose. - -We use this hierarchy to perform load balancing across neighbouring cluster nodes (nodes that are adjacent in the hierarchy), i.e.~if the job is launched on a subordinate node its principal node also receives a fraction of the load. This rule makes the system symmetric: Each node runs the same software and it is easy to switch from a failed master node to a backup node, it is just a matter of changing node's role. Similar design choice is applied in distributed key-value stores~\cite{anderson2010couchdb,lakshman2010cassandra} to handle failure of a master node, but we have no knowledge of job schedulers that use this to distribute the load on the cluster with multiple master nodes. +This work is based on the results of previous research: +In~\cite{gankevich2015subordination} we developed an algorithm that allows to +build a tree hierarchy from strictly ordered set of cluster nodes. The sole +purpose of this hierarchy is to make a cluster more fault-tolerant by +introducing multiple master nodes. If a master node fails, then its subordinates +try to connect to another node from the same or higher level of the hierarchy. +If there is no such node, one of the subordinates becomes the master. + +A position of a node in a hierarchy is determined by mapping its IP address +position in a network to its layer and offset in a tree hierarchy. The number of +layers is controlled by a fan-out value. As nodes' IP addresses change +infrequently this mapping is mostly static, and affected only by node failures. +Thus with help of tree hierarchy we can precisely determine IP address of a +master node without resorting to costly leader election algorithms which are +commonly used for this purpose. + +We use this hierarchy to perform load balancing across neighbouring cluster +nodes (nodes that are adjacent in the hierarchy), i.e.~if the job is launched on +a subordinate node its principal node also receives a fraction of the load. This +rule makes the system symmetric: Each node runs the same software and it is easy +to switch from a failed master node to a backup node, it is just a matter of +changing node's role. Similar design choice is applied in distributed key-value +stores~\cite{anderson2010couchdb,lakshman2010cassandra} to handle failure of a +master node, but we have no knowledge of job schedulers that use this to +distribute the load on the cluster with multiple master nodes. \subsection{HIERARCHY OF COMPUTATIONAL KERNELS} -Each programme that runs on top of the tree hierarchy is composed of computational kernels---objects that contain data and code to process it. To exploit parallelism a kernel may create arbitrary number of subordinate kernels which are automatically spread first across available processor cores, second across subordinate nodes in the tree hierarchy. The programme is itself a kernel (without a parent as it is executed by a user), which either solves the problem sequentially on its own or creates subordinate kernels to solve it in parallel. - -Unlike main function in programmes based on message passing library, the first computational kernel is initially run only on one node, and remote nodes are used only when the local queue is overflown by kernels. This design choice allows to have arbitrary number of nodes throughout execution of a programme, and take more nodes for highly parallel parts of the code. Somewhat similar choice was made in the design of MapReduce framework~\cite{dean2008mapreduce,vavilapalli2013yarn}---a user submitting a job does not specify the number of hosts to run its job on, and effective hosts are the hosts where input files are located. - -From mathematical point of view kernel $K$ can be described as a vector-valued functional which recursively maps a kernel to $n$-component vector of kernels: +Each programme that runs on top of the tree hierarchy is composed of +computational kernels---objects that contain data and code to process it. To +exploit parallelism a kernel may create arbitrary number of subordinate kernels +which are automatically spread first across available processor cores, second +across subordinate nodes in the tree hierarchy. The programme is itself a kernel +(without a parent as it is executed by a user), which either solves the problem +sequentially on its own or creates subordinate kernels to solve it in parallel. + +Unlike main function in programmes based on message passing library, the first +computational kernel is initially run only on one node, and remote nodes are +used only when the local queue is overflown by kernels. This design choice +allows to have arbitrary number of nodes throughout execution of a programme, +and take more nodes for highly parallel parts of the code. Somewhat similar +choice was made in the design of MapReduce +framework~\cite{dean2008mapreduce,vavilapalli2013yarn}---a user submitting a job +does not specify the number of hosts to run its job on, and effective hosts are +the hosts where input files are located. + +From mathematical point of view kernel $K$ can be described as a vector-valued +functional which recursively maps a kernel to $n$-component vector of kernels: \begin{equation*} K(f): \mathbb{K} \rightarrow \mathbb{K}^n \qquad \mathbb{K}^n = \left\{ f: \mathbb{K} \rightarrow \mathbb{K}^n \right\}. \end{equation*} -Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops recursion, is used to call the first kernel and finish execution of the programme. An argument to each kernel is interpreted using the following rules. +Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops +recursion, is used to call the first kernel and finish execution of the +programme. An argument to each kernel is interpreted using the following rules. \begin{enumerate} \item If a kernel is a new kernel, then its argument is its parent kernel. - \item If a kernel is a parent of the kernel that produced it or some other existing kernel, then the argument is the kernel that produced it. + \item If a kernel is a parent of the kernel that produced it or some other + existing kernel, then the argument is the kernel that produced it. \end{enumerate} -Engine that executes kernels is implemented as a simple loop. It starts with calling the first kernel with a dummy kernel as an argument, then calls each kernel that was produced by this call and so forth. The loop finishes when a dummy kernel is returned as a result of the call. +Engine that executes kernels is implemented as a simple loop. It starts with +calling the first kernel with a dummy kernel as an argument, then calls each +kernel that was produced by this call and so forth. The loop finishes when a +dummy kernel is returned as a result of the call. -Since kernel call may return multiple kernels they are executed in parallel. Parallel execution quickly produces a pool of kernels which permit execution in an unspecified order. Several threads concurrently retrieve kernels from the pool and may ``spill'' remaining kernels to neighbouring cluster nodes. +Since kernel call may return multiple kernels they are executed in parallel. +Parallel execution quickly produces a pool of kernels which permit execution in +an unspecified order. Several threads concurrently retrieve kernels from the +pool and may ``spill'' remaining kernels to neighbouring cluster nodes. -Kernels are implemented as closures---function objects containing all their arguments, a reference to parent kernel and user-supplied data. The data is either processed upon kernel call, or subordinate kernels are created to process it in parallel. When the processing is complete a parent kernel closure with its subordinate kernel as an argument is called to collect data from it. +Kernels are implemented as closures---function objects containing all their +arguments, a reference to parent kernel and user-supplied data. The data is +either processed upon kernel call, or subordinate kernels are created to process +it in parallel. When the processing is complete a parent kernel closure with its +subordinate kernel as an argument is called to collect data from it. \subsection{HANDLING SINGLE NODE FAILURES} -Basic strategy to overcome a failure of a subordinate node is to restart corresponding kernels on healthy node---a strategy employed in Erlang language to restart failed subordinate processes~\cite{armstrong2003thesis}. To implement this we record every kernel that is sent to remote cluster nodes, and in an event of a node failure these kernels are simply rescheduled to other subordinate nodes with no special handling from a programmer. If there are no nodes to sent kernels to, they are scheduled locally. So, in contrast to heavy-weight checkpoint/restart machinery, tree hierarchy allows automatic and transparent handling of subordinate node failures without restarting parallel processes on every node. - -A possible way of handling a failure of a node where the first kernel is located is to replicate this kernel to a backup node, and make all updates to its state propagate to the backup node by means of a distributed transaction. However, this approach does not play well with asynchronous nature of computational kernels. Fortunately, the first kernel usually does not perform operations in parallel, it is rather sequentially launches execution steps one by one, so it has only one subordinate at a time. Keeping this in mind, we can simplify synchronisation of its state: we can send the first kernel along with its subordinate to the subordinate node. When the node with the first kernel fails, its copy receives its subordinate, and no execution time is lost. When the node with its copy fails, its subordinate is rescheduled on some other node, and a whole step of computation is lost in the worst case. - -Described approach works only for kernels that do not have a parent and have only one subordinate at a time, which means that they act as optimised checkpoints. The advantage is that they save results after each sequential step, when memory footprint of a programme is low, they save only relevant data, and they use memory of a subordinate node instead of stable storage. +Basic strategy to overcome a failure of a subordinate node is to restart +corresponding kernels on healthy node---a strategy employed in Erlang language +to restart failed subordinate processes~\cite{armstrong2003thesis}. To implement +this we record every kernel that is sent to remote cluster nodes, and in an +event of a node failure these kernels are simply rescheduled to other +subordinate nodes with no special handling from a programmer. If there are no +nodes to sent kernels to, they are scheduled locally. So, in contrast to +heavy-weight checkpoint/restart machinery, tree hierarchy allows automatic and +transparent handling of subordinate node failures without restarting parallel +processes on every node. + +A possible way of handling a failure of a node where the first kernel is located +is to replicate this kernel to a backup node, and make all updates to its state +propagate to the backup node by means of a distributed transaction. However, +this approach does not play well with asynchronous nature of computational +kernels. Fortunately, the first kernel usually does not perform operations in +parallel, it is rather sequentially launches execution steps one by one, so it +has only one subordinate at a time. Keeping this in mind, we can simplify +synchronisation of its state: we can send the first kernel along with its +subordinate to the subordinate node. When the node with the first kernel fails, +its copy receives its subordinate, and no execution time is lost. When the node +with its copy fails, its subordinate is rescheduled on some other node, and a +whole step of computation is lost in the worst case. + +Described approach works only for kernels that do not have a parent and have +only one subordinate at a time, which means that they act as optimised +checkpoints. The advantage is that they save results after each sequential step, +when memory footprint of a programme is low, they save only relevant data, and +they use memory of a subordinate node instead of stable storage. \subsection{HANDLING OUTAGES} -Electricity outage is a serious failure, so if there is no other geographically distant cluster that can share the load, then the only choice is to hope that no important data is lost and restart every batch job after full site recovery. To reduce restart time the first kernel of each job may save its state (which is small compared to the full state of a job) to some stable storage. Such scenario complicates design of a distributed system so it was not considered in this paper. +Electricity outage is a serious failure, so if there is no other geographically +distant cluster that can share the load, then the only choice is to hope that no +important data is lost and restart every batch job after full site recovery. To +reduce restart time the first kernel of each job may save its state (which is +small compared to the full state of a job) to some stable storage. Such scenario +complicates design of a distributed system so it was not considered in this +paper. \subsection{IMPLEMENTATION} -For efficiency reasons fault tolerance techniques described above are implemented in the C++ framework: From the authors' perspective C language is deemed low-level for distributed programmes, and Java incurs too much overhead and is not popular in HPC community. To use the framework without a job scheduler, we need to implement a daemon that maintains the state of the hierarchy of nodes and exposes API to interact with it. As of now, the framework runs in the same process as an parallel application that uses it. The framework is called Factory, it is now in proof-of-concept development stage.- \ No newline at end of file +For efficiency reasons fault tolerance techniques described above are +implemented in the C++ framework: From the authors' perspective C language is +deemed low-level for distributed programmes, and Java incurs too much overhead +and is not popular in HPC community. To use the framework without a job +scheduler, we need to implement a daemon that maintains the state of the +hierarchy of nodes and exposes API to interact with it. As of now, the framework +runs in the same process as an parallel application that uses it. The framework +is called Factory, it is now in proof-of-concept development stage. diff --git a/src/results.tex b/src/results.tex @@ -1,6 +1,17 @@ \section{RESULTS} -Factory framework is evaluated on physical cluster (Table~\ref{tab:cluster}) on the example of hydrodynamics HPC application which was developed in~\cite{autoreg-stab,autoreg2011csit,autoreg1,autoreg2}. This programme generates wavy ocean surface using ARMA model, its output is a set of files representing different parts of the realisation. From a computer scientist point of view the application consists of a series of filters, each applying to the result of the previous one. Some of the filters are parallel, so the programme is written as a sequence of big steps and some steps are made internally parallel to get better performance. In the programme only the most compute-intensive step (the surface generation) is executed in parallel across all cluster nodes, and other steps are executed in parallel across all cores of the master node. +Factory framework is evaluated on physical cluster (Table~\ref{tab:cluster}) on +the example of hydrodynamics HPC application which was developed +in~\cite{autoreg-stab,autoreg2011csit,autoreg1,autoreg2}. This programme +generates wavy ocean surface using ARMA model, its output is a set of files +representing different parts of the realisation. From a computer scientist point +of view the application consists of a series of filters, each applying to the +result of the previous one. Some of the filters are parallel, so the programme +is written as a sequence of big steps and some steps are made internally +parallel to get better performance. In the programme only the most +compute-intensive step (the surface generation) is executed in parallel across +all cluster nodes, and other steps are executed in parallel across all cores of +the master node. \begin{table} \centering @@ -17,18 +28,49 @@ Factory framework is evaluated on physical cluster (Table~\ref{tab:cluster}) on \label{tab:cluster} \end{table} -The application was rewritten for the new version of the framework which required only slight modifications to handle failure of a node with the first kernel: The kernel was flagged so that the framework makes a replica and sends it to some subordinate node. There were no additional code changes other than modifying some parts to match the new API. So, the tree hierarchy of kernels is mostly non-intrusive model for providing fault tolerance which demands explicit marking of replicated kernels. +The application was rewritten for the new version of the framework which +required only slight modifications to handle failure of a node with the first +kernel: The kernel was flagged so that the framework makes a replica and sends +it to some subordinate node. There were no additional code changes other than +modifying some parts to match the new API. So, the tree hierarchy of kernels is +mostly non-intrusive model for providing fault tolerance which demands explicit +marking of replicated kernels. -In a series of experiments we benchmarked performance of the new version of the application in the presence of different types of failures (numbers correspond to the graphs in Figure~\ref{fig:benchmark}): +In a series of experiments we benchmarked performance of the new version of the +application in the presence of different types of failures (numbers correspond +to the graphs in Figure~\ref{fig:benchmark}): \begin{enumerate} \item no failures, - \item failure of a slave node (a node where a part of wavy surface is generated), + \item failure of a slave node (a node where a part of wavy surface is + generated), \item failure of a master node (a node where the first kernel is run), - \item failure of a backup node (a node where a copy of the first kernel is stored). + \item failure of a backup node (a node where a copy of the first kernel is + stored). \end{enumerate} -A tree hierarchy with fan-out value of 64 was chosen to make all cluster nodes connect directly to the first one. In each run the first kernel was launched on a different node to make mapping of kernel hierarchy to the tree hierarchy optimal. A victim node was made offline after a fixed amount of time after the programme start which is equivalent approximately to $1/3$ of the total run time without failures on a single node. All relevant parameters are summarised in Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the tree hierarchy). The results of these runs were compared to the run without node failures (Figures~\ref{fig:benchmark}-\ref{fig:slowdown}). +A tree hierarchy with fan-out value of 64 was chosen to make all cluster nodes +connect directly to the first one. In each run the first kernel was launched on +a different node to make mapping of kernel hierarchy to the tree hierarchy +optimal. A victim node was made offline after a fixed amount of time after the +programme start which is equivalent approximately to $1/3$ of the total run time +without failures on a single node. All relevant parameters are summarised in +Table~\ref{tab:benchmark} (here ``root'' and ``leaf'' refer to a node in the +tree hierarchy). The results of these runs were compared to the run without node +failures (Figures~\ref{fig:benchmark}-\ref{fig:slowdown}). -There is considerable difference in net performance for different types of failures. Graphs 2 and 3 in Figure~\ref{fig:benchmark} show that performance in case of master or slave node failure is the same. In case of master node failure a backup node stores a copy of the first kernel and uses this copy when it fails to connect to the master node. In case of slave node failure, the master node redistributes the load across remaining slave nodes. In both cases execution state is not lost and no time is spent to restore it, that is why performance is the same. Graph 4 in Figure~\ref{fig:benchmark} shows that performance in case of a backup node failure is much lower. It happens because master node stores only the current step of the computation plus some additional fixed amount of data, whereas a backup node not only stores the copy of this information but executes this step in parallel with other subordinate nodes. So, when a backup node fails, the master node executes the whole step once again on arbitrarily chosen healthy node. +There is considerable difference in net performance for different types of +failures. Graphs 2 and 3 in Figure~\ref{fig:benchmark} show that performance in +case of master or slave node failure is the same. In case of master node failure +a backup node stores a copy of the first kernel and uses this copy when it fails +to connect to the master node. In case of slave node failure, the master node +redistributes the load across remaining slave nodes. In both cases execution +state is not lost and no time is spent to restore it, that is why performance is +the same. Graph 4 in Figure~\ref{fig:benchmark} shows that performance in case +of a backup node failure is much lower. It happens because master node stores +only the current step of the computation plus some additional fixed amount of +data, whereas a backup node not only stores the copy of this information but +executes this step in parallel with other subordinate nodes. So, when a backup +node fails, the master node executes the whole step once again on arbitrarily +chosen healthy node. \begin{table} \centering @@ -46,7 +88,21 @@ There is considerable difference in net performance for different types of failu \label{tab:benchmark} \end{table} -Finally, to measure how much time is lost due to a failure we divide the total execution time with a failure by the total execution time without the failure but with the number of nodes minus one. The results for this calculation are obtained from the same benchmark and are presented in Figure~\ref{fig:slowdown}. The difference in performance in case of master and slave node failures lies within 5\% margin, and in case of backup node failure within 50\% margin for the number of node less than~6\footnote{Measuring this margin for higher number of nodes does not make sense since time before failure is greater than total execution time with these numbers of nodes, and programme's execution finishes before a failure occurs.}. Increase in execution time of 50\% is more than $1/3$ of execution time after which a failure occurs, but backup node failure need some time to be discovered: they are detected only when subordinate kernel carrying the copy of the first kernel finishes its execution and tries to reach its parent. Instant detection requires abrupt stopping of the subordinate kernel which may be undesirable for programmes with complicated logic. +Finally, to measure how much time is lost due to a failure we divide the total +execution time with a failure by the total execution time without the failure +but with the number of nodes minus one. The results for this calculation are +obtained from the same benchmark and are presented in Figure~\ref{fig:slowdown}. +The difference in performance in case of master and slave node failures lies +within 5\% margin, and in case of backup node failure within 50\% margin for the +number of node less than~6\footnote{Measuring this margin for higher number of + nodes does not make sense since time before failure is greater than total + execution time with these numbers of nodes, and programme's execution finishes + before a failure occurs.}. Increase in execution time of 50\% is more than +$1/3$ of execution time after which a failure occurs, but backup node failure +need some time to be discovered: they are detected only when subordinate kernel +carrying the copy of the first kernel finishes its execution and tries to reach +its parent. Instant detection requires abrupt stopping of the subordinate kernel +which may be undesirable for programmes with complicated logic. \begin{figure} \centering @@ -55,7 +111,10 @@ Finally, to measure how much time is lost due to a failure we divide the total e \label{fig:benchmark} \end{figure} -To summarise, the benchmark showed that \emph{no matter a master or a slave node fails, the resulting performance roughly equals to the one without failures with the number of nodes minus one}, however, when a backup node fails performance penalty is much higher. +To summarise, the benchmark showed that \emph{no matter a master or a slave node + fails, the resulting performance roughly equals to the one without failures + with the number of nodes minus one}, however, when a backup node fails +performance penalty is much higher. \begin{figure} \centering