hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 6d37f4a8981c24de5d5b071c6adb17e8e4819cff
parent c2cbbb629f57be37420dd3b6cc127fa990161e4e
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 22:52:44 +0300

Finalise the paper.

- Fix figure references.
- Spell-check.
- Add missing reference.

Diffstat:
bib/refs.bib | 12++++++++++++
main.tex | 7+++----
src/body.tex | 49++++++++++++++++++++++++-------------------------
src/tail.tex | 10+++++-----
4 files changed, 44 insertions(+), 34 deletions(-)

diff --git a/bib/refs.bib b/bib/refs.bib @@ -61,3 +61,15 @@ year={2016}, organization={IEEE} } + + +@inproceedings{schroeder2007understanding, + title={Understanding failures in petascale computers}, + author={Schroeder, Bianca and Gibson, Garth A}, + booktitle={Journal of Physics: Conference Series}, + volume={78}, + number={1}, + pages={012022}, + year={2007}, + organization={IOP Publishing} +} diff --git a/main.tex b/main.tex @@ -16,12 +16,12 @@ \title{Subordination: a framework for fault-tolerant systems} \author{% - \IEEEauthorblockN{Yuri Tipikin \quad Ivan Gankevich \quad Vladimir Korkhov} + \IEEEauthorblockN{Ivan Gankevich \quad Yuri Tipikin \quad Vladimir Korkhov} \IEEEauthorblockA{% Dept. of Computer Modeling and Multiprocessor Systems\\ Saint Petersburg State University\\ Saint-Petersburg, Russia\\ - Email: y.tipikin@spbu.ru, i.gankevich@spbu.ru, v.korkhov@spbu.ru% + Email: i.gankevich@spbu.ru, y.tipikin@spbu.ru, v.korkhov@spbu.ru% }% }% @@ -41,8 +41,7 @@ \input{src/tail} % delete this reference -\nocite{*} - +%\nocite{*} % trigger a \newpage just before the given reference % number - used to balance the columns on the last page diff --git a/src/body.tex b/src/body.tex @@ -14,7 +14,7 @@ algorithm that does not require periodic broadcasting of messages, and the role is derived from node's IP address. Detailed explanation of the algorithm is provided in~\cite{gankevich2015subordination}. Its strengths is scalability to a large number of nodes and low overhead, which are essential for large-scale -high-performance compuations, and its weakness is in artificial dependence of +high-performance computations, and its weakness is in artificial dependence of node's position in the hierarchy on its IP address, which is not desirable in virtual environments, where nodes' IP addresses may change without a notice. @@ -23,9 +23,9 @@ automatically reconfigurable logical tree hierarchy of cluster nodes. This hierarchy is used to distribute the load from the current node to its neighbours by simply iterating over all directly connected daemons. Upon reconfiguration due to node failure or due to new node joining the cluster, -daemons exchange messages telling each other how mant daemons are ``behind'' +daemons exchange messages telling each other how many daemons are ``behind'' them in the hierarchy. This information is used to distribute the load evenly, -even if a parallel prgoramme is launched on a slave node. In addition, this +even if a parallel programme is launched on a slave node. In addition, this topology reduces the number of simultaneous connections, thus preventing network overload. @@ -83,7 +83,7 @@ wait, and call correct kernel methods by analysing their internal state. In our system a node is considered failed if the corresponding network connection is abruptly closed. Developing more sophisticated failure techniques -is out of scope of this paper. For the purpose of studying recovery procedeures +is out of scope of this paper. For the purpose of studying recovery procedures upon node failure this simple approach is sufficient. Consequently, any kernel which resided on the failed node is considered failed, @@ -91,8 +91,8 @@ and failure recovery procedure is triggered. Depending on the position of the kernel in kernel hierarchy recovery is carried out on the node where parent or one of the subordinate kernels resides. Recovery procedure for failed subordinate kernel is re-execution of this kernel on a healthy node, which is -triggered automatically by the node where ots parent kernel is located. If the -subordinate communiates with other subordinates of the same parent kernel and +triggered automatically by the node where its parent kernel is located. If the +subordinate communicates with other subordinates of the same parent kernel and one of them fails, all kernels as well as their parent are considered failed, and a copy of the parent is re-executed on a healthy node. If parent kernel fails, then its copy, which is sent along with every subordinate on other @@ -159,9 +159,10 @@ When a principal fails every subordinate has its copy, but we need to restore it only once and only on one node to correctly continue programme execution. To ensure that the principal is restored only once, each subordinate tries to find the first surviving node from the IP address list of neighbours -(Figure~\ref{fig:principal-fails}). If such node is online, the search stops -and the subordinate kernel is deleted. If the node is not found, the -subordinate restores the principal from the copy and deletes itself. Kernel +(Figure~\ref{fig:principal-fails} +and~\ref{fig:subordinate-and-principal-fail}). If such node is online, the +search stops and the subordinate kernel is deleted. If the node is not found, +the subordinate restores the principal from the copy and deletes itself. Kernel deletion is necessary, because the whole computational step, modelled by the principal, is re-executed from the principal initial state, and there is no simple and reliable way of taking into account the results which was produced @@ -181,15 +182,22 @@ so far by the subordinates. \label{fig:principal-fails} \end{figure} +\begin{figure} + \centering + \includegraphics{sc2} + \caption{Simultaneous failure of a subordinate and its principal.} + \label{fig:subordinate-and-principal-fail} +\end{figure} + \paragraph*{Scenario~2} In comparison to the first scenario, the second one is more complicated, but also more frequent. While on kernel hierarchy the system acts the same as in the first scenario, when we move to daemon hierarchy one more possible variant is added. In deep kernel hierarchy a kernel may act as a subordinate and as a principal at the same time. Thus, we need to copy not only direct principal of each subordinate kernel, but also all principals -higher in the hierarchy recursively (Figure~\ref{fig:sc2} and~\ref{fig:sc3}). -So, the additional variant is a generalisation of the two previous ones for -deep kernel hierarchies. +higher in the hierarchy recursively (Figure~\ref{fig:sc3}). So, the additional +variant is a generalisation of the two previous ones for deep kernel +hierarchies. Handling principal failure in a deep kernel hierarchy may involve a lot of overhead, because its failure is detected only when a subordinate finishes its @@ -206,13 +214,6 @@ a file, whereas subordinates only process separate dataset parts. \begin{figure} \centering - \includegraphics{sc2} - \caption{Simultaneous failure of a subordinate and its principal.} - \label{fig:sc2} -\end{figure} - -\begin{figure} - \centering \includegraphics{sc3} \caption{Simultaneous failure of two principals.} \label{fig:sc3} @@ -241,8 +242,6 @@ as the follow. alive kernels and re-executes them. \end{itemize} - - Recovery from a failure of all nodes is the most inefficient, because it involves the use of persistent storage and there is no reliable way to ensure that all cluster nodes have been bootstrapped. If some nodes were not @@ -399,10 +398,10 @@ typically set to 2 or 3 depending on the particular site. We believe that there is no need to set number of object copies more than that, as it allows to tolerate simultaneous failure of 2 and 3 nodes respectively: it should be more than enough to tolerate node failures which are common at large -scale~\cite{mean-time-between-failures-darpa-TODO}. So, using arrays with -linear search complexity is more efficient than maps and sets, because the -number of elements in them is small, and linear search takes less time than -fixed time hash-based lookup. +scale~\cite{schroeder2007understanding}. So, using arrays with linear search +complexity is more efficient than maps and sets, because the number of elements +in them is small, and linear search takes less time than fixed time hash-based +lookup. Transmitting IP addresses of previous nodes is an optimisation over mapping to only linear hierarchies, that is hierarchies where only one subordinate is diff --git a/src/tail.tex b/src/tail.tex @@ -52,21 +52,21 @@ kernels upon a failure. \section{Conclusion} In the paper we propose a system architecture consisting of two tree -hierarchies of entitites, mapped on each other, that simplifies provision of +hierarchies of entities, mapped on each other, that simplifies provision of resilience to failures for parallel programmes. The resilience is solely -provided by the use of hierarchical dependencies between entitites, and is +provided by the use of hierarchical dependencies between entities, and is independent on each layer of the system. To optimise handling failure of multiple cluster nodes, we use the hierarchy implied by the order of creation -of subordinate entitities. The hierarchical approach to fault tolerance is +of subordinate entities. The hierarchical approach to fault tolerance is efficient, scales to a large number of cluster nodes, and requires slow I/O operations only for the most disastrous scenario~--- simultaneous failure of all cluster nodes. -The future work is to standardase application programming interface of the +The future work is to standardise application programming interface of the system and investigate load-balancing techniques, which are optimal for a programme composed of many computational kernels. -\section*{Acknowledgment} +\section*{Acknowledgement} The research was carried out using computational resources of Resource Centre ``Computational Centre of Saint Petersburg State University'' (\mbox{T-EDGE96} \mbox{HPC-0011828-001}) within frameworks of grants of Russian Foundation for