Finalise the paper. - hpcs-17-subord

commit 6d37f4a8981c24de5d5b071c6adb17e8e4819cff
parent c2cbbb629f57be37420dd3b6cc127fa990161e4e
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 22:52:44 +0300

Finalise the paper.

- Fix figure references.
- Spell-check.
- Add missing reference.

Diffstat:
bib/refs.bib  | 12 ++++++++++++
main.tex  | 7 +++----
src/body.tex  | 49 ++++++++++++++++++++++++-------------------------
src/tail.tex  | 10 +++++-----

4 files changed, 44 insertions(+), 34 deletions(-)
diff --git a/bib/refs.bib b/bib/refs.bib
@@ -61,3 +61,15 @@
   year={2016},
   organization={IEEE}
 }
+
+
+@inproceedings{schroeder2007understanding,
+  title={Understanding failures in petascale computers},
+  author={Schroeder, Bianca and Gibson, Garth A},
+  booktitle={Journal of Physics: Conference Series},
+  volume={78},
+  number={1},
+  pages={012022},
+  year={2007},
+  organization={IOP Publishing}
+}
diff --git a/main.tex b/main.tex
@@ -16,12 +16,12 @@
 \title{Subordination: a framework for fault-tolerant systems}
 
 \author{%
-	\IEEEauthorblockN{Yuri Tipikin \quad Ivan Gankevich \quad Vladimir Korkhov}
+	\IEEEauthorblockN{Ivan Gankevich \quad Yuri Tipikin \quad Vladimir Korkhov}
 	\IEEEauthorblockA{%
 		Dept. of Computer Modeling and Multiprocessor Systems\\
 		Saint Petersburg State University\\
 		Saint-Petersburg, Russia\\
-		Email: y.tipikin@spbu.ru, i.gankevich@spbu.ru, v.korkhov@spbu.ru%
+		Email: i.gankevich@spbu.ru, y.tipikin@spbu.ru, v.korkhov@spbu.ru%
 	}%
 }%
 
@@ -41,8 +41,7 @@
 \input{src/tail}
 
 % delete this reference
-\nocite{*}
-
+%\nocite{*}
 
 % trigger a \newpage just before the given reference
 % number - used to balance the columns on the last page
diff --git a/src/body.tex b/src/body.tex
@@ -14,7 +14,7 @@ algorithm that does not require periodic broadcasting of messages, and the role
 is derived from node's IP address. Detailed explanation of the algorithm is
 provided in~\cite{gankevich2015subordination}. Its strengths is scalability to
 a large number of nodes and low overhead, which are essential for large-scale
-high-performance compuations, and its weakness is in artificial dependence of
+high-performance computations, and its weakness is in artificial dependence of
 node's position in the hierarchy on its IP address, which is not desirable in
 virtual environments, where nodes' IP addresses may change without a notice.
 
@@ -23,9 +23,9 @@ automatically reconfigurable logical tree hierarchy of cluster nodes. This
 hierarchy is used to distribute the load from the current node to its
 neighbours by simply iterating over all directly connected daemons. Upon
 reconfiguration due to node failure or due to new node joining the cluster,
-daemons exchange messages telling each other how mant daemons are ``behind''
+daemons exchange messages telling each other how many daemons are ``behind''
 them in the hierarchy. This information is used to distribute the load evenly,
-even if a parallel prgoramme is launched on a slave node. In addition, this
+even if a parallel programme is launched on a slave node. In addition, this
 topology reduces the number of simultaneous connections, thus preventing
 network overload. 
 
@@ -83,7 +83,7 @@ wait, and call correct kernel methods by analysing their internal state.
 
 In our system a node is considered failed if the corresponding network
 connection is abruptly closed. Developing more sophisticated failure techniques
-is out of scope of this paper. For the purpose of studying recovery procedeures
+is out of scope of this paper. For the purpose of studying recovery procedures
 upon node failure this simple approach is sufficient.
 
 Consequently, any kernel which resided on the failed node is considered failed,
@@ -91,8 +91,8 @@ and failure recovery procedure is triggered. Depending on the position of the
 kernel in kernel hierarchy recovery is carried out on the node where parent or
 one of the subordinate kernels resides. Recovery procedure for failed
 subordinate kernel is re-execution of this kernel on a healthy node, which is
-triggered automatically by the node where ots parent kernel is located. If the
-subordinate communiates with other subordinates of the same parent kernel and
+triggered automatically by the node where its parent kernel is located. If the
+subordinate communicates with other subordinates of the same parent kernel and
 one of them fails, all kernels as well as their parent are considered failed,
 and a copy of the parent is re-executed on a healthy node. If parent kernel
 fails, then its copy, which is sent along with every subordinate on other
@@ -159,9 +159,10 @@ When a principal fails every subordinate has its copy, but we need to restore
 it only once and only on one node to correctly continue programme execution. To
 ensure that the principal is restored only once, each subordinate tries to find
 the first surviving node from the IP address list of neighbours
-(Figure~\ref{fig:principal-fails}). If such node is online, the search stops
-and the subordinate kernel is deleted. If the node is not found, the
-subordinate restores the principal from the copy and deletes itself. Kernel
+(Figure~\ref{fig:principal-fails}
+and~\ref{fig:subordinate-and-principal-fail}).  If such node is online, the
+search stops and the subordinate kernel is deleted.  If the node is not found,
+the subordinate restores the principal from the copy and deletes itself. Kernel
 deletion is necessary, because the whole computational step, modelled by the
 principal, is re-executed from the principal initial state, and there is no
 simple and reliable way of taking into account the results which was produced
@@ -181,15 +182,22 @@ so far by the subordinates.
 	\label{fig:principal-fails}
 \end{figure}
 
+\begin{figure}
+	\centering
+	\includegraphics{sc2}
+	\caption{Simultaneous failure of a subordinate and its principal.}
+	\label{fig:subordinate-and-principal-fail}
+\end{figure}
+
 \paragraph*{Scenario~2} In comparison to the first scenario, the second one is
 more complicated, but also more frequent. While on kernel hierarchy the system
 acts the same as in the first scenario, when we move to daemon hierarchy one
 more possible variant is added. In deep kernel hierarchy a kernel may act as a
 subordinate and as a principal at the same time.  Thus, we need to copy not
 only direct principal of each subordinate kernel, but also all principals
-higher in the hierarchy recursively (Figure~\ref{fig:sc2} and~\ref{fig:sc3}).
-So, the additional variant is a generalisation of the two previous ones for
-deep kernel hierarchies.
+higher in the hierarchy recursively (Figure~\ref{fig:sc3}). So, the additional
+variant is a generalisation of the two previous ones for deep kernel
+hierarchies.
 
 Handling principal failure in a deep kernel hierarchy may involve a lot of
 overhead, because its failure is detected only when a subordinate finishes its
@@ -206,13 +214,6 @@ a file, whereas subordinates only process separate dataset parts.
 
 \begin{figure}
 	\centering
-	\includegraphics{sc2}
-	\caption{Simultaneous failure of a subordinate and its principal.}
-	\label{fig:sc2}
-\end{figure}
-
-\begin{figure}
-	\centering
 	\includegraphics{sc3}
 	\caption{Simultaneous failure of two principals.}
 	\label{fig:sc3}
@@ -241,8 +242,6 @@ as the follow.
 	  alive kernels and re-executes them. 
 \end{itemize}
 
-
-
 Recovery from a failure of all nodes is the most inefficient, because it
 involves the use of persistent storage and there is no reliable way to ensure
 that all cluster nodes have been bootstrapped.  If some nodes were not
@@ -399,10 +398,10 @@ typically set to 2 or 3 depending on the particular site. We believe that there
 is no need to set number of object copies more than that, as it allows to
 tolerate simultaneous failure of 2 and 3 nodes respectively: it should be more
 than enough to tolerate node failures which are common at large
-scale~\cite{mean-time-between-failures-darpa-TODO}. So, using arrays  with
-linear search complexity is more efficient than maps and sets, because the
-number of elements in them is small, and linear search takes less time than
-fixed time hash-based lookup.
+scale~\cite{schroeder2007understanding}. So, using arrays  with linear search
+complexity is more efficient than maps and sets, because the number of elements
+in them is small, and linear search takes less time than fixed time hash-based
+lookup.
 
 Transmitting IP addresses of previous nodes is an optimisation over mapping to
 only linear hierarchies, that is hierarchies where only one subordinate is
diff --git a/src/tail.tex b/src/tail.tex
@@ -52,21 +52,21 @@ kernels upon a failure.
 \section{Conclusion}
 
 In the paper we propose a system architecture consisting of two tree
-hierarchies of entitites, mapped on each other, that simplifies provision of
+hierarchies of entities, mapped on each other, that simplifies provision of
 resilience to failures for parallel programmes. The resilience is solely
-provided by the use of hierarchical dependencies between entitites, and is
+provided by the use of hierarchical dependencies between entities, and is
 independent on each layer of the system. To optimise handling failure of
 multiple cluster nodes, we use the hierarchy implied by the order of creation
-of subordinate entitities. The hierarchical approach to fault tolerance is
+of subordinate entities. The hierarchical approach to fault tolerance is
 efficient, scales to a large number of cluster nodes, and requires slow I/O
 operations only for the most disastrous scenario~--- simultaneous failure of
 all cluster nodes.  
 
-The future work is to standardase application programming interface of the
+The future work is to standardise application programming interface of the
 system and investigate load-balancing techniques, which are optimal for a
 programme composed of many computational kernels.
 
-\section*{Acknowledgment}
+\section*{Acknowledgement}
 The research was carried out using computational resources of Resource Centre
 ``Computational Centre of Saint Petersburg State University'' (\mbox{T-EDGE96}
 \mbox{HPC-0011828-001}) within frameworks of grants of Russian Foundation for

	hpcs-17-subord
	git clone https://git.igankevich.com/hpcs-17-subord.git
	Log \| Files \| Refs

bib/refs.bib	\|	12	++++++++++++
main.tex	\|	7	+++----
src/body.tex	\|	49	++++++++++++++++++++++++-------------------------
src/tail.tex	\|	10	+++++-----