commit 8d16b2b3249cbe96592843be388781caa4985cb1
parent f717550ed77b75bce20fd9f00898b0809167c5e6
Author: Ivan Gankevich <igankevich@ya.ru>
Date: Fri, 24 Mar 2017 17:58:18 +0300
Reorder sections.
- Move related work to the end of the paper.
- Move computational kernel hierarchy to the main part.
- Move paragraphs from discussion to the main part.
Diffstat:
src/body.tex | | | 109 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------- |
src/head.tex | | | 99 | ------------------------------------------------------------------------------- |
src/tail.tex | | | 51 | +++++++++++++++++++++++++++++++++++++++++++++++++++ |
3 files changed, 131 insertions(+), 128 deletions(-)
diff --git a/src/body.tex b/src/body.tex
@@ -1,25 +1,83 @@
-\section{Cluster scheduler architecture}
-
-\subsection{Overview}
-Our framework has layered architecture:
-\begin{itemize}
-
- \item \textit{Physical layer.} Consists of nodes and direct/routed network
- links.
-
- \item \textit{Daemon layer.} Consists of daemon processes residing on
- cluster nodes and hierarchical (master/slave) links between them.
-
- \item \textit{Kernel layer.} Consists of kernels and hierarchical
- (parent/child) links between them.
-
-\end{itemize}
-
-Master and slave roles are dynamically assigned to daemon processes, any
-physical cluster node may become master or slave. Dynamic reassignment uses
-leader election algorithm that does not require periodic broadcasting of
-messages, and the role is derived from node's IP address. Detailed explanation
-of the algorithm is provided in [hpcs-2015-paper].
+\section{System architecture}
+
+Our model of computer system has layered architecture:
+
+\paragraph{Physical layer} Consists of nodes and direct/routed physical
+network links. On this layer full network connectivity, i.e. an ability to send
+packet from one cluster node to any other, is assumed.
+
+\paragraph{Daemon layer} Consists of daemon processes residing on cluster nodes
+and hierarchical (master/slave) logical links between them. Master and slave
+roles are dynamically assigned to daemon processes, any physical cluster node
+may become a master or a slave. Dynamic reassignment uses leader election
+algorithm that does not require periodic broadcasting of messages, and the role
+is derived from node's IP address. Detailed explanation of the algorithm is
+provided in~\cite{gankevich2015subordination}. Its strengths is scalability to
+a large number of nodes and low overhead, which are essential for large-scale
+high-performance compuations, and its weakness is in artificial dependence of
+node's position in the hierarchy on its IP address, which is not desirable in
+virtual environments, where nodes' IP addresses may change without a notice.
+
+The only purpose of daemon hierarchy is to provide load balancing and
+automatically reconfigurable logical tree hierarchy of cluster nodes. This
+hierarchy is used to distribute the load from the current node to its
+neighbours by simply iterating over all directly connected daemons. Upon
+reconfiguration due to node failure or due to new node joining the cluster,
+daemons exchange messages telling each other how mant daemons are ``behind''
+them in the hierarchy. This information is used to distribute the load evenly,
+even if a parallel prgoramme is launched on a slave node. In addition, this
+topology reduces the number of simultaneous connections, thus preventing
+network overload.
+
+\paragraph{Kernel layer} Consists of kernels and hierarchical (parent/child)
+logical links between them. The only purpose of kernel hierarchy is to provide
+fail over for kernels.
+
+The framework provides classes and methods to simplify development of
+distributed applications and middleware. The focus is to make distributed
+application resilient to failures, i.e.~make it fault tolerant and highly
+available, and do it transparently to a programmer. All classes are divided
+into two layers: the lower layer consists of classes for single node
+applications, and the upper layer consists of classes for applications that run
+on an arbitrary number of nodes. There are two kinds of tightly coupled
+entities in the framework~--- \emph{kernels} and \emph{pipelines}~--- which are
+used together to compose a~programme.
+
+Kernels implement control flow logic in theirs \Method{act} and \Method{react}
+methods and store the state of the current control flow branch. Domain-specific
+logic and state are implemented by a programmer. In~\Method{act} method some
+function is either sequentially computed or decomposed into subtasks
+(represented by another set of kernels) which are subsequently sent to a
+pipeline. In~\Method{react} method subordinate kernels that returned from the
+pipeline are processed by their parent. Calls to \Method{act} and
+\Method{react} methods are asynchronous and are made within threads spawned by
+a pipeline. For each kernel \Method{act} is called only once, and for multiple
+kernels the calls are done in parallel to each other, whereas \Method{react}
+method is called once for each subordinate kernel, and all the calls are made
+in the same thread to prevent race conditions (for different parent kernels
+different threads may be used).
+
+Pipelines implement asynchronous calls to \Method{act} and \Method{react}, and
+try to make as many parallel calls as possible considering concurrency of the
+platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline
+consists of a kernel pool, which contains all the subordinate kernels sent by
+their parents, and a thread pool that processes kernels in accordance with
+rules outlined in the previous paragraph. A~separate pipeline exists for each
+compute device: There are pipelines for parallel processing, schedule-based
+processing (periodic and delayed tasks), and a proxy pipeline for processing of
+kernels on other cluster nodes.
+
+In principle, kernels and pipelines machinery reflect the one of procedures and
+call stacks, with the advantage that kernel methods are called asynchronously
+and in parallel to each other. The stack, which ordinarily stores local
+variables, is modelled by fields of a kernel. The sequence of processor
+instructions before nested procedure calls is modelled by~\Method{act} method,
+and sequence of processor instructions after the calls is modelled
+by~\Method{react} method. The procedure calls themselves are modelled
+by~constructing and sending subordinate kernels to the pipeline. Two methods
+are necessary because calls are asynchronous and one must wait before
+subordinate kernels complete their work. Pipelines allow circumventing active
+wait, and call correct kernel methods by analysing their internal state.
\subsection{Fault tolerance and high availability}
@@ -335,10 +393,3 @@ recursively duplicating parents on the subordinate node.
Only electricity outage requires writing data to disk other failures can be
mitigated by duplicating kernels in memory.
-
-The only purpose of kernel hierarchy is to provide fail over for kernels. The
-only purpose of daemon hierarchy is to provide load balancing and automatically
-reconfigurable topology and to reduce the number of simultaneous connections.
-This topology reduces the number of simultaneous connections, thus preventing
-network overload. This topology is used to distribute the load from the current
-node to its neighbours by simply iterating over all directly connected daemons.
diff --git a/src/head.tex b/src/head.tex
@@ -12,105 +12,6 @@ parallel and sequential parts. Using different fault tolerant scenarios based
on hierarchy interactions, this framework provides continuous execution of a
parallel programme in case of hardware errors or electricity outages.
-\subsection{Computational kernel hierarchy}
-
-The framework provides classes and methods to simplify development of
-distributed applications and middleware. The focus is to make distributed
-application resilient to failures, i.e.~make it fault tolerant and highly
-available, and do it transparently to a programmer. All classes are divided
-into two layers: the lower layer consists of classes for single node
-applications, and the upper layer consists of classes for applications that run
-on an arbitrary number of nodes. There are two kinds of tightly coupled
-entities in the framework~--- \emph{kernels} and \emph{pipelines}~--- which are
-used together to compose a~programme.
-
-Kernels implement control flow logic in theirs \Method{act} and \Method{react}
-methods and store the state of the current control flow branch. Domain-specific
-logic and state are implemented by a programmer. In~\Method{act} method some
-function is either sequentially computed or decomposed into subtasks
-(represented by another set of kernels) which are subsequently sent to a
-pipeline. In~\Method{react} method subordinate kernels that returned from the
-pipeline are processed by their parent. Calls to \Method{act} and
-\Method{react} methods are asynchronous and are made within threads spawned by
-a pipeline. For each kernel \Method{act} is called only once, and for multiple
-kernels the calls are done in parallel to each other, whereas \Method{react}
-method is called once for each subordinate kernel, and all the calls are made
-in the same thread to prevent race conditions (for different parent kernels
-different threads may be used).
-
-Pipelines implement asynchronous calls to \Method{act} and \Method{react}, and
-try to make as many parallel calls as possible considering concurrency of the
-platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline
-consists of a kernel pool, which contains all the subordinate kernels sent by
-their parents, and a thread pool that processes kernels in accordance with
-rules outlined in the previous paragraph. A~separate pipeline exists for each
-compute device: There are pipelines for parallel processing, schedule-based
-processing (periodic and delayed tasks), and a proxy pipeline for processing of
-kernels on other cluster nodes.
-
-In principle, kernels and pipelines machinery reflect the one of procedures and
-call stacks, with the advantage that kernel methods are called asynchronously
-and in parallel to each other. The stack, which ordinarily stores local
-variables, is modelled by fields of a kernel. The sequence of processor
-instructions before nested procedure calls is modelled by~\Method{act} method,
-and sequence of processor instructions after the calls is modelled
-by~\Method{react} method. The procedure calls themselves are modelled
-by~constructing and sending subordinate kernels to the pipeline. Two methods
-are necessary because calls are asynchronous and one must wait before
-subordinate kernels complete their work. Pipelines allow circumventing active
-wait, and call correct kernel methods by analysing their internal state.
-
-\subsection{Related work}
-
-The feature that distinguishes our research with respect to some others, is the
-use of hierarchy as the only possible way of defining dependencies between
-objects, into which a programme is decomposed. The main advantage of hierarchy
-is trivial handling of object failures.
-
-In~\cite{zuckerman2011using} the authors describe codelet model for exascale
-machines. This model breaks a programme into small bits of functionality,
-called codelets, and dependencies between them. The programme dataflow
-represents directed graph, which is called well-behaved if forward progress of
-the programme is guaranteed. In contrast to our model, in codelet model
-hierarchical dependencies are not enforced, and resilience to failures is
-provided by object migration and relies on hardware fault detection mechanisms.
-Furthermore, execution of kernel hierarchies in our model resembles
-stack-based execution of ordinary programmes: the programme finishes only when
-all subordinate kernels of the main kernel finish. So, there is no need to
-define well-behaved graph to guarantee programme termination.
-
-In~\cite{meneses2015using} the authors describe migratable objects model for
-parallel programmes. In the framework of this model a programme is decomposed
-into objects that may communicate with each other by sending messages, and can
-be migrated to any cluster node if desired. The authors propose several
-possibilities, how this model may enhance fault-tolerance techniques for
-Charm++/AMPI programmes: proactive fault detection, checkpoint/restart and
-message logging. In contrast to our model, migratable objects do not compose a
-hierarchy, but may exchange messages with any object address of which is known
-to the sender. A spanning tree of nodes is used to orchestrate collective
-operations between objects. This tree is similar to tree hierarchy of nodes,
-which is used in our work to distribute kernels between available cluster
-nodes, but we use this hierarchy for any operations that require distribution
-of work, rather than collective ones. Our model does not use techniques
-described in this paper to provide fault-tolerance: upon a failure we
-re-execute subordinate kernels and copy principal kernels to be able to
-re-execute them as well. Our approach blends checkpoint/restart and message
-logging: each kernel which is sent to other cluster node is saved (logged) in
-memory of the sender, and removed from the log upon return. Since subordinate
-kernels are allowed to communicate only with their principals (all other
-communication may happen only when physical location of the kernel is known, if
-the communication fails, then the kernel also fails to trigger recovery by the
-principal), a collection of all logs on each cluster nodes constitutes the
-current state of programme execution, which is used to restart failed kernels
-on the surviving nodes.
-
-To summarise, the feature that distinguishes our model with respect to models
-proposed for improving parallel programme fault-tolerance is the use of kernel
-hierarchy~--- an abstraction which defines strict total order on a set of
-kernels (their execution order) and, consequently, defines for each kernel a
-principal kernel, responsibility of which is to re-execute failed subordinate
-kernels upon a failure.
-
In this paper we present an algorithm that guarantees continuous execution of a
parallel programme upon failure of all nodes except one. This algorithm is
based on the one developed in previous
diff --git a/src/tail.tex b/src/tail.tex
@@ -1,3 +1,54 @@
+\subsection{Related work}
+
+The feature that distinguishes our research with respect to some others, is the
+use of hierarchy as the only possible way of defining dependencies between
+objects, into which a programme is decomposed. The main advantage of hierarchy
+is trivial handling of object failures.
+
+In~\cite{zuckerman2011using} the authors describe codelet model for exascale
+machines. This model breaks a programme into small bits of functionality,
+called codelets, and dependencies between them. The programme dataflow
+represents directed graph, which is called well-behaved if forward progress of
+the programme is guaranteed. In contrast to our model, in codelet model
+hierarchical dependencies are not enforced, and resilience to failures is
+provided by object migration and relies on hardware fault detection mechanisms.
+Furthermore, execution of kernel hierarchies in our model resembles
+stack-based execution of ordinary programmes: the programme finishes only when
+all subordinate kernels of the main kernel finish. So, there is no need to
+define well-behaved graph to guarantee programme termination.
+
+In~\cite{meneses2015using} the authors describe migratable objects model for
+parallel programmes. In the framework of this model a programme is decomposed
+into objects that may communicate with each other by sending messages, and can
+be migrated to any cluster node if desired. The authors propose several
+possibilities, how this model may enhance fault-tolerance techniques for
+Charm++/AMPI programmes: proactive fault detection, checkpoint/restart and
+message logging. In contrast to our model, migratable objects do not compose a
+hierarchy, but may exchange messages with any object address of which is known
+to the sender. A spanning tree of nodes is used to orchestrate collective
+operations between objects. This tree is similar to tree hierarchy of nodes,
+which is used in our work to distribute kernels between available cluster
+nodes, but we use this hierarchy for any operations that require distribution
+of work, rather than collective ones. Our model does not use techniques
+described in this paper to provide fault-tolerance: upon a failure we
+re-execute subordinate kernels and copy principal kernels to be able to
+re-execute them as well. Our approach blends checkpoint/restart and message
+logging: each kernel which is sent to other cluster node is saved (logged) in
+memory of the sender, and removed from the log upon return. Since subordinate
+kernels are allowed to communicate only with their principals (all other
+communication may happen only when physical location of the kernel is known, if
+the communication fails, then the kernel also fails to trigger recovery by the
+principal), a collection of all logs on each cluster nodes constitutes the
+current state of programme execution, which is used to restart failed kernels
+on the surviving nodes.
+
+To summarise, the feature that distinguishes our model with respect to models
+proposed for improving parallel programme fault-tolerance is the use of kernel
+hierarchy~--- an abstraction which defines strict total order on a set of
+kernels (their execution order) and, consequently, defines for each kernel a
+principal kernel, responsibility of which is to re-execute failed subordinate
+kernels upon a failure.
+
\section{Conclusion}
\section*{Acknowledgment}