hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 8d16b2b3249cbe96592843be388781caa4985cb1
parent f717550ed77b75bce20fd9f00898b0809167c5e6
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 17:58:18 +0300

Reorder sections.

- Move related work to the end of the paper.
- Move computational kernel hierarchy to the main part.
- Move paragraphs from discussion to the main part.

Diffstat:
src/body.tex | 109++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
src/head.tex | 99-------------------------------------------------------------------------------
src/tail.tex | 51+++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 131 insertions(+), 128 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -1,25 +1,83 @@ -\section{Cluster scheduler architecture} - -\subsection{Overview} -Our framework has layered architecture: -\begin{itemize} - - \item \textit{Physical layer.} Consists of nodes and direct/routed network - links. - - \item \textit{Daemon layer.} Consists of daemon processes residing on - cluster nodes and hierarchical (master/slave) links between them. - - \item \textit{Kernel layer.} Consists of kernels and hierarchical - (parent/child) links between them. - -\end{itemize} - -Master and slave roles are dynamically assigned to daemon processes, any -physical cluster node may become master or slave. Dynamic reassignment uses -leader election algorithm that does not require periodic broadcasting of -messages, and the role is derived from node's IP address. Detailed explanation -of the algorithm is provided in [hpcs-2015-paper]. +\section{System architecture} + +Our model of computer system has layered architecture: + +\paragraph{Physical layer} Consists of nodes and direct/routed physical +network links. On this layer full network connectivity, i.e. an ability to send +packet from one cluster node to any other, is assumed. + +\paragraph{Daemon layer} Consists of daemon processes residing on cluster nodes +and hierarchical (master/slave) logical links between them. Master and slave +roles are dynamically assigned to daemon processes, any physical cluster node +may become a master or a slave. Dynamic reassignment uses leader election +algorithm that does not require periodic broadcasting of messages, and the role +is derived from node's IP address. Detailed explanation of the algorithm is +provided in~\cite{gankevich2015subordination}. Its strengths is scalability to +a large number of nodes and low overhead, which are essential for large-scale +high-performance compuations, and its weakness is in artificial dependence of +node's position in the hierarchy on its IP address, which is not desirable in +virtual environments, where nodes' IP addresses may change without a notice. + +The only purpose of daemon hierarchy is to provide load balancing and +automatically reconfigurable logical tree hierarchy of cluster nodes. This +hierarchy is used to distribute the load from the current node to its +neighbours by simply iterating over all directly connected daemons. Upon +reconfiguration due to node failure or due to new node joining the cluster, +daemons exchange messages telling each other how mant daemons are ``behind'' +them in the hierarchy. This information is used to distribute the load evenly, +even if a parallel prgoramme is launched on a slave node. In addition, this +topology reduces the number of simultaneous connections, thus preventing +network overload. + +\paragraph{Kernel layer} Consists of kernels and hierarchical (parent/child) +logical links between them. The only purpose of kernel hierarchy is to provide +fail over for kernels. + +The framework provides classes and methods to simplify development of +distributed applications and middleware. The focus is to make distributed +application resilient to failures, i.e.~make it fault tolerant and highly +available, and do it transparently to a programmer. All classes are divided +into two layers: the lower layer consists of classes for single node +applications, and the upper layer consists of classes for applications that run +on an arbitrary number of nodes. There are two kinds of tightly coupled +entities in the framework~--- \emph{kernels} and \emph{pipelines}~--- which are +used together to compose a~programme. + +Kernels implement control flow logic in theirs \Method{act} and \Method{react} +methods and store the state of the current control flow branch. Domain-specific +logic and state are implemented by a programmer. In~\Method{act} method some +function is either sequentially computed or decomposed into subtasks +(represented by another set of kernels) which are subsequently sent to a +pipeline. In~\Method{react} method subordinate kernels that returned from the +pipeline are processed by their parent. Calls to \Method{act} and +\Method{react} methods are asynchronous and are made within threads spawned by +a pipeline. For each kernel \Method{act} is called only once, and for multiple +kernels the calls are done in parallel to each other, whereas \Method{react} +method is called once for each subordinate kernel, and all the calls are made +in the same thread to prevent race conditions (for different parent kernels +different threads may be used). + +Pipelines implement asynchronous calls to \Method{act} and \Method{react}, and +try to make as many parallel calls as possible considering concurrency of the +platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline +consists of a kernel pool, which contains all the subordinate kernels sent by +their parents, and a thread pool that processes kernels in accordance with +rules outlined in the previous paragraph. A~separate pipeline exists for each +compute device: There are pipelines for parallel processing, schedule-based +processing (periodic and delayed tasks), and a proxy pipeline for processing of +kernels on other cluster nodes. + +In principle, kernels and pipelines machinery reflect the one of procedures and +call stacks, with the advantage that kernel methods are called asynchronously +and in parallel to each other. The stack, which ordinarily stores local +variables, is modelled by fields of a kernel. The sequence of processor +instructions before nested procedure calls is modelled by~\Method{act} method, +and sequence of processor instructions after the calls is modelled +by~\Method{react} method. The procedure calls themselves are modelled +by~constructing and sending subordinate kernels to the pipeline. Two methods +are necessary because calls are asynchronous and one must wait before +subordinate kernels complete their work. Pipelines allow circumventing active +wait, and call correct kernel methods by analysing their internal state. \subsection{Fault tolerance and high availability} @@ -335,10 +393,3 @@ recursively duplicating parents on the subordinate node. Only electricity outage requires writing data to disk other failures can be mitigated by duplicating kernels in memory. - -The only purpose of kernel hierarchy is to provide fail over for kernels. The -only purpose of daemon hierarchy is to provide load balancing and automatically -reconfigurable topology and to reduce the number of simultaneous connections. -This topology reduces the number of simultaneous connections, thus preventing -network overload. This topology is used to distribute the load from the current -node to its neighbours by simply iterating over all directly connected daemons. diff --git a/src/head.tex b/src/head.tex @@ -12,105 +12,6 @@ parallel and sequential parts. Using different fault tolerant scenarios based on hierarchy interactions, this framework provides continuous execution of a parallel programme in case of hardware errors or electricity outages. -\subsection{Computational kernel hierarchy} - -The framework provides classes and methods to simplify development of -distributed applications and middleware. The focus is to make distributed -application resilient to failures, i.e.~make it fault tolerant and highly -available, and do it transparently to a programmer. All classes are divided -into two layers: the lower layer consists of classes for single node -applications, and the upper layer consists of classes for applications that run -on an arbitrary number of nodes. There are two kinds of tightly coupled -entities in the framework~--- \emph{kernels} and \emph{pipelines}~--- which are -used together to compose a~programme. - -Kernels implement control flow logic in theirs \Method{act} and \Method{react} -methods and store the state of the current control flow branch. Domain-specific -logic and state are implemented by a programmer. In~\Method{act} method some -function is either sequentially computed or decomposed into subtasks -(represented by another set of kernels) which are subsequently sent to a -pipeline. In~\Method{react} method subordinate kernels that returned from the -pipeline are processed by their parent. Calls to \Method{act} and -\Method{react} methods are asynchronous and are made within threads spawned by -a pipeline. For each kernel \Method{act} is called only once, and for multiple -kernels the calls are done in parallel to each other, whereas \Method{react} -method is called once for each subordinate kernel, and all the calls are made -in the same thread to prevent race conditions (for different parent kernels -different threads may be used). - -Pipelines implement asynchronous calls to \Method{act} and \Method{react}, and -try to make as many parallel calls as possible considering concurrency of the -platform (no.~of cores per node and no.~of nodes in a cluster). A~pipeline -consists of a kernel pool, which contains all the subordinate kernels sent by -their parents, and a thread pool that processes kernels in accordance with -rules outlined in the previous paragraph. A~separate pipeline exists for each -compute device: There are pipelines for parallel processing, schedule-based -processing (periodic and delayed tasks), and a proxy pipeline for processing of -kernels on other cluster nodes. - -In principle, kernels and pipelines machinery reflect the one of procedures and -call stacks, with the advantage that kernel methods are called asynchronously -and in parallel to each other. The stack, which ordinarily stores local -variables, is modelled by fields of a kernel. The sequence of processor -instructions before nested procedure calls is modelled by~\Method{act} method, -and sequence of processor instructions after the calls is modelled -by~\Method{react} method. The procedure calls themselves are modelled -by~constructing and sending subordinate kernels to the pipeline. Two methods -are necessary because calls are asynchronous and one must wait before -subordinate kernels complete their work. Pipelines allow circumventing active -wait, and call correct kernel methods by analysing their internal state. - -\subsection{Related work} - -The feature that distinguishes our research with respect to some others, is the -use of hierarchy as the only possible way of defining dependencies between -objects, into which a programme is decomposed. The main advantage of hierarchy -is trivial handling of object failures. - -In~\cite{zuckerman2011using} the authors describe codelet model for exascale -machines. This model breaks a programme into small bits of functionality, -called codelets, and dependencies between them. The programme dataflow -represents directed graph, which is called well-behaved if forward progress of -the programme is guaranteed. In contrast to our model, in codelet model -hierarchical dependencies are not enforced, and resilience to failures is -provided by object migration and relies on hardware fault detection mechanisms. -Furthermore, execution of kernel hierarchies in our model resembles -stack-based execution of ordinary programmes: the programme finishes only when -all subordinate kernels of the main kernel finish. So, there is no need to -define well-behaved graph to guarantee programme termination. - -In~\cite{meneses2015using} the authors describe migratable objects model for -parallel programmes. In the framework of this model a programme is decomposed -into objects that may communicate with each other by sending messages, and can -be migrated to any cluster node if desired. The authors propose several -possibilities, how this model may enhance fault-tolerance techniques for -Charm++/AMPI programmes: proactive fault detection, checkpoint/restart and -message logging. In contrast to our model, migratable objects do not compose a -hierarchy, but may exchange messages with any object address of which is known -to the sender. A spanning tree of nodes is used to orchestrate collective -operations between objects. This tree is similar to tree hierarchy of nodes, -which is used in our work to distribute kernels between available cluster -nodes, but we use this hierarchy for any operations that require distribution -of work, rather than collective ones. Our model does not use techniques -described in this paper to provide fault-tolerance: upon a failure we -re-execute subordinate kernels and copy principal kernels to be able to -re-execute them as well. Our approach blends checkpoint/restart and message -logging: each kernel which is sent to other cluster node is saved (logged) in -memory of the sender, and removed from the log upon return. Since subordinate -kernels are allowed to communicate only with their principals (all other -communication may happen only when physical location of the kernel is known, if -the communication fails, then the kernel also fails to trigger recovery by the -principal), a collection of all logs on each cluster nodes constitutes the -current state of programme execution, which is used to restart failed kernels -on the surviving nodes. - -To summarise, the feature that distinguishes our model with respect to models -proposed for improving parallel programme fault-tolerance is the use of kernel -hierarchy~--- an abstraction which defines strict total order on a set of -kernels (their execution order) and, consequently, defines for each kernel a -principal kernel, responsibility of which is to re-execute failed subordinate -kernels upon a failure. - In this paper we present an algorithm that guarantees continuous execution of a parallel programme upon failure of all nodes except one. This algorithm is based on the one developed in previous diff --git a/src/tail.tex b/src/tail.tex @@ -1,3 +1,54 @@ +\subsection{Related work} + +The feature that distinguishes our research with respect to some others, is the +use of hierarchy as the only possible way of defining dependencies between +objects, into which a programme is decomposed. The main advantage of hierarchy +is trivial handling of object failures. + +In~\cite{zuckerman2011using} the authors describe codelet model for exascale +machines. This model breaks a programme into small bits of functionality, +called codelets, and dependencies between them. The programme dataflow +represents directed graph, which is called well-behaved if forward progress of +the programme is guaranteed. In contrast to our model, in codelet model +hierarchical dependencies are not enforced, and resilience to failures is +provided by object migration and relies on hardware fault detection mechanisms. +Furthermore, execution of kernel hierarchies in our model resembles +stack-based execution of ordinary programmes: the programme finishes only when +all subordinate kernels of the main kernel finish. So, there is no need to +define well-behaved graph to guarantee programme termination. + +In~\cite{meneses2015using} the authors describe migratable objects model for +parallel programmes. In the framework of this model a programme is decomposed +into objects that may communicate with each other by sending messages, and can +be migrated to any cluster node if desired. The authors propose several +possibilities, how this model may enhance fault-tolerance techniques for +Charm++/AMPI programmes: proactive fault detection, checkpoint/restart and +message logging. In contrast to our model, migratable objects do not compose a +hierarchy, but may exchange messages with any object address of which is known +to the sender. A spanning tree of nodes is used to orchestrate collective +operations between objects. This tree is similar to tree hierarchy of nodes, +which is used in our work to distribute kernels between available cluster +nodes, but we use this hierarchy for any operations that require distribution +of work, rather than collective ones. Our model does not use techniques +described in this paper to provide fault-tolerance: upon a failure we +re-execute subordinate kernels and copy principal kernels to be able to +re-execute them as well. Our approach blends checkpoint/restart and message +logging: each kernel which is sent to other cluster node is saved (logged) in +memory of the sender, and removed from the log upon return. Since subordinate +kernels are allowed to communicate only with their principals (all other +communication may happen only when physical location of the kernel is known, if +the communication fails, then the kernel also fails to trigger recovery by the +principal), a collection of all logs on each cluster nodes constitutes the +current state of programme execution, which is used to restart failed kernels +on the surviving nodes. + +To summarise, the feature that distinguishes our model with respect to models +proposed for improving parallel programme fault-tolerance is the use of kernel +hierarchy~--- an abstraction which defines strict total order on a set of +kernels (their execution order) and, consequently, defines for each kernel a +principal kernel, responsibility of which is to re-execute failed subordinate +kernels upon a failure. + \section{Conclusion} \section*{Acknowledgment}