hpcs-17-subord

git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit 27d412fd40a3ccc4322f412eec7f80ccfb85e8f2
parent 26fdcbce93b4664a873637f91265954650101a3d
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Fri, 24 Mar 2017 17:17:13 +0300

Revise and proof-read introduction and related work.

Diffstat:
src/head.tex | 63+++++++++++++++++++++++++++++++--------------------------------
1 file changed, 31 insertions(+), 32 deletions(-)

diff --git a/src/head.tex b/src/head.tex @@ -1,15 +1,16 @@ \section{Introduction} -In large scale cluster environments node faults are common. In general this do -not lead to global cluster malfunction, but it have huge impact on job running -on faulty resources. Classical MPI programs will fail if any one of used nodes -will broke. Today existed solutions mainly focused on making node checkpoints, -but with increasing speed of computations it became less efficient. Our approach -to make cluster computations reliable and efficient again is to use special -framework focused on structuring source algorithm in strong hierarchy of -parallel and sequential parts. Using different fault tolerant scenarios based on -hierarchy interactions framework can provide continuous computations in case of -hardware errors or electricity outages. +In large scale cluster environments node failures are common. In general this +does not lead to global cluster malfunction, but it has huge impact on job +running on faulty resources. Classical MPI programmes fail, if any one of +cluster nodes on which the programme is running fails. Today existing solutions +mainly focus on making application checkpoints, but with increasing size of +supercomputers and HPC clusters this approach becomes less efficient. Our +approach to make cluster computations reliable and efficient is to use special +framework focused on structuring parallel programme in strict hierarchy of +parallel and sequential parts. Using different fault tolerant scenarios based +on hierarchy interactions, this framework provides continuous execution of a +parallel programme in case of hardware errors or electricity outages. The framework provides classes and methods to simplify development of distributed applications and middleware. The focus is to make distributed @@ -59,8 +60,8 @@ wait, and call correct kernel methods by analysing their internal state. \section{Related work} -The feature that distingueshes our research with respect to some others, is the -use of hierarchy as the only possible way of defining depedencies between +The feature that distinguishes our research with respect to some others, is the +use of hierarchy as the only possible way of defining dependencies between objects, into which a programme is decomposed. The main advantage of hierarchy is trivial handling of object failures. @@ -69,9 +70,9 @@ machines. This model breaks a programme into small bits of functionality, called codelets, and dependencies between them. The programme dataflow represents directed graph, which is called well-behaved if forward progress of the programme is guaranteed. In contrast to our model, in codelet model -hierarchical depedencies are not enforced, and resilience to failures is +hierarchical dependencies are not enforced, and resilience to failures is provided by object migration and relies on hardware fault detection mechanisms. -Furthermore, execution of kernel hierarchiies in our model resembles +Furthermore, execution of kernel hierarchies in our model resembles stack-based execution of ordinary programmes: the programme finishes only when all subordinate kernels of the main kernel finish. So, there is no need to define well-behaved graph to guarantee programme termination. @@ -81,31 +82,29 @@ parallel programmes. In the framework of this model a programme is decomposed into objects that may communicate with each other by sending messages, and can be migrated to any cluster node if desired. The authors propose several possibilities, how this model may enhance fault-tolerance techniques for -Charm++ programmes: proactive fault detection, checkpoint/restart and message -logging. In contrast to our model, migratable objects do not compose a +Charm++/AMPI programmes: proactive fault detection, checkpoint/restart and +message logging. In contrast to our model, migratable objects do not compose a hierarchy, but may exchange messages with any object address of which is known to the sender. A spanning tree of nodes is used to orchestrate collective operations between objects. This tree is similar to tree hierarchy of nodes, which is used in our work to distribute kernels between available cluster nodes, but we use this hierarchy for any operations that require distribution -of work, rather than collective ones, and collective operations are typically -implemented as point-to-point communication between kernels address of which is -known to each other. Our model does not use techniques described in this paper -to provide fault-tolerance: upon a failure we re-execute subordinate kernels -and copy principal kernels to be able to re-execute them as well. Our approach -blends checkpoint/restart and message logging: each kernel which is sent to -other cluster node is saved (logged) in memory of the sender, and removed from -the log upon return. Since subordinate kernels are allowed to communicate only -with their principals (all other communication may happen only when physical -location of the kernel is known, if the communicaton fails, then the kernel -also fails to trigger recovery by the principal), a collection of all logs on -each cluster nodes consitutes the current state of programme execution, which -is used to restart failed kernels on the surviving nodes. +of work, rather than collective ones. Our model does not use techniques +described in this paper to provide fault-tolerance: upon a failure we +re-execute subordinate kernels and copy principal kernels to be able to +re-execute them as well. Our approach blends checkpoint/restart and message +logging: each kernel which is sent to other cluster node is saved (logged) in +memory of the sender, and removed from the log upon return. Since subordinate +kernels are allowed to communicate only with their principals (all other +communication may happen only when physical location of the kernel is known, if +the communication fails, then the kernel also fails to trigger recovery by the +principal), a collection of all logs on each cluster nodes constitutes the +current state of programme execution, which is used to restart failed kernels +on the surviving nodes. -To summarise, the feature that distiguishes our model with respect to models +To summarise, the feature that distinguishes our model with respect to models proposed for improving parallel programme fault-tolerance is the use of kernel hierarchy~--- an abstraction which defines strict total order on a set of kernels (their execution order) and, consequently, defines for each kernel a -principal kernel, responsibility of which is to re-executed failed subordinate +principal kernel, responsibility of which is to re-execute failed subordinate kernels upon a failure. -