arma-thesis

git clone https://git.igankevich.com/arma-thesis.git
Log | Files | Refs | LICENSE

commit 64f58b56ab6144ab22f09d93f029012208c62aeb
parent d3526b62b74b34f33697b3274c1f2d6e99cd4500
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Tue, 24 Jan 2017 11:32:03 +0300

Copy English text.

Diffstat:
phd-diss-ru.org | 24++++++++++++------------
phd-diss.org | 481+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 493 insertions(+), 12 deletions(-)

diff --git a/phd-diss-ru.org b/phd-diss-ru.org @@ -2494,7 +2494,7 @@ cite:lantz2010network,handigol2012reproducible,heller2013reproducible, где #+RESULTS: fig:tree-hierarchy-11 *** Алгоритм восстановления после сбоев -**** Обеспечение отказоустойчивости. +**** Введение. Отказы узлов распределенной системы можно разделить на три типа: отказ подчиненного узла, отказ главного узла и отказ одновременно всех узлов (отключение электричества). Для того чтобы запущенная на кластере задача могла @@ -2741,17 +2741,17 @@ Java влечет за собой накладные расходы, и не п Для оценки количества времени, которое теряется при выходе из строя одного из узлов, можно поделить общее время работы программы со сбоем на время работы программы без сбоев, но с количеством узлов минус один. Это отношение -представлено на [[fig:slowdown]]. Разница в производительности в случае -выхода из строя главного узла и подчиненного узла находится в пределах 5\%, а в -случае выхода из строя резервного узла --- в пределах 50\% для количества узлов -меньше 6\footnote{Измерение разницы для большего количества узлов не имеет -смысла, поскольку программа завершается еще до наступления сбоя.}. Разница в -50\% больше, чем $1/3$ времени работы программы, после которого происходит -сбой, однако отказ резервного узла требует некоторого времени, чтобы быть -обнаруженным другими узлами. Сбой узла обнаруживается только тогда, когда -подчиненный объект завершает свое выполнение и пытается вернуться на исходный -узел к родителю. Мгновенное обнаружение сбоя узла требует остановки выполнения -объектов, что может быть неприменимо для программ со сложной логикой. +представлено на [[fig:slowdown]]. Разница в производительности в случае выхода из +строя главного узла и подчиненного узла находится в пределах 5\%, а в случае +выхода из строя резервного узла --- в пределах 50\% для количества узлов меньше +6[fn::Измерение разницы для большего количества узлов не имеет смысла, поскольку +программа завершается еще до наступления сбоя.]. Разница в 50\% больше, чем +$1/3$ времени работы программы, после которого происходит сбой, однако отказ +резервного узла требует некоторого времени, чтобы быть обнаруженным другими +узлами. Сбой узла обнаруживается только тогда, когда подчиненный объект +завершает свое выполнение и пытается вернуться на исходный узел к родителю. +Мгновенное обнаружение сбоя узла требует остановки выполнения объектов, что +может быть неприменимо для программ со сложной логикой. #+name: fig:benchmark #+begin_src R diff --git a/phd-diss.org b/phd-diss.org @@ -1956,9 +1956,490 @@ arma.plot_factory_vs_openmp_overlap( ** MPP implementation *** Overview of distributed system architectures *** Cluster node discovery algorithm +**** Introduction. *** Fail over algorithm +**** Introduction. +Fault tolerance of data processing pipelines is one of the top concerns in +development of job schedulers for big data processing, however, most schedulers +provide fault tolerance for subordinate nodes only. These types of failures are +routinely mitigated by restarting the failed job or its part on healthy nodes, +and failure of a master node is often considered either improbable, or too +complicated to handle and configure on the target platform. System +administrators often find alternatives to application level fault tolerance: +they isolate master node from the rest of the cluster by placing it on a +dedicated machine, or use virtualisation technologies instead. All these +alternatives complexify configuration and maintenance, and by decreasing +probability of a machine failure resulting in a whole system failure, they +increase probability of a human error. + +From such point of view it seems more practical to implement master node fault +tolerance at application level, however, there is no generic implementation. +Most implementations are too tied to a particular application to become +universally acceptable. We believe that this happens due to people's habit to +think of a cluster as a collection of individual machines each of which can be +either master or slave, rather than to think of a cluster as a whole with master +and slave roles being dynamically assigned to a particular physical machine. + +This evolution in thinking allows to implement middleware that manages master +and slave roles automatically and handles node failures in a generic way. This +software provides an API to distribute parallel tasks on the pool of available +nodes and among them. Using this API one can write an application that runs on a +cluster without knowing the exact number of online nodes. The middleware works +as a cluster operating system overlay allowing to write distributed +applications. + +**** Related work. +Dynamic role assignment is an emerging trend in design of distributed systems +cite:ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra, +however, it is still not used in big data job schedulers. For example, in +popular YARN job scheduler cite:vavilapalli2013yarn, which is used by Hadoop and +Spark big data analysis frameworks, master and slave roles are static. Failure +of a slave node is tolerated by restarting a part of a job on a healthy node, +and failure of a master node is tolerated by setting up standby reserved server +cite:murthy2011architecture. Both master servers are coordinated by Zookeeper +service which itself uses dynamic role assignment to ensure its fault-tolerance +cite:okorafor2012zookeeper. So, the whole setup is complicated due to Hadoop +scheduler lacking dynamic roles: if dynamic roles were available, Zookeeper +would be redundant in this setup. Moreover, this setup does not guarantee +continuous operation of master node because standby server needs time to recover +current state after a failure. + +The same problem occurs in high-performance computing where master node of a job +scheduler is the single point of failure. In +cite:uhlemann2006joshua,engelmann2006symmetric the authors use replication to +make the master node highly-available, but backup server role is assigned +statically and cannot be delegated to a healthy worker node. This solution is +closer to fully dynamic role assignment than high-availability solution for big +data schedulers, because it does not involve using external service to store +configuration which should also be highly-available, however, it is far from +ideal solution where roles are completely decoupled from physical servers. + +Finally, the simplest master node high-availability is implemented in Virtual +Router Redundancy Protocol (VRRP) +cite:knight1998rfc2338,hinden2004virtual,nadas2010rfc5798. Although VRRP +protocol does provide master and backup node roles, which are dynamically +assigned to available routers, this protocol works on top of the IPv4 and IPv6 +protocols and is designed to be used by routers and reverse proxy servers. Such +servers lack the state that needs to be restored upon a failure (i.e.~there is +no job queue in web servers), so it is easier for them to provide +high-availability. In Linux it is implemented in Keepalived routing daemon +cite:cassen2002keepalived. + +In contrast to web servers and HPC and Big Data job schedulers, some distributed +key-value stores and parallel file systems have symmetric architecture, where +master and slave roles are assigned dynamically, so that any node can act as a +master when the current master node fails +cite:ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra. +This design decision simplifies management and interaction with a distributed +system. From system administrator point of view it is much simpler to install +the same software stack on each node than to manually configure master and slave +nodes. Additionally, it is much easier to bootstrap new nodes into the cluster +and decommission old ones. From user point of view, it is much simpler to +provide web service high-availability and load-balancing when you have multiple +backup nodes to connect to. + +Dynamic role assignment would be beneficial for Big Data job schedulers because +it allows to decouple distributed services from physical nodes, which is the +first step to build highly-available distributed service. The reason that there +is no general solution to this problem is that there is no generic programming +environment to write and execute distributed programmes. The aim of this work is +to propose such an environment and to describe its internal structure. + +The programming model used in this work is partly based on well-known actor +model of concurrent computation cite:agha1985actors,hewitt1973universal. Our +model borrows the concept of actor---an object that stores data and methods to +process it; this object can react to external events by either changing its +state or producing more actors. We call this objects /computational kernels/. +Their distinct feature is hierarchical dependence on parent kernel that created +each of them, which allows to implement fault-tolerance based on simple restart +of a failed subordinate kernel. + +However, using hierarchical dependence alone is not enough to develop +high-availability of a master kernel---the first kernel in a parallel programme. +To solve the problem the other part of our programming model is based on +bulk-synchronous parallel model cite:valiant1990bridging. It borrows the concept +of superstep---a sequential step of a parallel programme; at any time a +programme executes only one superstep, which allows to implement +high-availability of the first kernel (under assumption that it has only one +subordinate at a time) by sending it along its subordinate to a different +cluster node thus making a distributed copy of it. Since the first kernel has +only one subordinate at a time, its copy is always consistent with the original +kernel. This eliminates the need for complex distributed transactions and +distributed consensus algorithms and guarantees protection from at most one +master node failure per superstep. + +To summarise, the framework developed in this paper protects a parallel +programme from failure of any number of subordinate nodes and from one failure +of a master node per superstep. The paper does not answer the question of how to +determine if a node failed, it assumes a failure when the network connection to +a node is prematurely closed. In general, the presented research goes in line +with further development of the virtual supercomputer concept coined and +evaluated in cite:vsc-csit2013,vsc-iccsa2014,vsc-nova. + +**** Pipelining. +To infer fault tolerance model which is suitable for big data applications we +use bulk-synchronous parallel model cite:valiant1990bridging as the basis. This +model assumes that a parallel programme is composed of several sequential steps +that are internally parallel, and global synchronisation of all parallel +processes occurs after each step. In our model all sequential steps are +pipelined where it is possible. The evolution of the computational model is +described as follows. + +Given a programme that is sequential and large enough to be decomposed into +several sequential steps, the simplest way to make it run faster is to exploit +data parallelism. Usually it means finding multi-dimensional arrays and loops +that access their elements and trying to make them parallel. After transforming +several loops the programme will still have the same number of sequential steps, +but every step will (ideally) be internally parallel. + +After that the only possibility to speedup the programme is to overlap execution +of code blocks that work with different hardware devices. The most common +pattern is to overlap computation with network I/O or disk I/O. This approach +makes sense because all devices operate with little synchronisation, and issuing +commands in parallel makes the whole programme perform better. This behaviour +can be achieved by allocating a separate task queue for each device and +submitting tasks to these queues asynchronously with execution of the main +thread. So, after this optimisation, the programme will be composed of several +steps chained into the pipeline, each step is implemented as a task queue for a +particular device. + +Pipelining of otherwise sequential steps is beneficial not only for code +accessing different devices, but for code different branches of which are +suitable for execution by multiple hardware threads of the same core, i.e. +branches accessing different regions of memory or performing mixed arithmetic +(floating point and integer). In other words, code branches which use different +modules of processor are good candidates to run in parallel on a processor core +with multiple hardware threads. + +Even though pipelining may not add parallelism for a programme that uses only +one input file (or a set of input parameters), it adds parallelism when the +programme can process multiple input files: each input generates tasks which +travel through the whole pipeline in parallel with tasks generated by other +inputs. With a pipeline an array of files is processed in parallel by the same +set of resources allocated for a batch job, and possibly with greater efficiency +for busy HPC clusters compared to executing a separate job for each input file, +because the time that each subsequent job after the first spends in a queue is +eliminated. + +Computational model with a pipeline can be seen as /bulk-asynchronous model/, +because of the parallel nature of otherwise sequential execution steps. This +model is the basis of the fault-tolerance model developed here. + +**** Hierarchy. +Although, fault-tolerance and high-availability are different terms, in essence +they describe the same property---an ability of a system to switch processing +from a failed component to its live spare or backup component. In case of +fault-tolerance it is the ability to switch from a failed slave node to a spare +one, i.e. to repeat computation step on a healthy slave node. In case of +high-availability it is the ability to switch from a failed master node to a +backup node with full restoration of execution state. These are the core +abilities that constitute distributed system's ability to /fail over/. + +The key feature that is missing in the current parallel programming and big data +processing technologies is a possibility to specify hierarchical dependencies +between parallel tasks. When one has such dependency, it is trivial to determine +which task should be responsible for re-executing a failed task on a healthy +node. To re-execute the root of the hierarchy, a backup root task is created and +executed on a different node. There exists a number of engines that are capable +of executing directed acyclic graphs of tasks in parallel +cite:acun2014charmpp,islam2012oozie, but graphs are not good to infer +master-slave relationship between tasks, because a node in the graph may have +multiple parent nodes. + +This work is based on the results of previous research: In +cite:gankevich2015subordination,gankevich2015iccsa we developed an algorithm +that allows to build a tree hierarchy from strictly ordered set of cluster +nodes. The sole purpose of this hierarchy is to make a cluster more +fault-tolerant by introducing multiple master nodes. If a master node fails, +then its subordinates try to connect to another node from the same or higher +level of the hierarchy. If there is no such node, one of the subordinates +becomes the master. In cite:gankevich2015spec we developed a framework for big +data processing without fault tolerance, and here this framework is combined +with fault-tolerance techniques described in this paper. + +Each programme that runs on top of the tree hierarchy is composed of +computational kernels---objects that contain data and code to process it. To +exploit parallelism a kernel may create arbitrary number of subordinate kernels +which are automatically spread first across available processor cores, second +across subordinate nodes in the tree hierarchy. The programme is itself a kernel +(without a parent as it is executed by a user), which either solves the problem +sequentially on its own or creates subordinate kernels to solve it in parallel. + +In contrast to HPC applications, in big data applications it is inefficient to +run computational kernels on arbitrary chosen nodes. More practical approach is +to bind every kernel to a file location in a parallel file system and transfer +the kernel to that location before processing the file. That way expensive data +transfer is eliminated, and the file is always read from a local drive. This +approach is more deterministic compared to existing ones, e.g. MapReduce +framework runs jobs on nodes that are ``close'' to the file location, but not +necessarily the exact node where the file is located cite:dean2008mapreduce. +However, this approach does not come without disadvantages: scalability of a big +data application is limited by the strategy that was employed to distribute its +input files across cluster nodes. The more nodes used to store input files, the +more read performance is achieved. The advantage of our approach is that the I/O +performance is more predictable, than one of hybrid approach with streaming +files over the network. + **** Fault tolerance. +**** Handling single node failures. +Basic strategy to overcome a failure of a subordinate node is to restart +corresponding kernels on healthy node---a strategy employed in Erlang language +to restart failed subordinate processes cite:armstrong2003thesis. To implement +this we record every kernel that is sent to remote cluster nodes, and in an +event of a node failure these kernels are simply rescheduled to other +subordinate nodes with no special handling from a programmer. If there are no +nodes to sent kernels to, they are scheduled locally. So, in contrast to +heavy-weight checkpoint/restart machinery, tree hierarchy allows automatic and +transparent handling of subordinate node failures without restarting parallel +processes on every node. + +A possible way of handling a failure of a node where the first kernel is located +is to replicate this kernel to a backup node, and make all updates to its state +propagate to the backup node by means of a distributed transaction. However, +this approach does not play well with asynchronous nature of computational +kernels. Fortunately, the first kernel usually does not perform operations in +parallel, it is rather sequentially launches execution steps one by one, so it +has only one subordinate at a time. Keeping this in mind, we can simplify +synchronisation of its state: we can send the first kernel along with its +subordinate to the subordinate node. When the node with the first kernel fails, +its copy receives its subordinate, and no execution time is lost. When the node +with its copy fails, its subordinate is rescheduled on some other node, and a +whole step of computation is lost in the worst case. + +Described approach works only for kernels that do not have a parent and have +only one subordinate at a time, which means that they act as optimised +checkpoints. The advantage is that they save results after each sequential step, +when memory footprint of a programme is low, they save only relevant data, and +they use memory of a subordinate node instead of stable storage. + **** High availability. +A possible way of handling a failure of a node where the first kernel is located +(a master node) is to replicate this kernel to a backup node, and make all +updates to its state propagate to the backup node by means of a distributed +transaction. This approach requires synchronisation between all nodes that +execute subordinates of the first kernel and the node with the first kernel +itself. When a node with the first kernel goes offline, the nodes with +subordinate kernels must know what node is the backup one. However, if the +backup node also goes offline in the middle of execution of some subordinate +kernel, then it is impossible for this kernel to discover the next backup node +to return to, because this kernel has not discovered the unavailability of the +master node yet. One can think of a consensus-based algorithm to ensure that +subordinate kernels always know where the backup node is, but distributed +consensus algorithms do not scale well to the large number of nodes and they are +not reliable cite:fischer1985impossibility. So, consensus-based approach does +not play well with asynchronous nature of computational kernels as it may +inhibit scalability of a parallel programme. + +Fortunately, the first kernel usually does not perform operations in parallel, +it is rather sequentially launches execution steps one by one, so it has only +one subordinate at a time. Such behaviour is described by bulk-synchronous +parallel programming model, in the framework of which a programme consists of +sequential supersteps which are internally parallel cite:valiant1990bridging. +Keeping this in mind, we can simplify synchronisation of its state: we can send +the first kernel along with its subordinate to the subordinate node. When the +node with the first kernel fails, its copy receives its subordinate, and no +execution time is lost. When the node with its copy fails, its subordinate is +rescheduled on some other node, and in the worst case a whole step of +computation is lost. + +Described approach works only for kernels that do not have a parent and have +only one subordinate at a time, and act similar to manually triggered +checkpoints. The advantage is that they +- save results after each sequential step when memory footprint of a programme + is low, +- they save only relevant data, +- and they use memory of a subordinate node instead of stable storage. +**** Implementation. +For efficiency reasons fault tolerance techniques described above are +implemented in the C++ framework: From the authors' perspective C language is +deemed low-level for distributed programmes, and Java incurs too much overhead +and is not popular in HPC community. To use the framework without a job +scheduler, we need to implement a daemon that maintains the state of the +hierarchy of nodes and exposes API to interact with it. As of now, the framework +runs in the same process as an parallel application that uses it. The framework +is called Factory, it is now in proof-of-concept development stage. + +**** Evaluation. +Factory framework is evaluated on physical cluster (Table [[tab:cluster]]) on +the example of hydrodynamics HPC application which was developed +in cite:autoreg-stab,autoreg2011csit,autoreg1,autoreg2. This programme +generates wavy ocean surface using ARMA model, its output is a set of files +representing different parts of the realisation. From a computer scientist point +of view the application consists of a series of filters, each applying to the +result of the previous one. Some of the filters are parallel, so the programme +is written as a sequence of big steps and some steps are made internally +parallel to get better performance. In the programme only the most +compute-intensive step (the surface generation) is executed in parallel across +all cluster nodes, and other steps are executed in parallel across all cores of +the master node. + +#+name: tab:cluster +#+caption: Test platform configuration. +#+attr_latex: :booktabs t +| CPU | Intel Xeon E5440, 2.83GHz | +| RAM | 4Gb | +| HDD | ST3250310NS, 7200rpm | +| No. of nodes | 12 | +| No. of CPU cores per node | 8 | + +The application was rewritten for the new version of the framework which +required only slight modifications to handle failure of a node with the first +kernel: The kernel was flagged so that the framework makes a replica and sends +it to some subordinate node. There were no additional code changes other than +modifying some parts to match the new API. So, the tree hierarchy of kernels is +mostly non-intrusive model for providing fault tolerance which demands explicit +marking of replicated kernels. + +In a series of experiments we benchmarked performance of the new version of the +application in the presence of different types of failures (numbers correspond +to the graphs in Figure [[fig:benchmark]]): +- no failures, +- failure of a slave node (a node where a part of wavy surface is generated), +- failure of a master node (a node where the first kernel is run), +- failure of a backup node (a node where a copy of the first kernel is stored). +A tree hierarchy with fan-out value of 64 was chosen to make all cluster nodes +connect directly to the first one. In each run the first kernel was launched on +a different node to make mapping of kernel hierarchy to the tree hierarchy +optimal. A victim node was made offline after a fixed amount of time after the +programme start which is equivalent approximately to $1/3$ of the total run time +without failures on a single node. All relevant parameters are summarised in +Table [[tab:benchmark]] (here ``root'' and ``leaf'' refer to a node in the +tree hierarchy). The results of these runs were compared to the run without node +failures (Figures [[fig:benchmark]]-[[fig:slowdown]]). + +There is considerable difference in net performance for different types of +failures. Graphs 2 and 3 in Figure [[fig:benchmark]] show that performance in +case of master or slave node failure is the same. In case of master node failure +a backup node stores a copy of the first kernel and uses this copy when it fails +to connect to the master node. In case of slave node failure, the master node +redistributes the load across remaining slave nodes. In both cases execution +state is not lost and no time is spent to restore it, that is why performance is +the same. Graph 4 in Figure [[fig:benchmark]] shows that performance in case +of a backup node failure is much lower. It happens because master node stores +only the current step of the computation plus some additional fixed amount of +data, whereas a backup node not only stores the copy of this information but +executes this step in parallel with other subordinate nodes. So, when a backup +node fails, the master node executes the whole step once again on arbitrarily +chosen healthy node. + +#+name: tab:benchmark +#+caption: Benchmark parameters. +#+attr_latex: :booktabs t +| Experiment no. | Master node | Victim node | Time to offline, s | +| 1 | root | | | +| 2 | root | leaf | 10 | +| 3 | leaf | leaf | 10 | +| 4 | leaf | root | 10 | + +Finally, to measure how much time is lost due to a failure we divide the total +execution time with a failure by the total execution time without the failure +but with the number of nodes minus one. The results for this calculation are +obtained from the same benchmark and are presented in Figure [[fig:slowdown]]. The +difference in performance in case of master and slave node failures lies within +5\% margin, and in case of backup node failure within 50\% margin for the number +of node less than~6[fn::Measuring this margin for higher number of nodes does +not make sense since time before failure is greater than total execution time +with these numbers of nodes, and programme's execution finishes before a failure +occurs.]]. Increase in execution time of 50\% is more than $1/3$ of execution +time after which a failure occurs, but backup node failure need some time to be +discovered: they are detected only when subordinate kernel carrying the copy of +the first kernel finishes its execution and tries to reach its parent. Instant +detection requires abrupt stopping of the subordinate kernel which may be +undesirable for programmes with complicated logic. + +#+name: fig:benchmark +#+begin_src R +# TODO +#+end_src + +#+caption: Performance of hydrodynamics HPC application in the presence of node failures. +#+RESULTS: fig:benchmark + +To summarise, the benchmark showed that /no matter a master or a slave node +fails, the resulting performance roughly equals to the one without failures with +the number of nodes minus one/, however, when a backup node fails performance +penalty is much higher. + +#+name: fig:slowdown +#+begin_src R +# TODO +#+end_src + +#+caption: Slowdown of the hydrodynamics HPC application in the presence of different types of node failures compared to execution without failures but with the number of nodes minus one. +#+RESULTS: fig:slowdown + +**** Discussion. +Described algorithm guarantees to handle one failure per computational step, +more failures can be tolerated if they do not affect the master node. The system +handles simultaneous failure of all subordinate nodes, however, if both master +and backup nodes fail, there is no chance for an application to survive. In this +case the state of the current computation step is lost, and the only way to +restore it is to restart the application. + +Computational kernels are means of abstraction that decouple distributed +application from physical hardware: it does not matter how many nodes are online +for an application to run successfully. Computational kernels eliminate the need +to allocate a physical backup node to make master node highly-available, with +computational kernels approach any node can act as a backup one. Finally, +computational kernels can handle subordinate node failures in a way that is +transparent to a programmer. + +The disadvantage of this approach is evident: there is no way of making existing +middleware highly-available without rewriting their source code. Although, our +programming framework is lightweight, it is not easy to map architecture of +existing middleware systems to it: most systems are developed keeping in mind +static assignment of server/client roles, which is not easy to make dynamic. +Hopefully, our approach will simplify design of future middleware systems. + +The benchmark from the previous section show that it is essential for a +parallel application to have multiple sequential steps to make it resilient to +cluster node failures. Although, the probability of a master node failure is +lower than the probability of failure of any of the slave nodes, it does not +justify loosing all the data when the programme run is near completion. In +general, the more sequential steps one has in an HPC application the less is +performance penalty in an event of master node failure, and the more parallel +parts each step has the less is performance penalty in case of a slave node +failure. In other words, /the more scalable an application is the more +resilient to node failures it becomes/. + +In our experiments we specified manually where the programme starts its +execution to make mapping of hierarchy of computational kernels to tree +hierarchy of nodes optimal, however, it does not seem practical for real-world +cluster. The framework may perform such tasks automatically, and distribute the +load efficiently no matter whether the master node of the application is +located in the root or leaf of the tree hierarchy: Allocating the same node for +the first kernel of each application deteriorates fault-tolerance. + +Although it may not be clear from the benchmarks, Factory does not only provide +tolerance to node failures: new nodes automatically join the cluster and +receive their portion of the load as soon as it is possible. This is trivial +process as it does not involve restarting failed kernels or managing their +state, so it is not presented in this work. + +In theory, hierarchy-based fault-tolerance can be implemented on top of the +message-passing library without loss of generality. Although it would be +complicated to reuse free nodes instead of failed ones, as the number of nodes +is often fixed in such libraries, allocating reasonably large number of nodes +for the application would be enough to make it fault-tolerant. However, +implementing hierarchy-based fault-tolerance ``below'' message-passing +library does not seem beneficial, because it would require saving the state +of a parallel application which equals to the total amount of memory it +ccupies on each host, which would not make it more efficient than +checkpoints. + +The weak point of the proposed technology is the length of the period of time +starting from a failure of master node up to the moment when the failure is +detected, the first kernel is restored and new subordinate kernel with the +parent's copy is received by a subordinate node. If during this period of time +backup node fails, execution state of application is completely lost, and there +is no way to recover it other than fully restarting the application. The length +of the dangerous period can be minimised but the possibility of a abrupt +programme stop can not be fully eliminated. This result is consistent with the +scrutiny of ``impossibility theory'', in the framework of which it is proved +the impossibility of the distributed consensus with one faulty +process cite:fischer1985impossibility and impossibility of reliable +communication in the presence of node failures cite:fekete1993impossibility. + * Conclusion * Acknowledgements * List of acronyms and symbols