git clone https://git.igankevich.com/hpcs-17-subord.git
Log | Files | Refs

commit bfd5b5e81514bb8357f0071868845c2fcd0ed714
parent 5fa5a12c1a5a10b39bb508d998732d50e68be6c5
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Mon, 15 May 2017 16:05:12 +0300

Add comparison to checkpoints/restart.

src/body.tex | 11+++++++++++
1 file changed, 11 insertions(+), 0 deletions(-)

diff --git a/src/body.tex b/src/body.tex @@ -493,3 +493,14 @@ survives, all programmes continue their execution in possibly degraded state. However it requires recursively duplicating principals and sending the along with the subordinates. Only electricity outage requires writing data to disk other failures can be mitigated by duplicating kernels in memory. + +The framework has not been compared to other similar approaches, because to the +best of our knowledge there is no library/framework that provides resilience to +simultaneous failure of more than one node (including master node), and +comparison to checkpoint/restart approach would be unfair, as we do not stop +all parallel processes of an application and dump RAM image to stable storage, +but only copy kernels into memory of another node. This approach is far more +efficient than checkpoint/restart as no data is written to disk, and only a +small fraction of the whole memory occupied by the application is copied to the +other node. +