intro.tex (9191B)
1 \section{Introduction} 2 3 Fault tolerance of data processing pipelines is one of the top concerns in 4 development of job schedulers for big data processing, however, most schedulers 5 provide fault tolerance for subordinate nodes only. These types of failures are 6 routinely mitigated by restarting the failed job or its part on healthy nodes, 7 and failure of a master node is often considered either improbable, or too 8 complicated to handle and configure on the target platform. System 9 administrators often find alternatives to application level fault tolerance: 10 they isolate master node from the rest of the cluster by placing it on a 11 dedicated machine, or use virtualisation technologies instead. All these 12 alternatives complexify configuration and maintenance, and by decreasing 13 probability of a machine failure resulting in a whole system failure, they 14 increase probability of a human error. 15 16 From such point of view it seems more practical to implement master node fault 17 tolerance at the application level, however, there is no generic implementation. 18 Most implementations are too tied to a particular application to become 19 universally acceptable. We believe that this happens due to people's habit to 20 think of a cluster as a collection of individual machines each of which can be 21 either master or slave, rather than to think of a cluster as a whole with 22 master and slave roles being \emph{dynamically} assigned to a particular 23 physical machine. 24 25 This evolution in thinking allows to implement middleware that manages master 26 and slave roles automatically and handles node failures in a generic way. This 27 software provides an API to distribute parallel tasks on the pool of available 28 nodes and among them. Using this API one can write an application that runs on 29 a cluster without knowing the exact number of online nodes. The middleware is 30 implemented as a daemon running on each cluster node which acts as an 31 intermediate point of communication for distributed applications and 32 transparently routes application messages between operating system processes 33 running on different cluster nodes. 34 35 The rest of the paper is organised as follows. In Section 2 we discuss how 36 dynamic node role assignment is implemented in state-of-the-art middleware 37 systems and how hierarchical dependencies between kernels help establish rules 38 for restarting failed parts of a parallel application. In Section 3 we describe 39 programming model based on kernels~--- small independent units of work~--- 40 and an algorithm of handling master node failure. In Section 4 and 5 we 41 present and discuss results of experiments that show validity of the 42 proposed approach and conclude the paper in Section 6. 43 44 The research for new fault-tolerance methods is motivated by the fact that the 45 size of large-scale computing systems (clusters and supercomputers) approaches 46 critical point, where the number of nodes is so large that probability of all 47 nodes simultaneously working without a faulure tends to nought. In other words, 48 in future large-scale systems it is highly probable that a parallel application 49 experience node failure throughout its execution, and tolerating this failure 50 in a transparent way and without checkpointts will increase performance of 51 future parallel applications. 52 53 \section{Related work} 54 55 Dynamic role assignment is an emerging trend in design of distributed 56 systems~\citep{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}, 57 however, it is still not used in big data job schedulers. For example, in 58 popular YARN~\citep{vavilapalli2013yarn} and Spark~\citep{zaharia2012resilient} 59 job schedulers, which are used by Hadoop and Spark big data analysis frameworks, 60 master and slave roles are static. Failure of a slave node is tolerated by 61 restarting a part of a job on a healthy node, and failure of a master node is 62 tolerated by setting up standby reserved server~\citep{murthy2011architecture}. 63 Both master servers are coordinated by Zookeeper service which itself uses 64 dynamic role assignment to ensure its 65 fault-tolerance~\citep{okorafor2012zookeeper}. So, the whole setup is 66 complicated due to Hadoop scheduler lacking dynamic roles: if dynamic roles 67 were available, Zookeeper would be redundant in this setup. Moreover, this 68 setup does not guarantee continuous operation of the master node because the 69 standby server needs time to recover current state after a failure. 70 71 The same problem occurs in high-performance computing where the master node of a job 72 scheduler is the single point of failure. 73 In~\citep{uhlemann2006joshua,engelmann2006symmetric} the authors use replication 74 to make the master node highly-available, but the backup server role is assigned 75 statically and cannot be delegated to a healthy worker node. This solution is 76 closer to fully dynamic role assignment than the high-availability solution for big 77 data schedulers, because it does not involve using external service to store 78 configuration which should also be highly-available, however, it is far from 79 ideal solution where roles are completely decoupled from physical servers. 80 81 Finally, the simplest master node high-availability is implemented in Virtual 82 Router Redundancy Protocol 83 (VRRP)~\citep{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although 84 VRRP protocol does provide master and backup node roles, which are dynamically 85 assigned to available routers, this protocol works on top of the IPv4 and IPv6 86 protocols and is designed to be used by routers and reverse proxy servers. Such 87 servers lack the state that needs to be restored upon a failure (i.e.~there is 88 no job queue in web servers), so it is easier for them to provide 89 high-availability. On Linux it is implemented in Keepalived routing 90 daemon~\citep{cassen2002keepalived}. 91 92 In contrast to web servers and HPC and Big Data job schedulers, some distributed 93 key-value stores and parallel file systems have symmetric architecture, where 94 master and slave roles are assigned dynamically, so that any node can act as a 95 master when the current master node 96 fails~\citep{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}. 97 This design decision simplifies management and interaction with a distributed 98 system. From system administrator point of view it is much simpler to install 99 the same software stack on each node than to manually configure master and slave 100 nodes. Additionally, it is much easier to bootstrap new nodes into the cluster 101 and decommission old ones. From a user point of view, it is much simpler to 102 provide web service high-availability and load-balancing when you have multiple 103 backup nodes to connect to. 104 105 Dynamic role assignment would be beneficial for Big Data job schedulers because 106 it allows to decouple distributed services from physical nodes, which is the 107 first step to build highly-available distributed service. The reason that there 108 is no general solution to this problem is that there is no generic programming 109 environment to write and execute distributed programmes. The aim of this work is 110 to propose such an environment and to describe its internal structure. 111 112 The programming model used in this work is partly based on the well-known actor 113 model of concurrent computation~\citep{agha1985actors,hewitt1973universal}. Our 114 model borrows the concept of actor --- an object that stores data and methods to 115 process it; this object can react to external events by either changing its 116 state or producing more actors. We call this objects \emph{computational 117 kernels}. Their distinct feature is hierarchical dependence on parent kernel 118 that created each of them, which allows implementing fault-tolerance by 119 restarting failed subordinate kernel. 120 121 However, using hierarchical dependence alone is not enough to develop 122 high-availability of a master kernel --- the first kernel in a parallel 123 programme. To solve the problem the other part of our programming model is based 124 on bulk-synchronous parallel model~\citep{valiant1990bridging}. It borrows the 125 concept of superstep --- a sequential step of a parallel programme; at any time 126 a programme executes only one superstep, which allows implementing 127 high-availability of the first kernel (under assumption that it has only one 128 subordinate at a time) by sending it along its subordinate to a different 129 cluster node thus making a distributed copy of it. Since the first kernel has 130 only one subordinate at a time, its copy is always consistent with the original 131 kernel. This eliminates the need for complex distributed transactions and 132 distributed consensus algorithms and guarantees protection from at most one 133 master node failure per superstep. 134 135 To summarise, the framework developed in this paper protects a parallel 136 programme from failure of any number of subordinate nodes and from one failure 137 of a master node per superstep. The paper does not answer the question of how to 138 determine if a node has failed, it assumes node failure when the network 139 connection to a node is prematurely closed. In general, the presented research 140 goes in line with further development of the virtual supercomputer concept 141 coined and evaluated in~\citep{vsc-csit2013,vsc-iccsa2014,vsc-nova}.