intro.tex (7775B)
1 \section{Introduction} 2 3 Fault tolerance of data processing pipelines is one of the top concerns in 4 development of job schedulers for big data processing, however, most schedulers 5 provide fault tolerance for subordinate nodes only. These types of failures are 6 routinely mitigated by restarting the failed job or its part on healthy nodes, 7 and failure of a master node is often considered either improbable, or too 8 complicated to handle and configure on the target platform. System 9 administrators often find alternatives to application level fault tolerance: 10 they isolate master node from the rest of the cluster by placing it on a 11 dedicated machine, or use virtualisation technologies instead. All these 12 alternatives complexify configuration and maintenance, and by decreasing 13 probability of a machine failure resulting in a whole system failure, they 14 increase probability of a human error. 15 16 From such point of view it seems more practical to implement master node fault 17 tolerance at application level, however, there is no generic implementation. 18 Most implementations are too tied to a particular application to become 19 universally acceptable. We believe that this happens due to people's habit to 20 think of a cluster as a collection of individual machines each of which can be 21 either master or slave, rather than to think of a cluster as a whole with master 22 and slave roles being dynamically assigned to a particular physical machine. 23 24 This evolution in thinking allows to implement middleware that manages master 25 and slave roles automatically and handles node failures in a generic way. This 26 software provides an API to distribute parallel tasks on the pool of available 27 nodes and among them. Using this API one can write an application that runs on a 28 cluster without knowing the exact number of online nodes. The middleware works 29 as a cluster operating system overlay allowing to write distributed 30 applications. 31 32 \section{Related work} 33 34 Dynamic role assignment is an emerging trend in design of distributed 35 systems~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}, 36 however, it is still not used in big data job schedulers. For example, in 37 popular YARN job scheduler~\cite{vavilapalli2013yarn}, which is used by Hadoop 38 and Spark big data analysis frameworks, master and slave roles are static. 39 Failure of a slave node is tolerated by restarting a part of a job on a healthy 40 node, and failure of a master node is tolerated by setting up standby reserved 41 server~\cite{murthy2011architecture}. Both master servers are coordinated by 42 Zookeeper service which itself uses dynamic role assignment to ensure its 43 fault-tolerance~\cite{okorafor2012zookeeper}. So, the whole setup is complicated 44 due to Hadoop scheduler lacking dynamic roles: if dynamic roles were available, 45 Zookeeper would be redundant in this setup. Moreover, this setup does not 46 guarantee continuous operation of master node because standby server needs time 47 to recover current state after a failure. 48 49 The same problem occurs in high-performance computing where master node of a job 50 scheduler is the single point of failure. 51 In~\cite{uhlemann2006joshua,engelmann2006symmetric} the authors use replication 52 to make the master node highly-available, but backup server role is assigned 53 statically and cannot be delegated to a healthy worker node. This solution is 54 closer to fully dynamic role assignment than high-availability solution for big 55 data schedulers, because it does not involve using external service to store 56 configuration which should also be highly-available, however, it is far from 57 ideal solution where roles are completely decoupled from physical servers. 58 59 Finally, the simplest master node high-availability is implemented in Virtual 60 Router Redundancy Protocol 61 (VRRP)~\cite{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although 62 VRRP protocol does provide master and backup node roles, which are dynamically 63 assigned to available routers, this protocol works on top of the IPv4 and IPv6 64 protocols and is designed to be used by routers and reverse proxy servers. Such 65 servers lack the state that needs to be restored upon a failure (i.e.~there is 66 no job queue in web servers), so it is easier for them to provide 67 high-availability. In Linux it is implemented in Keepalived routing 68 daemon~\cite{cassen2002keepalived}. 69 70 In contrast to web servers and HPC and Big Data job schedulers, some distributed 71 key-value stores and parallel file systems have symmetric architecture, where 72 master and slave roles are assigned dynamically, so that any node can act as a 73 master when the current master node 74 fails~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}. 75 This design decision simplifies management and interaction with a distributed 76 system. From system administrator point of view it is much simpler to install 77 the same software stack on each node than to manually configure master and slave 78 nodes. Additionally, it is much easier to bootstrap new nodes into the cluster 79 and decommission old ones. From user point of view, it is much simpler to 80 provide web service high-availability and load-balancing when you have multiple 81 backup nodes to connect to. 82 83 Dynamic role assignment would be beneficial for Big Data job schedulers because 84 it allows to decouple distributed services from physical nodes, which is the 85 first step to build highly-available distributed service. The reason that there 86 is no general solution to this problem is that there is no generic programming 87 environment to write and execute distributed programmes. The aim of this work is 88 to propose such an environment and to describe its internal structure. 89 90 The programming model used in this work is partly based on well-known actor 91 model of concurrent computation~\cite{agha1985actors,hewitt1973universal}. Our 92 model borrows the concept of actor---an object that stores data and methods to 93 process it; this object can react to external events by either changing its 94 state or producing more actors. We call this objects \emph{computational 95 kernels}. Their distinct feature is hierarchical dependence on parent kernel 96 that created each of them, which allows to implement fault-tolerance based on 97 simple restart of a failed subordinate kernel. 98 99 However, using hierarchical dependence alone is not enough to develop 100 high-availability of a master kernel---the first kernel in a parallel programme. 101 To solve the problem the other part of our programming model is based on 102 bulk-synchronous parallel model~\cite{valiant1990bridging}. It borrows the 103 concept of superstep---a sequential step of a parallel programme; at any time a 104 programme executes only one superstep, which allows to implement 105 high-availability of the first kernel (under assumption that it has only one 106 subordinate at a time) by sending it along its subordinate to a different 107 cluster node thus making a distributed copy of it. Since the first kernel has 108 only one subordinate at a time, its copy is always consistent with the original 109 kernel. This eliminates the need for complex distributed transactions and 110 distributed consensus algorithms and guarantees protection from at most one 111 master node failure per superstep. 112 113 To summarise, the framework developed in this paper protects a parallel 114 programme from failure of any number of subordinate nodes and from one failure 115 of a master node per superstep. The paper does not answer the question of how to 116 determine if a node failed, it assumes a failure when the network connection to 117 a node is prematurely closed. In general, the presented research goes in line 118 with further development of the virtual supercomputer concept coined and 119 evaluated in~\cite{vsc-csit2013,vsc-iccsa2014,vsc-nova}.