iccsa-16-factory

Factory: Master Node High-Availability for Big Data Applications and Beyond
git clone https://git.igankevich.com/iccsa-16-factory.git
Log | Files | Refs

intro.tex (7775B)


      1 \section{Introduction}
      2 
      3 Fault tolerance of data processing pipelines is one of the top concerns in
      4 development of job schedulers for big data processing, however, most schedulers
      5 provide fault tolerance for subordinate nodes only. These types of failures are
      6 routinely mitigated by restarting the failed job or its part on healthy nodes,
      7 and failure of a master node is often considered either improbable, or too
      8 complicated to handle and configure on the target platform. System
      9 administrators often find alternatives to application level fault tolerance:
     10 they isolate master node from the rest of the cluster by placing it on a
     11 dedicated machine, or use virtualisation technologies instead. All these
     12 alternatives complexify configuration and maintenance, and by decreasing
     13 probability of a machine failure resulting in a whole system failure, they
     14 increase probability of a human error.
     15 
     16 From such point of view it seems more practical to implement master node fault
     17 tolerance at application level, however, there is no generic implementation.
     18 Most implementations are too tied to a particular application to become
     19 universally acceptable. We believe that this happens due to people's habit to
     20 think of a cluster as a collection of individual machines each of which can be
     21 either master or slave, rather than to think of a cluster as a whole with master
     22 and slave roles being dynamically assigned to a particular physical machine.
     23 
     24 This evolution in thinking allows to implement middleware that manages master
     25 and slave roles automatically and handles node failures in a generic way. This
     26 software provides an API to distribute parallel tasks on the pool of available
     27 nodes and among them. Using this API one can write an application that runs on a
     28 cluster without knowing the exact number of online nodes. The middleware works
     29 as a cluster operating system overlay allowing to write distributed
     30 applications.
     31 
     32 \section{Related work}
     33 
     34 Dynamic role assignment is an emerging trend in design of distributed
     35 systems~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra},
     36 however, it is still not used in big data job schedulers. For example, in
     37 popular YARN job scheduler~\cite{vavilapalli2013yarn}, which is used by Hadoop
     38 and Spark big data analysis frameworks, master and slave roles are static.
     39 Failure of a slave node is tolerated by restarting a part of a job on a healthy
     40 node, and failure of a master node is tolerated by setting up standby reserved
     41 server~\cite{murthy2011architecture}. Both master servers are coordinated by
     42 Zookeeper service which itself uses dynamic role assignment to ensure its
     43 fault-tolerance~\cite{okorafor2012zookeeper}. So, the whole setup is complicated
     44 due to Hadoop scheduler lacking dynamic roles: if dynamic roles were available,
     45 Zookeeper would be redundant in this setup. Moreover, this setup does not
     46 guarantee continuous operation of master node because standby server needs time
     47 to recover current state after a failure.
     48 
     49 The same problem occurs in high-performance computing where master node of a job
     50 scheduler is the single point of failure.
     51 In~\cite{uhlemann2006joshua,engelmann2006symmetric} the authors use replication
     52 to make the master node highly-available, but backup server role is assigned
     53 statically and cannot be delegated to a healthy worker node. This solution is
     54 closer to fully dynamic role assignment than high-availability solution for big
     55 data schedulers, because it does not involve using external service to store
     56 configuration which should also be highly-available, however, it is far from
     57 ideal solution where roles are completely decoupled from physical servers.
     58 
     59 Finally, the simplest master node high-availability is implemented in Virtual
     60 Router Redundancy Protocol
     61 (VRRP)~\cite{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although
     62 VRRP protocol does provide master and backup node roles, which are dynamically
     63 assigned to available routers, this protocol works on top of the IPv4 and IPv6
     64 protocols and is designed to be used by routers and reverse proxy servers. Such
     65 servers lack the state that needs to be restored upon a failure (i.e.~there is
     66 no job queue in web servers), so it is easier for them to provide
     67 high-availability. In Linux it is implemented in Keepalived routing
     68 daemon~\cite{cassen2002keepalived}.
     69 
     70 In contrast to web servers and HPC and Big Data job schedulers, some distributed
     71 key-value stores and parallel file systems have symmetric architecture, where
     72 master and slave roles are assigned dynamically, so that any node can act as a
     73 master when the current master node
     74 fails~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}.
     75 This design decision simplifies management and interaction with a distributed
     76 system. From system administrator point of view it is much simpler to install
     77 the same software stack on each node than to manually configure master and slave
     78 nodes. Additionally, it is much easier to bootstrap new nodes into the cluster
     79 and decommission old ones. From user point of view, it is much simpler to
     80 provide web service high-availability and load-balancing when you have multiple
     81 backup nodes to connect to.
     82 
     83 Dynamic role assignment would be beneficial for Big Data job schedulers because
     84 it allows to decouple distributed services from physical nodes, which is the
     85 first step to build highly-available distributed service. The reason that there
     86 is no general solution to this problem is that there is no generic programming
     87 environment to write and execute distributed programmes. The aim of this work is
     88 to propose such an environment and to describe its internal structure.
     89 
     90 The programming model used in this work is partly based on well-known actor
     91 model of concurrent computation~\cite{agha1985actors,hewitt1973universal}. Our
     92 model borrows the concept of actor---an object that stores data and methods to
     93 process it; this object can react to external events by either changing its
     94 state or producing more actors. We call this objects \emph{computational
     95   kernels}. Their distinct feature is hierarchical dependence on parent kernel
     96 that created each of them, which allows to implement fault-tolerance based on
     97 simple restart of a failed subordinate kernel.
     98 
     99 However, using hierarchical dependence alone is not enough to develop
    100 high-availability of a master kernel---the first kernel in a parallel programme.
    101 To solve the problem the other part of our programming model is based on
    102 bulk-synchronous parallel model~\cite{valiant1990bridging}. It borrows the
    103 concept of superstep---a sequential step of a parallel programme; at any time a
    104 programme executes only one superstep, which allows to implement
    105 high-availability of the first kernel (under assumption that it has only one
    106 subordinate at a time) by sending it along its subordinate to a different
    107 cluster node thus making a distributed copy of it. Since the first kernel has
    108 only one subordinate at a time, its copy is always consistent with the original
    109 kernel. This eliminates the need for complex distributed transactions and
    110 distributed consensus algorithms and guarantees protection from at most one
    111 master node failure per superstep.
    112 
    113 To summarise, the framework developed in this paper protects a parallel
    114 programme from failure of any number of subordinate nodes and from one failure
    115 of a master node per superstep. The paper does not answer the question of how to
    116 determine if a node failed, it assumes a failure when the network connection to
    117 a node is prematurely closed. In general, the presented research goes in line
    118 with further development of the virtual supercomputer concept coined and
    119 evaluated in~\cite{vsc-csit2013,vsc-iccsa2014,vsc-nova}.