iccsa-16-factory-extended

Master node fault tolerance in distributed big data processing clusters
git clone https://git.igankevich.com/iccsa-16-factory-extended.git
Log | Files | Refs

intro.tex (9191B)


      1 \section{Introduction}
      2 
      3 Fault tolerance of data processing pipelines is one of the top concerns in
      4 development of job schedulers for big data processing, however, most schedulers
      5 provide fault tolerance for subordinate nodes only. These types of failures are
      6 routinely mitigated by restarting the failed job or its part on healthy nodes,
      7 and failure of a master node is often considered either improbable, or too
      8 complicated to handle and configure on the target platform. System
      9 administrators often find alternatives to application level fault tolerance:
     10 they isolate master node from the rest of the cluster by placing it on a
     11 dedicated machine, or use virtualisation technologies instead. All these
     12 alternatives complexify configuration and maintenance, and by decreasing
     13 probability of a machine failure resulting in a whole system failure, they
     14 increase probability of a human error.
     15 
     16 From such point of view it seems more practical to implement master node fault
     17 tolerance at the application level, however, there is no generic implementation.
     18 Most implementations are too tied to a particular application to become
     19 universally acceptable. We believe that this happens due to people's habit to
     20 think of a cluster as a collection of individual machines each of which can be
     21 either master or slave, rather than to think of a cluster as a whole with
     22 master and slave roles being \emph{dynamically} assigned to a particular
     23 physical machine.
     24 
     25 This evolution in thinking allows to implement middleware that manages master
     26 and slave roles automatically and handles node failures in a generic way. This
     27 software provides an API to distribute parallel tasks on the pool of available
     28 nodes and among them. Using this API one can write an application that runs on
     29 a cluster without knowing the exact number of online nodes. The middleware is
     30 implemented as a daemon running on each cluster node which acts as an
     31 intermediate point of communication for distributed applications and
     32 transparently routes application messages between operating system processes
     33 running on different cluster nodes.
     34 
     35 The rest of the paper is organised as follows. In Section 2 we discuss how
     36 dynamic node role assignment is implemented in state-of-the-art middleware
     37 systems and how hierarchical dependencies between kernels help establish rules
     38 for restarting failed parts of a parallel application. In Section 3 we describe
     39 programming model based on kernels~--- small independent units of work~---
     40 and an algorithm of handling master node failure. In Section 4 and 5 we
     41 present and discuss results of experiments that show validity of the
     42 proposed approach and conclude the paper in Section 6.
     43 
     44 The research for new fault-tolerance methods is motivated by the fact that the
     45 size of large-scale computing systems (clusters and supercomputers) approaches
     46 critical point, where the number of nodes is so large that probability of all
     47 nodes simultaneously working without a faulure tends to nought. In other words,
     48 in future large-scale systems it is highly probable that a parallel application
     49 experience node failure throughout its execution, and tolerating this failure
     50 in a transparent way and without checkpointts will increase performance of
     51 future parallel applications.
     52 
     53 \section{Related work}
     54 
     55 Dynamic role assignment is an emerging trend in design of distributed
     56 systems~\citep{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra},
     57 however, it is still not used in big data job schedulers. For example, in
     58 popular YARN~\citep{vavilapalli2013yarn} and Spark~\citep{zaharia2012resilient}
     59 job schedulers, which are used by Hadoop and Spark big data analysis frameworks,
     60 master and slave roles are static.  Failure of a slave node is tolerated by
     61 restarting a part of a job on a healthy node, and failure of a master node is
     62 tolerated by setting up standby reserved server~\citep{murthy2011architecture}.
     63 Both master servers are coordinated by Zookeeper service which itself uses
     64 dynamic role assignment to ensure its
     65 fault-tolerance~\citep{okorafor2012zookeeper}. So, the whole setup is
     66 complicated due to Hadoop scheduler lacking dynamic roles: if dynamic roles
     67 were available, Zookeeper would be redundant in this setup. Moreover, this
     68 setup does not guarantee continuous operation of the master node because the
     69 standby server needs time to recover current state after a failure.
     70 
     71 The same problem occurs in high-performance computing where the master node of a job
     72 scheduler is the single point of failure.
     73 In~\citep{uhlemann2006joshua,engelmann2006symmetric} the authors use replication
     74 to make the master node highly-available, but the backup server role is assigned
     75 statically and cannot be delegated to a healthy worker node. This solution is
     76 closer to fully dynamic role assignment than the high-availability solution for big
     77 data schedulers, because it does not involve using external service to store
     78 configuration which should also be highly-available, however, it is far from
     79 ideal solution where roles are completely decoupled from physical servers.
     80 
     81 Finally, the simplest master node high-availability is implemented in Virtual
     82 Router Redundancy Protocol
     83 (VRRP)~\citep{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although
     84 VRRP protocol does provide master and backup node roles, which are dynamically
     85 assigned to available routers, this protocol works on top of the IPv4 and IPv6
     86 protocols and is designed to be used by routers and reverse proxy servers. Such
     87 servers lack the state that needs to be restored upon a failure (i.e.~there is
     88 no job queue in web servers), so it is easier for them to provide
     89 high-availability. On Linux it is implemented in Keepalived routing
     90 daemon~\citep{cassen2002keepalived}.
     91 
     92 In contrast to web servers and HPC and Big Data job schedulers, some distributed
     93 key-value stores and parallel file systems have symmetric architecture, where
     94 master and slave roles are assigned dynamically, so that any node can act as a
     95 master when the current master node
     96 fails~\citep{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}.
     97 This design decision simplifies management and interaction with a distributed
     98 system. From system administrator point of view it is much simpler to install
     99 the same software stack on each node than to manually configure master and slave
    100 nodes. Additionally, it is much easier to bootstrap new nodes into the cluster
    101 and decommission old ones. From a user point of view, it is much simpler to
    102 provide web service high-availability and load-balancing when you have multiple
    103 backup nodes to connect to.
    104 
    105 Dynamic role assignment would be beneficial for Big Data job schedulers because
    106 it allows to decouple distributed services from physical nodes, which is the
    107 first step to build highly-available distributed service. The reason that there
    108 is no general solution to this problem is that there is no generic programming
    109 environment to write and execute distributed programmes. The aim of this work is
    110 to propose such an environment and to describe its internal structure.
    111 
    112 The programming model used in this work is partly based on the well-known actor
    113 model of concurrent computation~\citep{agha1985actors,hewitt1973universal}. Our
    114 model borrows the concept of actor --- an object that stores data and methods to
    115 process it; this object can react to external events by either changing its
    116 state or producing more actors. We call this objects \emph{computational
    117   kernels}. Their distinct feature is hierarchical dependence on parent kernel
    118 that created each of them, which allows implementing fault-tolerance by
    119 restarting failed subordinate kernel.
    120 
    121 However, using hierarchical dependence alone is not enough to develop
    122 high-availability of a master kernel --- the first kernel in a parallel
    123 programme. To solve the problem the other part of our programming model is based
    124 on bulk-synchronous parallel model~\citep{valiant1990bridging}. It borrows the
    125 concept of superstep --- a sequential step of a parallel programme; at any time
    126 a programme executes only one superstep, which allows implementing
    127 high-availability of the first kernel (under assumption that it has only one
    128 subordinate at a time) by sending it along its subordinate to a different
    129 cluster node thus making a distributed copy of it. Since the first kernel has
    130 only one subordinate at a time, its copy is always consistent with the original
    131 kernel. This eliminates the need for complex distributed transactions and
    132 distributed consensus algorithms and guarantees protection from at most one
    133 master node failure per superstep.
    134 
    135 To summarise, the framework developed in this paper protects a parallel
    136 programme from failure of any number of subordinate nodes and from one failure
    137 of a master node per superstep. The paper does not answer the question of how to
    138 determine if a node has failed, it assumes node failure when the network
    139 connection to a node is prematurely closed. In general, the presented research
    140 goes in line with further development of the virtual supercomputer concept
    141 coined and evaluated in~\citep{vsc-csit2013,vsc-iccsa2014,vsc-nova}.