intro.tex - iccsa-16-factory - Factory: Master Node High-Availability for Big Data Applications and Beyond

intro.tex (7775B)

1 \section{Introduction}
2
3 Fault tolerance of data processing pipelines is one of the top concerns in
4 development of job schedulers for big data processing, however, most schedulers
5 provide fault tolerance for subordinate nodes only. These types of failures are
6 routinely mitigated by restarting the failed job or its part on healthy nodes,
7 and failure of a master node is often considered either improbable, or too
8 complicated to handle and configure on the target platform. System
9 administrators often find alternatives to application level fault tolerance:
10 they isolate master node from the rest of the cluster by placing it on a
11 dedicated machine, or use virtualisation technologies instead. All these
12 alternatives complexify configuration and maintenance, and by decreasing
13 probability of a machine failure resulting in a whole system failure, they
14 increase probability of a human error.
15
16 From such point of view it seems more practical to implement master node fault
17 tolerance at application level, however, there is no generic implementation.
18 Most implementations are too tied to a particular application to become
19 universally acceptable. We believe that this happens due to people's habit to
20 think of a cluster as a collection of individual machines each of which can be
21 either master or slave, rather than to think of a cluster as a whole with master
22 and slave roles being dynamically assigned to a particular physical machine.
23
24 This evolution in thinking allows to implement middleware that manages master
25 and slave roles automatically and handles node failures in a generic way. This
26 software provides an API to distribute parallel tasks on the pool of available
27 nodes and among them. Using this API one can write an application that runs on a
28 cluster without knowing the exact number of online nodes. The middleware works
29 as a cluster operating system overlay allowing to write distributed
30 applications.
31
32 \section{Related work}
33
34 Dynamic role assignment is an emerging trend in design of distributed
35 systems~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra},
36 however, it is still not used in big data job schedulers. For example, in
37 popular YARN job scheduler~\cite{vavilapalli2013yarn}, which is used by Hadoop
38 and Spark big data analysis frameworks, master and slave roles are static.
39 Failure of a slave node is tolerated by restarting a part of a job on a healthy
40 node, and failure of a master node is tolerated by setting up standby reserved
41 server~\cite{murthy2011architecture}. Both master servers are coordinated by
42 Zookeeper service which itself uses dynamic role assignment to ensure its
43 fault-tolerance~\cite{okorafor2012zookeeper}. So, the whole setup is complicated
44 due to Hadoop scheduler lacking dynamic roles: if dynamic roles were available,
45 Zookeeper would be redundant in this setup. Moreover, this setup does not
46 guarantee continuous operation of master node because standby server needs time
47 to recover current state after a failure.
48
49 The same problem occurs in high-performance computing where master node of a job
50 scheduler is the single point of failure.
51 In~\cite{uhlemann2006joshua,engelmann2006symmetric} the authors use replication
52 to make the master node highly-available, but backup server role is assigned
53 statically and cannot be delegated to a healthy worker node. This solution is
54 closer to fully dynamic role assignment than high-availability solution for big
55 data schedulers, because it does not involve using external service to store
56 configuration which should also be highly-available, however, it is far from
57 ideal solution where roles are completely decoupled from physical servers.
58
59 Finally, the simplest master node high-availability is implemented in Virtual
60 Router Redundancy Protocol
61 (VRRP)~\cite{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although
62 VRRP protocol does provide master and backup node roles, which are dynamically
63 assigned to available routers, this protocol works on top of the IPv4 and IPv6
64 protocols and is designed to be used by routers and reverse proxy servers. Such
65 servers lack the state that needs to be restored upon a failure (i.e.~there is
66 no job queue in web servers), so it is easier for them to provide
67 high-availability. In Linux it is implemented in Keepalived routing
68 daemon~\cite{cassen2002keepalived}.
69
70 In contrast to web servers and HPC and Big Data job schedulers, some distributed
71 key-value stores and parallel file systems have symmetric architecture, where
72 master and slave roles are assigned dynamically, so that any node can act as a
73 master when the current master node
74 fails~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}.
75 This design decision simplifies management and interaction with a distributed
76 system. From system administrator point of view it is much simpler to install
77 the same software stack on each node than to manually configure master and slave
78 nodes. Additionally, it is much easier to bootstrap new nodes into the cluster
79 and decommission old ones. From user point of view, it is much simpler to
80 provide web service high-availability and load-balancing when you have multiple
81 backup nodes to connect to.
82
83 Dynamic role assignment would be beneficial for Big Data job schedulers because
84 it allows to decouple distributed services from physical nodes, which is the
85 first step to build highly-available distributed service. The reason that there
86 is no general solution to this problem is that there is no generic programming
87 environment to write and execute distributed programmes. The aim of this work is
88 to propose such an environment and to describe its internal structure.
89
90 The programming model used in this work is partly based on well-known actor
91 model of concurrent computation~\cite{agha1985actors,hewitt1973universal}. Our
92 model borrows the concept of actor---an object that stores data and methods to
93 process it; this object can react to external events by either changing its
94 state or producing more actors. We call this objects \emph{computational
95 kernels}. Their distinct feature is hierarchical dependence on parent kernel
96 that created each of them, which allows to implement fault-tolerance based on
97 simple restart of a failed subordinate kernel.
98
99 However, using hierarchical dependence alone is not enough to develop
100 high-availability of a master kernel---the first kernel in a parallel programme.
101 To solve the problem the other part of our programming model is based on
102 bulk-synchronous parallel model~\cite{valiant1990bridging}. It borrows the
103 concept of superstep---a sequential step of a parallel programme; at any time a
104 programme executes only one superstep, which allows to implement
105 high-availability of the first kernel (under assumption that it has only one
106 subordinate at a time) by sending it along its subordinate to a different
107 cluster node thus making a distributed copy of it. Since the first kernel has
108 only one subordinate at a time, its copy is always consistent with the original
109 kernel. This eliminates the need for complex distributed transactions and
110 distributed consensus algorithms and guarantees protection from at most one
111 master node failure per superstep.
112
113 To summarise, the framework developed in this paper protects a parallel
114 programme from failure of any number of subordinate nodes and from one failure
115 of a master node per superstep. The paper does not answer the question of how to
116 determine if a node failed, it assumes a failure when the network connection to
117 a node is prematurely closed. In general, the presented research goes in line
118 with further development of the virtual supercomputer concept coined and
119 evaluated in~\cite{vsc-csit2013,vsc-iccsa2014,vsc-nova}.

	iccsa-16-factory Factory: Master Node High-Availability for Big Data Applications and Beyond
	git clone https://git.igankevich.com/iccsa-16-factory.git
	Log \| Files \| Refs