commit fc60d2879b5dc18ed6597ccee3ff90a87443f7a3
parent 5546c270ceb6400fef5c8e8109c7aad1a3b812e4
Author: Ivan Gankevich <igankevich@ya.ru>
Date: Sun, 22 Jan 2017 18:27:17 +0300
Reformat.
Diffstat:
2 files changed, 122 insertions(+), 15 deletions(-)
diff --git a/src/abstract.tex b/src/abstract.tex
@@ -1,5 +1,19 @@
\begin{abstract}
-Master node fault-tolerance is the topic that is often dimmed in the discussion of big data processing technologies. Although failure of a master node can take down the whole data processing pipeline, this is considered either improbable or too difficult to encounter. The aim of the studies reported here is to propose rather simple technique to deal with master-node failures. This technique is based on temporary delegation of master role to one of the slave nodes and transferring updated state back to the master when one step of computation is complete. That way the state is duplicated and computation can proceed to the next step regardless of a failure of a delegate or the master (but not both). We run benchmarks to show that a failure of a master is almost ``invisible'' to other nodes, and failure of a delegate results in recomputation of only one step of data processing pipeline. We believe that the technique can be used not only in Big Data processing but in other types of applications.
+Master node fault-tolerance is the topic that is often dimmed in the discussion
+of big data processing technologies. Although failure of a master node can take
+down the whole data processing pipeline, this is considered either improbable or
+too difficult to encounter. The aim of the studies reported here is to propose
+rather simple technique to deal with master-node failures. This technique is
+based on temporary delegation of master role to one of the slave nodes and
+transferring updated state back to the master when one step of computation is
+complete. That way the state is duplicated and computation can proceed to the
+next step regardless of a failure of a delegate or the master (but not both). We
+run benchmarks to show that a failure of a master is almost ``invisible'' to
+other nodes, and failure of a delegate results in recomputation of only one step
+of data processing pipeline. We believe that the technique can be used not only
+in Big Data processing but in other types of applications.
-\keywords{parallel computing $\cdot$ Big Data processing $\cdot$ distributed computing $\cdot$ backup node $\cdot$ state transfer $\cdot$ delegation $\cdot$ cluster computing $\cdot$ fault-tolerance}
-\end{abstract}-
\ No newline at end of file
+\keywords{parallel computing $\cdot$ Big Data processing $\cdot$ distributed
+ computing $\cdot$ backup node $\cdot$ state transfer $\cdot$ delegation
+ $\cdot$ cluster computing $\cdot$ fault-tolerance}
+\end{abstract}
diff --git a/src/intro.tex b/src/intro.tex
@@ -1,25 +1,119 @@
\section{Introduction}
-Fault tolerance of data processing pipelines is one of the top concerns in development of job schedulers for big data processing, however, most schedulers provide fault tolerance for subordinate nodes only. These types of failures are routinely mitigated by restarting the failed job or its part on healthy nodes, and failure of a master node is often considered either improbable, or too complicated to handle and configure on the target platform. System administrators often find alternatives to application level fault tolerance: they isolate master node from the rest of the cluster by placing it on a dedicated machine, or use virtualisation technologies instead. All these alternatives complexify configuration and maintenance, and by decreasing probability of a machine failure resulting in a whole system failure, they increase probability of a human error.
+Fault tolerance of data processing pipelines is one of the top concerns in
+development of job schedulers for big data processing, however, most schedulers
+provide fault tolerance for subordinate nodes only. These types of failures are
+routinely mitigated by restarting the failed job or its part on healthy nodes,
+and failure of a master node is often considered either improbable, or too
+complicated to handle and configure on the target platform. System
+administrators often find alternatives to application level fault tolerance:
+they isolate master node from the rest of the cluster by placing it on a
+dedicated machine, or use virtualisation technologies instead. All these
+alternatives complexify configuration and maintenance, and by decreasing
+probability of a machine failure resulting in a whole system failure, they
+increase probability of a human error.
-From such point of view it seems more practical to implement master node fault tolerance at application level, however, there is no generic implementation. Most implementations are too tied to a particular application to become universally acceptable. We believe that this happens due to people's habit to think of a cluster as a collection of individual machines each of which can be either master or slave, rather than to think of a cluster as a whole with master and slave roles being dynamically assigned to a particular physical machine.
+From such point of view it seems more practical to implement master node fault
+tolerance at application level, however, there is no generic implementation.
+Most implementations are too tied to a particular application to become
+universally acceptable. We believe that this happens due to people's habit to
+think of a cluster as a collection of individual machines each of which can be
+either master or slave, rather than to think of a cluster as a whole with master
+and slave roles being dynamically assigned to a particular physical machine.
-This evolution in thinking allows to implement middleware that manages master and slave roles automatically and handles node failures in a generic way. This software provides an API to distribute parallel tasks on the pool of available nodes and among them. Using this API one can write an application that runs on a cluster without knowing the exact number of online nodes. The middleware works as a cluster operating system overlay allowing to write distributed applications.
+This evolution in thinking allows to implement middleware that manages master
+and slave roles automatically and handles node failures in a generic way. This
+software provides an API to distribute parallel tasks on the pool of available
+nodes and among them. Using this API one can write an application that runs on a
+cluster without knowing the exact number of online nodes. The middleware works
+as a cluster operating system overlay allowing to write distributed
+applications.
\section{Related work}
-Dynamic role assignment is an emerging trend in design of distributed systems~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}, however, it is still not used in big data job schedulers. For example, in popular YARN job scheduler~\cite{vavilapalli2013yarn}, which is used by Hadoop and Spark big data analysis frameworks, master and slave roles are static. Failure of a slave node is tolerated by restarting a part of a job on a healthy node, and failure of a master node is tolerated by setting up standby reserved server~\cite{murthy2011architecture}. Both master servers are coordinated by Zookeeper service which itself uses dynamic role assignment to ensure its fault-tolerance~\cite{okorafor2012zookeeper}. So, the whole setup is complicated due to Hadoop scheduler lacking dynamic roles: if dynamic roles were available, Zookeeper would be redundant in this setup. Moreover, this setup does not guarantee continuous operation of master node because standby server needs time to recover current state after a failure.
+Dynamic role assignment is an emerging trend in design of distributed
+systems~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra},
+however, it is still not used in big data job schedulers. For example, in
+popular YARN job scheduler~\cite{vavilapalli2013yarn}, which is used by Hadoop
+and Spark big data analysis frameworks, master and slave roles are static.
+Failure of a slave node is tolerated by restarting a part of a job on a healthy
+node, and failure of a master node is tolerated by setting up standby reserved
+server~\cite{murthy2011architecture}. Both master servers are coordinated by
+Zookeeper service which itself uses dynamic role assignment to ensure its
+fault-tolerance~\cite{okorafor2012zookeeper}. So, the whole setup is complicated
+due to Hadoop scheduler lacking dynamic roles: if dynamic roles were available,
+Zookeeper would be redundant in this setup. Moreover, this setup does not
+guarantee continuous operation of master node because standby server needs time
+to recover current state after a failure.
-The same problem occurs in high-performance computing where master node of a job scheduler is the single point of failure. In~\cite{uhlemann2006joshua,engelmann2006symmetric} the authors use replication to make the master node highly-available, but backup server role is assigned statically and cannot be delegated to a healthy worker node. This solution is closer to fully dynamic role assignment than high-availability solution for big data schedulers, because it does not involve using external service to store configuration which should also be highly-available, however, it is far from ideal solution where roles are completely decoupled from physical servers.
+The same problem occurs in high-performance computing where master node of a job
+scheduler is the single point of failure.
+In~\cite{uhlemann2006joshua,engelmann2006symmetric} the authors use replication
+to make the master node highly-available, but backup server role is assigned
+statically and cannot be delegated to a healthy worker node. This solution is
+closer to fully dynamic role assignment than high-availability solution for big
+data schedulers, because it does not involve using external service to store
+configuration which should also be highly-available, however, it is far from
+ideal solution where roles are completely decoupled from physical servers.
-Finally, the simplest master node high-availability is implemented in Virtual Router Redundancy Protocol (VRRP)~\cite{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although VRRP protocol does provide master and backup node roles, which are dynamically assigned to available routers, this protocol works on top of the IPv4 and IPv6 protocols and is designed to be used by routers and reverse proxy servers. Such servers lack the state that needs to be restored upon a failure (i.e.~there is no job queue in web servers), so it is easier for them to provide high-availability. In Linux it is implemented in Keepalived routing daemon~\cite{cassen2002keepalived}.
+Finally, the simplest master node high-availability is implemented in Virtual
+Router Redundancy Protocol
+(VRRP)~\cite{knight1998rfc2338,hinden2004virtual,nadas2010rfc5798}. Although
+VRRP protocol does provide master and backup node roles, which are dynamically
+assigned to available routers, this protocol works on top of the IPv4 and IPv6
+protocols and is designed to be used by routers and reverse proxy servers. Such
+servers lack the state that needs to be restored upon a failure (i.e.~there is
+no job queue in web servers), so it is easier for them to provide
+high-availability. In Linux it is implemented in Keepalived routing
+daemon~\cite{cassen2002keepalived}.
-In contrast to web servers and HPC and Big Data job schedulers, some distributed key-value stores and parallel file systems have symmetric architecture, where master and slave roles are assigned dynamically, so that any node can act as a master when the current master node fails~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}. This design decision simplifies management and interaction with a distributed system. From system administrator point of view it is much simpler to install the same software stack on each node than to manually configure master and slave nodes. Additionally, it is much easier to bootstrap new nodes into the cluster and decommission old ones. From user point of view, it is much simpler to provide web service high-availability and load-balancing when you have multiple backup nodes to connect to.
+In contrast to web servers and HPC and Big Data job schedulers, some distributed
+key-value stores and parallel file systems have symmetric architecture, where
+master and slave roles are assigned dynamically, so that any node can act as a
+master when the current master node
+fails~\cite{ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra}.
+This design decision simplifies management and interaction with a distributed
+system. From system administrator point of view it is much simpler to install
+the same software stack on each node than to manually configure master and slave
+nodes. Additionally, it is much easier to bootstrap new nodes into the cluster
+and decommission old ones. From user point of view, it is much simpler to
+provide web service high-availability and load-balancing when you have multiple
+backup nodes to connect to.
-Dynamic role assignment would be beneficial for Big Data job schedulers because it allows to decouple distributed services from physical nodes, which is the first step to build highly-available distributed service. The reason that there is no general solution to this problem is that there is no generic programming environment to write and execute distributed programmes. The aim of this work is to propose such an environment and to describe its internal structure.
+Dynamic role assignment would be beneficial for Big Data job schedulers because
+it allows to decouple distributed services from physical nodes, which is the
+first step to build highly-available distributed service. The reason that there
+is no general solution to this problem is that there is no generic programming
+environment to write and execute distributed programmes. The aim of this work is
+to propose such an environment and to describe its internal structure.
-The programming model used in this work is partly based on well-known actor model of concurrent computation~\cite{agha1985actors,hewitt1973universal}. Our model borrows the concept of actor---an object that stores data and methods to process it; this object can react to external events by either changing its state or producing more actors. We call this objects \emph{computational kernels}. Their distinct feature is hierarchical dependence on parent kernel that created each of them, which allows to implement fault-tolerance based on simple restart of a failed subordinate kernel.
+The programming model used in this work is partly based on well-known actor
+model of concurrent computation~\cite{agha1985actors,hewitt1973universal}. Our
+model borrows the concept of actor---an object that stores data and methods to
+process it; this object can react to external events by either changing its
+state or producing more actors. We call this objects \emph{computational
+ kernels}. Their distinct feature is hierarchical dependence on parent kernel
+that created each of them, which allows to implement fault-tolerance based on
+simple restart of a failed subordinate kernel.
-However, using hierarchical dependence alone is not enough to develop high-availability of a master kernel---the first kernel in a parallel programme. To solve the problem the other part of our programming model is based on bulk-synchronous parallel model~\cite{valiant1990bridging}. It borrows the concept of superstep---a sequential step of a parallel programme; at any time a programme executes only one superstep, which allows to implement high-availability of the first kernel (under assumption that it has only one subordinate at a time) by sending it along its subordinate to a different cluster node thus making a distributed copy of it. Since the first kernel has only one subordinate at a time, its copy is always consistent with the original kernel. This eliminates the need for complex distributed transactions and distributed consensus algorithms and guarantees protection from at most one master node failure per superstep.
+However, using hierarchical dependence alone is not enough to develop
+high-availability of a master kernel---the first kernel in a parallel programme.
+To solve the problem the other part of our programming model is based on
+bulk-synchronous parallel model~\cite{valiant1990bridging}. It borrows the
+concept of superstep---a sequential step of a parallel programme; at any time a
+programme executes only one superstep, which allows to implement
+high-availability of the first kernel (under assumption that it has only one
+subordinate at a time) by sending it along its subordinate to a different
+cluster node thus making a distributed copy of it. Since the first kernel has
+only one subordinate at a time, its copy is always consistent with the original
+kernel. This eliminates the need for complex distributed transactions and
+distributed consensus algorithms and guarantees protection from at most one
+master node failure per superstep.
-To summarise, the framework developed in this paper protects a parallel programme from failure of any number of subordinate nodes and from one failure of a master node per superstep. The paper does not answer the question of how to determine if a node failed, it assumes a failure when the network connection to a node is prematurely closed. In general, the presented research goes in line with further development of the virtual supercomputer concept coined and evaluated in~\cite{vsc-csit2013,vsc-iccsa2014,vsc-nova}.
+To summarise, the framework developed in this paper protects a parallel
+programme from failure of any number of subordinate nodes and from one failure
+of a master node per superstep. The paper does not answer the question of how to
+determine if a node failed, it assumes a failure when the network connection to
+a node is prematurely closed. In general, the presented research goes in line
+with further development of the virtual supercomputer concept coined and
+evaluated in~\cite{vsc-csit2013,vsc-iccsa2014,vsc-nova}.