hpcs-16-factory

Factory: Non-stop batch jobs without checkpointing
git clone https://git.igankevich.com/hpcs-16-factory.git
Log | Files | Refs

methods.tex (7953B)


      1 \section{METHODS}
      2 
      3 \subsection{HIERARCHY OF NODES}
      4 
      5 This work is based on the results of previous research:
      6 In~\cite{gankevich2015subordination} we developed an algorithm that allows to
      7 build a tree hierarchy from strictly ordered set of cluster nodes. The sole
      8 purpose of this hierarchy is to make a cluster more fault-tolerant by
      9 introducing multiple master nodes. If a master node fails, then its subordinates
     10 try to connect to another node from the same or higher level of the hierarchy.
     11 If there is no such node, one of the subordinates becomes the master.
     12 
     13 A position of a node in a hierarchy is determined by mapping its IP address
     14 position in a network to its layer and offset in a tree hierarchy. The number of
     15 layers is controlled by a fan-out value. As nodes' IP addresses change
     16 infrequently this mapping is mostly static, and affected only by node failures.
     17 Thus with help of tree hierarchy we can precisely determine IP address of a
     18 master node without resorting to costly leader election algorithms which are
     19 commonly used for this purpose.
     20 
     21 We use this hierarchy to perform load balancing across neighbouring cluster
     22 nodes (nodes that are adjacent in the hierarchy), i.e.~if the job is launched on
     23 a subordinate node its principal node also receives a fraction of the load. This
     24 rule makes the system symmetric: Each node runs the same software and it is easy
     25 to switch from a failed master node to a backup node, it is just a matter of
     26 changing node's role. Similar design choice is applied in distributed key-value
     27 stores~\cite{anderson2010couchdb,lakshman2010cassandra} to handle failure of a
     28 master node, but we have no knowledge of job schedulers that use this to
     29 distribute the load on the cluster with multiple master nodes.
     30 
     31 \subsection{HIERARCHY OF COMPUTATIONAL KERNELS}
     32 
     33 Each programme that runs on top of the tree hierarchy is composed of
     34 computational kernels---objects that contain data and code to process it. To
     35 exploit parallelism a kernel may create arbitrary number of subordinate kernels
     36 which are automatically spread first across available processor cores, second
     37 across subordinate nodes in the tree hierarchy. The programme is itself a kernel
     38 (without a parent as it is executed by a user), which either solves the problem
     39 sequentially on its own or creates subordinate kernels to solve it in parallel.
     40 
     41 Unlike main function in programmes based on message passing library, the first
     42 computational kernel is initially run only on one node, and remote nodes are
     43 used only when the local queue is overflown by kernels. This design choice
     44 allows to have arbitrary number of nodes throughout execution of a programme,
     45 and take more nodes for highly parallel parts of the code. Somewhat similar
     46 choice was made in the design of MapReduce
     47 framework~\cite{dean2008mapreduce,vavilapalli2013yarn}---a user submitting a job
     48 does not specify the number of hosts to run its job on, and effective hosts are
     49 the hosts where input files are located.
     50 
     51 From mathematical point of view kernel $K$ can be described as a vector-valued
     52 functional which recursively maps a kernel to $n$-component vector of kernels:
     53 \begin{equation*}
     54     K(f): \mathbb{K} \rightarrow \mathbb{K}^n
     55     \qquad
     56     \mathbb{K}^n = \left\{ f: \mathbb{K} \rightarrow \mathbb{K}^n \right\}.
     57 \end{equation*}
     58 Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops
     59 recursion, is used to call the first kernel and finish execution of the
     60 programme. An argument to each kernel is interpreted using the following rules.
     61 \begin{enumerate}
     62     \item If a kernel is a new kernel, then its argument is its parent kernel.
     63     \item If a kernel is a parent of the kernel that produced it or some other
     64       existing kernel, then the argument is the kernel that produced it.
     65 \end{enumerate}
     66 
     67 Engine that executes kernels is implemented as a simple loop. It starts with
     68 calling the first kernel with a dummy kernel as an argument, then calls each
     69 kernel that was produced by this call and so forth. The loop finishes when a
     70 dummy kernel is returned as a result of the call.
     71 
     72 Since kernel call may return multiple kernels they are executed in parallel.
     73 Parallel execution quickly produces a pool of kernels which permit execution in
     74 an unspecified order. Several threads concurrently retrieve kernels from the
     75 pool and may ``spill'' remaining kernels to neighbouring cluster nodes.
     76 
     77 Kernels are implemented as closures---function objects containing all their
     78 arguments, a reference to parent kernel and user-supplied data. The data is
     79 either processed upon kernel call, or subordinate kernels are created to process
     80 it in parallel. When the processing is complete a parent kernel closure with its
     81 subordinate kernel as an argument is called to collect data from it.
     82 
     83 \subsection{HANDLING SINGLE NODE FAILURES}
     84 
     85 Basic strategy to overcome a failure of a subordinate node is to restart
     86 corresponding kernels on healthy node---a strategy employed in Erlang language
     87 to restart failed subordinate processes~\cite{armstrong2003thesis}. To implement
     88 this we record every kernel that is sent to remote cluster nodes, and in an
     89 event of a node failure these kernels are simply rescheduled to other
     90 subordinate nodes with no special handling from a programmer. If there are no
     91 nodes to sent kernels to, they are scheduled locally. So, in contrast to
     92 heavy-weight checkpoint/restart machinery, tree hierarchy allows automatic and
     93 transparent handling of subordinate node failures without restarting parallel
     94 processes on every node.
     95 
     96 A possible way of handling a failure of a node where the first kernel is located
     97 is to replicate this kernel to a backup node, and make all updates to its state
     98 propagate to the backup node by means of a distributed transaction. However,
     99 this approach does not play well with asynchronous nature of computational
    100 kernels. Fortunately, the first kernel usually does not perform operations in
    101 parallel, it is rather sequentially launches execution steps one by one, so it
    102 has only one subordinate at a time. Keeping this in mind, we can simplify
    103 synchronisation of its state: we can send the first kernel along with its
    104 subordinate to the subordinate node. When the node with the first kernel fails,
    105 its copy receives its subordinate, and no execution time is lost. When the node
    106 with its copy fails, its subordinate is rescheduled on some other node, and a
    107 whole step of computation is lost in the worst case.
    108 
    109 Described approach works only for kernels that do not have a parent and have
    110 only one subordinate at a time, which means that they act as optimised
    111 checkpoints. The advantage is that they save results after each sequential step,
    112 when memory footprint of a programme is low, they save only relevant data, and
    113 they use memory of a subordinate node instead of stable storage.
    114 
    115 \subsection{HANDLING OUTAGES}
    116 
    117 Electricity outage is a serious failure, so if there is no other geographically
    118 distant cluster that can share the load, then the only choice is to hope that no
    119 important data is lost and restart every batch job after full site recovery. To
    120 reduce restart time the first kernel of each job may save its state (which is
    121 small compared to the full state of a job) to some stable storage. Such scenario
    122 complicates design of a distributed system so it was not considered in this
    123 paper.
    124 
    125 \subsection{IMPLEMENTATION}
    126 
    127 For efficiency reasons fault tolerance techniques described above are
    128 implemented in the C++ framework: From the authors' perspective C language is
    129 deemed low-level for distributed programmes, and Java incurs too much overhead
    130 and is not popular in HPC community. To use the framework without a job
    131 scheduler, we need to implement a daemon that maintains the state of the
    132 hierarchy of nodes and exposes API to interact with it. As of now, the framework
    133 runs in the same process as an parallel application that uses it. The framework
    134 is called Factory, it is now in proof-of-concept development stage.