methods.tex (7953B)
1 \section{METHODS} 2 3 \subsection{HIERARCHY OF NODES} 4 5 This work is based on the results of previous research: 6 In~\cite{gankevich2015subordination} we developed an algorithm that allows to 7 build a tree hierarchy from strictly ordered set of cluster nodes. The sole 8 purpose of this hierarchy is to make a cluster more fault-tolerant by 9 introducing multiple master nodes. If a master node fails, then its subordinates 10 try to connect to another node from the same or higher level of the hierarchy. 11 If there is no such node, one of the subordinates becomes the master. 12 13 A position of a node in a hierarchy is determined by mapping its IP address 14 position in a network to its layer and offset in a tree hierarchy. The number of 15 layers is controlled by a fan-out value. As nodes' IP addresses change 16 infrequently this mapping is mostly static, and affected only by node failures. 17 Thus with help of tree hierarchy we can precisely determine IP address of a 18 master node without resorting to costly leader election algorithms which are 19 commonly used for this purpose. 20 21 We use this hierarchy to perform load balancing across neighbouring cluster 22 nodes (nodes that are adjacent in the hierarchy), i.e.~if the job is launched on 23 a subordinate node its principal node also receives a fraction of the load. This 24 rule makes the system symmetric: Each node runs the same software and it is easy 25 to switch from a failed master node to a backup node, it is just a matter of 26 changing node's role. Similar design choice is applied in distributed key-value 27 stores~\cite{anderson2010couchdb,lakshman2010cassandra} to handle failure of a 28 master node, but we have no knowledge of job schedulers that use this to 29 distribute the load on the cluster with multiple master nodes. 30 31 \subsection{HIERARCHY OF COMPUTATIONAL KERNELS} 32 33 Each programme that runs on top of the tree hierarchy is composed of 34 computational kernels---objects that contain data and code to process it. To 35 exploit parallelism a kernel may create arbitrary number of subordinate kernels 36 which are automatically spread first across available processor cores, second 37 across subordinate nodes in the tree hierarchy. The programme is itself a kernel 38 (without a parent as it is executed by a user), which either solves the problem 39 sequentially on its own or creates subordinate kernels to solve it in parallel. 40 41 Unlike main function in programmes based on message passing library, the first 42 computational kernel is initially run only on one node, and remote nodes are 43 used only when the local queue is overflown by kernels. This design choice 44 allows to have arbitrary number of nodes throughout execution of a programme, 45 and take more nodes for highly parallel parts of the code. Somewhat similar 46 choice was made in the design of MapReduce 47 framework~\cite{dean2008mapreduce,vavilapalli2013yarn}---a user submitting a job 48 does not specify the number of hosts to run its job on, and effective hosts are 49 the hosts where input files are located. 50 51 From mathematical point of view kernel $K$ can be described as a vector-valued 52 functional which recursively maps a kernel to $n$-component vector of kernels: 53 \begin{equation*} 54 K(f): \mathbb{K} \rightarrow \mathbb{K}^n 55 \qquad 56 \mathbb{K}^n = \left\{ f: \mathbb{K} \rightarrow \mathbb{K}^n \right\}. 57 \end{equation*} 58 Dummy kernel $\mathbb{O}: \mathbb{K} \rightarrow \mathbb{K}^0$, which stops 59 recursion, is used to call the first kernel and finish execution of the 60 programme. An argument to each kernel is interpreted using the following rules. 61 \begin{enumerate} 62 \item If a kernel is a new kernel, then its argument is its parent kernel. 63 \item If a kernel is a parent of the kernel that produced it or some other 64 existing kernel, then the argument is the kernel that produced it. 65 \end{enumerate} 66 67 Engine that executes kernels is implemented as a simple loop. It starts with 68 calling the first kernel with a dummy kernel as an argument, then calls each 69 kernel that was produced by this call and so forth. The loop finishes when a 70 dummy kernel is returned as a result of the call. 71 72 Since kernel call may return multiple kernels they are executed in parallel. 73 Parallel execution quickly produces a pool of kernels which permit execution in 74 an unspecified order. Several threads concurrently retrieve kernels from the 75 pool and may ``spill'' remaining kernels to neighbouring cluster nodes. 76 77 Kernels are implemented as closures---function objects containing all their 78 arguments, a reference to parent kernel and user-supplied data. The data is 79 either processed upon kernel call, or subordinate kernels are created to process 80 it in parallel. When the processing is complete a parent kernel closure with its 81 subordinate kernel as an argument is called to collect data from it. 82 83 \subsection{HANDLING SINGLE NODE FAILURES} 84 85 Basic strategy to overcome a failure of a subordinate node is to restart 86 corresponding kernels on healthy node---a strategy employed in Erlang language 87 to restart failed subordinate processes~\cite{armstrong2003thesis}. To implement 88 this we record every kernel that is sent to remote cluster nodes, and in an 89 event of a node failure these kernels are simply rescheduled to other 90 subordinate nodes with no special handling from a programmer. If there are no 91 nodes to sent kernels to, they are scheduled locally. So, in contrast to 92 heavy-weight checkpoint/restart machinery, tree hierarchy allows automatic and 93 transparent handling of subordinate node failures without restarting parallel 94 processes on every node. 95 96 A possible way of handling a failure of a node where the first kernel is located 97 is to replicate this kernel to a backup node, and make all updates to its state 98 propagate to the backup node by means of a distributed transaction. However, 99 this approach does not play well with asynchronous nature of computational 100 kernels. Fortunately, the first kernel usually does not perform operations in 101 parallel, it is rather sequentially launches execution steps one by one, so it 102 has only one subordinate at a time. Keeping this in mind, we can simplify 103 synchronisation of its state: we can send the first kernel along with its 104 subordinate to the subordinate node. When the node with the first kernel fails, 105 its copy receives its subordinate, and no execution time is lost. When the node 106 with its copy fails, its subordinate is rescheduled on some other node, and a 107 whole step of computation is lost in the worst case. 108 109 Described approach works only for kernels that do not have a parent and have 110 only one subordinate at a time, which means that they act as optimised 111 checkpoints. The advantage is that they save results after each sequential step, 112 when memory footprint of a programme is low, they save only relevant data, and 113 they use memory of a subordinate node instead of stable storage. 114 115 \subsection{HANDLING OUTAGES} 116 117 Electricity outage is a serious failure, so if there is no other geographically 118 distant cluster that can share the load, then the only choice is to hope that no 119 important data is lost and restart every batch job after full site recovery. To 120 reduce restart time the first kernel of each job may save its state (which is 121 small compared to the full state of a job) to some stable storage. Such scenario 122 complicates design of a distributed system so it was not considered in this 123 paper. 124 125 \subsection{IMPLEMENTATION} 126 127 For efficiency reasons fault tolerance techniques described above are 128 implemented in the C++ framework: From the authors' perspective C language is 129 deemed low-level for distributed programmes, and Java incurs too much overhead 130 and is not popular in HPC community. To use the framework without a job 131 scheduler, we need to implement a daemon that maintains the state of the 132 hierarchy of nodes and exposes API to interact with it. As of now, the framework 133 runs in the same process as an parallel application that uses it. The framework 134 is called Factory, it is now in proof-of-concept development stage.