hpcs-15-subord.tex (21927B)
1 \documentclass[conference]{HPCSTrans} 2 \IEEEoverridecommandlockouts 3 \usepackage{amsmath,amssymb,enumerate} 4 \usepackage{cite,graphics,graphicx} 5 6 % correct bad hyphenation here 7 %\hyphenation{op-tical net-works semi-conduc-tor} 8 \renewcommand{\labelitemi}{$\bullet$} 9 \renewcommand{\thefootnote}{\alph{footnote}} 10 \begin{document} 11 % 12 % paper title 13 % can use linebreaks \\ within to get better formatting as desired 14 \title{Subordination:\\ {\huge Cluster Management without Distributed Consensus}} 15 16 17 % author names and affiliations 18 % use a multiple column layout for up to three different 19 % affiliations 20 \author{\IEEEauthorblockN{Ivan Gankevich, Yuri Tipikin} 21 \IEEEauthorblockA{Dept. of Computer Modeling and Multiprocessor Systems\\ 22 Saint Petersburg State University\\ 23 Saint Petersburg, Russia\\ 24 igankevich@yandex.com, yuriitipikin@gmail.com 25 } 26 \and 27 \IEEEauthorblockN{Vladimir Gaiduchok} 28 \IEEEauthorblockA{Dept. of Computer Science and Engineering\\ 29 Saint Petersburg Electrotechnical University ``LETI''\\ 30 Saint Petersburg, Russia\\ 31 gvladimiru@gmail.com 32 } 33 } 34 35 % conference papers do not typically use \thanks and this command 36 % is locked out in conference mode. If really needed, such as for 37 % the acknowledgment of grants, issue a \IEEEoverridecommandlockouts 38 % after \documentclass 39 40 % for over three affiliations, or if they all won't fit within the width 41 % of the page, use this alternative format: 42 % 43 %\author{\IEEEauthorblockN{Michael Shell\IEEEauthorrefmark{1}, 44 %Homer Simpson\IEEEauthorrefmark{2}, 45 %James Kirk\IEEEauthorrefmark{3}, 46 %Montgomery Scott\IEEEauthorrefmark{3} and 47 %Eldon Tyrell\IEEEauthorrefmark{4}} 48 %\IEEEauthorblockA{\IEEEauthorrefmark{1}School of Electrical and Computer Engineering\\ 49 %Georgia Institute of Technology, 50 %Atlanta, Georgia 30332--0250\\ Email: see http://www.michaelshell.org/contact.html} 51 %\IEEEauthorblockA{\IEEEauthorrefmark{2}Twentieth Century Fox, Springfield, USA\\ 52 %Email: homer@thesimpsons.com} 53 %\IEEEauthorblockA{\IEEEauthorrefmark{3}Starfleet Academy, San Francisco, California 96678-2391\\ 54 %Telephone: (800) 555--1212, Fax: (888) 555--1212} 55 %\IEEEauthorblockA{\IEEEauthorrefmark{4}Tyrell Inc., 123 Replicant Street, Los Angeles, California 90210--4321}} 56 57 58 % use for special paper notices 59 \IEEEspecialpapernotice{\large POSTER PAPER} 60 61 % make the title area 62 \maketitle 63 64 \begin{abstract} 65 Nowadays, many cluster management systems rely on distributed consensus 66 algorithms to elect a leader that orchestrates subordinate nodes. Contrary to 67 these studies we propose consensus-free algorithm that arranges cluster nodes 68 into multiple levels of subordination. The algorithm structures IP address range 69 of cluster network so that each node has ranked list of candidates, from which 70 it chooses a leader. The results show that this approach easily scales to a 71 large number of nodes due to its asynchronous nature, and enables fast recovery 72 from node failures as they occur only on one level of hierarchy. Multiple levels 73 of subordination are useful for efficiently collecting monitoring and accounting 74 data from large number of nodes, and for scheduling general-purpose tasks on a 75 cluster. 76 \end{abstract} 77 78 \vspace{0.1in} 79 \begin{IEEEkeywords} 80 job scheduling, leader election, cluster monitoring, cluster accounting, cluster management 81 \end{IEEEkeywords} 82 83 \IEEEpeerreviewmaketitle 84 85 \section{INTRODUCTION} 86 87 Many distributed systems rely on a leader that manages computer cluster and 88 schedules jobs on other nodes. The leader role can be assigned either statically 89 -- to a particular physical node -- or dynamically through election process. In 90 the first case fault-tolerance is achieved by duplicating leader functionality 91 on some reserved node which is used as a substitute in case of current leader 92 failure. In the second case, fault-tolerance is guaranteed by re-electing failed 93 leader by survived nodes. Although, dynamic election requires dedicated 94 distributed algorithm, this approach becomes more and more popular as it does 95 not require spare nodes to survive leader failure. 96 97 Leader election algorithms (which are sometimes referred to as \emph{distributed 98 consensus} algorithms) are special cases of wave algorithms. 99 In~\cite{tel2000introduction} Tel defines them as algorithms in which 100 termination event is preceded by at least one event occurring in each parallel 101 process. Wave algorithms are not defined for anonymous networks, that is they 102 does not apply to processes that can not uniquely define themselves. However, 103 the number of processes can be determined in the course of an algorithm. In 104 distributed system this means that wave algorithms work for clusters with 105 dynamically changing number of nodes, and nodes can go online and offline. 106 107 Contrary to this, our approach does not use wave algorithms, and hence does not 108 require communicating with each node in a cluster to determine a leader. 109 Instead, each node enumerates all nodes in the network it is part of, and 110 converts this enumeration to a tree with a user-defined maximal fan-out value. 111 Then the node determines its level of hierarchy and tries to communicate with 112 nodes from higher levels to become subordinate. First, it checks the closest 113 ones and then goes all the way to the top. If it can not find any top-level 114 node, it becomes the root node. 115 116 Enumeration of all hosts in a network defines strict total order on a set of 117 cluster nodes. Although, technically any function can be chosen to map a node to 118 a number, in practise this function should be sufficiently smooth and may have 119 infrequent jumps. For example, high-frequency oscillations (e.g.~representing 120 measurement error), may result in rapid changes in the hierarchy of nodes, which 121 make it useless for end users. Thus, any such peculiarities should be smoothed 122 in order to make mapping useful. The ideal mapping function is a line, which 123 represents static node mapping. 124 125 Smoothness of the map function ensures stability of hierarchy of nodes and 126 minimises overheads of hierarchy changes. The simpliest such function is the 127 position of an IP address in network IP address range: a line which jumps only 128 when network configuration is changed (which a rare occasion). 129 130 So, the main feature of this algorithm is relation of subordination. It makes 131 hierarchy updates local to a particular level and branch of a hierarchy, and 132 allows precise determining of the leader of each node. 133 134 \section{RELATED WORK} 135 136 One point that distinguishes our approach with respect to some existing 137 proposals~\cite{brunekreef1996design,aguilera2001stable,romano2014design} is 138 that our algorithm elects multiple leaders thus creating \emph{leaders' 139 framework} with multiple levels of subordination. The topology of this 140 framework reflects the topology of underlying physical network as mush as 141 possible, however, if the number of nodes connected to a single switch is large, 142 additional levels of subordination may be introduced. 143 144 In contrast to many leader election algorithms, our algorithm is not intended to 145 manage updates to some distributed database. The main application of leader 146 framework is to distribute workload across large number of luster nodes. 147 Typically one cluster is managed by one master server (possibly by two servers 148 to improve fault tolerance), which collects monitoring and accounting data, 149 issues cluster-wide configuration commands, and launches jobs. When the cluster 150 becomes large, master server may not cope with the load. In this case, 151 introducing subordinate servers solves the issue. 152 153 The method described in the following sections relies on node performance and 154 node latency (for geographically distributed clusters) being stable. Otherwise 155 the method produces randomly changing framework of leaders which does not allow 156 to distribute the load efficiently. For local clusters the method is always 157 stable as the leader is determined by its position in Internet IP address range 158 which rarely changes. 159 160 To summarise, the method described in the following sections may not be suitable 161 for managing updates to a distributed database, and for environments where IP 162 addresses are changed frequently. Its sole purpose is to distribute the load on 163 the cluster with large number of nodes. 164 165 \section{METHODS} 166 167 \subsection{NODE MAPPING} 168 169 Relation of subordination can be defined as follows. Let $\mathcal{N}$ be the 170 set of cluster nodes connected to a network, then 171 \begin{align*} 172 \forall n_1 \forall n_2 &\in \mathcal{N}, \forall f \colon \mathcal{N} \rightarrow \mathcal{R}^n \\ 173 &\Rightarrow (f(n_1) < f(n_2) \Leftrightarrow 174 \neg (f(n_1) \geq f(n_2))), 175 \end{align*} 176 where $f$ maps a node to a number and operator $<$ defines strict total order on 177 $\mathcal{R}^n$. Thus, $f$ is the function that defines node's rank allowing to 178 compare nodes to each other. and $<$ is binary relation of subordination which 179 ensures that the mapping is bijective. 180 181 The simpliest function $f$ maps each node to its Internet address position in 182 network IP address range. Without conversion to a tree -- when only \emph{one} 183 leader is allowed in the network -- a node with the lowest position in this 184 range becomes the leader. If a node occupies the first position in the range, 185 then there is no leader for it. Although, IP address mapping is simple to 186 implement, it introduces artificial dependence of the leader role on the address 187 of a node. Still, it is useful for initial configuration of a cluster when more 188 complex mappings are not possible. 189 190 The more sophisticated mapping may use performance to rank nodes in a network. 191 Sometimes it is difficult to determine performance of a computer: The same 192 computers may show varying performance for different applications, and the same 193 applications may show varying performance for different computers. In this work 194 performance is estimated as the number of jobs completed per unit of time (which 195 makes it dependent on the type of a workload). 196 197 Several nodes may have the same performance, so using it as function $f$ may 198 violate strict total order. To implement performance-wise rankings the two 199 mappings can be combined into a compound one: 200 \begin{equation*} 201 f(n) = \langle 1/\text{perf}(n), \text{ip}(n) \rangle. 202 \end{equation*} 203 Here \textit{perf} is performance of a node estimated as the number of jobs 204 completed per unit of time, and \textit{ip} is the mapping of a node to its 205 Internet position in network IP address range. So, with the compound mapping 206 each node is characterised by both its performance and position in the network 207 address range. When performance data is available, it supersedes node's position 208 in the network for ranking. 209 210 For a cluster with \emph{linear} topology (all computers connected to a switch) 211 knowing performance is enough to rank nodes, however, for a distributed system 212 network latency may become a more important metric. For example, a 213 high-performance node in a distant cluster may not be the best choice for a 214 leader if there are some intermediate nodes on the way to it. Moreover, network 215 latency is a function of at least two nodes, and using it in the mapping makes 216 it dependent on the node which produced it. Thus, each node in the cluster has 217 its own ranked list of candidates, and a node may occupy a different position in 218 each list. Since, each node has its own ranked list, multiple leaders are 219 possible within network. Finally, measuring network latency for each node 220 introduces substantial overhead which can be minimised with conversion to a tree 221 (see Section~\ref{sec:tree}). 222 223 Although, putting network latency into the mapping along with other metrics 224 seems feasible, some works suggest that programme speedup depends on its ratio 225 to performance. For example, in~\cite{deg2003,soshmina2007,andrianov2007} the 226 authors suggest generalisation of Amdahl's law formula for computer clusters. 227 The formula 228 \begin{equation*} 229 S_N = \frac{N}{1 - \alpha + \alpha N + \beta\gamma N^3} 230 \end{equation*} 231 shows speedup of a parallel programme on a cluster taking into account 232 communication overhead. Here $N$ is the number of nodes, $\alpha$ is the 233 parallel portion of a program, $\beta$ is the diameter of a system (the maximal 234 number of intermediate nodes a message passes through when any two nodes 235 communicate), and $\gamma$ is the ratio of node performance to network link 236 performance. Speedup reaches its maximal value at $N = 237 \sqrt[3]{(1-\alpha)/(2\beta\gamma)}$, so the higher the link performance is the 238 higher speedup is achieved. This particular statement ratifies the use of node 239 performance to network latency ratio in the mapping, so that final mapping is as 240 the following. 241 \begin{equation*} 242 f(n) = \langle \text{lat}(n)/\text{perf}(n), \text{ip}(n) \rangle, 243 \end{equation*} 244 where \textit{lat} is measured network latency between the node which composes 245 ranked list and the current node in the list. 246 247 So, putting network latency to the mapping unnecessary complexifies 248 configuration of local cluster, but can be beneficial for cluster federation. In 249 case of local homogeneous cluster an IP address mapping should be enough to 250 determine the leader. 251 252 \subsection{SUBORDINATION TREE} 253 \label{sec:tree} 254 255 To make leader election algorithm scale to a large number of nodes, enumeration 256 of nodes is converted to a tree. In the tree each node is uniquely identified by 257 its level $l$ and offset $o$, which are computed as follows. 258 \begin{align*} 259 l(n) &= \lfloor \log_p n \rfloor, \\ 260 o(n) &= n - p^{l(n)}, 261 \end{align*} 262 where $n$ is the position of node's IP address in network IP address range, and 263 $p$ is the maximal number of subordinate nodes. The leader of a node with level 264 $l$ and offset $o$ has level $l-1$ and offset $\lfloor o/p \rfloor$. The 265 distance between any two nodes in the tree with network positions $i$ and $j$ is 266 computed as 267 \begin{align*} 268 & \langle 269 \text{lsub}(l(j), l(i)), 270 \left| o(j) - o(i) \right| 271 \rangle,\\ 272 & \text{lsub}(l_1, l_2) = 273 \begin{cases} 274 \infty & \quad \text{if } l_1 \geq l_2, \\ 275 l_1 - l_2 & \quad \text{otherwise}. 276 \end{cases} 277 \end{align*} 278 The distance is compound to make level dominant. 279 280 To determine the leader a node ranks all nodes in the network according to 281 mapping $\langle l(\text{ip}(n)), o(\text{ip}(n)) \rangle$, and using distance 282 formula chooses the node which is closest to computed leader's position and has 283 lower position in the ranking. That way offline nodes are skipped, however, for 284 sparse networks (i.e.~networks with nodes having non-contiguous IP addresses) 285 perfect tree is not guaranteed. 286 287 Ideally the number of messages sent over the network by a node is constant. It 288 means that when the network is dense (e.g.~there are no offline nodes, and there 289 are no gaps in the IP addresses), the node communicates with its leader only. 290 Hence, in contrast to full network scan, our election algorithm scales well for 291 a large number of nodes (Section~\label{ref:results}). 292 293 Level and offset are only useful to build subordination tree for linear 294 topologies (all computers connected to a switch). For topologies where links 295 between nodes are not homogeneous, it can be built by adding network latency 296 into the mapping. Typically distributed system runs on top of \emph{tree} 297 physical topology (i.e.~nodes connected to multiple switches), but 298 geographically distributed clusters may form the topology with a few cycles. 299 Measuring network latency helps a node to find a leader which is closest to it 300 in physical topology. So, the algorithm for linear and non-linear topologies 301 differs only in the mapping. 302 303 In the second phase of the algorithm -- when a node has found its leader -- a 304 node measures network latency for subordinates of this leader and its own 305 leader. So, only $p$ nodes are probed by each node, but if the node changes its 306 leader, the process repeats. It is difficult to estimate the total number of 307 repetitions for a given topology as it depends on the IP addresses of the nodes, 308 so there is no guarantee that the number will be smaller than for probing each 309 node in the network. 310 311 The main goal of this algorithm is to minimise the number of packets transmitted 312 over network per unit of time when finding the leader and the number of nodes is 313 unknown, rather than maximising performance. Since changes in network topology 314 are infrequent, the algorithm needs to be run only on initial cluster 315 installation, and the resulting hierarchy of nodes can be saved to persistent 316 storage for later retrieval in case of node restart. 317 318 So, for a distributed system subordination tree shape is close to that of 319 physical topology, and for cluster with a switch the shape is determined by the 320 maximal number of subordinates $p$ a node can have. The goal of this tree is to 321 optimise performance of cluster management tasks, which are discussed in 322 Section~\ref{sec:disc}. 323 324 \subsection{EVALUATION ON VIRTUAL NETWORK} 325 326 Test platform consisted of a multi-processor node, and Linux network namespaces 327 were used to consolidate virtual cluster of varying number of nodes on a 328 physical node. Similar approach was used in a number of works 329 \cite{lantz2010network,handigol2012reproducible,heller2013reproducible}. The 330 advantage of it is that the tests can be performed on a single machine, and 331 there is no need to use physical cluster. Tests were repeated multiple times to 332 reduce influences of processes running in background. Each subsequent test run 333 was separated from previous one with a delay to give operating system time to 334 release resources, cleanup files and flush buffers. 335 336 Performance test was designed to compare subordination tree build time for two 337 cases. In the first case each node performed full network scan to determine 338 online nodes, choose the leader, and then sent confirmation message to it. In 339 the second case each node used IP mapping to determine the leader without full 340 network scan, and then sent confirmation message to it. So, this test measured 341 the effect of using IP mapping, and only one leader was chosen for all nodes. 342 343 Subordination tree test was designed to check that resulting trees for different 344 maximal number of subordinate nodes are stable. For that purpose every change in 345 a tree topology was recorded in a log file, and after 30 seconds every test run 346 was forcibly stopped. The test was performed for 100-500 nodes. For this test 347 additional physical nodes were used to accommodate large number of parallel 348 processes (one node per 100 processes). 349 350 \section{RESULTS} 351 \label{sec:results} 352 353 Performance test showed that using IP mapping can speed up subordination tree 354 build time by up to 200\% for 40 nodes (Fig.~\ref{fig:discovery}), and this 355 number increases with the number of nodes. This is expected behaviour since 356 overhead of sending messages to each node is omitted, and predefined mapping is 357 used to find the leader. So, our approach is more efficient than full scan of a 358 network. The absolute time of this test is expected to increase when executed on 359 real network, and thus performance may increase. 360 361 Subordination tree test showed that for each number of nodes stable state can be 362 reached well before 30 seconds (Figure~\ref{fig:stab}). The absolute time of 363 this test may increase when executed on real network. Resulting subordination 364 tree for 11 nodes is presented in Fig.~\ref{fig:tree}. 365 366 367 % \begin{table} 368 % \centering 369 % \caption{Time needed to reach stable subordination tree state.} 370 % \begin{tabular}{ll} 371 % \hline 372 % No. of nodes & Time, s \\ 373 % \hline 374 % 3 & 0.8 \\ 375 % 11 & 2.0 \\ 376 % 33 & 3.4 \\ 377 % 77 & 5.4 \\ 378 % \hline 379 % \end{tabular} 380 % \label{tab:stab} 381 % \end{table} 382 383 \section{DISCUSSION} 384 \label{sec:disc} 385 386 Multiple levels of subordination are beneficial for a range of management tasks, 387 especially for resource monitoring and usage accounting. Typical monitoring task 388 involves probing each cluster node to determine its state (offline or online, 389 needs service etc.). Usually probing is done from one node, and in case of a 390 large cluster introducing intermediate nodes to collect data and send it to 391 master helps distribute the load. In subordination tree each level of hierarchy 392 adds another layer of intermediate nodes, so the data can be collected 393 efficiently. 394 395 The same data collection (or distribution) pattern occurs when retrieving 396 accounting data, and in distributed configuration systems, when configuration 397 files need to be distributed across all cluster nodes. Subordination trees cover 398 all these use cases. 399 400 \section{CONCLUSION} 401 402 Preliminary tests showed that multiple level of subordination can be easily 403 built for different number of nodes with fast IP mapping. The approach is more 404 efficient than full scan of a network. Resulting subordination tree can optimise 405 a range of typical cluster management tasks. Future work is to test how 406 latency-based mapping works for geographically distributed systems. 407 408 \section*{ACKNOWLEDGEMENTS} 409 410 The research was carried out using computational resources of Resource Centre 411 ``Computational Centre of Saint Petersburg State University'' (T-EDGE96 412 HPC-0011828-001) within frameworks of grants of Russian Foundation for Basic 413 Research (project no. 13-07-00747) and Saint Petersburg State University 414 (projects no. 9.38.674.2013 and 0.37.155.2014). 415 416 \bibliography{references}{} 417 \bibliographystyle{IEEEtran} 418 419 \begin{figure} 420 \centering 421 \includegraphics[width=0.7\columnwidth]{tree-11} 422 \caption{Resulting subordination tree for 11 nodes.} 423 \label{fig:tree} 424 \end{figure} 425 426 \begin{figure} 427 \centering 428 \includegraphics[width=0.8\columnwidth]{startup} 429 \caption{Time needed to build initial subordination tree with full scan of each IP address in the network and with IP mapping.} 430 \label{fig:discovery} 431 \end{figure} 432 433 \begin{figure} 434 \centering 435 \includegraphics[width=0.8\columnwidth]{discovery} 436 \caption{Time needed to reach stable subordination tree state for large number of nodes.} 437 \label{fig:stab} 438 \end{figure} 439 440 % that's all folks 441 \end{document} 442