iccsa-15-novel-appr.tex (7162B)
1 \documentclass[runningheads]{llncs} 2 3 \usepackage{amssymb} 4 \usepackage{amsmath} 5 \setcounter{tocdepth}{3} 6 \usepackage{graphicx} 7 8 \newcommand{\keywords}[1]{\par\addvspace\baselineskip 9 \noindent\keywordname\enspace\ignorespaces#1} 10 11 \begin{document} 12 13 \mainmatter 14 15 \title{Novel approaches for distributing workload on~commodity computer systems} 16 \titlerunning{Novel approaches for distributing workload} 17 18 \author{Ivan Gankevich \and Yuri Tipikin \and Alexander Degtyarev \and Vladimir Korkhov} 19 \authorrunning{I.~Gankevich \and Yu.~Tipikin \and A.~Degtyarev \and V.~Korkhov} 20 21 \institute{Saint Petersburg State University,\\ 22 Universitetskii 35, Petergof, 198504, Saint Petersburg, Russia\\ 23 igankevich@yandex.com, yuriitipikin@gmail.com 24 } 25 26 \toctitle{Novel approaches for distributing workload on commodity computer systems} 27 \tocauthor{Ivan Gankevich, Yuri Tipikin, Alexander Degtyarev, Vladimir Korkhov} 28 \maketitle 29 30 \graphicspath{ {figures/} } 31 32 \begin{abstract} 33 Efficient management of a distributed system is a common problem for university's and commercial computer centres, and handling node failures is a major aspect of it. Failures which are rare in a small commodity cluster, at large scale become common, and there should be a way to overcome them without restarting all parallel processes of an application. The efficiency of existing methods can be improved by forming a hierarchy of distributed processes. That way only lower levels of the hierarchy need to be restarted in case of a leaf node failure, and only root node needs special treatment. Process hierarchy changes in real time and the workload is dynamically rebalanced across online nodes. This approach makes it possible to implement efficient partial restart of a parallel application, and transactional behaviour for computer centre service tasks. 34 35 \keywords{long-lived transactions, distributed pipeline, node discovery, software engineering, distributed computing, cluster computing} 36 \end{abstract} 37 38 \section*{Introduction} 39 40 There are two main tasks for a computer centre: to run users' parallel applications, and to service users. Often these tasks are delegated to some distributed systems, so that users can service themselves, and often these systems are heterogeneous and there are many of them in a computer centre. Yet another system is introduced to make them consistent, but this system is usually not distributed. Moreover, the system usually does not allow rolling back a distributed transaction spanning its subordinate systems. So, the use of reliable distributed systems under control of an unreliable one makes the whole system poorly orchestrated, and absence of automatic error-recovery mechanism increases amount of manual work to bring the system up after a failure. 41 42 To orchestrate computations and perform service tasks in distributed environment the system should be decomposed into two levels. The top level of this system is occupied by transaction manager which executes long-lived transactions (a transaction consisting of nested subordinate ones which spans a long period of time and provides relaxed consistency guarantees). Transactions are distributed across cluster nodes and are retried or rolled back in case of a system failure. On the second level a distributed pipeline is formed from cluster nodes to process compute-intensive workloads with automatic restart of failed processes. The goal of this decomposition is to separate service tasks which need transactional behaviour from parallel applications which need only restart of failed processes (without a roll back). On both levels of the system a hierarchy is used to achieve these goals. 43 44 The first level of the system is capable of reliably executing ancillary and configuration tasks which are typical to university's computer centres. Long-lived transactions allow for a task to run days, months and years, until the transaction is complete. For example, a typical task of registering a user in a computer centre for a fixed period of time (the time span of his/her research grant) starts after the user submits registration form and ends when the research is complete. Additional tasks (e.g.~allocation of computational resources, changing quotas and custom configuration) are executed as subordinate of the main one. Upon completion of the work tasks are executed in reverse, reconfiguring the system to its initial state and erasing or archiving old data. 45 46 The rest of the paper describes the structure of two levels of the system and investigates their performance in a number of tests. Long-lived transactions are discussed in Section~\ref{sec:trans} and distributed pipeline (lower level) in Section~\ref{sec:pipe}. 47 48 \input{sec-1} 49 \input{sec-2} 50 51 \section*{Related work} 52 53 In~\cite{lifflander2014scalable} the authors discuss the use of message logging to implement efficient recovery from a failure. They observed that some messages written to log are commutative (can be reordered without changing the output of a program) and used this assumption to optimise recovery process. In our system objects in sent buffer are used instead of messages in a log, they are commutative within bounds of the buffer but do not represent history of all messages sent to another node. They are deleted upon receiving a reply, and deleted object can be safely ignored when recovering from a hard fault. So, when an object depends on parallel execution of its subordinates, results can be collected in any order, and it is often desirable to log only execution of principal to reduce recovery time. 54 55 In~\cite{tel2000introduction} the author introduces general distributed wave algorithms for leader election. In contrast to distributed pipeline, these algorithms assume that there can be only one leader and as a result do not take into account link speed between nodes. The goal of discovery algorithm is to create a framework of leaders to distribute workload on a cluster, and each leader is found by recursively repeating election process for the new level of a hierarchy. Using hierarchy of leaders simplifies election algorithm, and taking into account links between nodes allows efficient mapping of resulting hierarchy to physical topology of a network. 56 57 \section*{Conclusions and future work} 58 59 Distributed pipeline is a general method for distributing workload on a commodity cluster which relies on dynamically reconfigurable virtual topology and reliable data transfer. It primarily focuses on service tasks but can be used for high-performance computing as well. The future work is to find a simple way to make distributed pipeline resilient to the root node failures, and test applicability of this approach to high-performance computing applications. 60 61 \subsubsection*{\ackname} The research was carried out using computational resources of Resource Centre ``Computational Centre of Saint Petersburg State University'' (T-EDGE96 HPC-0011828-001) within frameworks of grants of Russian Foundation for Basic Research (project no. 13-07-00747) and Saint Petersburg State University (projects no. 9.38.674.2013 and 0.37.155.2014). 62 63 \bibliography{references}{} 64 \bibliographystyle{ieeetr} 65 66 \end{document}