abstract.tex (4438B)
1 \begin{abstract} 2 Distributed computing clusters are often built with commodity hardware which 3 leads to periodic failures of processing nodes due to relatively low 4 reliability of such hardware. While worker node fault-tolerance is 5 straightforward, fault tolerance of master node poses a bigger challenge. In 6 this paper master node failure handling is based on the concept of master and 7 worker roles that can be dynamically re-assigned to cluster nodes along with 8 maintaining a backup of the master node state on one of worker nodes. In such 9 case no special component is needed to monitor the health of the cluster while 10 master node failures can be resolved except for the cases of simultaneous 11 failure of master and backup. We present experimental evaluation of the 12 technique implementation, show benchmarks demonstrating that a failure of a 13 master does not affect running job, and a failure of a backup results in 14 re-computation of only the last job step. 15 \end{abstract} 16 17 \KEYWORD{parallel computing; Big Data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance; high-availability; hierarchy} 18 19 \begin{bio} 20 21 \noindent Ivan Gankevich is a research assistant in computer science at Dept. 22 of Computer Modelling and Multiprocessor Systems, Saint Petersburg State 23 University. His research interests include middleware for high-performance 24 computing, parallel programming, and large-scale ocean waves simulations.\vs{9} 25 26 \noindent Yuri Tipikin is a PhD student in computer science at Dept. of 27 Computer Modelling and Multiprocessor Systems, Saint Petersburg State 28 University. His research interests include middleware for high-performance 29 computing, task scheduling and resource allocation algorithms.\vs{9} 30 31 \noindent Dr. Vladimir Korkhov is an associate professor at the Computer 32 Modeling and Multiprocessor Systems department, Faculty of Applied Mathematics 33 and Control Processes, St. Petersburg State University, Russia. He received PhD 34 degree from the University of Amsterdam in 2009 with the thesis on hierarchical 35 resource management in Grid computing; he participated in a number of national 36 and international projects on distributed and grid computing, and a number of 37 RFBR-funded projects. As a post-doctoral researcher he worked at the Academic 38 Medical Center of the University of Amsterdam and at Charit\'e --- Medical 39 University of Berlin on applying grid technology to bioinformatics and medical 40 applications. Research interests include parallel, distributed, grid and cloud 41 computing, resource management, workflows. Dr. Korkhov has published around 70 42 scientific papers in international journals and conference proceedings.\vs{9} 43 44 \noindent Vladimir Gaiduchok is a PhD student in computer science at Dept. of 45 Computer Science and Engineering, Saint Petersburg Electrotechnical University 46 ``LETI''. His research interests include parallel programming, cloud computing 47 and optimisation.\vs{9} 48 49 \noindent Alexander Degtyarev graduated from Leningrad Shipbuilding Institute 50 in 1985. Defended PhD thesis in 1991 (thesis in the field of computational 51 fluid dynamics). Research positions in Institute for High Performance Computing 52 and Data Bases and different universities. Teaching experience from 1993, 53 professor in computational sciences from 2005. Current courses in the field of 54 high performance computing, intelligence systems (in St.~Petersburg State 55 University). Areas of research: development of mathematical and computer models 56 for complex dynamic systems, application of high performance computing 57 technology, on-board intelligence systems. The author of more than 100 58 papers.\vs{9} 59 60 \noindent Main activities of Prof. Bogdanov are related to mathematical methods 61 in physics, computational methods and mathematical modeling. Lately he became 62 concentrated on creation of data and knowledge bases for a number of applied 63 fields and creation of algorithms for high-performance computing. At the same 64 time, he's involved with computational cluster creation problems on various 65 computer platforms, i.e. system integration problems.\vs{9} 66 67 \noindent This paper is a revised and expanded version of a paper entitled 68 ``Factory: Master node high-availability for Big Data applications and beyond'' 69 presented at the~16\textsuperscript{th} International Conference on 70 Computational Science and its Applications (ICCSA\textquotesingle16), Beijing, 71 China, July 4--7. 72 73 \end{bio}