abstract.tex - iccsa-16-factory-extended - Master node fault tolerance in distributed big data processing clusters

abstract.tex (4438B)

1 \begin{abstract}
2 Distributed computing clusters are often built with commodity hardware which
3 leads to periodic failures of processing nodes due to relatively low
4 reliability of such hardware. While worker node fault-tolerance is
5 straightforward, fault tolerance of master node poses a bigger challenge. In
6 this paper master node failure handling is based on the concept of master and
7 worker roles that can be dynamically re-assigned to cluster nodes along with
8 maintaining a backup of the master node state on one of worker nodes. In such
9 case no special component is needed to monitor the health of the cluster while
10 master node failures can be resolved except for the cases of simultaneous
11 failure of master and backup. We present experimental evaluation of the
12 technique implementation, show benchmarks demonstrating that a failure of a
13 master does not affect running job, and a failure of a backup results in
14 re-computation of only the last job step.
15 \end{abstract}
16
17 \KEYWORD{parallel computing; Big Data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance; high-availability; hierarchy}
18
19 \begin{bio}
20
21 \noindent Ivan Gankevich is a research assistant in computer science at Dept.
22 of Computer Modelling and Multiprocessor Systems, Saint Petersburg State
23 University. His research interests include middleware for high-performance
24 computing, parallel programming, and large-scale ocean waves simulations.\vs{9}
25
26 \noindent Yuri Tipikin is a PhD student in computer science at Dept. of
27 Computer Modelling and Multiprocessor Systems, Saint Petersburg State
28 University. His research interests include middleware for high-performance
29 computing, task scheduling and resource allocation algorithms.\vs{9}
30
31 \noindent Dr. Vladimir Korkhov is an associate professor at the Computer
32 Modeling and Multiprocessor Systems department, Faculty of Applied Mathematics
33 and Control Processes, St. Petersburg State University, Russia. He received PhD
34 degree from the University of Amsterdam in 2009 with the thesis on hierarchical
35 resource management in Grid computing; he participated in a number of national
36 and international projects on distributed and grid computing, and a number of
37 RFBR-funded projects. As a post-doctoral researcher he worked at the Academic
38 Medical Center of the University of Amsterdam and at Charit\'e --- Medical
39 University of Berlin on applying grid technology to bioinformatics and medical
40 applications. Research interests include parallel, distributed, grid and cloud
41 computing, resource management, workflows. Dr. Korkhov has published around 70
42 scientific papers in international journals and conference proceedings.\vs{9}
43
44 \noindent Vladimir Gaiduchok is a PhD student in computer science at Dept. of
45 Computer Science and Engineering, Saint Petersburg Electrotechnical University
46 ``LETI''. His research interests include parallel programming, cloud computing
47 and optimisation.\vs{9}
48
49 \noindent Alexander Degtyarev graduated from Leningrad Shipbuilding Institute
50 in 1985. Defended PhD thesis in 1991 (thesis in the field of computational
51 fluid dynamics). Research positions in Institute for High Performance Computing
52 and Data Bases and different universities. Teaching experience from 1993,
53 professor in computational sciences from 2005. Current courses in the field of
54 high performance computing, intelligence systems (in St.~Petersburg State
55 University). Areas of research: development of mathematical and computer models
56 for complex dynamic systems, application of high performance computing
57 technology, on-board intelligence systems. The author of more than 100
58 papers.\vs{9}
59
60 \noindent Main activities of Prof. Bogdanov are related to mathematical methods
61 in physics, computational methods and mathematical modeling. Lately he became
62 concentrated on creation of data and knowledge bases for a number of applied
63 fields and creation of algorithms for high-performance computing. At the same
64 time, he's involved with computational cluster creation problems on various
65 computer platforms, i.e. system integration problems.\vs{9}
66
67 \noindent This paper is a revised and expanded version of a paper entitled
68 ``Factory: Master node high-availability for Big Data applications and beyond''
69 presented at the~16\textsuperscript{th} International Conference on
70 Computational Science and its Applications (ICCSA\textquotesingle16), Beijing,
71 China, July 4--7.
72
73 \end{bio}

	iccsa-16-factory-extended Master node fault tolerance in distributed big data processing clusters
	git clone https://git.igankevich.com/iccsa-16-factory-extended.git
	Log \| Files \| Refs