iccsa-16-factory-extended

Master node fault tolerance in distributed big data processing clusters
git clone https://git.igankevich.com/iccsa-16-factory-extended.git
Log | Files | Refs

abstract.tex (4438B)


      1 \begin{abstract}
      2 Distributed computing clusters are often built with commodity hardware which
      3 leads to periodic failures of processing nodes due to relatively low
      4 reliability of such hardware. While worker node fault-tolerance is
      5 straightforward, fault tolerance of master node poses a bigger challenge. In
      6 this paper master node failure handling is based on the concept of master and
      7 worker roles that can be dynamically re-assigned to cluster nodes along with
      8 maintaining a backup of the master node state on one of worker nodes. In such
      9 case no special component is needed to monitor the health of the cluster while
     10 master node failures can be resolved except for the cases of simultaneous
     11 failure of master and backup. We present experimental evaluation of the
     12 technique implementation, show benchmarks demonstrating that a failure of a
     13 master does not affect running job, and a failure of a backup results in
     14 re-computation of only the last job step.
     15 \end{abstract}
     16 
     17 \KEYWORD{parallel computing; Big Data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance; high-availability; hierarchy}
     18 
     19 \begin{bio}
     20 	
     21 \noindent Ivan Gankevich is a research assistant in computer science at Dept.
     22 of Computer Modelling and Multiprocessor Systems, Saint Petersburg State
     23 University. His research interests include middleware for high-performance
     24 computing, parallel programming, and large-scale ocean waves simulations.\vs{9}
     25 
     26 \noindent Yuri Tipikin is a PhD student in computer science at Dept. of
     27 Computer Modelling and Multiprocessor Systems, Saint Petersburg State
     28 University. His research interests include middleware for high-performance
     29 computing, task scheduling and resource allocation algorithms.\vs{9}
     30 
     31 \noindent Dr. Vladimir Korkhov is an associate professor at the Computer
     32 Modeling and Multiprocessor Systems department, Faculty of Applied Mathematics
     33 and Control Processes, St. Petersburg State University, Russia. He received PhD
     34 degree from the University of Amsterdam in 2009 with the thesis on hierarchical
     35 resource management in Grid computing; he participated in a number of national
     36 and international projects on distributed and grid computing, and a number of
     37 RFBR-funded projects. As a post-doctoral researcher he worked at the Academic
     38 Medical Center of the University of Amsterdam and at Charit\'e --- Medical
     39 University of Berlin on applying grid technology to bioinformatics and medical
     40 applications. Research interests include parallel, distributed, grid and cloud
     41 computing, resource management, workflows. Dr. Korkhov has published around 70
     42 scientific papers in international journals and conference proceedings.\vs{9}
     43 
     44 \noindent Vladimir Gaiduchok is a PhD student in computer science at Dept. of
     45 Computer Science and Engineering, Saint Petersburg Electrotechnical University
     46 ``LETI''. His research interests include parallel programming, cloud computing
     47 and optimisation.\vs{9}
     48 
     49 \noindent Alexander Degtyarev graduated from Leningrad Shipbuilding Institute
     50 in 1985. Defended PhD thesis in 1991 (thesis in the field of computational
     51 fluid dynamics). Research positions in Institute for High Performance Computing
     52 and Data Bases and different universities. Teaching experience from 1993,
     53 professor in computational sciences from 2005. Current courses in the field of
     54 high performance computing, intelligence systems (in St.~Petersburg State
     55 University). Areas of research: development of mathematical and computer models
     56 for complex dynamic systems, application of high performance computing
     57 technology, on-board intelligence systems. The author of more than 100
     58 papers.\vs{9}
     59 
     60 \noindent Main activities of Prof. Bogdanov are related to mathematical methods
     61 in physics, computational methods and mathematical modeling. Lately he became
     62 concentrated on creation of data and knowledge bases for a number of applied
     63 fields and creation of algorithms for high-performance computing. At the same
     64 time, he's involved with computational cluster creation problems on various
     65 computer platforms, i.e. system integration problems.\vs{9}
     66 
     67 \noindent This paper is a revised and expanded version of a paper entitled
     68 ``Factory: Master node high-availability for Big Data applications and beyond''
     69 presented at the~16\textsuperscript{th} International Conference on
     70 Computational Science and its Applications (ICCSA\textquotesingle16), Beijing,
     71 China, July 4--7.
     72 
     73 \end{bio}