arma-thesis.org (198906B)
1 #+TITLE: Simulation modelling of irregular waves for marine object dynamics programmes 2 #+AUTHOR: Ivan Gennadevich Gankevich 3 #+DATE: St. Petersburg, 2017 4 #+LANGUAGE: en 5 #+SETUPFILE: setup.org 6 #+LATEX_HEADER_EXTRA: \organization{Saint Petersburg State University} 7 #+LATEX_HEADER_EXTRA: \manuscript{a manuscript} 8 #+LATEX_HEADER_EXTRA: \degree{thesis for candidate of sciences degree} 9 #+LATEX_HEADER_EXTRA: \speciality{Speciality 05.13.18\\Mathematical modelling, numerical methods and programme complexes} 10 #+LATEX_HEADER_EXTRA: \supervisor{Supervisor\\Alexander Borisovich Degtyarev, ScD} 11 #+LATEX_HEADER_EXTRA: \newcites{published}{Publications on the subject of thesis} 12 #+LATEX_HEADER_EXTRA: \AtBeginDocument{\setcounter{page}{146}} 13 14 #+latex: \maketitlepage 15 16 * Introduction 17 **** Topic relevance. 18 Software programmes, which simulate ship behaviour in sea waves, are widely used 19 to model ship motion, estimate impact of external forces on floating platform or 20 other marine object, and estimate capsize probability under given weather 21 conditions; however, to model sea waves most of them use linear wave 22 theory\nbsp{}cite:shin2003nonlinear,van2007forensic,kat2001prediction,van2002development, 23 in the framework of which it is difficult to reproduce certain peculiarities of 24 wind wave climate. Among them are transition between normal and storm weather, 25 and sea composed of multiple wave systems\nbsp{}--- both wind waves and 26 swell\nbsp{}--- heading from multiple directions. Another shortcoming of linear 27 wave theory is an assumption, that wave amplitude is small compared to wave 28 length. This makes calculations inaccurate when modelling ship motion in 29 irregular waves, for which the assumption does not hold. So, studying new and 30 more advanced models and methods for sea simulation software would increase 31 number of its application scenarios and foster studying ship motion in extreme 32 conditions in particular. 33 34 **** State of the art. 35 Autoregressive moving average (ARMA) model is a response to difficulties 36 encountered by practitioners who used wave simulation models developed in the 37 framework of linear wave theory. The problems they have encountered with 38 Longuet---Higgins model (a model which is entirely based on linear wave theory) 39 are the following. 40 1. /Periodicity/. Linear wave theory approximates waves by a sum of harmonics, 41 hence period of the whole wavy surface realisation depends on the number of 42 harmonics in the model. The more realisation size is, the more coefficients 43 are required to eliminate periodicity, therefore, generation time grows 44 non-linearly with realisation size. This in turn results in overall low 45 efficiency of any model based on this theory, no matter how optimised the 46 software implementation is. 47 2. /Linearity/. Linear wave theory gives mathematical definition for sea waves 48 which have small amplitudes compared to their lengths. Waves of this type 49 occur mostly in open ocean, so near-shore waves as well as storm waves, for 50 which this assumption does not hold, are inaccurately modelled in the 51 framework of linear theory. 52 3. /Probabilistic convergence/. Phase of a wave, which is generated by pseudo 53 random number generator (PRNG), has uniform distribution, which leads to low 54 convergence rate of wavy surface integral characteristics (average wave 55 height, wave period, wave length etc.). 56 57 These difficulties became a starting point in search for a new model which is 58 not based on linear wave theory, and ARMA process studies were found to have all 59 the required mathematical apparatus. 60 1. ARMA process takes auto-covariate function (ACF) as an input parameter, and 61 this function can be directly obtained from wave energy or 62 frequency-directional spectrum (which is the input parameter for 63 Longuet---Higgins model). So, inputs for one model can easily be converted to 64 each other. 65 2. There is no small-amplitude waves assumption. Wave may have any amplitude, 66 and can be generated as steep as it is possible with real sea wave ACF. 67 3. Period of the realisation equals the period of PRNG, so generation time grows 68 linearly with the realisation size. 69 4. White noise\nbsp{}--- the only probabilistic term in ARMA process\nbsp{}--- 70 has Gaussian distribution, hence convergence rate is not probabilistic. 71 72 **** Goals and objectives. 73 ARMA process is the basis for ARMA sea simulation model, but due to its 74 non-physical nature the model needs to be adapted to be used for wavy surface 75 generation. 76 1. Investigate how different ACF shapes affect the choice of ARMA parameters 77 (the number of moving average and autoregressive processes coefficients). 78 2. Investigate a possibility to generate waves of arbitrary profiles, not only 79 cosines (which means taking into account asymmetric distribution of wavy 80 surface elevation). 81 3. Develop a method to determine pressure field under discretely given wavy 82 surface. Usually, such formulae are derived for a particular model by 83 substituting wave profile formula into the system of equations for pressure 84 determination\nbsp{}eqref:eq-problem, however, ARMA process does not provide 85 explicit wave profile formula, so this problem has to be solved for general 86 wavy surface (which is not defined by a mathematical formula), without 87 linearisation of boundaries and assumption of small-amplitude waves. 88 4. Verify that wavy surface integral characteristics match the ones of real sea 89 waves. 90 5. Develop software programme that implements ARMA model and pressure 91 calculation method, and allows to run simulations on both shared and 92 distributed memory computer systems. 93 94 **** Scientific novelty. 95 ARMA model, as opposed to other sea simulation models, does not use linear 96 wave theory. This makes it capable of 97 - generating waves with arbitrary amplitudes by adjusting wave steepness via 98 ACF; 99 - generating waves with arbitrary profiles by adjusting asymmetry of wave 100 elevation distribution via non-linear inertialess transform (NIT). 101 This makes it possible to use ARMA process to model transition between normal 102 and storm weather taking into account climate spectra and assimilation data of a 103 particular ocean region, which is not possible with models based on linear wave 104 theory. 105 106 The distinct feature of this work is 107 - the use of /three-dimensional/ AR and MA models in all experiments, 108 - the use of velocity potential field calculation method suitable for discretely 109 given wavy surface, and 110 - the development of software programme that implements sea wavy surface 111 generation and pressure field computation on both shared memory (SMP) and 112 distributed memory (MPP) computer systems, but also hybrid systems (GPGPU) 113 which use graphical coprocessors to accelerate computations. 114 115 **** Theoretical and practical significance. 116 Implementing ARMA model, that does not use assumptions of linear wave theory, 117 will increase quality of ship motion and marine object behaviour simulation 118 software. 119 120 1. Since pressure field method is developed for discrete wavy surface and 121 without assumptions about wave amplitudes, it is applicable to any wavy 122 surface of incompressible inviscid fluid (in particular, it is applicable to 123 wavy surface generated by LH model). This allows to use this method without 124 being tied to ARMA model. 125 2. From computational point of view this method is more efficient than the 126 corresponding method for LH model, because integrals in its formula are 127 reduced to Fourier transforms, for which there is fast Fourier transform 128 (FFT) family of algorithms, optimised for different processor architectures. 129 3. Since the formula which is used in the method is explicit, there is no need 130 in data exchange between parallel processes, which allows to scale 131 performance to a large number of processor cores. 132 4. Finally, ARMA model is itself more efficient than LH model due to absence of 133 trigonometric functions in its formula: In fact, wavy surface is computed as 134 a sum of large number of polynomials, for which there is low-level FMA (Fused 135 Multiply-Add) assembly instruction, and memory access pattern allows for 136 efficient use of CPU cache. 137 138 **** Methodology and research methods. 139 Software implementation of ARMA model and pressure field calculation method was 140 created incrementally: a prototype written in high-level engineering languages 141 (Mathematica\nbsp{}cite:mathematica10 and Octave\nbsp{}cite:octave2015) was 142 rewritten in lower level language (C++). Due to usage of different abstractions 143 and language primitives, implementation of the same algorithms and methods in 144 languages of varying levels allows to correct errors, which would left 145 unnoticed, if only one language was used. Wavy surface, generated by ARMA model, 146 as well as all input parameters (ACF, distribution of wave elevation etc.) were 147 inspected via graphical means built into the programming language. 148 149 **** Theses for the defence. 150 - Sea wave simulation model which allows to generate wavy surface realisations 151 with large period and consisting of waves of arbitrary amplitudes; 152 - Pressure field calculation method derived for this model without assumptions 153 of linear wave theory; 154 - Software implementation of the simulation model and the method for shared and 155 distributed memory computer systems. 156 **** Results verification and approbation. 157 ARMA model is verified by comparing generated wavy surface integral 158 characteristics (distribution of wave elevation, wave heights and lengths etc.) 159 to the ones of real sea waves. Pressure field calculation method was developed 160 using Mathematica language, where resulting formulae are verified by built-in 161 graphical means. 162 163 ARMA model and pressure field calculation method were incorporated into Large 164 Amplitude Motion Programme (LAMP)\nbsp{}--- an ship motion simulation software 165 programme\nbsp{}--- where they were compared to previously used LH model. 166 Numerical experiments showed higher computational efficiency of ARMA model. 167 168 * Problem statement 169 The aim of the study reported here is to apply ARMA process mathematical 170 apparatus to sea wave modelling and to develop a method to compute pressure 171 field under discretely given wavy surface for inviscid incompressible fluid 172 without assumptions of linear wave theory. 173 - In case of small-amplitude waves the resulting pressure field must correspond 174 to the one obtained via formula from linear wave theory; in all other cases 175 field values must not tend to infinity. 176 - Integral characteristics of generated wavy surface must match the ones of real 177 sea waves. 178 - Software implementation of ARMA model and pressure field computation method 179 must work on shared and distributed memory systems. 180 181 **** Pressure field formula. 182 The problem of finding pressure field under wavy sea surface represents inverse 183 problem of hydrodynamics for incompressible inviscid fluid. System of equations 184 for it in general case is written as\nbsp{}cite:kochin1966theoretical 185 \begin{align} 186 & \nabla^2\phi = 0,\nonumber\\ 187 & \phi_t+\frac{1}{2} |\vec{\upsilon}|^2 + g\zeta=-\frac{p}{\rho}, & \text{at }z=\zeta(x,y,t),\label{eq-problem}\\ 188 & D\zeta = \nabla \phi \cdot \vec{n}, & \text{at }z=\zeta(x,y,t),\nonumber 189 \end{align} 190 where \(\phi\) is velocity potential, \(\zeta\)\nbsp{}--- elevation (\(z\) 191 coordinate) of wavy surface, \(p\)\nbsp{}--- wave pressure, \(\rho\)\nbsp{}--- 192 fluid density, \(\vec{\upsilon}=(\phi_x,\phi_y,\phi_z)\)\nbsp{}--- velocity 193 vector, \(g\)\nbsp{}--- acceleration of gravity, and \(D\)\nbsp{}--- substantial 194 (Lagrange) derivative. The first equation is called continuity (Laplace) 195 equation, the second one is the conservation of momentum law (the so called 196 dynamic boundary condition); the third one is kinematic boundary condition for 197 free wavy surface, which states that a rate of change of wavy surface elevation 198 (\(D\zeta\)) equals to the change of velocity potential derivative in the 199 direction of wavy surface normal (\(\nabla\phi\cdot\vec{n}\), see 200 section\nbsp{}[[#directional-derivative]]). 201 202 Inverse problem of hydrodynamics consists in solving this system of equations 203 for \(\phi\). In this formulation dynamic boundary condition becomes explicit 204 formula to determine pressure field using velocity potential derivatives 205 obtained from the remaining equations. From mathematical point of view inverse 206 problem of hydrodynamics reduces to Laplace equation with mixed boundary 207 condition\nbsp{}--- Robin problem. 208 209 Inverse problem is feasible because ARMA model generate hydrodynamically 210 adequate sea wavy surface: distributions of integral characteristics and 211 dispersion relation match the ones of real 212 waves\nbsp{}cite:boukhanovsky1997thesis,degtyarev2011modelling. 213 214 * ARMA model for sea wave simulation 215 ** Sea wave models analysis 216 Pressure computation is only possible when the shape of wavy surface is known. 217 It is defined either at discrete grid points, or continuously via some analytic 218 formula. As will be shown in section\nbsp{}[[#linearisation]], such formula may 219 simplify pressure computation by effectively reducing the task to pressure field 220 generation, instead of wavy surface generation. 221 222 **** Longuet---Higgins model. 223 The simplest model, formula of which is derived in the framework of linear wave 224 theory (see\nbsp{}sec.\nbsp{}[[#longuet-higgins-derivation]]), is Longuet---Higgins 225 (LH) model\nbsp{}cite:longuet1957statistical. In-depth comparative analysis of 226 this model and ARMA model is done 227 in\nbsp{}cite:degtyarev2011modelling,boukhanovsky1997thesis. 228 229 LH model represents sea wavy surface as a superposition of 230 sine waves with random amplitudes \(c_n\) and phases \(\epsilon_n\), uniformly 231 distributed on interval \([0,2\pi]\). Wavy surface elevation (\(z\) coordinate) is 232 defined by 233 \begin{equation} 234 \zeta(x,y,t) = \sum\limits_n c_n \cos(u_n x + v_n y - \omega_n t + \epsilon_n). 235 \label{eq-longuet-higgins} 236 \end{equation} 237 Here wave numbers \((u_n,v_n)\) are continuously distributed on plane \((u,v)\), 238 i.e.\nbsp{}area \(du\times{}dv\) contains infinite quantity of wave numbers. 239 Frequency is related to wave numbers via dispersion relation 240 \(\omega_n=\omega(u_n,v_n)\). Function \(\zeta(x,y,t)\) is a three-dimensional 241 ergodic stationary homogeneous Gaussian process defined by 242 \begin{equation*} 243 2E_\zeta(u,v)\, du\, dv = \sum\limits_n c_n^2, 244 \end{equation*} 245 where \(E_\zeta(u,v)\)\nbsp{}--- two-dimensional wave energy spectral density. 246 Coefficients \(c_n\) are derived from wave energy spectrum \(S(\omega)\) via 247 \begin{equation*} 248 c_n = \sqrt{ \textstyle\int\limits_{\omega_n}^{\omega_{n+1}} S(\omega) d\omega}. 249 \end{equation*} 250 251 **** Disadvantages of Longuet-Higgins model. 252 Despite simplicity of LH model numerical algorithm, in practice it has several 253 disadvantages. 254 1. The model simulates only stationary Gaussian field. This is a consequence of 255 central limit theorem (CLT): sum of large number of sines with random 256 amplitudes and phases has normal distribution, no matter what spectrum is 257 used as the model input. Using lower number of coefficients may solve the 258 problem, but also make realisation period smaller. Using LH model to simulate 259 waves with non-Gaussian distribution of elevation\nbsp{}--- a distribution 260 which real sea waves have\nbsp{}cite:huang1980experimental,rozhkov1996theory 261 \nbsp{}--- is impractical. 262 2. From computational point of view, the deficiency of the model is non-linear 263 increase of wavy surface generation time with the increase of realisation 264 size. The larger the size of the realisation, the higher number of 265 coefficients (discrete points of frequency-directional spectrum) is needed to 266 eliminate periodicity. This makes LH model inefficient for long-time 267 simulations. 268 3. Finally, there are peculiarities which make LH model unsuitable as a base for 269 building more advanced simulation models. 270 - In software implementation convergence rate 271 of\nbsp{}[[eqref:eq-longuet-higgins]] is low due to uniform distribution of 272 phases \(\epsilon_n\). 273 - It is difficult to generalise LH model for non-Gaussian processes as it 274 involves incorporating non-linear terms in\nbsp{}[[eqref:eq-longuet-higgins]] 275 for which there is no known formula to determine 276 coefficients\nbsp{}cite:rozhkov1990stochastic. 277 278 To summarise, LH model is applicable to generating sea wavy surface only in the 279 framework of linear wave theory, inefficient for long-time simulations, and has 280 a number of deficiencies which do not allow to use it as a base for more 281 advanced models. 282 283 **** ARMA model. 284 In\nbsp{}cite:spanos1982arma ARMA model is used to generate time series spectrum 285 of which is compatible with Pierson---Moskowitz (PM) spectrum. The authors carry 286 out experiments for one-dimensional AR, MA and ARMA models. They mention 287 excellent agreement between target and initial spectra and higher performance of 288 ARMA model compared to models based on summing large number of harmonic 289 components with random phases. The also mention that in order to reach agreement 290 between target and initial spectrum MA model require lesser number of 291 coefficients than AR model. In\nbsp{}cite:spanos1996efficient the authors 292 generalise ARMA model coefficients determination formulae for multi-variate 293 (vector) case. 294 295 One thing that distinguishes present work with respect to afore-mentioned ones 296 is the study of three-dimensional (2D in space and 1D in time) ARMA model, which 297 is mostly a different problem. 298 1. Yule---Walker system of equations, which are used to determine AR 299 coefficients, has complex block-block structure. 300 2. Optimal model order (in a sense that target spectrum agrees with initial) is 301 determined manually. 302 3. Instead of PM spectrum, analytic formulae for standing and propagating 303 waves ACF are used as the model input. 304 4. Three-dimensional wavy surface should be compatible with real sea surface 305 not only in terms of spectral characteristics, but also in the shape of wave 306 profiles. So, model verification includes distributions of various parameters 307 of generated waves (lengths, heights, periods etc.). 308 Multi-dimensionality of investigated model not only complexifies the task, but 309 also allows to carry out visual validation of generated wavy surface. It is the 310 opportunity to visualise output of the programme that allows to ensure that 311 generated surface is compatible with real sea surface, and is not abstract 312 multi-dimensional stochastic process that corresponds to the real one only 313 statistically. 314 315 In\nbsp{}cite:fusco2010short AR model is used to predict swell waves to control 316 wave-energy converters (WEC) in real-time. In order to make WEC more efficient 317 its internal oscillator frequency should match the one of sea waves. The authors 318 treat wave elevation as time series and compare performance of AR model, neural 319 networks and cyclical models in forecasting time series future values. AR model 320 gives the most accurate prediction of low-frequency swell waves for up to two 321 typical wave periods. It is an example of successful application of AR process 322 to sea wave modelling. 323 324 ** Governing equations for 3-dimensional ARMA process 325 ARMA sea simulation model defines sea wavy surface as three-dimensional (two 326 dimensions in space and one in time) autoregressive moving average process: 327 every surface point is represented as a weighted sum of previous in time and 328 space points plus weighted sum of previous in time and space normally 329 distributed random impulses. The governing equation for 3-D ARMA process is 330 \begin{equation} 331 \zeta_{\vec i} 332 = 333 \sum\limits_{\vec j = \vec 0}^{\vec N} 334 \Phi_{\vec j} \zeta_{\vec i - \vec j} 335 + 336 \sum\limits_{\vec j = \vec 0}^{\vec M} 337 \Theta_{\vec j} \epsilon_{\vec i - \vec j} 338 , 339 \label{eq-arma-process} 340 \end{equation} 341 where \(\zeta\)\nbsp{}--- wave elevation, \(\Phi\)\nbsp{}--- AR process 342 coefficients, \(\Theta\)\nbsp{}--- MA process coefficients, 343 \(\epsilon\)\nbsp{}--- white noise with Gaussian distribution, 344 \(\vec{N}\)\nbsp{}--- AR process order, \(\vec{M}\)\nbsp{}--- MA process order, 345 and \(\Phi_{\vec{0}}\equiv{0}\), \(\Theta_{\vec{0}}\equiv{0}\). Here arrows 346 denote multi-component indices with a component for each dimension. In general, 347 any scalar quantity can be a component (temperature, salinity, concentration of 348 some substance in water etc.). Equation parameters are AR and MA process 349 coefficients and order. 350 351 Sea simulation model is used to simulate realistic realisations of wind induced 352 wave field and is suitable to use in ship dynamics calculations. Stationarity 353 and invertibility properties are the main criteria for choosing one or another 354 process to simulate waves with different profiles, which are discussed in 355 section\nbsp{}[[#sec-process-selection]]. 356 357 **** Autoregressive (AR) process. 358 AR process is ARMA process with only one random impulse instead of theirs 359 weighted sum: 360 \begin{equation} 361 \zeta_{\vec i} 362 = 363 \sum\limits_{\vec j = \vec 0}^{\vec N} 364 \Phi_{\vec j} \zeta_{\vec i - \vec j} 365 + 366 \epsilon_{i,j,k} 367 . 368 \label{eq-ar-process} 369 \end{equation} 370 The coefficients \(\Phi\) are calculated from ACF via three-dimensional 371 Yule---Walker (YW) equations, which are obtained after multiplying both parts of 372 the previous equation by \(\zeta_{\vec{i}-\vec{k}}\) and computing the expected 373 value. Generic form of YW equations is 374 \begin{equation} 375 \label{eq-yule-walker} 376 \gamma_{\vec k} 377 = 378 \sum\limits_{\vec j = \vec 0}^{\vec N} 379 \Phi_{\vec j} 380 \text{ }\gamma_{\vec{k}-\vec{j}} 381 + 382 \Var{\epsilon} \delta_{\vec{k}}, 383 \qquad 384 \delta_{\vec{k}} = 385 \begin{cases} 386 1, \quad \text{if } \vec{k}=0 \\ 387 0, \quad \text{if } \vec{k}\neq0, 388 \end{cases} 389 \end{equation} 390 where \(\gamma\)\nbsp{}--- ACF of process \(\zeta\), 391 \(\Var{\epsilon}\)\nbsp{}--- white noise variance. Matrix form of 392 three-dimensional YW equations, which is used in the present work, is 393 \begin{equation*} 394 \Gamma 395 \left[ 396 \begin{array}{l} 397 \Phi_{\vec 0}\\ 398 \Phi_{0,0,1}\\ 399 \vdotswithin{\Phi_{\vec 0}}\\ 400 \Phi_{\vec N} 401 \end{array} 402 \right] 403 = 404 \left[ 405 \begin{array}{l} 406 \gamma_{0,0,0}-\Var{\epsilon}\\ 407 \gamma_{0,0,1}\\ 408 \vdotswithin{\gamma_{\vec 0}}\\ 409 \gamma_{\vec N} 410 \end{array} 411 \right], 412 \qquad 413 \Gamma= 414 \left[ 415 \begin{array}{llll} 416 \Gamma_0 & \Gamma_1 & \cdots & \Gamma_{N_1} \\ 417 \Gamma_1 & \Gamma_0 & \ddots & \vdotswithin{\Gamma_0} \\ 418 \vdotswithin{\Gamma_0} & \ddots & \ddots & \Gamma_1 \\ 419 \Gamma_{N_1} & \cdots & \Gamma_1 & \Gamma_0 420 \end{array} 421 \right], 422 \end{equation*} 423 where \(\vec N = \left( p_1, p_2, p_3 \right)\) and 424 \begin{equation*} 425 \Gamma_i = 426 \left[ 427 \begin{array}{llll} 428 \Gamma^0_i & \Gamma^1_i & \cdots & \Gamma^{N_2}_i \\ 429 \Gamma^1_i & \Gamma^0_i & \ddots & \vdotswithin{\Gamma^0_i} \\ 430 \vdotswithin{\Gamma^0_i} & \ddots & \ddots & \Gamma^1_i \\ 431 \Gamma^{N_2}_i & \cdots & \Gamma^1_i & \Gamma^0_i 432 \end{array} 433 \right] 434 \qquad 435 \Gamma_i^j= 436 \left[ 437 \begin{array}{llll} 438 \gamma_{i,j,0} & \gamma_{i,j,1} & \cdots & \gamma_{i,j,N_3} \\ 439 \gamma_{i,j,1} & \gamma_{i,j,0} & \ddots & \vdotswithin{\gamma_{i,j,0}} \\ 440 \vdotswithin{\gamma_{i,j,0}} & \ddots & \ddots & \gamma_{i,j,1} \\ 441 \gamma_{i,j,N_3} & \cdots & \gamma_{i,j,1} & \gamma_{i,j,0} 442 \end{array} 443 \right], 444 \end{equation*} 445 Since \(\Phi_{\vec 0}\equiv0\), the first row and column of \(\Gamma\) can be 446 eliminated. Matrix \(\Gamma\) is block-block-toeplitz, positive definite and 447 symmetric, hence the system is efficiently solved by Cholesky decomposition, 448 which is particularly suitable for these types of matrices. 449 450 After solving this system of equations white noise variance is estimated 451 from\nbsp{}eqref:eq-yule-walker by plugging \(\vec k = \vec 0\): 452 \begin{equation*} 453 \Var{\epsilon} = 454 \Var{\zeta} 455 - 456 \sum\limits_{\vec j = \vec 0}^{\vec N} 457 \Phi_{\vec j} 458 \text{ }\gamma_{\vec{j}}. 459 \end{equation*} 460 461 **** Moving average (MA) process. 462 MA process is ARMA process with \(\Phi\equiv0\): 463 \begin{equation} 464 \zeta_{\vec i} 465 = 466 \sum\limits_{\vec j = \vec 0}^{\vec M} 467 \Theta_{\vec j} \epsilon_{\vec i - \vec j} 468 . 469 \label{eq-ma-process} 470 \end{equation} 471 MA coefficients \(\Theta\) are defined implicitly via the following non-linear 472 system of equations: 473 \begin{equation*} 474 \gamma_{\vec i} = 475 \left[ 476 \displaystyle 477 \sum\limits_{\vec j = \vec i}^{\vec M} 478 \Theta_{\vec j}\Theta_{\vec j - \vec i} 479 \right] 480 \Var{\epsilon}. 481 \end{equation*} 482 The system is solved numerically by fixed-point iteration method via the 483 following formulae 484 \begin{equation*} 485 \Theta_{\vec i} = 486 -\frac{\gamma_{\vec 0}}{\Var{\epsilon}} 487 + 488 \sum\limits_{\vec j = \vec i}^{\vec M} 489 \Theta_{\vec j} \Theta_{\vec j - \vec i}. 490 \end{equation*} 491 Here coefficients \(\Theta\) are calculated from back to front: from 492 \(\vec{i}=\vec{M}\) to \(\vec{i}=\vec{0}\). White noise variance is estimated by 493 \begin{equation*} 494 \Var{\epsilon} = \frac{\gamma_{\vec 0}}{ 495 1 496 + 497 \sum\limits_{\vec j = \vec 0}^{\vec M} 498 \Theta_{\vec j}^2 499 }. 500 \end{equation*} 501 Authors of\nbsp{}cite:box1976time suggest using Newton---Raphson method to solve 502 this equation with higher precision, however, in this case it is difficult to 503 generalise the method for three dimensions. Using slower method does not have 504 dramatic effect on the overall programme performance, because the number of 505 coefficients is small and most of the time is spent generating wavy surface. 506 507 **** Stationarity and invertibility of AR and MA processes. 508 In order for modelled wavy surface to represent physical phenomena, the 509 corresponding process must be stationary and invertible. If the process is 510 /invertible/, then there is a reasonable connection of current events with the 511 events in the past, and if the process is /stationary/, the modelled physical 512 signal amplitude does not increase infinitely in time and space. 513 514 AR process is always invertible, and for stationarity it is necessary for roots 515 of characteristic equation 516 \begin{equation*} 517 1 - \Phi_{0,0,1} z - \Phi_{0,0,2} z^2 518 - \cdots 519 - \Phi_{\vec N} z^{N_0 N_1 N_2} = 0, 520 \end{equation*} 521 to lie \emph{outside} the unit circle. Here \(\vec{N}\) is AR process order 522 and \(\Phi\) are coefficients. 523 524 MA process is always stationary, and for invertibility it is necessary for roots 525 of characteristic equation 526 \begin{equation*} 527 1 - \Theta_{0,0,1} z - \Theta_{0,0,2} z^2 528 - \cdots 529 - \Theta_{\vec M} z^{M_0 M_1 M_2} = 0, 530 \end{equation*} 531 to lie \emph{outside} the unit circle. Here \(\vec{M}\) is 532 three-dimensional MA process order and \(\Theta\) are coefficients. 533 534 Stationarity and invertibility properties are the main criteria in selection of 535 the process to model different wave profiles, which are discussed in 536 section\nbsp{}[[#sec-process-selection]]. 537 538 **** Mixed autoregressive moving average (ARMA) process. 539 :PROPERTIES: 540 :CUSTOM_ID: sec-how-to-mix-arma 541 :END: 542 543 Generally speaking, ARMA process is obtained by plugging MA generated wavy 544 surface as random impulse to AR process, however, in order to get the process 545 with desired ACF one should re-compute AR coefficients before plugging. There 546 are several approaches to "mix" AR and MA processes. 547 - The approach proposed in\nbsp{}cite:box1976time which involves dividing ACF 548 into MA and AR part along each dimension is not applicable here, because in 549 three dimensions such division is not possible: there always be parts of the 550 ACF that are not taken into account by AR and MA process. 551 - The alternative approach is to use the same (undivided) ACF for both AR and MA 552 processes but use different process order, however, then realisation 553 characteristics (mean, variance etc.) become skewed: these are characteristics 554 of the two overlapped processes. 555 For the first approach the authors of\nbsp{}cite:box1976time propose a formula 556 to correct AR process coefficients, but there is no such formula for the second 557 approach. So, the only proven solution for now is to simply use AR and MA 558 process exclusively. 559 560 **** Process selection criteria. 561 :PROPERTIES: 562 :CUSTOM_ID: sec-process-selection 563 :END: 564 565 One problem of ARMA model application to sea wave generation is that for 566 different types of wave profiles different processes /must/ be used: standing 567 waves are modelled by AR process, and propagating waves by MA process. This 568 statement comes from practice: if one tries to use the processes the other way 569 round, the resulting realisation either diverges or does not correspond to real 570 sea waves. (The latter happens for non-invertible MA process, as it is always 571 stationary.) So, the best way to apply ARMA model to sea wave generation is to 572 use AR process for standing waves and MA process for progressive waves. 573 574 The other problem is difficulty of determination of optimal number of 575 coefficients for three-dimensional AR and MA processes. For one-dimensional 576 processes there are easy to implement iterative methods\nbsp{}cite:box1976time, 577 but for three-dimensional case there are analogous 578 methods\nbsp{}cite:byoungseon1999arma3d,liew2005arma3d which are difficult to 579 implement, hence they were not used in this work. Manual choice of model's order 580 is trivial: choosing knowingly higher order than needed affects performance, but 581 not realisation quality, because processes' periodicity does not depend on the 582 number of coefficients. Software implementation of iterative methods of finding 583 the optimal model order would improve automation level of wavy surface 584 generator, but not the quality of the experiments being conducted. 585 586 In practice some statements made for AR and MA processes 587 in\nbsp{}cite:box1976time should be flipped for three-dimensional case. For 588 example, the authors say that ACF of MA process cuts at \(q\) and ACF of AR 589 process decays to nought infinitely, but in practice making ACF of 3-dimensional 590 MA process not decay results in it being non-invertible and producing 591 realisation that does not look like real sea waves, whereas doing the same for 592 ACF of AR process results in stationary process and adequate realisation. Also, 593 the authors say that one should allocate the first \(q\) points of ACF to MA 594 process (as it often needed to describe the peaks in ACF) and leave the rest 595 points to AR process, but in practice in case of ACF of a propagating wave AR 596 process is stationary only for the first time slice of the ACF, and the rest is 597 left to MA process. 598 599 To summarise, the only established scenario of applying ARMA model to sea wave 600 generation is to use AR process for standing waves and MA process for 601 propagating waves. Using mixed ARMA process for both types of waves might 602 increase model precision, given that coefficient correction formulae for three 603 dimensions become available, which is the objective of the future research. 604 605 **** Verification of wavy surface integral characteristics. 606 In\nbsp{}cite:degtyarev2011modelling,degtyarev2013synoptic,boukhanovsky1997thesis 607 AR model the following items are verified experimentally: 608 - probability distributions of different wave characteristics (heights, lengths, 609 crests, periods, slopes, three-dimensionality), 610 - dispersion relation, 611 - retention of integral characteristics for mixed wave sea state. 612 In this work both AR and MA model are verified by comparing probability 613 distributions of different wave characteristics. 614 615 In\nbsp{}cite:rozhkov1990stochastic the authors show that several sea wave 616 characteristics (listed in table\nbsp{}[[tab-weibull-shape]]) have Weibull 617 distribution, and wavy surface elevation has Gaussian distribution. In order to 618 verify that distributions corresponding to generated realisation are correct, 619 quantile-quantile plots are used (plots where analytic quantile values are used 620 for \(OX\) axis and estimated quantile values for \(OY\) axis). If the estimated 621 distribution matches analytic then the graph has the form of the straight line. 622 Tails of the graph may diverge from the straight line, because they can not be 623 reliably estimated from the finite-size realisation. 624 625 #+name: tab-weibull-shape 626 #+caption[Values of Weibull shape parameter]: 627 #+caption: Values of Weibull shape parameter for different wave characteristics. 628 #+attr_latex: :booktabs t 629 | Characteristic | Weibull shape (\(k\)) | 630 |----------------------+-----------------------| 631 | | <l> | 632 | Wave height | 2 | 633 | Wave length | 2.3 | 634 | Crest length | 2.3 | 635 | Wave period | 3 | 636 | Wave slope | 2.5 | 637 | Three-dimensionality | 2.5 | 638 639 Verification was performed for standing and propagating waves. The corresponding 640 ACFs and quantile-quantile plots of wave characteristics distributions are shown 641 in 642 fig.\nbsp{}[[propagating-wave-distributions]],\nbsp{}[[standing-wave-distributions]],\nbsp{}[[acf-slices]]. 643 644 #+name: propagating-wave-distributions 645 #+begin_src R :file build/propagating-wave-qqplots.pdf 646 source(file.path("R", "common.R")) 647 par(pty="s", mfrow=c(2, 2), family="serif") 648 arma.qqplot_grid( 649 file.path("build", "arma-benchmarks", "verification-orig", "propagating_wave"), 650 c("elevation", "heights_y", "lengths_y", "periods"), 651 c("elevation", "height Y", "length Y", "period"), 652 xlab="x", 653 ylab="y" 654 ) 655 #+end_src 656 657 #+caption[Quantile-quantile plots for propagating waves]: 658 #+caption: Quantile-quantile plots for propagating waves. 659 #+name: propagating-wave-distributions 660 #+RESULTS: propagating-wave-distributions 661 [[file:build/propagating-wave-qqplots.pdf]] 662 663 #+name: standing-wave-distributions 664 #+begin_src R :file build/standing-wave-qqplots.pdf 665 source(file.path("R", "common.R")) 666 par(pty="s", mfrow=c(2, 2), family="serif") 667 arma.qqplot_grid( 668 file.path("build", "arma-benchmarks", "verification-orig", "standing_wave"), 669 c("elevation", "heights_y", "lengths_y", "periods"), 670 c("elevation", "height Y", "length Y", "period"), 671 xlab="x", 672 ylab="y" 673 ) 674 #+end_src 675 676 #+caption[Quantile-quantile plots for standing waves]: 677 #+caption: Quantile-quantile plots for standing waves. 678 #+name: standing-wave-distributions 679 #+RESULTS: standing-wave-distributions 680 [[file:build/standing-wave-qqplots.pdf]] 681 682 #+name: acf-slices 683 #+header: :width 6 :height 9 684 #+begin_src R :file build/acf-slices.pdf 685 source(file.path("R", "common.R")) 686 propagating_acf <- read.csv(file.path("build", "arma-benchmarks", "verification-orig", "propagating_wave", "acf.csv")) 687 standing_acf <- read.csv(file.path("build", "arma-benchmarks", "verification-orig", "standing_wave", "acf.csv")) 688 par(mfrow=c(5, 2), mar=c(0,0,0,0), family="serif") 689 for (i in seq(0, 4)) { 690 arma.wavy_plot(standing_acf, i, zlim=c(-5,5)) 691 arma.wavy_plot(propagating_acf, i, zlim=c(-5,5)) 692 } 693 #+end_src 694 695 #+caption[ACF time slices for standing and propagating waves]: 696 #+caption: ACF time slices for standing (left column) and propagating waves 697 #+caption: (right column). 698 #+name: acf-slices 699 #+RESULTS: acf-slices 700 [[file:build/acf-slices.pdf]] 701 702 Graph tails in fig.\nbsp{}[[propagating-wave-distributions]] deviate from original 703 distribution for individual wave characteristics, because every wave have to be 704 extracted from the resulting wavy surface to measure its length, period and 705 height. There is no algorithm that guarantees correct extraction of all waves, 706 because they may and often overlap each other. Weibull distribution right tail 707 represents infrequently occurring waves, so it deviates more than left tail. 708 709 Correspondence rate for standing waves (fig.\nbsp{}[[standing-wave-distributions]]) 710 is lower for height and length, roughly the same for surface elevation and 711 higher for wave period distribution tails. Lower correspondence degree for 712 length and height may be attributed to the fact that Weibull distributions were 713 obtained empirically for sea waves which are typically propagating, and 714 distributions may be different for standings waves. Higher correspondence degree 715 for wave periods is attributed to the fact that wave periods of standing waves 716 are extracted more precisely as the waves do not move outside simulated wavy 717 surface region. The same correspondence degree for wave elevation is obtained, 718 because this is the characteristic of the wavy surface (and corresponding AR or 719 MA process) and is not affected by the type of waves. 720 721 To summarise, software implementation of AR and MA models (which is described in 722 detail in sec.\nbsp{}[[#sec-hpc]]) generates wavy surface, distributions of 723 individual waves' characteristics of which correspond to /in situ/ data. 724 725 ** Modelling non-linearity of sea waves 726 ARMA model allows to model asymmetry of wave elevation distribution, 727 i.e.\nbsp{}generate sea waves, distribution of \(z\)-coordinate of which has 728 non-nought kurtosis and asymmetry. Such distribution is inherent to real sea 729 waves\nbsp{}cite:longuet1963nonlinear and given by either polynomial 730 approximation of /in situ/ data or analytic formula. Wave asymmetry is modelled 731 by non-linear inertialess transform (NIT) of stochastic process, however, 732 transforming resulting wavy surface means transforming initial ACF. In order to 733 alleviate this, ACF must be preliminary transformed as shown 734 in\nbsp{}cite:boukhanovsky1997thesis. 735 736 **** Wavy surface transformation. 737 Explicit formula \(z=f(y)\) that transforms wavy surface to desired 738 one-dimensional distribution \(F(z)\) is the solution of non-linear 739 transcendental equation \(F(z)=\Phi(y)\), where \(\Phi(y)\)\nbsp{}--- 740 one-dimensional Gaussian distribution. Since distribution of wave elevation is 741 often given by some approximation based on /in situ/ data, this equation is 742 solved numerically with respect to \(z_k\) in each grid point \(y_k|_{k=0}^N\) 743 of generated wavy surface. In this case equation is rewritten as 744 \begin{equation} 745 \label{eq-distribution-transformation} 746 F(z_k) 747 = 748 \frac{1}{\sqrt{2\pi}} 749 \int\limits_0^{y_k} \exp\left[ -\frac{t^2}{2} \right] dt 750 . 751 \end{equation} 752 Since, distribution functions are monotonic, the simplest interval halving 753 (bisection) numerical method is used to solve this equation. 754 755 **** Preliminary ACF transformation. 756 In order to transform ACF \(\gamma_z\) of the process, it is expanded as a 757 series of Hermite polynomials (Gram---Charlier series) 758 \begin{equation*} 759 \gamma_z \left( \vec u \right) 760 = 761 \sum\limits_{m=0}^{\infty} 762 C_m^2 \frac{\gamma_y^m \left( \vec u \right)}{m!}, 763 \end{equation*} 764 where 765 \begin{equation*} 766 C_m = \frac{1}{\sqrt{2\pi}} 767 \int\limits_{0}^\infty 768 f(y) H_m(y) \exp\left[ -\frac{y^2}{2} \right], 769 \end{equation*} 770 \(H_m\)\nbsp{}--- Hermite polynomial, and \(f(y)\)\nbsp{}--- solution to 771 equation\nbsp{}eqref:eq-distribution-transformation. Plugging polynomial 772 approximation \(f(y)\approx\sum\limits_{i}d_{i}y^i\) and analytic formulae for 773 Hermite polynomial yields 774 \begin{equation*} 775 \frac{1}{\sqrt{2\pi}} 776 \int\limits_\infty^\infty 777 y^k \exp\left[ -\frac{y^2}{2} \right] 778 = 779 \begin{cases} 780 (k-1)!! & \text{if }k\text{ is even},\\ 781 0 & \text{if }k\text{ is odd}, 782 \end{cases} 783 \end{equation*} 784 which simplifies the former equation. Optimal number of coefficients \(C_m\) is 785 determined by computing them sequentially and stopping when variances of both 786 fields become equal with desired accuracy \(\epsilon\): 787 \begin{equation} 788 \label{eq-nit-error} 789 \left| \Var{z} - \sum\limits_{k=0}^m 790 \frac{C_k^2}{k!} \right| \leq \epsilon. 791 \end{equation} 792 793 In\nbsp{}cite:boukhanovsky1997thesis the author suggests using polynomial 794 approximation \(f(y)\) also for wavy surface transformation, however, in 795 practice sea surface realisation often contains points, in which 796 \(z\)-coordinate is beyond the limits of the approximation, leading to sharp 797 precision decrease. In these points it is more efficient to solve 798 equation\nbsp{}eqref:eq-distribution-transformation by bisection method. Using 799 the same approximation in Gram---Charlier series does not lead to such errors. 800 801 **** Gram---Charlier series expansion. 802 In\nbsp{}cite:huang1980experimental the authors experimentally show, that PDF of 803 sea surface elevation is distinguished from normal distribution by non-nought 804 kurtosis and skewness. In\nbsp{}cite:rozhkov1996theory the authors show, that 805 this type of PDF expands as a Gram---Charlier series (GCS): 806 \begin{align} 807 \label{eq-skew-normal-1} 808 & F(z; \mu=0, \sigma=1, \gamma_1, \gamma_2) \approx 809 \Phi(z; \mu, \sigma) \frac{2 + \gamma_2}{2} 810 - \frac{2}{3} \phi(z; \mu, \sigma) 811 \left(\gamma_2 z^3+\gamma_1 812 \left(2 z^2+1\right)\right) \nonumber 813 \\ 814 & f(z; \mu, \sigma, \gamma_1, \gamma_2) \approx 815 \phi(z; \mu, \sigma) 816 \left[ 817 1+ 818 \frac{1}{3!} \gamma_1 H_3 \left(\frac{z-\mu}{\sigma}\right) 819 + \frac{1}{4!} \gamma_2 H_4 \left(\frac{z-\mu}{\sigma}\right) 820 \right], 821 \end{align} 822 where \(\Phi(z)\)\nbsp{}--- cumulative density function (CDF) of normal 823 distribution, \(\phi\)\nbsp{}--- PDF of normal distribution, 824 \(\gamma_1\)\nbsp{}--- skewness, \(\gamma_2\)\nbsp{}--- kurtosis, 825 \(f\)\nbsp{}--- PDF, \(F\)\nbsp{}--- cumulative distribution function (CDF). 826 According to\nbsp{}cite:rozhkov1990stochastic for sea waves skewness is selected 827 from interval \(0.1\leq\gamma_1\leq{0.52}]\) and kurtosis from interval 828 \(0.1\leq\gamma_2\leq{0.7}\). Family of probability density functions for 829 different parameters is shown in fig.\nbsp{}[[fig-skew-normal-1]]. 830 831 #+NAME: fig-skew-normal-1 832 #+begin_src R :file build/skew-normal-1.pdf 833 source(file.path("R", "common.R")) 834 par(family="serif") 835 x <- seq(-3, 3, length.out=100) 836 params <- data.frame( 837 skewness = c(0.00, 0.52, 0.00, 0.52), 838 kurtosis = c(0.00, 0.00, 0.70, 0.70), 839 linetypes = c("solid", "dashed", "dotdash", "dotted") 840 ) 841 arma.skew_normal_1_plot(x, params) 842 legend( 843 "topleft", 844 mapply( 845 function (s, k) { 846 as.expression(bquote(list( 847 gamma[1] == .(arma.fmt(s, 2)), 848 gamma[2] == .(arma.fmt(k, 2)) 849 ))) 850 }, 851 params$skewness, 852 params$kurtosis 853 ), 854 lty = paste(params$linetypes) 855 ) 856 #+end_src 857 858 #+caption[Graph of PDF of GCS-based distribution]: Graph of probability density function of GCS-based distribution law for different values of skewness \(\gamma_1\) and kurtosis \(\gamma_2\). 859 #+name: fig-skew-normal-1 860 #+RESULTS: fig-skew-normal-1 861 [[file:build/skew-normal-1.pdf]] 862 863 **** Skew-normal distribution. 864 Alternative approach is to approximate distribution of sea wavy surface 865 elevation by skew-normal (SN) distribution: 866 \begin{align} 867 \label{eq-skew-normal-2} 868 F(z; \alpha) & = \frac{1}{2} 869 \mathrm{erfc}\left[-\frac{z}{\sqrt{2}}\right]-2 T(z,\alpha ), \nonumber \\ 870 f(z; \alpha) & = \frac{e^{-\frac{z^2}{2}}}{\sqrt{2 \pi }} 871 \mathrm{erfc}\left[-\frac{\alpha z}{\sqrt{2}}\right], 872 \end{align} 873 where \(T\)\nbsp{}--- Owen \(T\)-function\nbsp{}cite:owen1956tables. Using this 874 formula it is impossible to specify skewness and kurtosis separately\nbsp{}--- 875 both values are adjusted via \(\alpha\) parameter. The only advantage of the 876 formula is its relative computational simplicity: this function is available in 877 some programmes and mathematical libraries. Its graph for different values of 878 \(\alpha\) is shown in fig.\nbsp{}[[fig-skew-normal-2]]. 879 880 #+name: fig-skew-normal-2 881 #+begin_src R :file build/skew-normal-2.pdf 882 source(file.path("R", "common.R")) 883 par(family="serif") 884 x <- seq(-3, 3, length.out=100) 885 alpha <- c(0.00, 0.87, 2.25, 4.90) 886 params <- data.frame( 887 alpha = alpha, 888 skewness = arma.bits.skewness_2(alpha), 889 kurtosis = arma.bits.kurtosis_2(alpha), 890 linetypes = c("solid", "dashed", "dotdash", "dotted") 891 ) 892 arma.skew_normal_2_plot(x, params) 893 legend( 894 "topleft", 895 mapply( 896 function (a, s, k) { 897 as.expression(bquote(list( 898 alpha == .(arma.fmt(a, 2)), 899 gamma[1] == .(arma.fmt(s, 2)), 900 gamma[2] == .(arma.fmt(k, 2)) 901 ))) 902 }, 903 params$alpha, 904 params$skewness, 905 params$kurtosis 906 ), 907 lty = paste(params$linetypes) 908 ) 909 #+end_src 910 911 #+caption[Graph of PDF of SN distribution]: Graph of PDF of skew-normal distribution for different values of skewness coefficient \(\alpha\). 912 #+name: fig-skew-normal-2 913 #+RESULTS: fig-skew-normal-2 914 [[file:build/skew-normal-2.pdf]] 915 916 **** Evaluation. 917 In order to measure the effect of NIT on the shape of the resulting wavy 918 surface, three realisations were generated: 919 - realisation with Gaussian distribution (without NIT), 920 - realisation with Gram---Charlier series (GCS) based distribution, and 921 - realisation with skew normal distribution. 922 The seed of PRNG was set to be the same for all programme executions to make ARMA 923 model produce the same values for each realisation. There we two experiments: 924 for standing and propagating waves with ACFs given by formulae from 925 section\nbsp{}[[#sec-wave-acfs]]. 926 927 While the experiment showed that applying NIT with GCS-based distribution 928 increases wave steepness, the same is not true for skew normal distribution 929 (fig.\nbsp{}[[fig-nit]]). Using this distribution results in wavy surface each 930 \(z\)-coordinate of which is always greater or equal to nought. So, skew normal 931 distribution is unsuitable for NIT. NIT increases the wave height and steepness 932 of both standing and propagating waves. Increasing either skewness or kurtosis 933 of GCS-based distribution increases both wave height and steepness. The error of 934 ACF approximation (eq.\nbsp{}eqref:eq-nit-error) ranges from 0.20 for GCS-based 935 distribution to 0.70 for skew normal distribution (table\nbsp{}[[tab-nit-error]]). 936 937 #+name: fig-nit 938 #+header: :width 7 :height 7 939 #+begin_src R :file build/nit.pdf 940 source(file.path("R", "nonlinear.R")) 941 par(mfrow=c(2, 1), mar=c(4,4,4,0.5), family='serif') 942 args <- list( 943 graphs=c('Gaussian', 'Gram—Charlier', 'Skew normal'), 944 linetypes=c('solid', 'dashed', 'dotted') 945 ) 946 args$title <- 'Propagating waves' 947 arma.plot_nonlinear(file.path("build", "arma-benchmarks", "verification-orig", "nit-propagating"), args) 948 args$title <- 'Standing waves' 949 arma.plot_nonlinear(file.path("build", "arma-benchmarks", "verification-orig", "nit-standing"), args) 950 #+end_src 951 952 #+name: fig-nit 953 #+caption[Surface slices with different distributions of applicates]: 954 #+caption: Wavy surface slices with different distributions 955 #+caption: of wave elevation (Gaussian, GCS-based and SN). 956 #+RESULTS: fig-nit 957 [[file:build/nit.pdf]] 958 959 #+name: tab-nit-error 960 #+caption[Errors of ACF approximations]: 961 #+caption: Errors of ACF approximations (the difference of variances) for 962 #+caption: different wave elevation distributions. \(N\) is the number of 963 #+caption: coefficients of ACF approximation. 964 #+attr_latex: :booktabs t 965 | Wave type | Distribution | \(\gamma_1\) | \(\gamma_2\) | \(\alpha\) | Error | \(N\) | Wave height | 966 |-------------+--------------+--------------+--------------+------------+-------+-------+-------------| 967 | propagating | Gaussian | | | | | | 2.41 | 968 | propagating | GCS-based | 2.25 | 0.4 | | 0.20 | 2 | 2.75 | 969 | propagating | skew normal | | | 1 | 0.70 | 3 | 1.37 | 970 | standing | Gaussian | | | | | | 1.73 | 971 | standing | GCS-based | 2.25 | 0.4 | | 0.26 | 2 | 1.96 | 972 | standing | skew normal | | | 1 | 0.70 | 3 | 0.94 | 973 974 To summarise, the only test case that showed acceptable results is realisation 975 with GCS-based distribution for both standing and propagating waves. Skew normal 976 distribution warps wavy surface for both types of waves. GCS-based realisations 977 have large error of ACF approximation, which results in increase of wave height. 978 The reason for the large error is that GCS approximations are not accurate as 979 they do not converge for all possible 980 functions\nbsp{}cite:wallace1958asymptotic. Despite the large error, the change 981 in wave height is small (table\nbsp{}[[tab-nit-error]]). 982 983 ***** Wave height :noexport: 984 :PROPERTIES: 985 :header-args:R: :results output org 986 :END: 987 988 #+header: 989 #+begin_src R :results output org 990 source(file.path("R", "nonlinear.R")) 991 propagating <- arma.print_wave_height(file.path("build", "arma-benchmarks", "verification-orig", "nit-propagating")) 992 standing <- arma.print_wave_height(file.path("build", "arma-benchmarks", "verification-orig", "nit-standing")) 993 result <- data.frame( 994 h1=c(propagating$h1, standing$h1), 995 h2=c(propagating$h2, standing$h2), 996 h3=c(propagating$h3, standing$h3) 997 ) 998 rownames(result) <- c('propagating', 'standing') 999 colnames(result) <- c('none', 'gcs', 'sn') 1000 ascii(result) 1001 #+end_src 1002 1003 #+RESULTS: 1004 #+BEGIN_SRC org 1005 | | none | gcs | sn | 1006 |-------------+------+------+------| 1007 | propagating | 2.41 | 2.75 | 1.37 | 1008 | standing | 1.73 | 1.96 | 0.94 | 1009 #+END_SRC 1010 1011 ** The shape of ACF for different types of waves 1012 :PROPERTIES: 1013 :CUSTOM_ID: sec-wave-acfs 1014 :END: 1015 1016 **** Analytic method of finding the ACF. 1017 The straightforward way to find ACF for a given sea wave profile is to apply 1018 Wiener---Khinchin theorem. According to this theorem the autocorrelation \(K\) of 1019 a function \(\zeta\) is given by the Fourier transform of the absolute square of 1020 the function: 1021 \begin{equation} 1022 K(t) = \Fourier{\left| \zeta(t) \right|^2}. 1023 \label{eq-wiener-khinchin} 1024 \end{equation} 1025 When \(\zeta\) is replaced with actual wave profile, this formula gives you 1026 analytic formula for the corresponding ACF. 1027 1028 For three-dimensional wave profile (2D in space and 1D in time) analytic formula 1029 is a polynomial of high order and is best obtained via symbolic computation 1030 programme. Then for practical usage it can be approximated by superposition of 1031 exponentially decaying cosines (which is how ACF of a stationary ARMA process 1032 looks like\nbsp{}cite:box1976time). 1033 1034 **** Empirical method of finding the ACF. 1035 However, for three-dimensional case there exists simpler empirical method which 1036 does not require sophisticated software to determine shape of the ACF. It is 1037 known that ACF represented by exponentially decaying cosines satisfies first 1038 order Stokes' equations for gravity waves\nbsp{}cite:boccotti1983wind. So, if 1039 the shape of the wave profile is the only concern in the simulation, then one 1040 can simply multiply it by a decaying exponent to get appropriate ACF. This ACF 1041 does not reflect other wave profile parameters, such as wave height and period, 1042 but opens possibility to simulate waves of a particular shape by defining their 1043 profile with discretely given function, then multiplying it by an exponent and 1044 using the resulting function as ACF. So, this empirical method is imprecise but 1045 offers simpler alternative to Wiener---Khinchin theorem approach; it is mainly 1046 useful to test ARMA model. 1047 1048 **** Standing wave ACF. 1049 For three-dimensional plain standing wave the profile is given by 1050 \begin{equation} 1051 \zeta(t, x, y) = A \sin (k_x x + k_y y) \sin (\sigma t). 1052 \label{eq-standing-wave} 1053 \end{equation} 1054 Find ACF via analytic method. Multiplying the formula by a decaying exponent 1055 (because Fourier transform is defined for a function \(f\) that 1056 \(f\underset{x\rightarrow\pm\infty}{\longrightarrow}0\)) yields 1057 \begin{equation} 1058 \zeta(t, x, y) = 1059 A 1060 \exp\left[-\alpha (|t|+|x|+|y|) \right] 1061 \sin (k_x x + k_y y) \sin (\sigma t). 1062 \label{eq-decaying-standing-wave} 1063 \end{equation} 1064 Then, apply three-dimensional Fourier transform to both sides of the equation 1065 via symbolic computation programme, fit the resulting polynomial to the 1066 following approximation: 1067 \begin{equation} 1068 K(t,x,y) = 1069 \gamma 1070 \exp\left[-\alpha (|t|+|x|+|y|) \right] 1071 \cos \beta t 1072 \cos \left[ \beta x + \beta y \right]. 1073 \label{eq-standing-wave-acf} 1074 \end{equation} 1075 So, after applying Wiener---Khinchin theorem we get initial formula but with 1076 cosines instead of sines. This difference is important because the value of ACF 1077 at \((0,0,0)\) equals to the ARMA process variance, and if one used sines the 1078 value would be wrong. 1079 1080 If one tries to replicate the same formula via empirical method, the usual way 1081 is to adapt\nbsp{}eqref:eq-decaying-standing-wave to 1082 match\nbsp{}eqref:eq-standing-wave-acf. This can be done either by changing the 1083 phase of the sine, or by substituting sine with cosine to move the maximum of 1084 the function to the origin of coordinates. 1085 1086 **** Propagating wave ACF. 1087 Three-dimensional profile of plain propagating wave is given by 1088 \begin{equation} 1089 \zeta(t, x, y) = A \cos (\sigma t + k_x x + k_y y). 1090 \label{eq-propagating-wave} 1091 \end{equation} 1092 For the analytic method repeating steps from the previous two paragraphs yields 1093 \begin{equation} 1094 K(t,x,y) = 1095 \gamma 1096 \exp\left[-\alpha (|t|+|x|+|y|) \right] 1097 \cos\left[\beta (t+x+y) \right]. 1098 \label{eq-propagating-wave-acf} 1099 \end{equation} 1100 For the empirical method the wave profile is simply multiplied by a decaying 1101 exponent without need to adapt the maximum value of ACF (as it is required for 1102 standing wave). 1103 1104 **** Comparison of studied methods. 1105 To summarise, the analytic method of finding sea wave's ACF reduces to the 1106 following steps. 1107 - Make wave profile decay when approaching \(\pm\infty\) by multiplying it by 1108 a decaying exponent. 1109 - Apply Fourier transform to the absolute square of the resulting equation using 1110 symbolic computation programme. 1111 - Fit the resulting polynomial to the appropriate ACF approximation. 1112 1113 Two examples in this section showed that in case of standing and propagating 1114 waves their decaying profiles resemble the corresponding ACFs with the exception 1115 that the ACF's maximum should be moved to the origin to preserve simulated 1116 process variance. Empirical method of finding ACF reduces to the following 1117 steps. 1118 - Make wave profile decay when approaching \(\pm\infty\) by multiplying it by 1119 a decaying exponent. 1120 - Move maximum value of the resulting function to the origin by using 1121 trigonometric identities to shift the phase. 1122 1123 ** Summary 1124 ARMA model, owing to its non-physical nature, does not have the notion of sea 1125 wave; it simulates wavy surface as a whole instead. Motions of individual waves 1126 and their shape are often rough, and the total number of waves can not be 1127 determined precisely. However, integral characteristics of wavy surface match 1128 the ones of real sea waves. 1129 1130 Theoretically, sea waves themselves can be chosen as ACFs, the only 1131 pre-processing step is to make them decay exponentially. This may allow to 1132 generate waves of arbitrary profiles without using NIT, and is one of the 1133 directions of future work. 1134 1135 * Pressure field under discretely given wavy surface 1136 ** Known pressure field determination formulae 1137 **** Small amplitude waves theory. 1138 In\nbsp{}cite:stab2012,degtyarev1998modelling,degtyarev1997analysis the authors 1139 propose a solution for inverse problem of hydrodynamics of potential flow in the 1140 framework of small-amplitude wave theory (under assumption that wave length is 1141 much larger than height: \(\lambda \gg h\)). In that case inverse problem is 1142 linear and reduces to Laplace equation with mixed boundary conditions, and 1143 equation of motion is solely used to determine pressures for calculated velocity 1144 potential derivatives. The assumption of small amplitudes means the slow decay 1145 of wind wave coherence function, i.e. small change of local wave number in time 1146 and space compared to the wavy surface elevation (\(z\) coordinate). This 1147 assumption allows to calculate elevation \(z\) derivative as \(\zeta_z=k\zeta\), 1148 where \(k\) is wave number. In two-dimensional case the solution is written 1149 explicitly as 1150 \begin{align} 1151 \left.\frac{\partial\phi}{\partial x}\right|_{x,t}= & 1152 -\frac{1}{\sqrt{1+\alpha^{2}}}e^{-I(x)} 1153 \int\limits_{0}^x\frac{\partial\dot{\zeta}/\partial 1154 z+\alpha\dot{\alpha}}{\sqrt{1+\alpha^{2}}}e^{I(x)}dx,\label{eq-old-sol-2d}\\ 1155 I(x)= & \int\limits_{0}^x\frac{\partial\alpha/\partial z}{1+\alpha^{2}}dx,\nonumber 1156 \end{align} 1157 where \(\alpha\) is wave slope. In three-dimensional case the solution is 1158 written in the form of elliptic partial differential equation (PDE): 1159 \begin{align*} 1160 & \frac{\partial^2 \phi}{\partial x^2} \left( 1 + \alpha_x^2 \right) + 1161 \frac{\partial^2 \phi}{\partial y^2} \left( 1 + \alpha_y^2 \right) + 1162 2\alpha_x\alpha_y \frac{\partial^2 \phi}{\partial x \partial y} + \\ 1163 & \left( 1164 \frac{\partial \alpha_x}{\partial z} + 1165 \alpha_x \frac{\partial \alpha_x}{\partial x} + 1166 \alpha_y \frac{\partial \alpha_x}{\partial y} 1167 \right) \frac{\partial \phi}{\partial x} + \\ 1168 & \left( 1169 \frac{\partial \alpha_y}{\partial z} + 1170 \alpha_x \frac{\partial \alpha_y}{\partial x} + 1171 \alpha_y \frac{\partial \alpha_y}{\partial y} 1172 \right) \frac{\partial \phi}{\partial y} + \\ 1173 & \frac{\partial \dot{\zeta}}{\partial z} + 1174 \alpha_x \dot{\alpha_x} + \alpha_y \dot{\alpha_y} = 0. 1175 \end{align*} 1176 The authors suggest transforming this equation to finite differences and solve 1177 it numerically. 1178 1179 As will be shown in section\nbsp{}[[#sec-compare-formulae]], 1180 formula\nbsp{}eqref:eq-old-sol-2d diverges when attempted to calculate velocity 1181 field for large-amplitude waves, and this is the reason that it can not be used 1182 in conjunction with a model, that generates arbitrary-amplitude waves. 1183 1184 **** Linearisation of boundary condition. 1185 :PROPERTIES: 1186 :CUSTOM_ID: linearisation 1187 :END: 1188 1189 LH model allows to derive an explicit formula for velocity field by linearising 1190 kinematic boundary condition. Velocity potential formula is written as 1191 \begin{equation*} 1192 \phi(x,y,z,t) = \sum_n \frac{c_n g}{\omega_n} 1193 e^{\sqrt{u_n^2+v_n^2} z} 1194 \sin(u_n x + v_n y - \omega_n t + \epsilon_n). 1195 \end{equation*} 1196 This formula is differentiated to obtain velocity potential derivatives, which 1197 are plugged to dynamic boundary condition to determine pressure field. 1198 1199 ** Determining wave pressures for discretely given wavy surface 1200 Analytic solutions to boundary problems in classical equations are often used to 1201 study different properties of the solution, and for that purpose general 1202 solution formula is too difficult to study, as it contains integrals of unknown 1203 functions. Fourier method is one of the methods to find analytic solutions to 1204 PDE. It is based on Fourier transform, applying of which to some PDEs reduces 1205 them to algebraic equations, and the solution is written as inverse Fourier 1206 transform of some function (which may contain Fourier transforms of other 1207 functions). Since it is not possible to write analytic forms of these Fourier 1208 transforms in some cases, unique solutions are found and their behaviour is 1209 studied in different domains instead. At the same time, computing discrete 1210 Fourier transforms numerically is possible for any discretely defined function 1211 using FFT algorithms. These algorithms use symmetry of complex exponentials to 1212 decrease asymptotic complexity from \(\mathcal{O}(n^2)\) to 1213 \(\mathcal{O}(n\log_{2}n)\). So, even if general solution contains Fourier 1214 transforms of unknown functions, they still can be computed numerically, and FFT 1215 family of algorithms makes this approach efficient. 1216 1217 Alternative approach to solve PDEs is to reduce them to difference equations, 1218 which are solved by constructing various numerical schemes. This approach leads 1219 to approximate solution, and asymptotic complexity of corresponding algorithms 1220 is comparable to that of FFT. For example, for stationary elliptic PDE an 1221 implicit numerical scheme is constructed; the scheme is computed by iterative 1222 method on each step of which a tridiagonal or five-diagonal system of algebraic 1223 equations is solved by Thomas algorithm. Asymptotic complexity of this approach 1224 is \(\mathcal{O}({n}{m})\), where \(n\) is the number of wavy surface grid 1225 points and \(m\) is the number of iterations. 1226 1227 Despite their wide spread, iterative algorithms are inefficient on parallel 1228 computer architectures due to inevitable process synchronisation after each 1229 iteration; in particular, their mapping to co-processors may involve copying 1230 data in and out of the co-processor on each iteration, which negatively affects 1231 their performance. At the same time, high number of Fourier transforms in the 1232 solution is an advantage, rather than a disadvantage. First, solutions obtained 1233 by Fourier method are explicit, hence their implementations scale with the large 1234 number of parallel computer cores. Second, there are implementations of FFT 1235 optimised for different processor architectures as well as co-processors (GPU, 1236 MIC) which makes it easy to get high performance on any computing platform. 1237 These advantages substantiate the choice of Fourier method to obtain explicit 1238 solution to the problem of determining pressures under wavy sea surface. 1239 1240 *** Two-dimensional velocity potential field 1241 :PROPERTIES: 1242 :CUSTOM_ID: sec-pressure-2d 1243 :END: 1244 1245 **** Formula for infinite depth fluid. 1246 Two-dimensional Laplace equation with Robin boundary condition is written as 1247 \begin{align} 1248 \label{eq-problem-2d} 1249 & \phi_{xx}+\phi_{zz}=0,\\ 1250 & \zeta_t + \zeta_x\phi_x 1251 = \FracSqrtZetaX{\zeta_x}\phi_x 1252 - \FracSqrtZetaX{1}\phi_z, 1253 & \text{at }z=\zeta(x,t).\nonumber 1254 \end{align} 1255 Use Fourier method to solve this problem. Applying Fourier transform to both 1256 sides of the equation yields 1257 \begin{equation*} 1258 -4 \pi^2 \left( u^2 + v^2 \right) 1259 \FourierY{\phi(x,z)}{u,v} = 0, 1260 \end{equation*} 1261 hence \(v = \pm i u\). Hereinafter we use the following symmetric form of 1262 Fourier transform: 1263 \begin{equation*} 1264 \FourierY{f(x,y)}{u,v} = 1265 \iint\limits_{-\infty}^{\phantom{--}\infty} 1266 f(x,y) 1267 e^{-2\pi i (x u + y v)} 1268 dx dy. 1269 \end{equation*} 1270 We seek solution in the form of inverse Fourier transform 1271 \(\phi(x,z)=\InverseFourierY{E(u,v)}{x,z}\). Plugging[fn::\(v={-i}{u}\) is not 1272 applicable because velocity potential must go to nought when depth goes to 1273 infinity.] \(v={i}{u}\) into the formula yields 1274 \begin{equation} 1275 \label{eq-guessed-sol-2d} 1276 \phi(x,z) = \InverseFourierY{e^{2\pi u z}E(u)}{x}. 1277 \end{equation} 1278 In order to make substitution \(z=\zeta(x,t)\) not interfere with Fourier 1279 transforms, we rewrite\nbsp{}eqref:eq-guessed-sol-2d as a convolution: 1280 \begin{equation*} 1281 \phi(x,z) 1282 = 1283 \Fun{z} 1284 \ast 1285 \InverseFourierY{E(u)}{x}, 1286 \end{equation*} 1287 where \(\Fun{z}\)\nbsp{}--- a function, form of which is defined in 1288 section\nbsp{}[[#sec-compute-delta]] and which satisfies equation 1289 \(\FourierY{\Fun{z}}{u}=e^{2\pi{u}{z}}\). Plugging formula \(\phi\) into the 1290 boundary condition yields 1291 \begin{equation*} 1292 \zeta_t 1293 = 1294 \left( i f(x) - 1/\SqrtZetaX \right) 1295 \left[ 1296 \Fun{z} 1297 \ast 1298 \InverseFourierY{2\pi u E(u)}{x} 1299 \right], 1300 \end{equation*} 1301 where \(f(x)={\zeta_x}/{\sqrt{1+\zeta_x^2}}-\zeta_x\). Applying Fourier transform 1302 to both sides of this equation yields formula for coefficients \(E\): 1303 \begin{equation*} 1304 E(u) = 1305 \frac{1}{2\pi u} 1306 \frac{ 1307 \FourierY{\zeta_t / \left(i f(x) - 1/\SqrtZetaX\right)}{u} 1308 }{ 1309 \FourierY{\Fun{z}}{u} 1310 } 1311 \end{equation*} 1312 Finally, substituting \(z\) for \(\zeta(x,t)\) and plugging resulting equation 1313 into\nbsp{}eqref:eq-guessed-sol-2d yields formula for \(\phi(x,z)\): 1314 \begin{equation} 1315 \label{eq-solution-2d} 1316 \boxed{ 1317 \phi(x,z) 1318 = 1319 \InverseFourierY{ 1320 \frac{e^{2\pi u z}}{2\pi u} 1321 \frac{ 1322 \FourierY{ \zeta_t / \left(i f(x) - 1/\SqrtZetaX\right) }{u} 1323 }{ 1324 \FourierY{ \Fun{\zeta(x,t)} }{u} 1325 } 1326 }{x}. 1327 } 1328 \end{equation} 1329 1330 Multiplier \(e^{2\pi{u}{z}}/(2\pi{u})\) makes graph of a function to which 1331 Fourier transform is applied asymmetric with respect to \(OY\) axis. This makes 1332 it difficult to use FFT, because it expects periodic function which takes nought 1333 values on both ends of the interval. At the same time, using numerical 1334 integration instead of FFT is not faster than solving the initial system of 1335 equations with numerical schemes. This problem is alleviated by using 1336 formula\nbsp{}eqref:eq-solution-2d-full for finite depth fluid with knowingly 1337 large depth \(h\). This formula is derived in the following section. 1338 1339 **** Formula for finite depth fluid. 1340 On the sea bottom vertical fluid velocity component equals nought, 1341 i.e.\nbsp{}\(\phi_z=0\) on \(z=-h\), where \(h\)\nbsp{}--- water depth. In this 1342 case equation \(v=-{i}{u}\), which came from Laplace equation, can not be 1343 neglected, hence the solution is sought in the following form: 1344 \begin{equation} 1345 \phi(x,z) 1346 = 1347 \InverseFourierY{ 1348 \left( C_1 e^{2\pi u z} + C_2 e^{-2\pi u z} \right) 1349 E(u) 1350 }{x}. 1351 \label{eq-guessed-sol-2d-full} 1352 \end{equation} 1353 Plugging \(\phi\) into the boundary condition on the sea bottom yields 1354 \begin{equation*} 1355 C_1 e^{-2\pi u h} - C_2 e^{2\pi u h} = 0, 1356 \end{equation*} 1357 hence \(C_1=\frac{1}{2}C{e}^{2\pi{u}{h}}\) and 1358 \(C_2=-\frac{1}{2}C{e}^{-2\pi{u}{h}}\). Constant \(C\) may take arbitrary values 1359 here, because after plugging it becomes part of unknown coefficients \(E(u)\). 1360 Plugging formulae for \(C_1\) and \(C_2\) 1361 into\nbsp{}eqref:eq-guessed-sol-2d-full yields 1362 \begin{equation*} 1363 \phi(x,z) = \InverseFourierY{ \Sinh{2\pi u (z+h)} E(u) }{x}. 1364 \end{equation*} 1365 Plugging \(\phi\) into the boundary condition on the free surface yields 1366 \begin{align*} 1367 \zeta_t & = f(x) \InverseFourierY{ 2\pi i u \Sinh{2\pi u (z+h)} E(u) }{x} \\ 1368 & - \frac{1}{\SqrtZetaX} 1369 \InverseFourierY{ 2\pi u \SinhX{2\pi u (z+h)} E(u) }{x}. 1370 \end{align*} 1371 Here \(\sinh\) and \(\cosh\) give similar results near free surface, and since this 1372 is the main area of interest in practical applications, we assume that 1373 \(\Sinh{2\pi{u}(z+h)}\approx\SinhX{2\pi{u}(z+h)}\). Performing analogous to the 1374 previous section transformations yields final formula for \(\phi(x,z)\): 1375 \begin{equation} 1376 \boxed{ 1377 \phi(x,z,t) 1378 = 1379 \InverseFourierY{ 1380 \frac{\Sinh{2\pi u (z+h)}}{2\pi u} 1381 \frac{ 1382 \FourierY{ \zeta_t / \left(i f(x) - 1/\SqrtZetaX\right) }{u} 1383 }{ 1384 \FourierY{ \FunSecond{\zeta(x,t)} }{u} 1385 } 1386 }{x}, 1387 } 1388 \label{eq-solution-2d-full} 1389 \end{equation} 1390 where \(\FunSecond{z}\) is a function, a form of which is defined in 1391 section\nbsp{}[[#sec-compute-delta]] and which satisfies equation 1392 \(\FourierY{\FunSecond{z}}{u}=\Sinh{2\pi{u}{z}}\). 1393 1394 **** Reduction to the formulae from linear wave theory. 1395 Check the validity of derived formulae by substituting \(\zeta(x,t)\) with known 1396 analytic formula for plain waves. Symbolic computation of Fourier transforms in 1397 this section were performed in Mathematica\nbsp{}cite:mathematica10. In the 1398 framework of linear wave theory assume that waves have small amplitude compared 1399 to their lengths, which allows us to simplify initial system of 1400 equations\nbsp{}eqref:eq-problem-2d to 1401 \begin{align*} 1402 & \phi_{xx}+\phi_{zz}=0,\\ 1403 & \zeta_t = -\phi_z & \text{at }z=\zeta(x,t), 1404 \end{align*} 1405 solution to which is written as 1406 \begin{equation*} 1407 \phi(x,z,t) 1408 = 1409 -\InverseFourierY{ 1410 \frac{e^{2\pi u z}}{2\pi u} 1411 \FourierY{\zeta_t}{u} 1412 }{x} 1413 . 1414 \end{equation*} 1415 Propagating wave profile is defined as \(\zeta(x,t)=A\cos(2\pi(kx-t))\). 1416 Plugging this formula into\nbsp{}eqref:eq-solution-2d yields 1417 \(\phi(x,z,t)=-\frac{A}{k}\sin(2\pi(kx-t))\Sinh{2\pi{k}{z}}\). In order to 1418 reduce it to the formula from linear wave theory, rewrite hyperbolic cosine in 1419 exponential form and discard the term containing \(e^{-2\pi{k}{z}}\) as 1420 contradicting condition 1421 \(\phi\underset{z\rightarrow-\infty}{\longrightarrow}0\). Taking real part of 1422 the resulting formula yields 1423 \(\phi(x,z,t)=\frac{A}{k}e^{2\pi{k}{z}}\sin(2\pi(kx-t))\), which corresponds to 1424 the known formula from linear wave theory. Similarly, under small-amplitude 1425 waves assumption the formula for finite depth 1426 fluid\nbsp{}eqref:eq-solution-2d-full is reduced to 1427 \begin{equation*} 1428 \phi(x,z,t) 1429 = 1430 -\InverseFourierY{ 1431 \frac{\Sinh{2\pi u (z+h)}}{2\pi u \Sinh{2\pi u h}} 1432 \FourierY{\zeta_t}{u} 1433 }{x}. 1434 \end{equation*} 1435 Substituting \(\zeta(x,t)\) with propagating plain wave profile formula yields 1436 \begin{equation} 1437 \label{eq-solution-2d-linear} 1438 \phi(x,z,t)=\frac{A}{k} 1439 \frac{\Sinh{2 \pi k (z+h)}}{ \Sinh{2 \pi k h} } 1440 \sin(2 \pi (k x-t)), 1441 \end{equation} 1442 which corresponds to the formula from linear wave theory for finite depth fluid. 1443 1444 Different forms of Laplace equation solutions, in which decaying exponent is 1445 written with either "+" or "-" signs, may cause incompatibilities between 1446 formulae from linear wave theory and formulae derived in this work, where 1447 \(\sinh\) is used instead of \(\cosh\). Equality 1448 \(\frac{\Sinh{2\pi{k}(z+h)}}{\Sinh{2\pi{k}{h}}}\approx\frac{\sinh(2\pi{k}(z+h))}{\sinh(2\pi{k}{h})}\) 1449 becomes strict on the free surface, and difference between left-hand and 1450 right-hand sides increases when approaching sea bottom (for sufficiently large 1451 depth difference near free surface is negligible). So, for sufficiently large 1452 depth any function (\(\cosh\) or \(\sinh\)) may be used for velocity potential 1453 computation near free surface. 1454 1455 Reducing\nbsp{}eqref:eq-solution-2d and\nbsp{}eqref:eq-solution-2d-full to the 1456 known formulae from linear wave theory shows, that formula for infinite 1457 depth\nbsp{}eqref:eq-solution-2d is not suitable to compute velocity potentials 1458 with Fourier method, because it does not have symmetry, which is required for 1459 Fourier transform. However, formula for finite depth can be used instead by 1460 setting \(h\) to some characteristic water depth. For standing wave reducing to 1461 linear wave theory formulae is made under the same assumptions. 1462 *** Three-dimensional velocity potential field 1463 Three-dimensional version of\nbsp{}eqref:eq-problem is written as 1464 \begin{align} 1465 \label{eq-problem-3d} 1466 & \phi_{xx} + \phi_{yy} + \phi_{zz} = 0,\\ 1467 & \zeta_t + \zeta_x\phi_x + \zeta_y\phi_y 1468 = \FracSqrtZetaY{\zeta_x} \phi_x 1469 + \FracSqrtZetaY{\zeta_y} \phi_y 1470 - \FracSqrtZetaY{1} \phi_z, \nonumber\\ 1471 & \text{at }z=\zeta(x,y,t).\nonumber 1472 \end{align} 1473 Again, use Fourier method to solve it. Applying Fourier transform to both sides 1474 of Laplace equation yields 1475 \begin{equation*} 1476 -4 \pi^2 \left( u^2 + v^2 + w^2 \right) 1477 \FourierY{\phi(x,y,z)}{u,v,w} = 0, 1478 \end{equation*} 1479 hence \(w=\pm{i}\sqrt{u^2+v^2}\). We seek solution in the form of inverse Fourier 1480 transform \(\phi(x,y,z)=\InverseFourierY{E(u,v,w)}{x,y,z}\). Plugging 1481 \(w=i\sqrt{u^2+v^2}=i\Kveclen\) into the formula yields 1482 \begin{equation*} 1483 \phi(x,y,z) = \InverseFourierY{ 1484 \left( 1485 C_1 e^{2\pi \Kveclen z} 1486 -C_2 e^{-2\pi \Kveclen z} 1487 \right) 1488 E(u,v) 1489 }{x,y}. 1490 \end{equation*} 1491 Plugging \(\phi\) into the boundary condition on the sea bottom (analogous to 1492 two-dimensional case) yields 1493 \begin{equation} 1494 \label{eq-guessed-sol-3d} 1495 \phi(x,y,z) = \InverseFourierY{ 1496 \Sinh{2\pi \Kveclen (z+h)} E(u,v) 1497 }{x,y}. 1498 \end{equation} 1499 Plugging \(\phi\) into the boundary condition on the free surface yields 1500 \begin{equation*} 1501 \arraycolsep=1.4pt 1502 \begin{array}{rl} 1503 \zeta_t = & i f_1(x,y) \InverseFourierY{2 \pi u \Sinh{2\pi \Kveclen (z+h)}E(u,v)}{x,y} \\ 1504 + & i f_2(x,y) \InverseFourierY{2 \pi v \Sinh{2\pi \Kveclen (z+h)}E(u,v)}{x,y} \\ 1505 - & f_3(x,y) \InverseFourierY{2 \pi \Kveclen \SinhX{2\pi \Kveclen (z+h)}E(u,v)}{x,y} 1506 \end{array} 1507 \end{equation*} 1508 where \(f_1(x,y)={\zeta_x}/{\SqrtZetaY}-\zeta_x\), 1509 \(f_2(x,y)={\zeta_y}/{\SqrtZetaY}-\zeta_y\) and \(f_3(x,y)=1/\SqrtZetaY\). 1510 1511 Like in section\nbsp{}[[#sec-pressure-2d]] we assume that 1512 \(\Sinh{2\pi{u}(z+h)}\approx\SinhX{2\pi{u}(z+h)}\) near free surface, but in 1513 three-dimensional case this is not enough to solve the problem. In order to get 1514 explicit formula for coefficients \(E\) we need to assume, that all Fourier 1515 transforms in the equation have radially symmetric kernels, i.e.\nbsp{}replace 1516 \(u\) and \(v\) with \(\Kveclen\). There are two points supporting this 1517 assumption. First, in numerical implementation integration is done over positive 1518 wave numbers, so the sign of \(u\) and \(v\) does not affect the solution. 1519 Second, the growth rate of \(\cosh\) term of the integral kernel is much higher 1520 than the one of \(u\) or \(\Kveclen\), so the substitution has small effect on 1521 the magnitude of the solution. Despite these two points, a use of more 1522 mathematically rigorous approach would be preferable. 1523 1524 Making the replacement, applying Fourier transform to both sides of the equation 1525 and plugging the result into\nbsp{}eqref:eq-guessed-sol-3d yields formula for 1526 \(\phi\): 1527 \begin{equation} 1528 \label{eq-phi-3d} 1529 \phi(x,y,z,t) = \InverseFourierY{ 1530 \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} }{ 2\pi\Kveclen } 1531 \frac{ \FourierY{ \zeta_t / \left( i f_1(x,y) + i f_2(x,y) - f_3(x,y) \right)}{u,v} } 1532 { \FourierY{\mathcal{D}_3\left( x,y,\zeta\left(x,y\right) \right)}{u,v} } 1533 }{x,y}, 1534 \end{equation} 1535 where 1536 \(\FourierY{\mathcal{D}_3\left(x,y,z\right)}{u,v}=\Sinh{\smash{2\pi\Kveclen{}z}}\). 1537 1538 *** Velocity potential normalisation formulae 1539 :PROPERTIES: 1540 :CUSTOM_ID: sec-compute-delta 1541 :END: 1542 1543 In solutions\nbsp{}eqref:eq-solution-2d and\nbsp{}eqref:eq-solution-2d-full to 1544 two-dimensional pressure determination problem there are functions 1545 \(\Fun{z}=\InverseFourierY{e^{2\pi{u}{z}}}{x}\) and 1546 \(\FunSecond{z}=\InverseFourierY{\Sinh{2\pi{u}{z}}}{x}\) which has multiple 1547 analytic representations and are difficult to compute. Each function is a 1548 Fourier transform of linear combination of exponents which reduces to poorly 1549 defined Dirac delta function of a complex argument (see 1550 table\nbsp{}[[tab-delta-functions]]). The usual way of handling this type of 1551 functions is to write them as multiplication of Dirac delta functions of real 1552 and imaginary part, however, this approach does not work here, because applying 1553 inverse Fourier transform to this representation does not produce exponent, 1554 which severely warp resulting velocity field. In order to get unique analytic 1555 definition normalisation factor \(1/\Sinh{2\pi{u}{h}}\) (which is also included 1556 in formula for \(E(u)\)) may be used. Despite the fact that normalisation allows 1557 to obtain adequate velocity potential field, numerical experiments show that 1558 there is little difference between this field and the one produced by formulae 1559 from linear wave theory, in which terms with \(\zeta\) are omitted. As a result, 1560 the formula for three-dimensional case was not derived. 1561 1562 #+name: tab-delta-functions 1563 #+caption[Formulae for computing \(\Fun{z}\) and \(\FunSecond{z}\)]: 1564 #+caption: Formulae for computing \(\Fun{z}\) and \(\FunSecond{z}\) from 1565 #+caption: sec.\nbsp{}[[#sec-pressure-2d]], that use normalisation to eliminate 1566 #+caption: uncertainty from definition of Dirac delta function of complex argument. 1567 #+attr_latex: :booktabs t 1568 | Function | Without normalisation | Normalised | 1569 |-------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------| 1570 | \(\Fun{z}\) | \(\delta (x+i z)\) | \(\frac{1}{2 h}\mathrm{sech}\left(\frac{\pi (x-i (h+z))}{2 h}\right)\) | 1571 | \(\FunSecond{z}\) | \(\frac{1}{2}\left[\delta (x-i z) + \delta (x+i z) \right]\) | \(\frac{1}{4 h}\left[\text{sech}\left(\frac{\pi (x-i (h+z))}{2 h}\right)+\text{sech}\left(\frac{\pi (x+i(h+z))}{2 h}\right)\right]\) | 1572 1573 ** Verification of velocity potential fields 1574 :PROPERTIES: 1575 :CUSTOM_ID: sec-compare-formulae 1576 :END: 1577 1578 Comparing obtained generic formulae\nbsp{}eqref:eq-solution-2d 1579 and\nbsp{}eqref:eq-solution-2d-full to the known formulae from linear wave 1580 theory allows to see the difference between velocity fields for both large and 1581 small amplitude waves. In general case analytic formula for velocity potential 1582 in not known, even for plain waves, so comparison is done numerically. Taking 1583 into account conclusions of section\nbsp{}[[#sec-pressure-2d]], only finite depth 1584 formulae are compared. 1585 1586 **** The difference with linear wave theory formulae. 1587 In order to obtain velocity potential fields, plain waves of varying amplitude 1588 were generated. Wave numbers in Fourier transforms were chosen on the interval 1589 from \(0\) to the maximal wave number determined numerically from the resulting 1590 wavy surface. Experiments were conducted for waves of both small and large 1591 amplitudes. 1592 1593 The experiment showed that velocity potential fields for large-amplitude waves 1594 produced by formula\nbsp{}eqref:eq-solution-2d-full for finite depth fluid and 1595 formula\nbsp{}eqref:eq-solution-2d-linear from linear wave theory are 1596 qualitatively different (fig.\nbsp{}[[fig-potential-field-nonlinear]]). First, 1597 velocity potential contours have sinusoidal shape, which is different from oval 1598 shape described by linear wave theory. Second, velocity potential decays more 1599 rapidly than in linear wave theory as getting closer to the bottom, and the 1600 region where the majority of wave energy is concentrated is closer to the wave 1601 crest. Similar numerical experiment, in which all terms 1602 of\nbsp{}eqref:eq-solution-2d-full that are neglected in the framework of linear 1603 wave theory are eliminated, shows no difference (as much as machine precision 1604 allows) in resulting velocity potential fields. 1605 1606 #+name: fig-potential-field-nonlinear 1607 #+header: :width 8 :height 11 1608 #+begin_src R :file build/plain-wave-velocity-field-comparison.pdf 1609 source(file.path("R", "velocity-potentials.R")) 1610 par(pty="s", family="serif") 1611 nlevels <- 41 1612 levels <- pretty(c(-200,200), nlevels) 1613 palette <- colorRampPalette(c("blue", "lightyellow", "red")) 1614 col <- palette(nlevels-1) 1615 1616 # linear solver 1617 par(fig=c(0,0.95,0,0.5),new=TRUE) 1618 arma.plot_velocity_potential_field( 1619 file.path("build", "arma-benchmarks", "verification-orig", "plain_wave_linear_solver"), 1620 levels=levels, 1621 col=col 1622 ) 1623 1624 # high-amplitude solver 1625 par(fig=c(0,0.95,0.5,1),new=TRUE) 1626 arma.plot_velocity_potential_field( 1627 file.path("build", "arma-benchmarks", "verification-orig", "plain_wave_high_amplitude_solver"), 1628 levels=levels, 1629 col=col 1630 ) 1631 1632 # legend 1 1633 par(pty="m",fig=c(0.80,1,0.5,1), new=TRUE) 1634 arma.plot_velocity_potential_field_legend( 1635 levels=levels, 1636 col=col 1637 ) 1638 1639 # legend 2 1640 par(pty="m",fig=c(0.80,1,0,0.5), new=TRUE) 1641 arma.plot_velocity_potential_field_legend( 1642 levels=levels, 1643 col=col 1644 ) 1645 #+end_src 1646 1647 #+caption[Velocity potential fields for propagating wave]: 1648 #+caption: Comparison of velocity potential fields for propagating wave 1649 #+caption: \(\zeta(x,y,t) = \cos(2\pi x - t/2)\). Field produced by generic 1650 #+caption: formula (top) and linear wave theory formula (bottom). 1651 #+name: fig-potential-field-nonlinear 1652 #+attr_latex: :width \textwidth 1653 #+RESULTS: fig-potential-field-nonlinear 1654 [[file:build/plain-wave-velocity-field-comparison.pdf]] 1655 1656 **** The difference with small-amplitude wave theory. 1657 The experiment, in which velocity fields produced numerically by different 1658 formulae were compared, shows that velocity fields produced by 1659 formula\nbsp{}eqref:eq-solution-2d-full and\nbsp{}eqref:eq-old-sol-2d correspond 1660 to each other for small-amplitude waves. Two sea wavy surface realisations were 1661 made by AR model: one containing small-amplitude waves, other containing 1662 large-amplitude waves. Integration in formula\nbsp{}eqref:eq-solution-2d-full 1663 was done over wave numbers range extracted from the generated wavy surface. For 1664 small-amplitude waves both formulae showed comparable results (the difference in 1665 the velocity is attributed to the stochastic nature of AR model), whereas for 1666 large-amplitude waves stable velocity field was produced only by 1667 formula\nbsp{}eqref:eq-solution-2d-full (fig.\nbsp{}[[fig-velocity-field-2d]]). 1668 So, generic formula\nbsp{}eqref:eq-solution-2d-full gives satisfactory results 1669 without restriction on wave amplitudes. 1670 1671 #+name: fig-velocity-field-2d 1672 #+header: :width 8 :height 11 1673 #+begin_src R :file build/large-and-small-amplitude-velocity-field-comparison.pdf 1674 source(file.path("R", "velocity-potentials.R")) 1675 linetypes = c("solid", "dashed") 1676 par(mfrow=c(2, 1), family="serif") 1677 arma.plot_velocity( 1678 file.path("data", "velocity", "low-amp"), 1679 file.path("data", "velocity", "low-amp-0"), 1680 linetypes=linetypes, 1681 ylim=c(-2,2) 1682 ) 1683 arma.plot_velocity( 1684 file.path("data", "velocity", "high-amp"), 1685 file.path("data", "velocity", "high-amp-0"), 1686 linetypes=linetypes, 1687 ylim=c(-2,2) 1688 ) 1689 #+end_src 1690 1691 #+name: fig-velocity-field-2d 1692 #+caption[Velocity potential fields for small and large amplitude waves]: 1693 #+caption: Comparison of velocity potential fields obtained by generic formula 1694 #+caption: (\(u_1\)) and formula for small-amplitude waves (\(u_2\)): velocity 1695 #+caption: field for realisations containing small-amplitude (top) and 1696 #+caption: large-amplitude (bottom) waves. 1697 #+attr_latex: :width \textwidth 1698 #+RESULTS: fig-velocity-field-2d 1699 [[file:build/large-and-small-amplitude-velocity-field-comparison.pdf]] 1700 1701 ** Summary 1702 Formulae derived in this section allow to numerically compute velocity potential 1703 field (and hence pressure field) near discretely or mathematically given wavy 1704 sea surface, bypassing assumptions of linear wavy theory and theory of small 1705 amplitude waves. For small amplitude waves new formulae produce the same 1706 velocity potential field, as formulae from linear wave theory. For 1707 large-amplitude waves the usage of new formulae results in a shift of a region 1708 where the majority of wave energy is concentrated closer to the wave crest. 1709 1710 * High-performance software implementation of sea wave simulation 1711 :PROPERTIES: 1712 :CUSTOM_ID: sec-hpc 1713 :END: 1714 1715 ** SMP implementation 1716 *** Wavy surface generation 1717 **** Parallel AR, MA and LH model algorithms. 1718 :PROPERTIES: 1719 :CUSTOM_ID: sec-parallel 1720 :END: 1721 1722 Although, AR and MA models are part of the single mixed model they have 1723 disparate parallel algorithms, which are different from trivial one of LH model. 1724 AR algorithm consists in partitioning wavy surface into equally-sized parts in 1725 each dimension and computing them in parallel taking into account causal 1726 constraints imposed by autoregressive dependencies between surface points. There 1727 are no such dependencies in MA model, and its formula represents convolution of 1728 white noise with model coefficients, which is reduced to computation of three 1729 Fourier transforms via convolution theorem. So, MA algorithm consists in 1730 parallel computation of the convolution which is based on FFT computation. 1731 Finally, LH algorithm is made parallel by simply computing each wavy surface 1732 point in parallel in several threads. So, parallel implementation of ARMA model 1733 consists of two parallel algorithms, one for each part of the model, which are 1734 more sophisticated than the one for LH model. 1735 1736 AR model's formula main feature is autoregressive dependencies between wavy 1737 surface points in each dimension which prevent computing each surface point in 1738 parallel. Instead the surface is partitioned along each dimension into equal 1739 three-dimensional parts, and for each part information dependencies, which 1740 define computation order, are established. Figure\nbsp{}[[fig-ar-cubes]] shows these 1741 dependencies. An arrow denotes dependency of one part on availability of 1742 another, i.e.\nbsp{}computation of a part may start only when all parts on which 1743 it depends were computed. Here part \(A\) does not have dependencies, parts 1744 \(B\) and \(D\) depend only on \(A\), and part \(E\) depends on \(A\), \(B\) and 1745 \(C\). In general, each part depends on all parts that have previous index in at 1746 least one dimension (if such parts exist). The first part does not have any 1747 dependencies; and the size of each part along each dimension is made greater or 1748 equal to the corresponding number of coefficients along the dimension to 1749 consider only adjacent parts in dependency resolution. 1750 1751 #+name: fig-ar-cubes 1752 #+header: :width 5 :height 5 1753 #+begin_src R :file build/ar-cubes.pdf 1754 source(file.path("R", "common.R")) 1755 par(family="serif") 1756 arma.plot_ar_cubes_2d(3, 3, xlabel="Part index (X)", ylabel="Part index (Y)") 1757 #+end_src 1758 1759 #+name: fig-ar-cubes 1760 #+caption[Autoregressive dependencies between surface parts]: 1761 #+caption: Autoregressive dependencies between wavy surface parts. 1762 #+RESULTS: fig-ar-cubes 1763 [[file:build/ar-cubes.pdf]] 1764 1765 Each part has an three-dimensional index and a completion status. The algorithm 1766 starts by submitting objects containing this information into a queue. After 1767 that parallel threads start, each thread finds the first part for which all 1768 dependencies are satisfied (by checking the completion status of each part), 1769 removes the part from the queue, generates wavy surface for this part and sets 1770 completion status. The algorithm ends when the queue becomes empty. Access to 1771 the queue from different threads is synchronised by locks. The algorithm is 1772 suitable for SMP machines; for MPP all parts on which the current one depends, 1773 dependent parts needs to be copied to the node where computation is carried out. 1774 1775 So, parallel AR model algorithm reduces to implementing minimalistic job 1776 scheduler, in which 1777 - each job corresponds to a wavy surface part, 1778 - the order of execution of jobs is defined by autoregressive dependencies, and 1779 - job queue is processed by a simple thread pool in which each thread in a loop 1780 removes the first job from the queue for which all dependent jobs have 1781 completed and executes it. 1782 1783 In contrast to AR model, MA model does not have autoregressive dependencies 1784 between points; instead, each surface point depends on previous in time and 1785 space white noise values. MA model's formula allows for rewriting it as a 1786 convolution of white noise with the coefficients as a kernel. Using convolution 1787 theorem, the convolution is rewritten as inverse Fourier transform of the 1788 product of Fourier transforms of white noise and coefficients. Since the number 1789 of MA coefficients is much smaller than the number of wavy surface points, 1790 parallel FFT implementation is not suitable here, as it requires padding the 1791 coefficients with noughts to match the size of the surface. Instead, the surface 1792 is divided into parts along each dimension which are padded with noughts to 1793 match the number of the coefficients along each dimension multiplied by two. 1794 Then Fourier transform of each part is computed in parallel, multiplied by 1795 previously computed Fourier transform of the coefficients, and inverse Fourier 1796 transform of the result is computed. After that, each part is written to the 1797 output array with overlapping (due to padding) points added to each other. This 1798 algorithm is commonly known in signal processing as 1799 "overlap-add"\nbsp{}cite:oppenheim1989discrete,svoboda2011efficient,pavel2013algorithms. 1800 Padding with noughts is needed to prevent aliasing errors: without it the result 1801 would be circular convolution. 1802 1803 Despite the fact that MA model algorithm partitions the surface into the same 1804 parts (but possibly of different sizes) as AR model algorithm, the absence of 1805 autoregressive dependencies between them allows to compute them in parallel 1806 without the use of specialised job scheduler. However, this algorithm requires 1807 padding parts with noughts to make the result of calculations correspond to the 1808 result obtained with original MA model's formula. So, MA model's algorithm is 1809 more scalable to a large number of nodes as it has no information dependencies 1810 between parts, but the size of the parts is greater than in AR model's 1811 algorithm. 1812 1813 The distinct feature of LH model's algorithm is its simplicity: to make it 1814 parallel, wavy surface is partitioned into parts of equal sizes and each part is 1815 generated in parallel. There are no information dependencies between parts, 1816 which makes this algorithm particularly suitable for computation on GPU: each 1817 hardware thread simply computes its own point. In addition, sine and cosine 1818 functions in the model's formula are slow to compute on CPU, which makes GPU 1819 even more favourable choice. 1820 1821 To summarise, even though AR and MA models are part of the single mixed model, 1822 their parallel algorithms are fundamentally different and are more complicated 1823 than trivial parallel algorithm of LH model. Efficient implementation AR 1824 algorithm requires specialised job scheduler to manage autoregressive 1825 dependencies between wavy surface parts, whereas MA algorithm requires padding 1826 each part with noughts to be able to compute them in parallel. In contrast to 1827 these models, LH model has no information dependencies between parts, but 1828 requires more computational resources (floating point operations per seconds) 1829 for transcendental functions in its formula. 1830 1831 **** Parallel white noise generation algorithm. 1832 In order to eliminate periodicity from wavy surface generated by sea wave model, 1833 it is imperative to use PRNG with sufficiently large period to generate white 1834 noise. Parallel Mersenne Twister\nbsp{}cite:matsumoto1998mersenne with a period 1835 of \(2^{19937}-1\) is used as a generator in this work. It allows to produce 1836 aperiodic sea wavy surface realisations in any practical usage scenarios. 1837 1838 There is no guarantee that multiple PRNGs executed in parallel threads with 1839 distinct initial states produce uncorrelated pseudo-random number sequences, and 1840 algorithm of dynamic creation of Mersenne 1841 Twisters\nbsp{}cite:matsumoto1998dynamic is used to provide such guarantee. The 1842 essence of the algorithm is to find matrices of initial generator states, that 1843 give maximally uncorrelated pseudo-random number sequences when Mersenne 1844 Twisters are executed in parallel with these initial states. Since finding such 1845 initial states consumes considerable amount of processor time, vector of initial 1846 states is created beforehand with knowingly larger number of parallel threads 1847 and saved to a file, which is then read before starting white noise generation. 1848 1849 **** Ramp-up interval elimination. 1850 In AR model value of wavy surface elevation at a particular point depends on 1851 previous in space and time points, as a result the so called /ramp-up interval/ 1852 (see fig.\nbsp{}[[fig-ramp-up-interval]]), in which realisation does not correspond 1853 to specified ACF, forms in the beginning of the realisation. There are several 1854 solutions to this problem which depend on the simulation context. If realisation 1855 is used in the context of ship stability simulation without manoeuvring, ramp-up 1856 interval will not affect results of the simulation, because it is located on the 1857 border (too far away from the studied marine object). Alternative approach is to 1858 generate sea wavy surface on ramp-up interval with LH model and generate the 1859 rest of the realisation with AR model. If ship stability with manoeuvring is 1860 studied, then the interval may be simply discarded from the realisation (the 1861 size of the interval approximately equals the number of AR coefficients). 1862 However, this may lead to a loss of a very large number of points, because 1863 discarding occurs for each of three dimensions. 1864 1865 #+name: fig-ramp-up-interval 1866 #+begin_src R :file build/ramp-up-interval.pdf 1867 source(file.path("R", "common.R")) 1868 par(family="serif") 1869 arma.plot_ramp_up_interval() 1870 #+end_src 1871 1872 #+caption[Ramp-up interval at the beginning of AR process realisation]: 1873 #+caption: Ramp-up interval at the beginning of AR process realisation. 1874 #+name: fig-ramp-up-interval 1875 #+RESULTS: fig-ramp-up-interval 1876 [[file:build/ramp-up-interval.pdf]] 1877 1878 **** Performance of OpenMP and OpenCL implementations. 1879 :PROPERTIES: 1880 :header-args:R: :results output raw :exports results 1881 :END: 1882 1883 Differences in models' parallel algorithms make them efficient on different 1884 processor architectures, and to find the most efficient one, all the models were 1885 benchmarked in both CPU and GPU. 1886 1887 AR and MA models do not require highly optimised codes to be efficient, their 1888 performance is high even without use of co-processors; there are two main causes 1889 of that. First, these models do not use transcendental functions (sines, cosines 1890 and exponents) as opposed to LH model. All calculations (except model 1891 coefficients) are done via polynomials, which can be efficiently computed on 1892 modern processors using FMA instructions. Second, pressure computation is done 1893 via explicit formula using a number of nested FFTs. Since two-dimensional FFT of 1894 the same size is repeatedly applied to every time slice, its coefficients 1895 (complex exponents) are pre-computed one time for all slices, and further 1896 computations involve only a few transcendental functions. In case of MA model, 1897 performance is also increased by doing convolution with FFT. So, high 1898 performance of AR and MA models is due to scarce use of transcendental functions 1899 and heavy use of FFT, not to mention that high convergence rate and 1900 non-existence of periodicity allows to use far fewer coefficients compared to LH 1901 model. 1902 1903 #+name: tab-gpulab 1904 #+caption["Gpulab" system configuration]: 1905 #+caption: "Gpulab" system configuration. 1906 #+attr_latex: :booktabs t 1907 | CPU | AMD FX-8370 | 1908 | RAM | 16Gb | 1909 | GPU | GeForce GTX 1060 | 1910 | GPU memory | 6GB | 1911 | HDD | WDC WD40EZRZ, 5400rpm | 1912 | No. of CPU cores | 4 | 1913 | No. of threads per core | 2 | 1914 1915 Software implementation uses several libraries of mathematical functions, 1916 numerical algorithms and visualisation primitives (listed in 1917 table\nbsp{}[[tab-arma-libs]]), and was implemented using several parallel 1918 programming technologies (OpenMP, OpenCL) to have a possibility to choose the 1919 most efficient one. For each technology and each model an optimised wavy surface 1920 generation was implemented (except for MA model for which only OpenMP 1921 implementation was done). Velocity potential computation was done in OpenMP and 1922 was implemented in OpenCL only for real-time visualisation of wavy surface. For 1923 each technology the programme was recompiled and run multiple times and 1924 performance of each top-level subroutine was measured using system clock. 1925 1926 Results of benchmarks of the technologies are summarised in 1927 table\nbsp{}[[tab-arma-performance]]. All benchmarks were run on a machine equipped 1928 with a GPU, characteristics of which are summarised in table\nbsp{}[[tab-gpulab]]. 1929 All benchmarks were run with the same input parameters for all the models: 1930 realisation length 10000s and output grid size \(40\times40\)m. The only 1931 parameter that was different is the order (the number of coefficients): order of 1932 AR and MA model was \(7,7,7\) and order of LH model was \(40,40\). This is due 1933 to higher number of coefficient for LH model to eliminate periodicity. 1934 1935 In all benchmarks wavy surface generation and NIT take the most of the running 1936 time, whereas velocity potential calculation together with other subroutines 1937 only a small fraction of it. 1938 1939 #+name: tab-arma-libs 1940 #+caption[A list of libraries used in software implementation]: 1941 #+caption: A list of libraries used in software implementation. 1942 #+attr_latex: :booktabs t 1943 | Library | What it is used for | 1944 |--------------------------------------------------------------+---------------------------------| 1945 | DCMT\nbsp{}cite:matsumoto1998dynamic | parallel PRNG | 1946 | Blitz\nbsp{}cite:veldhuizen1997will,veldhuizen2000techniques | multidimensional arrays | 1947 | GSL\nbsp{}cite:gsl2008scientific | PDF, CDF, FFT computation | 1948 | | checking process stationarity | 1949 | clFFT\nbsp{}cite:clfft | FFT computation | 1950 | LAPACK, GotoBLAS\nbsp{}cite:goto2008high,goto2008anatomy | finding AR coefficients | 1951 | GL, GLUT\nbsp{}cite:kilgard1996opengl | three-dimensional visualisation | 1952 | CGAL\nbsp{}cite:fabri2009cgal | wave numbers interpolation | 1953 1954 AR model exhibits the highest performance in OpenMP and the lowest performance 1955 in OpenCL implementations, which is also the best and the worst performance 1956 across all model and framework combinations. In the most optimal model and 1957 framework combination AR performance is 4.5 times higher than MA performance, 1958 and 20 times higher than LH performance; in the most suboptimal 1959 combination\nbsp{}--- 137 times slower than MA and two times slower than LH. The 1960 ratio between the best (OpenMP) and the worst (OpenCL) AR model performance is 1961 several hundreds. This is explained by the fact that the model 1962 formula\nbsp{}eqref:eq-ar-process is efficiently mapped on the CPU architecture, 1963 which is distinguished from GPU architecture by having multiple caches, 1964 low-bandwidth memory and small number of floating point units compared to GPU. 1965 - This formula does not contain transcendental mathematical functions (sines, 1966 cosines and exponents), 1967 - all of the multiplications and additions in the formula can be implemented 1968 using FMA processor instructions, and 1969 - efficient use (locality) of cache is achieved by using Blitz library which 1970 implements optimised traversals for multidimensional arrays based on Hilbert 1971 space-filling curve. 1972 In contrast to CPU, GPU has less number of caches, high-bandwidth memory and 1973 large number of floating point units, which is the worst case scenario for AR 1974 model: 1975 - there are no transcendental functions which could compensate high memory 1976 latency, 1977 - there are FMA instructions in GPU but they do not improve performance due to 1978 high latency, and 1979 - optimal traversal of multidimensional arrays was not used due to a lack of 1980 libraries implementing it for a GPU. 1981 Finally, GPU architecture does not contain synchronisation primitives that allow 1982 to implement autoregressive dependencies between distinct wavy surface parts; 1983 instead of this a separate OpenCL kernel is launched for each part, and 1984 information dependency management between them is done on CPU side. So, in AR 1985 model case CPU architecture is superior compared to GPU due to better handling 1986 of complex information dependencies, simple calculations (multiplications and 1987 additions) and complex memory access patterns. 1988 1989 #+header: :results output raw :exports results 1990 #+name: tab-arma-performance 1991 #+begin_src R 1992 source(file.path("R", "benchmarks.R")) 1993 model_names <- list( 1994 ar.x="AR", 1995 ma.x="MA", 1996 lh.x="LH", 1997 ar.y="AR", 1998 ma.y="MA", 1999 lh.y="LH", 2000 Row.names="\\orgcmidrule{2-4}{5-6}Subroutine" 2001 ) 2002 row_names <- list( 2003 determine_coefficients="Determine coefficients", 2004 validate="Validate model", 2005 generate_surface="Generate wavy surface", 2006 nit="NIT", 2007 write_all="Write output to files", 2008 copy_to_host="Copy data from GPU", 2009 velocity="Compute velocity potentials" 2010 ) 2011 arma.print_openmp_vs_opencl(model_names, row_names) 2012 #+end_src 2013 2014 #+name: tab-arma-performance 2015 #+caption[Performance of OpenMP and OpenCL implementations]: 2016 #+caption: Running time (s.) for OpenMP and OpenCL implementations 2017 #+caption: of AR, MA and LH models. 2018 #+attr_latex: :booktabs t 2019 #+RESULTS: tab-arma-performance 2020 | | | | OpenMP | | OpenCL | 2021 | | <r> | <r> | <r> | <r> | <r> | 2022 | \orgcmidrule{2-4}{5-6}Subroutine | AR | MA | LH | AR | LH | 2023 |----------------------------------+------+------+--------+--------+--------| 2024 | Determine coefficients | 0.02 | 0.01 | 0.19 | 0.01 | 1.19 | 2025 | Validate model | 0.08 | 0.10 | | 0.08 | | 2026 | Generate wavy surface | 1.26 | 5.57 | 350.98 | 769.38 | 0.02 | 2027 | NIT | 7.11 | 7.43 | | 0.02 | | 2028 | Copy data from GPU | | | | 5.22 | 25.06 | 2029 | Compute velocity potentials | 0.05 | 0.05 | 0.06 | 0.03 | 0.03 | 2030 | Write output to files | 0.27 | 0.27 | 0.27 | 0.28 | 0.27 | 2031 2032 In contrast to AR model, LH model exhibits the best performance on GPU and the 2033 worst performance on CPU. The reasons for that are 2034 - the large number of transcendental functions in its formula which offset high 2035 memory latency, 2036 - linear memory access pattern which allows to vectorise calculations and 2037 coalesce memory accesses by different hardware threads, and 2038 - no information dependencies between wavy surface points. 2039 Despite the fact that GPU on the test system is more performant than CPU (in 2040 terms of floating point operations per second), the overall performance of LH 2041 model compared to AR model is lower. The reason for that is slow data transfer 2042 between GPU and CPU memory. 2043 2044 MA model is faster than LH model and slower than AR model. As the convolution in 2045 its formula is implemented using FFT, its performance depends on the performance 2046 of FFT implementation: GSL for CPU and clFFT for GPU. In this work 2047 performance of MA model on GPU was not tested due to unavailability of the 2048 three-dimensional FFT in clFFT library; if the transform was available, it could 2049 made the model even faster than AR. 2050 2051 NIT takes less time on GPU and more time on CPU, but taking data transfer 2052 between them into consideration makes their execution time comparable. This is 2053 explained by the large amount of transcendental functions that need to be 2054 computed for each wavy surface point to transform distribution of its 2055 \(z\)-coordinates. For each point a non-linear transcendental 2056 equation\nbsp{}eqref:eq-distribution-transformation is solved using bisection 2057 method. GPU performs this task several hundred times faster than CPU, but spends 2058 a lot of time to transfer the result back to the processor memory. So, the only 2059 possibility to optimise this routine is to use root finding method with 2060 quadratic convergence rate to reduce the number of transcendental functions that 2061 need to be computed. 2062 2063 **** I/O performance. 2064 :PROPERTIES: 2065 :header-args:R: :results output raw :exports results 2066 :CUSTOM_ID: sec-io-perf 2067 :END: 2068 2069 Although, in the benchmarks from the previous section writing data to files does 2070 not consume much of the running time, the use of network-mounted file systems 2071 may slow down this stage of the programme. To optimise it wavy surface parts are 2072 written to a file as soon as generation of the whole time slice is completed 2073 (fig.\nbsp{}[[fig-arma-io-events]]): a separate thread starts writing to files as 2074 soon as the first time slice is available and finishes it after the main thread 2075 group finishes the computation. The total time spent to perform I/O is 2076 increased, but the total programme running time is decreased, because the I/O is 2077 done in parallel to computation (table\nbsp{}[[tab-arma-io-performance]]). Using 2078 this approach with local file system has the same effect, but performance gain 2079 is lower. 2080 2081 #+header: :results output raw :exports results 2082 #+name: tab-arma-io-performance 2083 #+begin_src R 2084 source(file.path("R", "benchmarks.R")) 2085 suffix_names <- list( 2086 xfs.x="XFS", 2087 xfs.y="XFS", 2088 nfs.x="NFS", 2089 nfs.y="NFS", 2090 gfs.x="GlusterFS", 2091 gfs.y="GlusterFS", 2092 Row.names="\\orgcmidrule{2-4}{5-7}Subroutine" 2093 ) 2094 top_names <- c("I", "II") 2095 row_names <- list( 2096 generate_surface="Generate wavy surface", 2097 write_all="Write output to files" 2098 ) 2099 arma.print_sync_vs_async_io(suffix_names, row_names, top_names) 2100 #+end_src 2101 2102 #+name: tab-arma-io-performance 2103 #+caption[I/O performance for XFS, NFS and GlusterFS]: 2104 #+caption: Running time of subroutines (s.) with the use of XFS, NFS and 2105 #+caption: GlusterFS along with sequential (I) and parallel (II) file output. 2106 #+attr_latex: :booktabs t 2107 #+RESULTS: tab-arma-io-performance 2108 | | | | I | | | II | 2109 | | <r> | <r> | <r> | <r> | <r> | <r> | 2110 | \orgcmidrule{2-4}{5-7}Subroutine | XFS | NFS | GlusterFS | XFS | NFS | GlusterFS | 2111 |----------------------------------+------+------+-----------+------+------+-----------| 2112 | Generate wavy surface | 1.26 | 1.26 | 1.33 | 1.33 | 3.30 | 11.06 | 2113 | Write output to files | 0.28 | 2.34 | 10.95 | 0.00 | 0.00 | 0.00 | 2114 2115 #+name: fig-arma-io-events 2116 #+header: :height 9 :results output graphics 2117 #+begin_src R :file build/arma-io-events.pdf 2118 source(file.path("R", "benchmarks.R")) 2119 names <- list(xlab="Time, s", ylab="Thread") 2120 par(mfrow=c(3, 1), mar=c(4,4,4,0.5), family='serif', cex=1) 2121 arma.plot_io_events(names) 2122 #+end_src 2123 2124 #+name: fig-arma-io-events 2125 #+caption[Event graph for XFS, NFS and GlusterFS]: 2126 #+caption: Event graph for XFS, NFS and GlusterFS that shows time intervals 2127 #+caption: spent for I/O (red) and computation (black) by different threads. 2128 #+RESULTS: fig-arma-io-events 2129 [[file:build/arma-io-events.pdf]] 2130 2131 *** Velocity potential field computation 2132 **** Parallel velocity potential field computation. 2133 The benchmarks for AR, MA and LH models showed that velocity potential field 2134 computation consumes only a fraction of total programme execution time, however, 2135 the absolute computation time over a dense \(XY\) grid may be greater. One 2136 application where dense grid is used is real-time simulation and visualisation 2137 of wavy surface. Real-time visualisation allows to 2138 - adjust parameters of the model and ACF function, getting the result of the 2139 changes immediately, and 2140 - compare the size and the shape of regions where the most wave energy is 2141 concentrated. 2142 2143 Since visualisation is done by GPU, doing velocity potential computation on CPU 2144 may cause data transfer between memory of these two devices to become a 2145 bottleneck. To circumvent this, the programme uses OpenCL/OpenGL 2146 interoperability API which allows creating buffers, that are shared between 2147 OpenCL and OpenGL contexts thus eliminating copying of data between devices. In 2148 addition to this, three-dimensional velocity potential field 2149 formula\nbsp{}eqref:eq-phi-3d is particularly suitable for computation by GPUs: 2150 - it contains transcendental functions (hyperbolic cosines and complex 2151 exponents); 2152 - it is computed over large four-dimensional \((t,x,y,z)\) region; 2153 - it is explicit with no information dependencies between individual points in 2154 \(t\) and \(z\) dimensions. 2155 These considerations make velocity potential field computation on GPU 2156 advantageous in the application to real-time simulation and visualisation of 2157 wavy surface. 2158 2159 In order to investigate, how much the use of GPU can speed-up velocity potential 2160 field computation, we benchmarked simplified version of\nbsp{}eqref:eq-phi-3d: 2161 \begin{align} 2162 \label{eq-phi-linear} 2163 \phi(x,y,z,t) &= \InverseFourierY{ 2164 \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} } 2165 { 2\pi\Kveclen \Sinh{\smash{2\pi \Kveclen h}} } 2166 \FourierY{ \zeta_t }{u,v} 2167 }{x,y}\nonumber \\ 2168 &= \InverseFourierY{ g_1(u,v) \FourierY{ g_2(x,y) }{u,v} }{x,y}. 2169 \end{align} 2170 Velocity potential computation code was rewritten in OpenCL and its performance 2171 was compared to an existing OpenMP implementation. 2172 2173 For each implementation running time of the corresponding subroutines and time 2174 spent for data transfer between devices was measured. Velocity potential field 2175 was computed for one \(t\) point, for 128 \(z\) points below wavy surface and 2176 for each \(x\) and \(y\) point of four-dimensional \((t,x,y,z)\) grid. Between 2177 programme runs the size of the grid along \(x\) dimension was varied. 2178 2179 A different FFT library was used for each implementation: GNU Scientific Library 2180 (GSL)\nbsp{}cite:galassi2015gnu for OpenMP and clFFT\nbsp{}cite:clfft for 2181 OpenCL. FFT routines from these libraries have the following features: 2182 - The order of frequencies in FFT is different. In case of clFFT library 2183 elements of the resulting array are additionally shifted to make it correspond 2184 to the correct velocity potential field. In case of GSL no shift is needed. 2185 - Discontinuity at \((x,y)=(0,0)\) is handled automatically by clFFT library, 2186 whereas GSL library produce skewed values at this point, thus in case of GSL 2187 these points are interpolated. 2188 Other differences of FFT subroutines, that have impact on performance, were not 2189 discovered. 2190 2191 **** Performance of velocity potential solver. 2192 :PROPERTIES: 2193 :header-args:R: :results output raw :exports results 2194 :END: 2195 2196 The experiments showed that OpenCL outperforms OpenMP implementation by a factor 2197 of 2--6 (fig.\nbsp{}[[fig-arma-realtime-graph]]), however, distribution of running 2198 time between subroutines is different (table\nbsp{}[[tab-arma-realtime]]). For CPU 2199 the most of the time is spent to compute \(g_1\), whereas for GPU time spent to 2200 compute \(g_1\) is comparable to \(g_2\). Copying the resulting velocity 2201 potential field between CPU and GPU consumes \(\approx{}20\%\) of the total 2202 field computation time. Computing \(g_2\) consumes the most of the execution 2203 time for OpenCL and the least of the time for OpenMP. In both implementations 2204 \(g_2\) is computed on CPU, because subroutine for derivative computation in 2205 OpenCL was not found. For OpenCL the result is duplicated for each \(z\) grid 2206 point in order to perform multiplication of all \(XYZ\) planes along \(z\) 2207 dimension in single OpenCL kernel, and, subsequently copied to GPU memory which 2208 has negative impact on performance. All benchmarks were run on a machine 2209 equipped with a GPU, characteristics of which are summarised in 2210 table\nbsp{}[[tab-storm]]. 2211 2212 #+name: tab-storm 2213 #+caption["Storm" system configuration]: 2214 #+caption: "Storm" system configuration. 2215 #+attr_latex: :booktabs t 2216 | CPU | Intel Core 2 Quad Q9550 | 2217 | RAM | 8Gb | 2218 | GPU | AMD Radeon R7 360 | 2219 | GPU memory | 2GB | 2220 | HDD | Seagate Barracuda, 7200 rpm | 2221 | No. of CPU cores | 4 | 2222 2223 #+name: fig-arma-realtime-graph 2224 #+header: :results output graphics 2225 #+begin_src R :file build/realtime-performance.pdf 2226 source(file.path("R", "benchmarks.R")) 2227 par(family="serif") 2228 data <- arma.load_realtime_data() 2229 arma.plot_realtime_data(data) 2230 title(xlab="Wavy surface size", ylab="Time, s") 2231 #+end_src 2232 2233 #+name: fig-arma-realtime-graph 2234 #+caption[Performance of velocity potential code]: 2235 #+caption: Performance comparison of CPU (OpenMP) and GPU (OpenCL) versions of 2236 #+caption: velocity potential calculation code. 2237 #+RESULTS: fig-arma-realtime-graph 2238 [[file:build/realtime-performance.pdf]] 2239 2240 The reason for different distribution of time between OpenCL and OpenMP 2241 subroutines is the same as for different AR model performance on CPU and GPU: 2242 GPU has more floating point units and modules for transcendental mathematical 2243 functions, than CPU, which are needed for computation of \(g_1\), but lacks 2244 caches which are needed to optimise irregular memory access pattern of \(g_2\). 2245 In contrast to AR model, performance of multidimensional derivative computation 2246 on GPU is easier to improve, as there are no information dependencies between 2247 points: in this work optimisation was not done due to unavailability of existing 2248 implementation. Additionally, such library may allow to efficiently compute the 2249 non-simplified formula entirely on GPU, since omitted terms also contain 2250 derivatives. 2251 2252 #+name: tab-arma-realtime 2253 #+begin_src R 2254 source(file.path("R", "benchmarks.R")) 2255 routine_names <- list( 2256 harts_g1="\\(g_1\\) function", 2257 harts_g2="\\(g_2\\) function", 2258 harts_fft="FFT", 2259 harts_copy_to_host="Copy data from GPU" 2260 ) 2261 column_names <- c("Subroutine", "OpenMP", "OpenCL") 2262 data <- arma.load_realtime_data() 2263 arma.print_table_for_realtime_data(data, routine_names, column_names) 2264 #+end_src 2265 2266 #+name: tab-arma-realtime 2267 #+caption[Velocity potential calculation performance]: 2268 #+caption: Running time (s.) of real-time velocity potential calculation subroutines 2269 #+caption: for wavy surface (OX size equals 16384). 2270 #+attr_latex: :booktabs t 2271 #+RESULTS: tab-arma-realtime 2272 | | <r> | <r> | 2273 | Subroutine | OpenMP | OpenCL | 2274 |--------------------+--------+--------| 2275 | \(g_1\) function | 4.6730 | 0.0038 | 2276 | \(g_2\) function | 0.0002 | 0.8253 | 2277 | FFT | 2.8560 | 0.3585 | 2278 | Copy data from GPU | | 2.6357 | 2279 2280 As expected, sharing the same buffer between OpenCL and OpenGL contexts 2281 increases overall solver performance by eliminating data transfer between CPU 2282 and GPU memory, but also requires for the data to be in vertex buffer object 2283 format, that OpenGL can operate on. Conversion to this format is fast, but the 2284 resulting array occupies more memory, since each point now is a vector with 2285 three components. The other disadvantage of using OpenCL and OpenGL together is 2286 the requirement for manual locking of shared buffer: failure to do so results in 2287 appearance of screen image artefacts which can be removed only by rebooting the 2288 computer. 2289 2290 *** Summary 2291 Benchmarks showed that GPU outperforms CPU in arithmetic intensive tasks, 2292 i.e.\nbsp{}tasks requiring large number of floating point operations per second, 2293 however, its performance degrades when the volume of data that needs to be 2294 copied between CPU and GPU memory increases or when memory access pattern 2295 differs from linear. The first problem may be solved by using a co-processor 2296 where high-bandwidth memory is located on the same die together with the 2297 processor and the main memory. This eliminates data transfer bottleneck, but may 2298 also increase execution time due to smaller number of floating point units. The 2299 second problem may be solved programmatically, if OpenCL library that computes 2300 multi-dimensional derivatives were available. 2301 2302 AR and MA models outperform LH model in benchmarks and does not require GPU to 2303 do so. From computational point of view their strengths are 2304 - absence of transcendental mathematical functions, and 2305 - simple algorithm for both AR and MA model, performance of which depends on the 2306 performance of multi-dimensional array library and FFT library. 2307 Providing main functionality via low-level libraries makes performance of the 2308 programme portable: support for new processor architectures can be added by 2309 substituting the libraries. Finally, using explicit formula makes velocity 2310 potential field computation consume only a small fraction of total programme 2311 execution time. If such formula did not exist or did not have all integrals as 2312 Fourier transforms, velocity potential field computation would consume much more 2313 time. 2314 2315 ** Fault-tolerant batch job scheduler 2316 *** System architecture 2317 **** Physical layer. 2318 Consists of nodes and direct/routed physical network links. On this layer full 2319 network connectivity, i.e.\nbsp{}an ability to send network packets between any 2320 pair of cluster nodes. 2321 2322 **** Daemon process layer. 2323 :PROPERTIES: 2324 :CUSTOM_ID: sec-daemon-layer 2325 :END: 2326 2327 Consists of daemon processes running on cluster nodes and hierarchical 2328 (master/slave) logical links between them. Only one daemon process is launched 2329 per node, so these terms are use interchangeably in this work. Master and slave 2330 roles are dynamically assigned to daemon processes, i.e.\nbsp{}any physical 2331 cluster node may become a master or a slave, or both simultaneously. Dynamic 2332 role assignment uses leader election algorithm that does not require periodic 2333 broadcasting of messages to all cluster nodes, instead the role is determined by 2334 node's IP address. Detailed explanation of the algorithm is provided 2335 in\nbsp{}[[#sec-node-discovery]]. Its strengths are scalability to a large number of 2336 nodes and low overhead, which makes it suitable for large-scale high-performance 2337 computations; its weakness is in artificial dependence of node's position in the 2338 hierarchy on its IP address, which makes it unsuitable for virtual environments, 2339 where nodes' IP addresses may change dynamically. 2340 2341 The only purpose of tree hierarchy is to balance the load between cluster nodes. 2342 The load is distributed from the current node to its neighbours in the hierarchy 2343 by simply iterating over them. When new links appear in the hierarchy or old 2344 links change (due to new node joining the cluster or due to node failure), 2345 daemon processes communicate each other the number of processes /behind/ the 2346 corresponding link in the hierarchy. This information is used to distribute the 2347 load evenly, even if distributed programme is launched on slave node. In 2348 addition, this hierarchy reduces the number of simultaneous network connections 2349 between nodes: only one network connection is established and maintained for 2350 each hierarchical link\nbsp{}--- this decreases probability of network 2351 congestion when there is a large number of nodes in the cluster. 2352 2353 Load balancing is implemented as follows. When node \(A\) tries to become a 2354 subordinate of node \(B\), it sends a message to a corresponding IP address 2355 telling how many daemon processes are already linked to it in the hierarchy 2356 (including itself). After all hierarchical links are established, every node has 2357 enough information to tell, how many nodes exist behind each link. If the link 2358 is between a slave and a master, and the slave wants to know, how many nodes are 2359 behind the link to the master, then it subtracts the total number of nodes 2360 behind all of its slave nodes from the total number of nodes behind the master 2361 to get the correct amount. To distribute kernels 2362 (see\nbsp{}sec.\nbsp{}[[#sec-kernel-layer]]) across nodes simple round-robin 2363 algorithm is used, i.e.\nbsp{}iterate over all links of the current node 2364 including the link to its master and taking into account the number of nodes 2365 behind each link. So, even if a programme is launched on a slave node in the 2366 bottom of the hierarchy, the kernels are distributed evenly between all cluster 2367 nodes. 2368 2369 The proposed approach can be extended to include sophisticated logic into load 2370 distribution algorithm. Along with the number of nodes behind the hierarchical 2371 link, any metric, that is required to implement this logic, can be sent. 2372 However, if an update of the metric happens more frequently, than a change in 2373 the hierarchy occurs, or it changes periodically, then status messages will be 2374 also sent more frequently. To ensure that this transfers do not affect 2375 programmes' performance, a separate physical network may be used to transmit the 2376 messages. The other disadvantage is that when reconfiguration of the hierarchy 2377 occurs, the kernels that are already sent to the nodes are not taken into 2378 account in the load distribution, so frequent hierarchy changes may cause uneven 2379 load on cluster nodes (which, however, balances over time). 2380 2381 Dynamic master/slave role assignment coupled with kernel distribution makes 2382 overall system architecture homogeneous within the framework of single cluster. 2383 On every node the same daemon is run, and no prior configuration is needed to 2384 construct a hierarchy of daemons processes running on different nodes. 2385 2386 **** Control flow objects layer. 2387 :PROPERTIES: 2388 :CUSTOM_ID: sec-kernel-layer 2389 :END: 2390 2391 The key feature that is missing in the current parallel programming technologies 2392 is a possibility to specify hierarchical dependencies between tasks executed in 2393 parallel. When such dependency exists, the principal task becomes responsible 2394 for re-executing a failed subordinate task on the survived cluster nodes. To 2395 re-execute a task which does not have a principal, a copy of it is created and 2396 sent to a different node (see\nbsp{}sec.\nbsp{}[[#sec-fail-over]]). There exists a 2397 number of systems that are capable of executing directed acyclic graphs of tasks 2398 in parallel\nbsp{}cite:acun2014charmpp,islam2012oozie, but graphs are not 2399 suitable to determine principal-subordinate relationship between tasks, because 2400 a node in the graph may have multiple incoming edges (and hence multiple 2401 principal nodes). 2402 2403 The main purpose of the proposed approach is to simplify development of 2404 distributed batch processing applications and middleware. The idea is to provide 2405 resilience to failures on the lowest possible level. The implementation is 2406 divided into two layers: the lower layer consists of routines and classes for 2407 single node applications (with no network interactions), and the upper layer for 2408 applications that run on an arbitrary number of nodes. There are two kinds of 2409 tightly coupled entities\nbsp{}--- /control flow objects/ (or /kernels/ for 2410 short) and /pipelines/\nbsp{}--- which form a basis of the programme. 2411 2412 Kernels implement control flow logic in theirs ~act~ and ~react~ methods and 2413 store the state of the current control flow branch. Both logic and state are 2414 implemented by a programmer. In ~act~ method some function is either directly 2415 computed or decomposed into nested function calls (represented by a set of 2416 subordinate kernels) which are subsequently sent to a pipeline. In ~react~ 2417 method subordinate kernels that returned from the pipeline are processed by 2418 their parent. Calls to ~act~ and ~react~ methods are asynchronous and are made 2419 within threads attached to a pipeline. For each kernel ~act~ is called only 2420 once, and for multiple kernels the calls are done in parallel to each other, 2421 whereas ~react~ method is called once for each subordinate kernel, and all the 2422 calls are made in the same thread to prevent race conditions (for different 2423 parent kernels different threads may be used). 2424 2425 Pipelines implement asynchronous calls to ~act~ and ~react~, and try to make as 2426 many parallel calls as possible considering concurrency of the platform (no. of 2427 cores per node and no. of nodes in a cluster). A pipeline consists of a kernel 2428 pool, which contains all the subordinate kernels sent by their parents, and a 2429 thread pool that processes kernels in accordance with rules outlined in the 2430 previous paragraph. A separate pipeline is used for each device: There are 2431 pipelines for parallel processing, schedule-based processing (periodic and 2432 delayed tasks), and a proxy pipeline for processing of kernels on other cluster 2433 nodes (see fig.\nbsp{}[[fig-subord-ppl]]). 2434 2435 In principle, kernels and pipelines machinery reflect the one of procedures and 2436 call stacks, with the advantage that kernel methods are called asynchronously 2437 and in parallel to each other (as much as programme logic allows). Kernel fields 2438 are stack local variables, ~act~ method is a sequence of processor instructions 2439 before nested procedure call, and ~react~ method is a sequence of processor 2440 instructions after the call. Constructing and sending subordinate kernels to the 2441 pipeline is nested procedure call. Two methods are necessary to make calls 2442 asynchronous, and replace active wait for completion of subordinate kernels with 2443 passive one. Pipelines, in turn, allow to implement passive wait, and call 2444 correct kernel methods by analysing their internal state. 2445 2446 #+name: fig-subord-ppl 2447 #+begin_src dot :exports results :file build/subord-ppl.pdf 2448 graph G { 2449 2450 node [fontname="Old Standard",fontsize=14,margin="0.055,0",shape=box] 2451 graph [nodesep="0.25",ranksep="0.25",rankdir="LR",margin=0] 2452 edge [arrowsize=0.66] 2453 2454 subgraph cluster_daemon { 2455 label="Daemon process" 2456 style=filled 2457 color=lightgrey 2458 graph [margin=8] 2459 2460 factory [label="Factory"] 2461 parallel_ppl [label="Parallel\npipeline"] 2462 io_ppl [label="I/O\npipeline"] 2463 sched_ppl [label="Schedule-based\npipeline"] 2464 net_ppl [label="Network\npipeline"] 2465 proc_ppl [label="Process\npipeline"] 2466 2467 upstream [label="Upstream\nthread pool"] 2468 downstream [label="Downstream\nthread pool"] 2469 } 2470 2471 factory--parallel_ppl 2472 factory--io_ppl 2473 factory--sched_ppl 2474 factory--net_ppl 2475 factory--proc_ppl 2476 2477 subgraph cluster_hardware { 2478 label="Compute devices" 2479 style=filled 2480 color=lightgrey 2481 graph [margin=8] 2482 2483 cpu [label="CPU"] 2484 core0 [label="Core 0"] 2485 core1 [label="Core 1"] 2486 core2 [label="Core 2"] 2487 core3 [label="Core 3"] 2488 2489 storage [label="Storage"] 2490 disk0 [label="Disk 0"] 2491 2492 network [label="Network"] 2493 nic0 [label="NIC 0"] 2494 2495 timer [label="Timer"] 2496 2497 } 2498 2499 core0--cpu 2500 core1--cpu 2501 core2--cpu 2502 core3--cpu 2503 2504 disk0--storage 2505 nic0--network 2506 2507 parallel_ppl--upstream 2508 parallel_ppl--downstream 2509 2510 upstream--{core0,core1,core2,core3} [style="dashed"] 2511 downstream--core0 [style="dashed"] 2512 2513 io_ppl--core0 [style="dashed"] 2514 io_ppl--disk0 [style="dashed"] 2515 sched_ppl--core0 [style="dashed"] 2516 sched_ppl--timer [style="dashed"] 2517 net_ppl--core0 [style="dashed"] 2518 net_ppl--nic0 [style="dashed"] 2519 proc_ppl--core0 [style="dashed"] 2520 2521 subgraph cluster_children { 2522 style=filled 2523 color=white 2524 2525 subgraph cluster_child0 { 2526 label="Child process 0" 2527 style=filled 2528 color=lightgrey 2529 labeljust=right 2530 2531 app0_factory [label="Factory"] 2532 app0 [label="Child process\rpipeline"] 2533 } 2534 2535 # subgraph cluster_child1 { 2536 # label="Child process 1" 2537 # style=filled 2538 # color=lightgrey 2539 # labeljust=right 2540 # 2541 # app1_factory [label="Factory"] 2542 # app1 [label="Child process\rpipeline"] 2543 # } 2544 } 2545 2546 proc_ppl--app0 2547 # proc_ppl--app1 2548 2549 app0_factory--app0 [constraint=false] 2550 # app1_factory--app1 [constraint=false] 2551 2552 } 2553 #+end_src 2554 2555 #+caption[Mapping of pipelines to compute devices]: 2556 #+caption: Mapping of parent and child process pipelines to compute devices. 2557 #+caption: Solid lines denote aggregation, dashed lines\nbsp{}--- mapping between 2558 #+caption: logical and physical entities. 2559 #+attr_latex: :width \textwidth 2560 #+name: fig-subord-ppl 2561 #+RESULTS: fig-subord-ppl 2562 [[file:build/subord-ppl.pdf]] 2563 2564 **** Software implementation. 2565 For efficiency reasons object pipeline and fault tolerance techniques (which 2566 will be described later) are implemented in the C++ framework: From the author's 2567 perspective C language is deemed low-level for distributed programmes, and Java 2568 incurs too much overhead and is not popular in HPC. The framework is called 2569 Bscheduler, it is now in proof-of-concept development stage. 2570 2571 **** Application programming interface. 2572 Each kernel has four types of fields (listed in table\nbsp{}[[tab-kernel-fields]]): 2573 - fields related to control flow, 2574 - fields defining the source location of the kernel, 2575 - fields defining the current location of the kernel, and 2576 - fields defining the target location of the kernel. 2577 2578 #+name: tab-kernel-fields 2579 #+caption[Description of kernel fields]: 2580 #+caption: Description of kernel fields. 2581 #+attr_latex: :booktabs t :align lp{0.7\textwidth} 2582 | Field | Description | 2583 |-------------------+------------------------------------------------------------------------------------------------| 2584 | ~process_id~ | Identifier of a cluster-wide process (application) a kernel belongs to. | 2585 | ~id~ | Identifier of a kernel within a process. | 2586 | ~pipeline~ | Identifier of a pipeline a kernel is processed by. | 2587 | | | 2588 | ~exit_code~ | A result of a kernel execution. | 2589 | ~flags~ | Auxiliary bit flags used in scheduling. | 2590 | ~time_point~ | A time point at which a kernel is scheduled to be executed. | 2591 | | | 2592 | ~parent~ | Address/identifier of a parent kernel. | 2593 | ~src_ip~ | IP-address of a source cluster node. | 2594 | ~from_process_id~ | Identifier of a cluster-wide process from which a kernel originated. | 2595 | | | 2596 | ~principal~ | Address/identifier of a target kernel (a kernel to which the current one is sent or returned). | 2597 | ~dst_ip~ | IP-address of a destination cluster node. | 2598 2599 Upon creation each kernel is assigned a parent and a pipeline. If there no other 2600 fields are set, then the kernel is an /upstream/ kernel\nbsp{}--- a kernel that 2601 can be distributed on any node and any processor core to exploit parallelism. If 2602 principal field is set, then the kernel is a /downstream/ kernel\nbsp{}--- a 2603 kernel that can only be sent to its principal, and a processor core to which the 2604 kernel is sent is defined by the principal memory address/identifier. If a 2605 downstream kernel is to be sent to another node, the destination IP-address must 2606 be set, otherwise the system will not find the target kernel. 2607 2608 When kernel execution completes (its ~act~ method finishes), the kernel is 2609 explicitly sent to some other kernel, this directive is explicitly called inside 2610 ~act~ method. Usually, after the execution completes a kernel is sent to its 2611 parent by setting ~principal~ field to the address/identifier of the parent, 2612 destination IP-address field to the source IP-address, and process identifier to 2613 the source process identifier. After that kernel becomes a downstream kernel and 2614 is sent by the system to the node, where its current principal is located 2615 without invoking load balancing algorithm. For downstream kernel ~react~ method 2616 of its parent is called by a pipeline with the kernel as the method argument to 2617 make it possible for a parent to collect the result of the execution. 2618 2619 There is no way to provide fine-grained resilience to cluster node failures, if 2620 there are downstream kernels in the programme, except the ones returning to 2621 their parents. Instead, an exit code of the kernel is checked and a 2622 user-provided recovery subroutine is executed. If there is no error checking, 2623 the system restarts execution from the first parent kernel, which did not 2624 produce any downstream kernels. This means, that if a problem being solved by 2625 the programme has information dependencies between parts that are computed in 2626 parallel, and a node failure occurs during computation of these parts, then this 2627 computation is restarted from the very beginning, discarding any already 2628 computed parts. This does not occur for embarrassingly parallel programmes, 2629 where parallel parts do not have such information dependencies between each 2630 other: in this case only parts from failed nodes are recomputed and all 2631 previously computed parts are retained. 2632 2633 Unlike ~main~ function in programmes based on MPI library, the first (the main) 2634 kernel is initially run only on one node, and other cluster nodes are used 2635 on-demand. This allows to use more nodes for highly parallel parts of the code, 2636 and less nodes for other parts. Similar choice is made in the design of big data 2637 frameworks\nbsp{}cite:dean2008mapreduce,vavilapalli2013yarn --- a user 2638 submitting a job does not specify the number of hosts to run its job on, and 2639 actual hosts are the hosts where input files are located. 2640 2641 **** Mapping of simulation models on system architecture. 2642 Software implementation of AR and MA models works as a computational pipeline, 2643 in which each joint applies some function to the output coming from the pipe of 2644 the previous joint. Joints are distributed across computer cluster nodes to 2645 enable function parallelism, and then data flowing through the joints is 2646 distributed across processor cores to enable data parallelism. 2647 Figure\nbsp{}[[fig-pipeline]] shows a diagram of such pipeline. Here rectangles with 2648 rounded corners denote joints, regular rectangles denote arrays of problem 2649 domain objects flowing from one joint to another, and arrows show flow 2650 direction. Some joints are divided into /sections/ each of which process a 2651 separate part of the array. If joints are connected without a /barrier/ 2652 (horizontal or vertical bar), then transfer of separate objects between them is 2653 done in parallel to computations, as they become available. Sections work in 2654 parallel on each processor core (or node of the cluster). There is surjective 2655 mapping between a set of processor cores, a set of pipeline joint sections and 2656 objects, i.e.\nbsp{}each processor core may run several sections, each of which 2657 may sequentially process several objects, but a section can not work 2658 simultaneously on several processor cores, and an object can not be processed 2659 simultaneously by several sections. So, the programme is a pipeline through 2660 which control objects flow. 2661 2662 #+name: fig-pipeline 2663 #+begin_src dot :exports results :file build/pipeline.pdf 2664 digraph { 2665 2666 node [fontname="Old Standard",fontsize=14,margin="0.055,0"] 2667 graph [nodesep="0.25",ranksep="0.25",rankdir="TB",margin=0] 2668 edge [arrowsize=0.66] 2669 2670 # data 2671 subgraph xcluster_linear { 2672 label="Linear model" 2673 2674 start [label="",shape=circle,style=filled,fillcolor=black,width=0.23] 2675 spectrum [label="S(ω,θ)",shape=box] 2676 acf [label="K(i,j,k)",shape=box] 2677 phi [label="Φ(i,j,k)",shape=box] 2678 2679 # transformations 2680 fourier_transform [label="Fourier transform",shape=box,style=rounded] 2681 solve_yule_walker [label="Solve Yule—Walker\nequations",shape=box,style=rounded] 2682 2683 subgraph cluster_nonlinear_1 { 2684 label="Simulate non-linearity\l" 2685 labeljust=left 2686 style=filled 2687 color=lightgrey 2688 graph [margin=8] 2689 acf2 [label="K*(i,j,k)",shape=box] 2690 transform_acf [label="Transform ACF",shape=box,style=rounded] 2691 } 2692 } 2693 2694 subgraph xcluster_linear2 { 2695 2696 eps_parts [label="<e1> ε₁|<e2> ε₂|<e3> …|<e4> εₙ|<e> ε(t,x,y)",shape=record] 2697 end [label="",shape=doublecircle,style=filled,fillcolor=black,width=0.23] 2698 2699 generate_white_noise [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate\lwhite noise",shape=record,style=rounded] 2700 generate_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate sea\lwavy surface parts\l",shape=record,style=rounded] 2701 2702 zeta_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Non-crosslinked\lrealisation parts",shape=record] 2703 overlap_add [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Crosslink realisation\lparts\l",shape=record,style=rounded] 2704 2705 zeta_parts:g1->overlap_add:g1 2706 zeta_parts:g2->overlap_add:g2 2707 zeta_parts:g3->overlap_add:g3 2708 zeta_parts:g4->overlap_add:g4 2709 2710 zeta_parts:g2->overlap_add:g1 [constraint=false] 2711 zeta_parts:g3->overlap_add:g2 [constraint=false] 2712 zeta_parts:g4->overlap_add:g3 [constraint=false] 2713 2714 overlap_add:g1->zeta2_parts:g1 2715 overlap_add:g2->zeta2_parts:g2 2716 overlap_add:g3->zeta2_parts:g3 2717 overlap_add:g4->zeta2_parts:g4 2718 2719 zeta2_parts:g1->transform_zeta:g1->zeta3_parts:g1->write_zeta:g1->eps_end 2720 zeta2_parts:g2->transform_zeta:g2->zeta3_parts:g2->write_zeta:g2->eps_end 2721 zeta2_parts:g3->transform_zeta:g3->zeta3_parts:g3->write_zeta:g3->eps_end 2722 zeta2_parts:g4->transform_zeta:g4->zeta3_parts:g4->write_zeta:g4->eps_end 2723 2724 } 2725 2726 subgraph part3 { 2727 2728 zeta2_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Wavy surface with\lGaussian distribution\l",shape=record] 2729 2730 subgraph cluster_nonlinear_2 { 2731 label="Simulate non-linearity\r" 2732 labeljust=right 2733 style=filled 2734 color=lightgrey 2735 graph [margin=8] 2736 zeta3_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> ζ(t,x,y)",shape=record] 2737 transform_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Transform wavy\lsurface elevation\lprobability distribution\l",shape=record,style=rounded] 2738 } 2739 2740 # barriers 2741 eps_start [label="",shape=box,style=filled,fillcolor=black,height=0.05] 2742 eps_end [label="",shape=box,style=filled,fillcolor=black,height=0.05] 2743 2744 write_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Write finished\lparts to a file\l",shape=record,style=rounded] 2745 } 2746 2747 # edges 2748 start->spectrum->fourier_transform->acf->transform_acf 2749 transform_acf->acf2 2750 acf2->solve_yule_walker 2751 solve_yule_walker->phi 2752 phi->eps_start [constraint=false] 2753 eps_start->generate_white_noise:g1 2754 eps_start->generate_white_noise:g2 2755 eps_start->generate_white_noise:g3 2756 eps_start->generate_white_noise:g4 2757 generate_white_noise:g1->eps_parts:e1->generate_zeta:g1->zeta_parts:g1 2758 generate_white_noise:g2->eps_parts:e2->generate_zeta:g2->zeta_parts:g2 2759 generate_white_noise:g3->eps_parts:e3->generate_zeta:g3->zeta_parts:g3 2760 generate_white_noise:g4->eps_parts:e4->generate_zeta:g4->zeta_parts:g4 2761 2762 eps_end->end 2763 } 2764 #+end_src 2765 2766 #+caption[Diagram of data processing pipeline]: 2767 #+caption: Diagram of data processing pipeline, that implements sea wavy 2768 #+caption: surface generation via AR model. 2769 #+name: fig-pipeline 2770 #+RESULTS: fig-pipeline 2771 [[file:build/pipeline.pdf]] 2772 2773 Object pipeline may be seen as an improvement of Bulk Synchronous Parallel (BSP) 2774 model\nbsp{}cite:valiant1990bridging, which is used in graph 2775 processing\nbsp{}cite:malewicz2010pregel,seo2010hama. Pipeline eliminates global 2776 synchronisation (where it is possible) after each sequential computation step by 2777 doing data transfer between joints in parallel to computations, whereas in BSP 2778 model global synchronisation occurs after each step. 2779 2780 Object pipeline speeds up the programme by parallel execution of code blocks 2781 that work with different compute devices: while the current part of wavy surface 2782 is generated by a processor, the previous part is written to a disk. This 2783 approach allows to get speed-up (see\nbsp{}sec.\nbsp{}[[#sec-io-perf]]) because 2784 compute devices operate asynchronously, and their parallel usage increases the 2785 whole programme performance. 2786 2787 Since data transfer between pipeline joints is done in parallel to computations, 2788 the same pipeline may be used to run several copies of the application but with 2789 different parameters (generate several sea wavy surfaces having different 2790 characteristics). In practise, high-performance applications do not always 2791 consume 100% of processor time spending a portion of time on synchronisation of 2792 parallel processes and writing data to disk. Using pipeline in this case allows 2793 to run several computations on the same set of processes, and use all of the 2794 computer devices at maximal efficiency. For example, when one object writes data 2795 to a file, the other do computations on the processor in parallel. This 2796 minimises downtime of the processor and other computer devices and increases 2797 overall throughput of the computer cluster. 2798 2799 Pipelining of otherwise sequential steps is beneficial not only for code working 2800 with different devices, but for code different branches of which are suitable 2801 for execution by multiple hardware threads of the same processor core, 2802 i.e.\nbsp{}branches accessing different memory blocks or performing mixed 2803 arithmetic (integer and floating point). Code branches which use different CPU 2804 modules are good candidates to run in parallel on a processor core with multiple 2805 hardware threads. 2806 2807 So, computational model with a pipeline can be seen as /bulk-asynchronous 2808 model/, because of the parallel nature of programme steps. This model is the 2809 basis of the fault-tolerance model which will be described later. 2810 2811 *** Cluster node discovery 2812 :PROPERTIES: 2813 :CUSTOM_ID: sec-node-discovery 2814 :END: 2815 2816 **** Leader election algorithms. 2817 Many batch job scheduling systems are built on the principle of /subordination/: 2818 there is master node in each cluster which manages job queue, schedules job 2819 execution on subordinate nodes and monitors their state. Master role is assigned 2820 either /statically/ by an administrator to a particular physical node, or 2821 /dynamically/ by periodically electing one of the cluster nodes as master. In 2822 the former case fault tolerance is provided by reserving additional spare node 2823 which takes master role when current master fails. In the latter case fault 2824 tolerance is provided by electing new master node from survived nodes. Despite 2825 the fact that dynamic role assignment requires leader election algorithm, this 2826 approach becomes more and more popular as it does not require spare reserved 2827 nodes to recover from master node 2828 failure\nbsp{}cite:hunt2010zookeeper,lakshman2010cassandra,divya2013elasticsearch 2829 and generally leads to a symmetric system architecture, in which the same 2830 software stack with the same configuration is installed on every 2831 node\nbsp{}cite:boyer2012glusterfs,ostrovsky2015couchbase. 2832 2833 Leader election algorithms (which sometimes referred to as /distributed 2834 consensus/ algorithms are special cases of wave algorithms. 2835 In\nbsp{}cite:tel2000introduction the author defines them as algorithms in which 2836 termination event is preceded by at least one event occurring in /each/ parallel 2837 process. Wave algorithms are not defined for anonymous networks, that is they 2838 apply only to processes that can uniquely define themselves. However, the number 2839 of processes affected by the "wave" can be determined in the course of an 2840 algorithm. For a distributed system this means that wave algorithms work for 2841 computer clusters with dynamically changing number of nodes, and the algorithm 2842 is unaffected by some nodes going on-line and off-line. 2843 2844 The approach for cluster node discovery does not use wave algorithms, and hence 2845 does not require communicating with each node of the cluster to determine a 2846 leader. Instead, each node enumerates all nodes in the network it is part of, 2847 and converts this list to a /tree hierarchy/ with a user-defined /fan-out/ value 2848 (maximal number of subordinate nodes a node may have). Then the node determines 2849 its hierarchy level and tries to communicate with nodes from higher levels to 2850 become their subordinate. First, it checks the closest ones and then goes all 2851 the way to the top. If there is no top-level nodes or the node cannot connect to 2852 them, then the node itself becomes the master of the whole hierarchy. 2853 2854 Tree hierarchy of all hosts in a network defines strict total order on a set of 2855 cluster nodes. Although, technically any function can be chosen to map a node to 2856 a number, in practise this function should be sufficiently smooth along the time 2857 axis and may have infrequent jumps: high-frequency oscillations (which are often 2858 caused by measurement errors) may result in constant passing of master role from 2859 one node to another, which makes the hierarchy unusable for load balancing. The 2860 simplest such function is the position of an IP address in network IP address 2861 range. 2862 2863 **** Tree hierarchy creation algorithm. 2864 Strict total order on the set \(\mathcal{N}\) of cluster nodes connected to a 2865 network is defined as 2866 \begin{equation*} 2867 \forall n_1 \forall n_2 \in \mathcal{N}, 2868 \forall f \colon \mathcal{N} \rightarrow \mathcal{R}^n 2869 \Rightarrow (f(n_1) < f(n_2) \Leftrightarrow \neg (f(n_1) \geq f(n_2))), 2870 \end{equation*} 2871 where \(f\) maps a node to its level and operator \(<\) defines strict total 2872 order on \(\mathcal{R}^n\). Function \(f\) defines node's sequential number, and 2873 \(<\) makes this number unique. 2874 2875 The simplest function \(f\) maps each node to its IP address position in network 2876 IP address range. Without the use of tree hierarchy a node with the lowest 2877 position in this range becomes the master. If IP-address of a node occupies the 2878 first position in the range, then there is no master for it, and it continues to 2879 be at the top of the hierarchy until it fails. Although, IP address mapping is 2880 simple to implement, it introduces artificial dependence of the master role on 2881 the address of a node. Still, it is useful for initial configuration of a 2882 cluster when more complex mappings are not applicable. 2883 2884 To make node discovery scale to a large number of nodes, IP address range is 2885 mapped to a tree hierarchy. In this hierarchy each node is uniquely identified 2886 by its hierarchy level \(l\), which it occupies, and offset \(o\), which equals 2887 to the sequential number of node on its level. Values of level and offset are 2888 computed from the following optimisation problem. 2889 \begin{align*} 2890 n = \sum\limits_{i=0}^{l(n)} p^i + o(n), \quad 2891 l \rightarrow \min, \quad 2892 o \rightarrow \min, \quad 2893 l \geq 0, \quad 2894 o \geq 0 2895 \end{align*} 2896 where \(n\) is the position of node's IP address in network IP address range and 2897 \(p\) is fan-out value (the maximal number of subordinates, a node can have). 2898 The master of a node with level \(l\) and offset \(o\) has level \(l-1\) and 2899 offset \(\lfloor{o/p}\rfloor\). The distance between any two nodes in the tree 2900 hierarchy with network positions \(i\) and \(j\) is computed as 2901 \begin{align*} 2902 & \langle 2903 \text{lsub}(l(j), l(i)), \quad 2904 \left| o(j) - o(i)/p \right| 2905 \rangle,\\ 2906 & \text{lsub}(l_1, l_2) = 2907 \begin{cases} 2908 \infty & \quad \text{if } l_1 \geq l_2, \\ 2909 l_1 - l_2 & \quad \text{if } l_1 < l_2. 2910 \end{cases} 2911 \end{align*} 2912 The distance is compound to account for level in the first place. 2913 2914 To determine its master, each node ranks all nodes in the network according to 2915 their position \(\langle{l(n),o(n)}\rangle\), and using distance formula chooses 2916 the node which is closest to potential master position and has lower level. That 2917 way IP addresses of off-line nodes are skipped, however, for sparse networks (in 2918 which nodes occupy non-contiguous IP addresses) perfect tree is not guaranteed. 2919 2920 Formula for computing distance can be made arbitrary complex (e.g.\nbsp{}to 2921 account for network latency and throughput or geographical location of the 2922 node), however, for its simplest form it is more efficient to use cluster node 2923 /traversal algorithm/. The algorithm requires less memory as there is no need to 2924 store ranked list of all nodes, but iterates over IP addresses of the network in 2925 the order defined by the fan-out value. The algorithm is as follows. First, the 2926 /base/ node (a node which searches for its master) computes its potential master 2927 address and tries to connect to this node. If the connection fails, the base 2928 node sequentially tries to connect to each node from the higher hierarchy 2929 levels, until it reaches the top of the hierarchy (the root of the tree). If 2930 none of the connections succeed, the base node sequentially connects to all 2931 nodes on its own level having lower position in IP address range. If none of the 2932 nodes respond, the base node becomes the master node of the whole hierarchy, and 2933 the traversal repeats after a set period of time. An example of traversal order 2934 for a cluster of 11 nodes and a tree hierarchy with fan-out value of 2 is shown 2935 in\nbsp{}fig.\nbsp{}[[fig-tree-hierarchy-11]]. 2936 2937 #+name: fig-tree-hierarchy-11 2938 #+begin_src dot :exports results :file build/tree-hierarchy-11.pdf 2939 digraph { 2940 2941 node [fontname="Old Standard",fontsize=14,margin="0.055,0.055",shape=box,style=rounded] 2942 graph [nodesep="0.30",ranksep="0.30",rankdir="BT"] 2943 edge [arrowsize=0.66] 2944 2945 m1 [label="10.0.0.1"] 2946 m2 [label="10.0.0.2"] 2947 m3 [label="10.0.0.3"] 2948 m4 [label="10.0.0.4"] 2949 m5 [label="10.0.0.5"] 2950 m6 [label="10.0.0.6"] 2951 m7 [label="10.0.0.7"] 2952 m8 [label="10.0.0.8"] 2953 m9 [label="10.0.0.9"] 2954 m10 [label="10.0.0.10",fillcolor="#c0c0c0",style="filled,rounded"] 2955 m11 [label="10.0.0.11",shape=Mrecord] 2956 2957 m2->m1 2958 m3->m1 2959 m4->m2 2960 m5->m2 2961 m6->m3 2962 m7->m3 2963 m8->m4 2964 m9->m4 2965 m10->m5 2966 m11->m5 2967 2968 m5->m4->m6 [style="dashed,bold",color="#c00000",constraint=false] 2969 {rank=same; m6->m7 [style="dashed,bold",color="#c00000"]} 2970 m7->m2->m3->m1->m8->m9 [style="dashed,bold",color="#c00000",constraint=false] 2971 2972 } 2973 #+end_src 2974 2975 #+caption[Tree hierarchy for 11 nodes]: 2976 #+caption: Tree hierarchy for 11 nodes with fan-out of 2. 2977 #+caption: Red arrows denote hierarchy traversal order for a node 2978 #+caption: with IP address 10.0.0.10. 2979 #+name: fig-tree-hierarchy-11 2980 #+RESULTS: fig-tree-hierarchy-11 2981 [[file:build/tree-hierarchy-11.pdf]] 2982 2983 **** Evaluation results. 2984 To benchmark performance of traversal algorithm on large number of nodes, 2985 several daemon processes were launched on each physical cluster node, each bound 2986 to its own IP address. The number of processes per physical core varied from 2 2987 to 16. Each process was bound to a particular physical core to reduce overhead 2988 of process migration between cores. The algorithm has low requirements for 2989 processor time and network throughput, so running multiple processes per 2990 physical core is feasible, in contrast to HPC codes, where it often lowers 2991 performance. Test platform configuration is shown in table\nbsp{}[[tab-ant]]. 2992 2993 #+name: tab-ant 2994 #+caption["Ant" system configuration]: 2995 #+caption: "Ant" system configuration. 2996 #+attr_latex: :booktabs t 2997 | CPU | Intel Xeon E5440, 2.83GHz | 2998 | RAM | 4Gb | 2999 | HDD | ST3250310NS, 7200rpm | 3000 | No. of nodes | 12 | 3001 | No. of CPU cores per node | 8 | 3002 3003 Similar approach was used in 3004 in\nbsp{}cite:lantz2010network,handigol2012reproducible,heller2013reproducible 3005 where the authors reproduce various real-world experiments using virtual 3006 clusters, based on Linux namespaces, and compare the results to physical ones. 3007 The advantage of it is that the tests can be performed on a large virtual 3008 cluster using relatively small number of physical nodes. The advantage of the 3009 approach used in this work (which does not use Linux namespaces) is that it is 3010 more lightweight and larger number of daemon processes can be benchmarked on the 3011 same physical cluster. 3012 3013 Traversal algorithm performance was evaluated by measuring time needed for all 3014 nodes of the cluster to discover each other, i.e.\nbsp{}the time needed for the 3015 tree hierarchy of nodes to reach stable state. Each change of the hierarchy, as 3016 seen by each node, was written to a log file and after a set amount of time all 3017 daemon processes (each of which models cluster node) were forcibly terminated. 3018 Daemon processes were launched sequentially with a 100ms delay to ensure that 3019 master nodes are always come online before subordinates and hierarchy does not 3020 change randomly as a result of different start time of each process. This 3021 artificial delay was subsequently subtracted from the results. So, benchmark 3022 results represent discovery time in an "ideal" cluster, in which every daemon 3023 process always finds its master on the first try. 3024 3025 The benchmark was run multiple times varying the number of daemon processes per 3026 cluster node. The experiment showed that discovery of 512 nodes (8 physical 3027 nodes with 64 processes per node) each other takes no more than two seconds 3028 (fig.\nbsp{}[[fig-discovery-benchmark]]). This value does not change significantly 3029 with the increase in the number of physical nodes. Using more than 8 nodes with 3030 64 processes per node causes large variation in discovery time due to a large 3031 total number of processes connecting simultaneously to one master process (a 3032 fan-out value of 10000 was used for all tests), so these results were excluded 3033 from consideration. 3034 3035 #+name: fig-discovery-benchmark 3036 #+header: :width 7 :height 5 3037 #+begin_src R :file build/discovery-benchmark.pdf 3038 source(file.path("R", "discovery.R")) 3039 par(family="serif") 3040 bscheduler.plot_discovery(xlabel="No. of physical nodes",toplabel="ppn") 3041 #+end_src 3042 3043 #+caption[Time needed to discover all daemon processes on the cluster]: 3044 #+caption: Time needed to discover all daemon processes running on the cluster 3045 #+caption: depending on the number of daemon processes. Dashed lines denote 3046 #+caption: minimum and maximum values, "ppn" is the number of daemon 3047 #+caption: processes per node. 3048 #+name: fig-discovery-benchmark 3049 #+RESULTS: fig-discovery-benchmark 3050 [[file:build/discovery-benchmark.pdf]] 3051 3052 **** Discussion. 3053 Traversal algorithm scales to a large number of nodes, because in order to 3054 determine its master, a node is required to communicate to a node address of 3055 which it knows beforehand. Communication with other nodes occurs only when the 3056 current master node fails. So, if cluster nodes occupy contiguous addresses in 3057 network IP address range, each node connects only to its master, and inefficient 3058 scan of the whole network by each node does not occur. 3059 3060 The following key features distinguish the proposed approach with respect to 3061 some existing 3062 approaches\nbsp{}cite:brunekreef1996design,aguilera2001stable,romano2014design. 3063 - *Multi-level hierarchy.* The number of master nodes in a network depends on 3064 the fan-out value. If it is lesser than the number of IP-addresses in the 3065 network, then there are multiple master nodes in the cluster. If it is 3066 greater or equal to the number of IP-addresses in the network, then there is 3067 only one master node. When some node fail, multi-level hierarchy changes 3068 locally, and only nodes that are adjacent to the failed one communicate. 3069 However, node weight changes propagate to every node in the cluster via 3070 hierarchical links. 3071 - *IP-address mapping.* Since hierarchy structure solely depends on the nodes' 3072 IP addresses, there is no election phase in the algorithm. To change the 3073 master each node sends a message to the old master and to the new one. 3074 - *Completely event-based.* The messages are sent only when some node joins the 3075 cluster or fails, so there is no constant load on the network. Since the 3076 algorithm allows to tolerate failure of sending any message, there is no need 3077 in heartbeat packets indicating node serviceability in the network; instead, 3078 all messages play the role of heartbeats and packet send time-out is 3079 adjusted\nbsp{}cite:rfc5482. 3080 - *No manual configuration.* A node does not require any prior knowledge to find 3081 the master: it determines the network it is part of, calculates potential 3082 master IP-address and sends the message. If it fails, the process is repeated 3083 for the next potential master node. So, the algorithm is able to bootstrap a 3084 cluster (make all nodes aware of each other) without prior manual 3085 configuration, the only requirement is to assign an IP address to each node 3086 and start the daemon process on it. 3087 To summarise, the advantage of the algorithm is that it 3088 - scales to a large number of nodes by means of hierarchy with multiple 3089 master nodes, 3090 - does not constantly load the network with node status messages and heartbeat 3091 packets, 3092 - does not require manual configuration to bootstrap a cluster. 3093 3094 The disadvantage of the algorithm is that it requires IP-address to change 3095 infrequently, in order to be useful for load balancing. It is not suitable for 3096 cloud environments in which node DNS name is preserved, but IP-address may 3097 change over time. When IP-address changes, current connections may close, and 3098 node hierarchy is rebuilt. After each change in the hierarchy only newly created 3099 kernels are distributed using the new links, old kernels continue to execute on 3100 whatever nodes they were sent. So, environments where an IP address is not an 3101 unique identifier of the node are not suitable for the algorithm. 3102 3103 The other disadvantage is that the algorithm creates artificial dependence of 3104 node rank on an IP-address: it is difficult to substitute IP-address mapping 3105 with a more sophisticated one. If the mapping uses current node and network load 3106 to infer node ranks, measurement errors may result in unstable hierarchy, and 3107 the algorithm cease to be fully event-based as load levels need to be periodically 3108 collected on every node and propagated to each node in the cluster. 3109 3110 Node discovery algorithm is designed to balance the load on a cluster of 3111 compute nodes (see\nbsp{}sec.\nbsp{}[[#sec-daemon-layer]]), its use in other 3112 applications is not studied here. When distributed or parallel programme starts 3113 on any of cluster nodes, its kernels are distributed to all adjacent nodes in 3114 the hierarchy (including master node if applicable). To distribute the load 3115 evenly when the application is run on a subordinate node, each node maintains 3116 the weight of each adjacent node in the hierarchy. The weight equals to the 3117 number of nodes in the tree "behind" the adjacent node. For example, if the 3118 weight of the first adjacent node is 2, then round-robin load balancing 3119 algorithm distributes two kernels to the first node before moving to the next 3120 one. 3121 3122 To summarise, traversal algorithm is 3123 - designed to ease load balancing on the large number of cluster nodes, 3124 - fully fault-tolerant, because the state of every node can be recomputed at any 3125 time, 3126 - fully event-based as it does not overload the network by periodically sending 3127 state update messages. 3128 3129 *** Fail over algorithm 3130 :PROPERTIES: 3131 :CUSTOM_ID: sec-fail-over 3132 :END: 3133 3134 **** Checkpoints. 3135 Node failures in a distributed system are divided into two types: failure of a 3136 slave node and failure of a master node. In order for a job running on the 3137 cluster to survive slave node failure, job scheduler periodically creates 3138 checkpoints and writes them to a stable (redundant) storage. In order to create 3139 the checkpoint, the scheduler temporarily suspends all parallel processes of the 3140 job, copies all memory pages and all internal operating system kernel structures 3141 allocated for these processes to disk, and resumes execution of the job. In 3142 order to survive master node failure, job scheduler daemon process continuously 3143 copies its internal state to a backup node, which becomes the master after the 3144 failure. 3145 3146 There are many works on improving performance of 3147 checkpoints\nbsp{}cite:egwutuoha2013survey, and alternative approaches do not 3148 receive much attention. Usually HPC applications use message passing for 3149 parallel processes communication and store their state in global memory space, 3150 hence there is no way one can restart a failed process from its current state 3151 without writing the whole memory image to disk. Usually the total number of 3152 processes is fixed by the job scheduler, and all parallel processes restart upon 3153 a failure. There is ongoing effort to make it possible to restart only the 3154 failed process\nbsp{}cite:meyer2012radic by restoring them from a checkpoint on 3155 the surviving nodes, but this may lead to overload if there are other processes 3156 on these nodes. Theoretically, process restart is not needed, if the job can 3157 proceed on the surviving nodes, however, message passing library does not allow 3158 to change the number of processes at runtime, and most programmes assume this 3159 number to be constant. So, there is no reliable way to provide fault tolerance 3160 in message passing library other than restarting all parallel processes from a 3161 checkpoint. 3162 3163 At the same time, there is a possibility to continue execution of a job on 3164 lesser number of nodes than it was initially requested by implementing fault 3165 tolerance on job scheduler level. In this case master and slave roles are 3166 dynamically distributed between scheduler's daemon processes running on each 3167 cluster node, forming a tree hierarchy of cluster nodes, and parallel programme 3168 consists of kernels which use node hierarchy to dynamically distribute the load 3169 and use their own hierarchy to restart kernels upon node failure. 3170 3171 **** Dynamic role distribution. 3172 Fault tolerance of a parallel programme is one of the problems which is solved 3173 by big data and HPC job schedulers, however, most schedulers provide fault 3174 tolerance for slave nodes only. These types of failures are routinely handled by 3175 restarting the affected job (from a checkpoint) or its part on the remaining 3176 nodes, and failure of a principal node is often considered either improbable, or 3177 too complicated to handle and configure on the target platform. System 3178 administrators often find alternatives to application level fault tolerance: 3179 they isolate principal process of the scheduler from the rest of the cluster 3180 nodes by placing it on a dedicated machine, or use virtualisation technologies 3181 instead. All these alternatives complexify configuration and maintenance, and by 3182 decreasing probability of a machine failure resulting in a whole system failure, 3183 they increase probability of a human error. 3184 3185 From such point of view it seems more practical to implement master node fault 3186 tolerance at application level, but there is no proven generic solution. Most 3187 implementations are too tied to a particular application to become universally 3188 applicable. Often a cluster is presented as a machine, which has a distinct 3189 control unit (master node) and a number of identical compute units (slave 3190 nodes). In practice, however, master and slave nodes are physically the same, 3191 and are distinguished only by batch job scheduler processes running on them. 3192 Realisation of the fact that cluster nodes are merely compute units allows to 3193 implement middleware that distributes master and slave roles automatically and 3194 handles node failures in a generic way. This software provides an API to 3195 distribute kernels between currently available nodes. Using this API one can 3196 write a programme that runs on a cluster without knowing the exact number of 3197 working nodes. The middleware works as a cluster operating system in user space, 3198 allowing to execute distributed applications transparently on any number of 3199 nodes. 3200 3201 **** Symmetric architecture. 3202 Many distributed key-value stores and parallel file systems have symmetric 3203 architecture, in which master and slave roles are dynamically distributed, so 3204 that any node can act as a master when the current master node 3205 fails\nbsp{}cite:ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra. 3206 However, this architecture is still not used in big data and HPC job schedulers. 3207 For example, in YARN scheduler\nbsp{}cite:vavilapalli2013yarn master and slave 3208 roles are static. Failure of a slave node is tolerated by restarting a part of a 3209 job, that worked on it, on one of the surviving nodes, and failure of a master 3210 node is tolerated by setting up standby master 3211 node\nbsp{}cite:murthy2011architecture. Both master nodes are coordinated by 3212 Zookeeper service which uses dynamic role assignment to ensure its own 3213 fault-tolerance\nbsp{}cite:okorafor2012zookeeper. So, the lack of dynamic role 3214 distribution in YARN scheduler complicates the whole cluster configuration: if 3215 dynamic roles were available, Zookeeper would be redundant in this 3216 configuration. 3217 3218 The same problem occurs in HPC job schedulers where master node (where the main 3219 job scheduler process is run) is the single point of failure. 3220 In\nbsp{}cite:uhlemann2006joshua,engelmann2006symmetric the authors replicate 3221 job scheduler state to a backup node to make the master node highly available, 3222 but backup node role is assigned statically. This solution is close to symmetric 3223 architecture, because it does not involve external service to provide high 3224 availability, but far from ideal in which backup node is dynamically chosen. 3225 3226 Finally, the simplest master node high availability is implemented in VRRP 3227 protocol (Virtual Router Redundancy 3228 Protocol)\nbsp{}cite:rfc2338,rfc3768,rfc5798. Although VRRP protocol does 3229 provide dynamic role distribution, it is designed to be used by routers and 3230 reverse proxy servers behind them. Such servers lack the state (a job queue) 3231 that needs to be restored upon node failure, so it is easier for them to provide 3232 high availability. The protocol can be implemented even without routers using 3233 Keepalived daemon\nbsp{}cite:cassen2002keepalived instead. 3234 3235 Symmetric architecture is beneficial for job schedulers because it allows to 3236 - make physical nodes interchangeable, 3237 - implement dynamic distribution of master and slave node roles, and 3238 - implement automatic recovery after a failure of any node. 3239 The following sections describe the components that are required to write 3240 parallel programme and job scheduler, that can tolerate failure of cluster 3241 nodes. 3242 3243 **** Definitions of hierarchies. 3244 To disambiguate hierarchical links between daemon processes and kernels and to 3245 simplify the discussion, the following naming conventions are used throughout 3246 the text. If the link is between two daemon processes, the relationship is 3247 denoted as /master-slave/. If the link is between two kernels, then the 3248 relationship is denoted as either /principal-subordinate/ or /parent-child/. Two 3249 hierarchies are orthogonal to each other in a sense that no kernel may have a 3250 link to a daemon, and vice versa. Since daemon process hierarchy is used to 3251 distribute the load on cluster nodes, kernel hierarchy is mapped onto it, and 3252 this mapping can be arbitrary: It is common to have principal kernel on a slave 3253 node with its subordinate kernels distributed evenly between all cluster nodes 3254 (including the node where the principal is located). Both hierarchies can be 3255 arbitrarily deep, but "shallow" ones are preferred for highly parallel 3256 programmes, as there are less number of hops when kernels are distributed 3257 between cluster nodes. Since there is one-to-one correspondence between daemons 3258 and cluster nodes, they are used interchangeably in the text. 3259 3260 **** Handling nodes failures. 3261 Basic method to overcome a failure of a subordinate node is to restart 3262 corresponding kernels on a healthy node (Erlang language uses the same method to 3263 restart failed subordinate processes\nbsp{}cite:armstrong2003thesis). In order 3264 to implement this method in the framework of kernel hierarchy, sender node saves 3265 every kernel that is sent to remote cluster nodes, and in an event of a failure 3266 of any number of nodes, where kernels were sent, their copies are redistributed 3267 between the remaining nodes without custom handling by a programmer. If there 3268 are no nodes to sent kernels to, they are executed locally. In contrast to 3269 "heavy-weight" checkpoint/restart machinery employed by HPC cluster job 3270 schedulers, tree hierarchy of nodes coupled with hierarchy of kernels allows to 3271 automatically and transparently handle any number of subordinate node failures 3272 without restarting any processes of a parallel programme, and processes act as 3273 resource allocation units. 3274 3275 A possible way of handling failure of the main node (a node where the main 3276 kernel is executed) is to replicate the main kernel to a backup node, and make 3277 all updates to its state propagate to the backup node by means of a distributed 3278 transaction, but this approach does not correlate with asynchronous nature of 3279 kernels and too complex to implement. In practice, however, the main kernel 3280 usually does not perform operations in parallel, it is rather sequentially 3281 executes programme steps one by one, hence it has only one subordinate at a 3282 time. (Each subordinate kernel represents sequential computational step which 3283 may or may not be internally parallel.) Keeping this in mind, one can simplify 3284 synchronisation of the main kernel state: send the main kernel along with its 3285 subordinate to the subordinate node. Then if the main node fails, the copy of 3286 the main kernel receives its subordinate (because both of them are on the same 3287 node) and no time is spent on recovery. When the subordinate node, to which 3288 subordinate kernel together with the copy of the main kernel was sent, fails, 3289 the subordinate kernel is sent to a survived node, and in the worst case the 3290 current computational step is executed again. 3291 3292 The approach described above is designed for kernels that do not have a parent 3293 and have only one subordinate at a time, which means that it functions as 3294 checkpoint mechanism. The advantage of this approach is that it 3295 - saves programme state after each sequential step, when memory footprint of a 3296 programme is low, 3297 - saves only relevant data, and 3298 - uses memory of a subordinate node rather than disk storage. 3299 This simple approach allows to tolerate at most one failure of /any/ cluster node 3300 per computational step or arbitrary number of subordinate nodes at any time 3301 during programme execution. 3302 3303 For a four-node cluster and a hypothetical parallel programme failure recovery 3304 algorithm works as follows (fig.\nbsp{}[[fig-fail-over-example]]). 3305 1. *Initial state.* Initially, computer cluster does not need to be configured 3306 except setting up local network. The algorithm assumes full connectivity of 3307 cluster nodes, and works best with tree network topologies in which several 3308 network switches connect all cluster nodes. 3309 2. *Build node hierarchy.* When the cluster is bootstrapped, daemon processes 3310 start on all cluster nodes and collectively build their hierarchy 3311 superimposed on the topology of cluster network. Position of a daemon process 3312 in the hierarchy is defined by the position of its node IP address in the 3313 network IP address range. To establish hierarchical link each process 3314 connects to its assumed master process. The hierarchy is changed only when a 3315 new node joins the cluster or an existing node fails. 3316 3. *Launch main kernel.* The first kernel launches on one of the subordinate 3317 nodes (node \(B\)). Main kernel may have only one subordinate at a time, and 3318 backup copy of the main kernel is sent along with the subordinate kernel 3319 \(T_1\) to master node \(A\). \(T_1\) represents one sequential step of a 3320 programme. There can be any number of sequential steps in a programme, and 3321 when node \(A\) fails, the current step is restarted from the beginning. 3322 4. *Launch subordinate kernels.* Kernels \(S_1\), \(S_2\), \(S_3\) are launched 3323 on subordinate cluster nodes. When node \(B\), \(C\) or \(D\) fails, 3324 corresponding main kernel restarts failed subordinates (\(T_1\) restarts 3325 \(S_1\), master kernel restarts \(T_1\) etc.). When node \(B\) fails, master 3326 kernel is recovered from backup copy. 3327 3328 #+name: fig-fail-over-example 3329 #+header: :headers '("\\input{preamble}") 3330 #+begin_src latex :file build/fail-over-example.pdf :exports results :results raw 3331 \input{tex/preamble} 3332 \newcommand*{\spbuInsertFigure}[1]{% 3333 \vspace{2\baselineskip}% 3334 \begin{minipage}{0.5\textwidth}% 3335 \Large% 3336 \input{#1}% 3337 \end{minipage}% 3338 }% 3339 \noindent% 3340 \spbuInsertFigure{tex/cluster-0}~\spbuInsertFigure{tex/frame-0}\newline 3341 \spbuInsertFigure{tex/frame-3}~\spbuInsertFigure{tex/frame-4}\newline 3342 \spbuInsertFigure{tex/legend} 3343 #+end_src 3344 3345 #+caption[Working scheme of fail over algorithm]: 3346 #+caption: Working scheme of fail over algorithm. 3347 #+name: fig-fail-over-example 3348 #+attr_latex: :width \textwidth 3349 #+RESULTS: fig-fail-over-example 3350 [[file:build/fail-over-example.pdf]] 3351 3352 **** Evaluation results. 3353 Fail over algorithm was evaluated on physical cluster (table\nbsp{}[[tab-ant]]) on 3354 the example of distributed AR model application, which is described in detail in 3355 section\nbsp{}[[#sec-arma-mpp]]. The application consists of a series of functions, each 3356 of which is applied to the result of the previous one. Some of the functions are 3357 computed in parallel, so the programme is written as a sequence of steps, some 3358 of which are made internally parallel to get better performance. In the 3359 programme only the most compute-intensive step (the surface generation) is 3360 executed in parallel across all cluster nodes, and other steps are executed in 3361 parallel across all cores of the master node. 3362 3363 The application was rewritten for the distributed version of the framework which 3364 required adding read/write methods to each kernel which is sent over the network 3365 and slight modifications to handle failure of a node with the main kernel. The 3366 kernel was marked so that the framework makes a replica and sends it to some 3367 subordinate node along with its subordinate kernel. Other code changes involved 3368 modifying some parts to match the new API. So, providing fault tolerance by 3369 means of kernel hierarchy is mostly transparent to the programmer: it demands 3370 explicit marking of the main kernel and adding code to read and write kernels to 3371 the byte buffer. 3372 3373 In a series of experiments performance of the new version of the application in 3374 the presence of different types of failures was benchmarked: 3375 - no failures, 3376 - failure of a slave node (a node where a copy of the main kernel is stored), 3377 - failure of a master node (a node where the main kernel is run). 3378 Only two directly connected cluster nodes were used for the benchmark. Node 3379 failure was simulated by sending ~SIGKILL~ signal to the daemon process on the 3380 corresponding node right after the copy of the main kernel is made. The 3381 application immediately recognised node as offline, because the corresponding 3382 connection was closed; in real-world scenario, however, the failure is detected 3383 only after a configurable TCP user timeout\nbsp{}cite:rfc5482. The execution 3384 time of these runs were compared to the run without node failures. Results are 3385 presented in fig.\nbsp{}[[fig-master-slave-failure]], and a diagram of how kernels 3386 are distributed between two nodes is shown in 3387 fig.\nbsp{}[[fig-master-slave-backup]]. 3388 3389 #+name: fig-master-slave-backup 3390 #+begin_src dot :exports results :file build/master-slave-backup.pdf 3391 digraph { 3392 3393 node [fontname="Old Standard",fontsize=14,margin="0.055,0.055",shape=box,style=rounded] 3394 graph [nodesep="0.30",ranksep="0.30",rankdir="BT"] 3395 edge [arrowsize=0.66] 3396 3397 m1 [label="{{<master>Master | 10.0.0.1} | {<main>M | <slave1>S₁ | <slave3>S₃}}",shape=record] 3398 m2 [label="{{<slave>Slave | 10.0.0.2} | {M' | <step>N | <slave2>S₂ | <slave4>S₄}}",shape=record] 3399 3400 m2->m1 3401 3402 } 3403 #+end_src 3404 3405 #+name: fig-master-slave-backup 3406 #+caption[Mapping of kernels on master and slave nodes]: 3407 #+caption: Master and slave nodes and mapping of main kernel \(M\), 3408 #+caption: a copy of main kernel \(M'\), current step kernel \(N\) 3409 #+caption: and subordinate kernels \(S_{1,2,3}\) on them. 3410 #+RESULTS: fig-master-slave-backup 3411 [[file:build/master-slave-backup.pdf]] 3412 3413 As expected, there is considerable difference in application performance for 3414 different types of failures. In case of slave node failure the main kernel as 3415 well as some subordinate kernels (that were distributed to the slave node) are 3416 lost, but master node has a copy of the main kernel and uses it to continue the 3417 execution. So, in case of slave node failure nothing is lost except performance 3418 potential of the slave node. In case of master node failure, a copy of the main 3419 kernel as well as the subordinate kernel, which carried the copy, are lost, but 3420 slave node has the original main kernel and uses it to restart execution of the 3421 current sequential step, i.e.\nbsp{}send the subordinate kernel to one of the 3422 survived cluster nodes (in case of two directly connected nodes, it sends the 3423 kernel to itself). So, application performance is different, because the number 3424 of kernels that are lost as a result of a failure as well as their roles are 3425 different. 3426 3427 Slave node failure needs some time to be discovered: it is detected only when 3428 subordinate kernel carrying a copy of the main kernel finishes its execution and 3429 tries to reach its parent. Instant detection requires forcible termination of 3430 the subordinate kernel which may be inapplicable for programmes with complicated 3431 logic. 3432 3433 #+name: fig-master-slave-failure 3434 #+begin_src R :file build/master-slave-failure.pdf 3435 source(file.path("R", "benchmarks.R")) 3436 par(family="serif") 3437 data <- arma.load_master_slave_failure_data() 3438 arma.plot_master_slave_failure_data( 3439 data, 3440 list( 3441 master="Bscheduler (master failure)", 3442 slave="Bscheduler (slave failure)", 3443 nofailures="Bscheduler (no failures)" 3444 ) 3445 ) 3446 title(xlab="Wavy surface size", ylab="Time, s") 3447 #+end_src 3448 3449 #+caption[Performance of a programme in the presence of node failures]: 3450 #+caption: Performance of AR model in the presence of node failures. 3451 #+name: fig-master-slave-failure 3452 #+RESULTS: fig-master-slave-failure 3453 [[file:build/master-slave-failure.pdf]] 3454 3455 To summarise, if failure occurs right after a copy if the main kernel is made, 3456 only a small fraction of performance is lost no matter whether the main kernel 3457 or its copy is lost. 3458 3459 **** Discussion of test results. 3460 Since failure is simulated right after the first subordinate kernel reaches its 3461 destination (a node where it is supposed to be executed), slave node failure 3462 results in a loss of a small fraction of performance; in real-world scenario, 3463 where failure may occur in the middle of wavy surface generation, performance 3464 loss due to /backup/ node failure (a node where a copy of the main kernel is 3465 located) would be higher. Similarly, in real-world scenario the number of 3466 cluster nodes is larger, hence less amount of subordinate kernels is lost due to 3467 master node failure, so performance penalty is lower. In the benchmark the 3468 penalty is higher for the slave node failure, which is the result of absence of 3469 parallelism in the beginning of AR model wavy surface generation: the first part 3470 is computed sequentially, and other parts are computed only when the first one 3471 is available. So, the loss of the first subordinate kernel delays execution of 3472 every dependent kernel in the programme. 3473 3474 Fail over algorithm guarantees to handle one failure per sequential programme 3475 step, more failures can be tolerated if they do not affect the master node. The 3476 algorithm handles simultaneous failure of all subordinate nodes, however, if 3477 both master and backup node fail, there is no chance for a programme to continue 3478 the work. In this case the state of the current computation step is lost, and 3479 the only way to restore it is to restart the application from the beginning 3480 (which is currently not implemented in Bscheduler). 3481 3482 Kernels are means of abstraction that decouple distributed application from 3483 physical hardware: it does not matter how many cluster nodes are currently 3484 available for a programme to run without interruption. Kernels eliminate the 3485 need to allocate a physical live spare node to tolerate master node failures: in 3486 the framework of kernel hierarchy any physical node (except master) can act as a 3487 live spare. Finally, kernels allow to handle failures in a way that is 3488 transparent to a programmer, deriving the order of actions from the internal 3489 state of a kernel. 3490 3491 The experiments show that it is essential for a parallel programme to have 3492 multiple sequential steps to make it resilient to cluster node failures, 3493 otherwise failure of backup node, in fact triggers recovery of the initial state 3494 of the programme. Although, the probability of a master node failure is lower 3495 than the probability of a failure of any of the slave nodes, it does not justify 3496 loosing all the data when the long-running programme is near completion. In 3497 general, the more sequential steps one has in a parallel programme the less time 3498 is lost in an event of a backup node failure, and the more parallel parts each 3499 sequential step has the less time is lost in case of a master or slave node 3500 failure. In other words, the more nodes a programme uses the more resilient to 3501 their failures it becomes. 3502 3503 Although it is not shown in the experiments, Bscheduler does not only provide 3504 tolerance to cluster node failures, but allows for new nodes to automatically 3505 join the cluster and receive their portion of kernels from the already running 3506 programmes. This is trivial process in the context of the framework as it does 3507 not involve restarting failed kernels or copying their state, so it has not been 3508 studied experimentally in this work. 3509 3510 Theoretically, hierarchy-based fault tolerance based can be implemented on top 3511 of the message-passing library without loss of generality. Although it would be 3512 complicated to reuse free nodes instead of failed ones in the framework of this 3513 library, as the number of nodes is often fixed in such libraries, allocating 3514 reasonably large number of nodes for the programme would be enough to make it 3515 fault-tolerant. At the same time, implementing hierarchy-based fault tolerance 3516 inside message-passing library itself is not practical, because it would require 3517 saving the state of a parallel programme which equals to the total amount of 3518 memory it occupies on each cluster node, which in turn would not make it more 3519 efficient than checkpoints. 3520 3521 The weak point of the proposed method is the period of time starting from a 3522 failure of master node up to the moment when the failure is detected, the main 3523 kernel is restored and new subordinate kernel with the parent's copy is received 3524 by a slave node. If backup node fails within this time frame, execution state of 3525 a programme is completely lost, and there is no way to recover it other than 3526 restarting the programme from the beginning. The duration of the dangerous 3527 period can be minimised, but the probability of an abrupt programme termination 3528 can not be fully eliminated. This result is consistent with the scrutiny of 3529 /impossibility theory/, in the framework of which it is proved the impossibility 3530 of the distributed consensus with one faulty 3531 process\nbsp{}cite:fischer1985impossibility and impossibility of reliable 3532 communication in the presence of node 3533 failures\nbsp{}cite:fekete1993impossibility. 3534 3535 *** Summary 3536 Current state-of-the-art approach to developing and running parallel programmes 3537 on the cluster is the use of message passing library and job scheduler, and 3538 despite the fact that this approach is highly efficient in terms of parallel 3539 computation, it is not flexible enough to accommodate dynamic load balancing and 3540 automatic fault-tolerance. Programmes written with message passing library 3541 typically assume 3542 - equal load on each processor, 3543 - non-interruptible and reliable execution of batch jobs, and 3544 - constant number of parallel processes throughout the execution which is equal 3545 to the total number of processors. 3546 The first assumption does not hold for sea wave simulation programme because AR 3547 model requires dynamic load balancing between processors to generate each part 3548 of the surface only when all dependent parts have already been generated. The 3549 last assumption also does not hold, because for the sake of efficiency each part 3550 is written to a file asynchronously by a separate thread. The remaining 3551 assumption is not related to the programme itself, but to the job scheduler, and 3552 does not generally hold for very large computer clusters in which node failures 3553 occur regularly, and job scheduler restores the failed job from the checkpoint 3554 greatly increasing its running time. The idea of the proposed approach is to 3555 give parallel programmes more flexibility: 3556 - provide dynamic load balancing via pipelined execution of sequential, 3557 internally parallel programme steps, 3558 - restart only processes that were affected by node failure, and 3559 - execute the programme on as many compute nodes as are available in the 3560 cluster. 3561 In this section advantages and disadvantages of this approach are discussed. 3562 3563 In comparison to portable batch systems (PBS) the proposed approach uses 3564 lightweight control flow objects instead of heavy-weight parallel jobs to 3565 distribute the load on cluster nodes. First, this allows to have per-node kernel 3566 queues instead of several cluster-wide job queues. The granularity of control 3567 flow objects is much higher than of the batch jobs, and despite the fact that 3568 their execution time cannot be reliably predicted (as is execution time of batch 3569 jobs\nbsp{}cite:zotkin1999job), objects from multiple parallel programmes can be 3570 dynamically distributed between the same set of cluster nodes, thus making the 3571 load more even. The disadvantage is that this requires more RAM to execute many 3572 programmes on the same set of nodes, and execution time of each programme may be 3573 greater because of the shared control flow object queues. Second, the proposed 3574 approach uses dynamic distribution of master and slave roles between cluster 3575 nodes instead of their static assignment to the particular physical nodes. This 3576 makes nodes interchangeable, which is required to provide fault tolerance. 3577 Simultaneous execution of multiple parallel programmes on the same set of nodes 3578 increases throughput of the cluster, but also decreases their performance taken 3579 separately; dynamic role distribution is the base on which resilience to 3580 failures builds. 3581 3582 In comparison to MPI the proposed approach uses lightweight control flow objects 3583 instead of heavy-weight processes to decompose the programme into individual 3584 entities. First, this allows to determine the number of entities computed in 3585 parallel based on the problem being solved, not the computer or cluster 3586 architecture. A programmer is encouraged to create as many objects as needed, 3587 guided by the algorithm or restrictions on the size of data structures from the 3588 problem domain. In sea wave simulation programme the minimal size of each wavy 3589 surface part depends on the number of coefficients along each dimension, and at 3590 the same time the number of parts should be larger than the number of processors 3591 to make the load on each processor more even. Considering these limits the 3592 optimal part size is determined at runtime, and, in general, is not equal the 3593 number of parallel processes. The disadvantage is that the more control flow 3594 objects there are in the programme, the more shared data structures 3595 (e.g.\nbsp{}coefficients) are copied to the same node with subordinate objects; 3596 this problem is solved by introducing another intermediate layer of objects, 3597 which in turn adds more complexity to the programme. Second, hierarchy of 3598 control flow objects together with hierarchy of nodes allows for automatic 3599 recomputation of failed objects on surviving nodes in an event of hardware 3600 failure. It is possible because the state of the programme execution is stored 3601 in objects and not in global variables like in MPI programmes. By duplicating 3602 the state to a subordinate node, the system recomputes only objects from 3603 affected processes instead of the whole programme. Transition from processes to 3604 control flow objects increases performance of a parallel programme via dynamic 3605 load balancing, but may inhibit its scalability for a large number of nodes and 3606 large amount of shared structures due to duplication of these structures. 3607 3608 Decomposition of a parallel programme into individual entities is done in 3609 Charm++ framework\nbsp{}cite:kale2012charm (with load 3610 balancing\nbsp{}cite:bhandarkar2001adaptive) and in actor 3611 model\nbsp{}cite:hewitt1973universal,agha1985actors, but none of the approaches 3612 uses hierarchical relationship to restart entity processing after a failure. 3613 Instead of using tree hierarchy of kernels, these approaches allow exchanging 3614 messages between any pair of entities. Such irregular message exchange pattern 3615 makes it impossible to decide which object is responsible for restarting a 3616 failed one, hence non-universal fault tolerance techniques are used 3617 instead\nbsp{}cite:lifflander2014scalable. 3618 3619 Three building blocks of the proposed approach\nbsp{}--- control flow objects, 3620 pipelines and hierarchies\nbsp{}--- complement each other. Without control flow 3621 objects carrying programme state it is impossible to recompute failed 3622 subordinate objects and provide fault tolerance. Without node hierarchy it is 3623 impossible to distribute the load between cluster nodes, because all nodes are 3624 equal without the hierarchy. Without pipelines for each device it is impossible 3625 to execute control flow objects asynchronously and implement dynamic load 3626 balancing. These three entities form a closed system, in which programme logic 3627 is implemented in kernels or pipelines, failure recovery logic\nbsp{}--- in 3628 kernel hierarchy, and data transfer logic\nbsp{}--- in cluster node hierarchy. 3629 3630 ** MPP implementation 3631 :PROPERTIES: 3632 :CUSTOM_ID: sec-arma-mpp 3633 :END: 3634 3635 **** Distributed AR model algorithm. 3636 :PROPERTIES: 3637 :CUSTOM_ID: sec-distributed 3638 :END: 3639 3640 This algorithm, unlike its parallel counterpart, employs copying of data to 3641 execute computation on different cluster nodes, and since network throughput is 3642 much lower than memory bandwidth, the size of data that is sent over the network 3643 have to be optimised to get better performance than on SMP system. One way to 3644 accomplish this is to distribute wavy surface parts between cluster nodes 3645 copying in the coefficients and all the boundary points, and copying out 3646 generated wavy surface part. Autoregressive dependencies prevent from creating 3647 all the parts at once and statically distributing them between cluster nodes, so 3648 the parts are created dynamically on the first node, when points on which they 3649 depend become available. So, distributed AR model algorithm is a master-slave 3650 algorithm\nbsp{}cite:lusk2010more in which the master dynamically creates tasks 3651 for each wavy surface part taking into account autoregressive dependencies 3652 between points and sends them to slaves, whereas slaves compute each wavy 3653 surface part and send them back to the master. 3654 3655 In MPP implementation each task is modelled by a kernel: there is a principal 3656 kernel that creates subordinate kernels on demand, and subordinate kernels that 3657 generate wavy surface parts. In ~act~ method of principal kernel a subordinate 3658 kernel is created for the first part\nbsp{}--- a part that does not depend on 3659 any points. When this kernel returns, the principal kernel in ~react~ method 3660 determines which parts can be computed in turn, creates a subordinate kernel for 3661 each part and sends them to the pipeline. In ~act~ method of subordinate kernel 3662 wavy surface part is generated and then the kernel sends itself back to the 3663 principal. The ~react~ method of subordinate kernel is empty. 3664 3665 Distributed AR algorithm implementation has several advantages over the parallel 3666 one. 3667 - Pipelines automatically distribute subordinate kernels between available 3668 cluster nodes, and the main programme does not have to deal with these 3669 implementation details. 3670 - There is no need to implement minimalistic job scheduler, which determines 3671 execution order of jobs (kernels) taking into account autoregressive 3672 dependencies: the order is fully defined in ~react~ method of the principal 3673 kernel. 3674 - There is no need in separate version of the programme for SMP machine, the 3675 implementation works transparently on any number of nodes, even if job 3676 scheduler is not running. 3677 3678 **** Performance of distributed AR model implementation. 3679 Distributed AR model implementation was benchmarked on the two nodes of "Ant" 3680 system (table\nbsp{}[[tab-ant]]). To increase network throughput these nodes were 3681 directly connected to each other and maximum transmission unit (MTU) was set to 3682 9200 bytes. Two cases were considered: with one Bscheduler daemon process 3683 running on the local node, and with two daemon processes running on each node. 3684 The performance of the programme was compared to the performance of OpenMP 3685 version running on single node. 3686 3687 Bscheduler outperforms OpenMP in both cases 3688 (fig.\nbsp{}[[fig-bscheduler-performance]]). In case of one node the higher 3689 performance is explained by the fact that Bscheduler does not scan the queue for 3690 wavy surface parts for which dependencies are ready (as in parallel version of 3691 the algorithm), but for each part updates a counter of completed parts on which 3692 it depends. The same approach can be used in OpenMP version, but was discovered 3693 only for newer Bscheduler version, as queue scanning can not be performed 3694 efficiently in this framework. In case of two nodes the higher performance is 3695 explained by greater total number of processor cores (16) and high network 3696 throughput of the direct network link. So, Bscheduler implementation of 3697 distributed AR model algorithm is faster on shared memory system due to more 3698 efficient autoregressive dependencies handling and its performance on 3699 distributed memory system scales to a larger number of cores due to small data 3700 transmission overhead of direct network link. 3701 3702 #+name: fig-bscheduler-performance 3703 #+begin_src R :file build/bscheduler-performance.pdf 3704 source(file.path("R", "benchmarks.R")) 3705 par(family="serif") 3706 data <- arma.load_bscheduler_performance_data() 3707 arma.plot_bscheduler_performance_data( 3708 data, 3709 list( 3710 openmp="OpenMP", 3711 bsc1="Bscheduler (single node)", 3712 bsc2="Bscheduler (two nodes)" 3713 ) 3714 ) 3715 title(xlab="Wavy surface size", ylab="Time, s") 3716 #+end_src 3717 3718 #+name: fig-bscheduler-performance 3719 #+caption[Performance of Bscheduler and OpenMP programme versions]: 3720 #+caption: Performance comparison of Bscheduler and OpenMP programme versions for 3721 #+caption: AR model. 3722 #+RESULTS: fig-bscheduler-performance 3723 [[file:build/bscheduler-performance.pdf]] 3724 3725 * Conclusion 3726 In the study of mathematical apparatus for sea wave simulations which goes 3727 beyond linear wave theory the following main results were achieved. 3728 - AR and MA models were applied to simulation of sea waves of arbitrary 3729 amplitudes. Integral characteristics of wavy surface were verified by 3730 comparing to the ones of real sea surface. 3731 - New method was applied to compute velocity potentials under generated surface. 3732 The resulting field was verified by comparing it to the one given by formulae 3733 from linear wave theory for small-amplitude waves. For large-amplitude waves 3734 the new method gives a reasonably different field. The method is 3735 computationally efficient because all the integrals in its formula are written 3736 as Fourier transforms, for which there are high-performance implementations. 3737 - The model and the method were implemented for both shared and distributed 3738 memory systems, and showed near linear scalability for different number of 3739 cores in several benchmarks. AR model is more computationally efficient on CPU 3740 than on GPU, and outperforms LH model. 3741 3742 One of the topic of future research is studying generation of waves of arbitrary 3743 profiles on the basis of mixed ARMA process. Another direction is integration of 3744 the developed model and pressure calculation method into existing application 3745 software packages. 3746 3747 * Summary 3748 A problem of determining pressures under sea surface is solved explicitly 3749 without assumptions of linear and small-amplitude wave theories. This solution 3750 coupled with AR and MA models, that generate sea waves of arbitrary amplitudes, 3751 can be used to determine the impact of wave oscillations on the dynamic marine 3752 object in a sea way, and in case of large-amplitude waves gives more accurate 3753 velocity potential field than analogous solutions obtained in the framework of 3754 linear wave theory and theory of small-amplitude waves. 3755 3756 Numerical experiments show that wavy surface generation as well as pressure 3757 computation are efficiently implemented for both shared and distributed memory 3758 systems, without computing on GPU. High performance in case of wavy surface 3759 generation is provided by the use of specialised job scheduler and a library for 3760 multi-dimensional arrays, and in case of velocity potential 3761 computation\nbsp{}--- by the use of FFT algorithms for computing integrals. 3762 3763 The developed mathematical apparatus and its numerical implementation can become 3764 a base of virtual testbed for marine objects dynamics studies. The use of new 3765 models in virtual testbed would allow to 3766 - conduct long-time simulations without the loss of efficiency, 3767 - obtain more accurate pressure fields due to new velocity potential field 3768 computation method, and 3769 - make software complex more robust due to high convergence rate of the models 3770 and by using fault-tolerant batch job scheduler. 3771 3772 * Acknowledgements 3773 The graphs in this work were prepared using R language for statistical 3774 computing\nbsp{}cite:rlang2016,sarkar2008lattice and Graphviz 3775 software\nbsp{}cite:gansner00anopen. The manuscript was prepared using 3776 Org-mode\nbsp{}cite:schulte2011org2,schulte2011org1,dominik2010org for GNU Emacs 3777 which provides computing environment for reproducible research. This means that 3778 all graphs can be reproduced and corresponding statements verified on different 3779 computer systems by cloning thesis repository[fn:repo], installing Emacs and 3780 exporting the document. 3781 3782 [fn:repo] [[https://github.com/igankevich/arma-thesis]] 3783 3784 * List of acronyms and symbols 3785 - <<<MPP>>> :: Massively Parallel Processing, computers with distributed memory. 3786 - <<<SMP>>> :: Symmetric Multi-Processing, computers with shared memory. 3787 - <<<GPGPU>>> :: General-purpose computing on graphics processing units. 3788 - ACF :: auto-covariate function. 3789 - <<<FFT>>> :: fast Fourier transform. 3790 - <<<PRNG>>> :: pseudo-random number generator. 3791 - <<<BC>>> :: boundary condition. 3792 - <<<PDE>>> :: partial differential equation. 3793 - <<<NIT>>> :: non-linear inertialess transform which allows to specify 3794 arbitrary distribution law for wavy surface elevation without 3795 changing its original auto-covariate function. 3796 - AR :: auto-regressive process. 3797 - ARMA :: auto-regressive moving-average process. 3798 - MA :: moving average process. 3799 - LH :: Longuet---Higgins model, formula of which is derived in the framework of 3800 linear wave theory. 3801 - <<<LAMP>>> :: Large Amplitude Motion Programme, a programme that simulates 3802 ship behaviour in ocean waves. 3803 - <<<CLT>>> :: central limit theorem. 3804 - <<<PM>>> :: Pierson---Moskowitz ocean wave spectrum approximation. 3805 - <<<YW>>> :: Yule---Walker equations which are used to determine autoregressive 3806 model coefficients for a given auto-covariate function. 3807 - <<<LS>>> :: least squares. 3808 - PDF :: probability density function. 3809 - <<<CDF>>> :: cumulative distribution function. 3810 - <<<BSP>>> :: Bulk Synchronous Parallel. 3811 - OpenCL :: Open Computing Language, parallel programming technology for hybrid 3812 systems with GPUs or other co-processors. 3813 - <<<OpenGL>>> :: Open Graphics Library. 3814 - OpenMP :: Open Multi-Processing, parallel programming technology for 3815 multi-processor systems. 3816 - <<<MPI>>> :: Message Passing Interface. 3817 - <<<FMA>>> :: Fused multiply-add. 3818 - <<<DCMT>>> :: Dynamic creation of Mersenne Twisters, an algorithm for creating 3819 pseudo-random number generators that produce uncorrelated 3820 sequences when run in parallel. 3821 - <<<GSL>>> :: GNU Scientific Library. 3822 - <<<BLAS>>> :: Basic Linear Algebra Sub-programmes. 3823 - <<<LAPACK>>> :: Linear Algebra Package. 3824 - <<<DNS>>> :: Dynamic name resolution. 3825 - <<<HPC>>> :: High-performance computing. 3826 - GCS :: Gram---Charlier series. 3827 - SN :: Skew normal distribution. 3828 - <<<PBS>>> :: Portable batch system, a system for allocating and distributing 3829 cluster resources for parallel programmes. 3830 - Transcendental functions :: non-algebraic mathematical functions 3831 (i.e.\nbsp{}logarithmic, trigonometric, exponential etc.). 3832 3833 #+begin_export latex 3834 \input{postamble} 3835 #+end_export 3836 3837 bibliographystyle:ugost2008 3838 bibliography:bib/refs.bib 3839 3840 * Appendix 3841 ** Longuet---Higgins model formula derivation 3842 :PROPERTIES: 3843 :CUSTOM_ID: longuet-higgins-derivation 3844 :END: 3845 3846 In the framework of linear wave theory two-dimensional system of 3847 equations\nbsp{}eqref:eq-problem is written as 3848 \begin{align*} 3849 & \phi_{xx} + \phi_{zz} = 0,\\ 3850 & \zeta(x,t) = -\frac{1}{g} \phi_t, & \text{at }z=\zeta(x,t), 3851 \end{align*} 3852 where \(\frac{p}{\rho}\) includes \(\phi_t\). The solution to the Laplace 3853 equation is sought in a form of Fourier series\nbsp{}cite:kochin1966theoretical: 3854 \begin{equation*} 3855 \phi(x,z,t) = \int\limits_{0}^{\infty} e^{k z} 3856 \left[ A(k, t) \cos(k x) + B(k, t) \sin(k x) \right] dk. 3857 \end{equation*} 3858 Plugging it in the boundary condition yields 3859 \begin{align*} 3860 \zeta(x,t) &= -\frac{1}{g} \int\limits_{0}^{\infty} 3861 \left[ A_t(k, t) \cos(k x) + B_t(k, t) \sin(k x) \right] dk \\ 3862 &= -\frac{1}{g} \int\limits_{0}^{\infty} C_t(k, t) \cos(kx + \epsilon(k, t)). 3863 \end{align*} 3864 Here \(\epsilon\) is white noise and \(C_t\) includes \(dk\). Substituting 3865 integral with infinite sum yields two-dimensional form 3866 of\nbsp{}eqref:eq-longuet-higgins. 3867 3868 ** Derivative in the direction of the surface normal 3869 :PROPERTIES: 3870 :CUSTOM_ID: directional-derivative 3871 :END: 3872 Directional derivative of \(\phi\) in the direction of vector \(\vec{n}\) is 3873 given by \(\nabla_n\phi=\nabla\phi\cdot\frac{\vec{n}}{|\vec{n}|}\). Normal 3874 vector \(\vec{n}\) to the surface \(z=\zeta(x,y)\) at point \((x_0,y_0)\) is 3875 given by 3876 \begin{equation*} 3877 \vec{n} = \begin{bmatrix}\zeta_x(x_0,y_0)\\\zeta_y(x_0,y_0)\\-1\end{bmatrix}. 3878 \end{equation*} 3879 Hence, derivative in the direction of the surface normal is given by 3880 \begin{equation*} 3881 \nabla_n \phi = \phi_x \frac{\zeta_x}{\sqrt{\zeta_x^2+\zeta_y^2+1}} 3882 + \phi_y \frac{\zeta_y}{\sqrt{\zeta_x^2+\zeta_y^2+1}} 3883 + \phi_z \frac{-1}{\sqrt{\zeta_x^2+\zeta_y^2+1}}, 3884 \end{equation*} 3885 where \(\zeta\) derivatives are calculated at \((x_0,y_0)\).