arma-thesis

Simulation modelling of irregular waves for marine object dynamics programmes
git clone https://git.igankevich.com/arma-thesis.git
Log | Files | Refs | LICENSE

arma-thesis.org (198906B)


      1 #+TITLE: Simulation modelling of irregular waves for marine object dynamics programmes
      2 #+AUTHOR: Ivan Gennadevich Gankevich
      3 #+DATE: St. Petersburg, 2017
      4 #+LANGUAGE: en
      5 #+SETUPFILE: setup.org
      6 #+LATEX_HEADER_EXTRA: \organization{Saint Petersburg State University}
      7 #+LATEX_HEADER_EXTRA: \manuscript{a manuscript}
      8 #+LATEX_HEADER_EXTRA: \degree{thesis for candidate of sciences degree}
      9 #+LATEX_HEADER_EXTRA: \speciality{Speciality 05.13.18\\Mathematical modelling, numerical methods and programme complexes}
     10 #+LATEX_HEADER_EXTRA: \supervisor{Supervisor\\Alexander Borisovich Degtyarev, ScD}
     11 #+LATEX_HEADER_EXTRA: \newcites{published}{Publications on the subject of thesis}
     12 #+LATEX_HEADER_EXTRA: \AtBeginDocument{\setcounter{page}{146}}
     13 
     14 #+latex: \maketitlepage
     15 
     16 * Introduction
     17 **** Topic relevance.
     18 Software programmes, which simulate ship behaviour in sea waves, are widely used
     19 to model ship motion, estimate impact of external forces on floating platform or
     20 other marine object, and estimate capsize probability under given weather
     21 conditions; however, to model sea waves most of them use linear wave
     22 theory\nbsp{}cite:shin2003nonlinear,van2007forensic,kat2001prediction,van2002development,
     23 in the framework of which it is difficult to reproduce certain peculiarities of
     24 wind wave climate. Among them are transition between normal and storm weather,
     25 and sea composed of multiple wave systems\nbsp{}--- both wind waves and
     26 swell\nbsp{}--- heading from multiple directions. Another shortcoming of linear
     27 wave theory is an assumption, that wave amplitude is small compared to wave
     28 length. This makes calculations inaccurate when modelling ship motion in
     29 irregular waves, for which the assumption does not hold. So, studying new and
     30 more advanced models and methods for sea simulation software would increase
     31 number of its application scenarios and foster studying ship motion in extreme
     32 conditions in particular.
     33 
     34 **** State of the art.
     35 Autoregressive moving average (ARMA) model is a response to difficulties
     36 encountered by practitioners who used wave simulation models developed in the
     37 framework of linear wave theory. The problems they have encountered with
     38 Longuet---Higgins model (a model which is entirely based on linear wave theory)
     39 are the following.
     40 1. /Periodicity/. Linear wave theory approximates waves by a sum of harmonics,
     41    hence period of the whole wavy surface realisation depends on the number of
     42    harmonics in the model. The more realisation size is, the more coefficients
     43    are required to eliminate periodicity, therefore, generation time grows
     44    non-linearly with realisation size. This in turn results in overall low
     45    efficiency of any model based on this theory, no matter how optimised the
     46    software implementation is.
     47 2. /Linearity/. Linear wave theory gives mathematical definition for sea waves
     48    which have small amplitudes compared to their lengths. Waves of this type
     49    occur mostly in open ocean, so near-shore waves as well as storm waves, for
     50    which this assumption does not hold, are inaccurately modelled in the
     51    framework of linear theory.
     52 3. /Probabilistic convergence/. Phase of a wave, which is generated by pseudo
     53    random number generator (PRNG), has uniform distribution, which leads to low
     54    convergence rate of wavy surface integral characteristics (average wave
     55    height, wave period, wave length etc.).
     56 
     57 These difficulties became a starting point in search for a new model which is
     58 not based on linear wave theory, and ARMA process studies were found to have all
     59 the required mathematical apparatus.
     60 1. ARMA process takes auto-covariate function (ACF) as an input parameter, and
     61    this function can be directly obtained from wave energy or
     62    frequency-directional spectrum (which is the input parameter for
     63    Longuet---Higgins model). So, inputs for one model can easily be converted to
     64    each other.
     65 2. There is no small-amplitude waves assumption. Wave may have any amplitude,
     66    and can be generated as steep as it is possible with real sea wave ACF.
     67 3. Period of the realisation equals the period of PRNG, so generation time grows
     68    linearly with the realisation size.
     69 4. White noise\nbsp{}--- the only probabilistic term in ARMA process\nbsp{}---
     70    has Gaussian distribution, hence convergence rate is not probabilistic.
     71 
     72 **** Goals and objectives.
     73 ARMA process is the basis for ARMA sea simulation model, but due to its
     74 non-physical nature the model needs to be adapted to be used for wavy surface
     75 generation.
     76 1. Investigate how different ACF shapes affect the choice of ARMA parameters
     77    (the number of moving average and autoregressive processes coefficients).
     78 2. Investigate a possibility to generate waves of arbitrary profiles, not only
     79    cosines (which means taking into account asymmetric distribution of wavy
     80    surface elevation).
     81 3. Develop a method to determine pressure field under discretely given wavy
     82    surface. Usually, such formulae are derived for a particular model by
     83    substituting wave profile formula into the system of equations for pressure
     84    determination\nbsp{}eqref:eq-problem, however, ARMA process does not provide
     85    explicit wave profile formula, so this problem has to be solved for general
     86    wavy surface (which is not defined by a mathematical formula), without
     87    linearisation of boundaries and assumption of small-amplitude waves.
     88 4. Verify that wavy surface integral characteristics match the ones of real sea
     89    waves.
     90 5. Develop software programme that implements ARMA model and pressure
     91    calculation method, and allows to run simulations on both shared and
     92    distributed memory computer systems.
     93 
     94 **** Scientific novelty.
     95 ARMA model, as opposed to other sea simulation models, does not use linear
     96 wave theory. This makes it capable of
     97 - generating waves with arbitrary amplitudes by adjusting wave steepness via
     98   ACF;
     99 - generating waves with arbitrary profiles by adjusting asymmetry of wave
    100   elevation distribution via non-linear inertialess transform (NIT).
    101 This makes it possible to use ARMA process to model transition between normal
    102 and storm weather taking into account climate spectra and assimilation data of a
    103 particular ocean region, which is not possible with models based on linear wave
    104 theory.
    105 
    106 The distinct feature of this work is
    107 - the use of /three-dimensional/ AR and MA models in all experiments,
    108 - the use of velocity potential field calculation method suitable for discretely
    109   given wavy surface, and
    110 - the development of software programme that implements sea wavy surface
    111   generation and pressure field computation on both shared memory (SMP) and
    112   distributed memory (MPP) computer systems, but also hybrid systems (GPGPU)
    113   which use graphical coprocessors to accelerate computations.
    114 
    115 **** Theoretical and practical significance.
    116 Implementing ARMA model, that does not use assumptions of linear wave theory,
    117 will increase quality of ship motion and marine object behaviour simulation
    118 software.
    119 
    120 1. Since pressure field method is developed for discrete wavy surface and
    121    without assumptions about wave amplitudes, it is applicable to any wavy
    122    surface of incompressible inviscid fluid (in particular, it is applicable to
    123    wavy surface generated by LH model). This allows to use this method without
    124    being tied to ARMA model.
    125 2. From computational point of view this method is more efficient than the
    126    corresponding method for LH model, because integrals in its formula are
    127    reduced to Fourier transforms, for which there is fast Fourier transform
    128    (FFT) family of algorithms, optimised for different processor architectures.
    129 3. Since the formula which is used in the method is explicit, there is no need
    130    in data exchange between parallel processes, which allows to scale
    131    performance to a large number of processor cores.
    132 4. Finally, ARMA model is itself more efficient than LH model due to absence of
    133    trigonometric functions in its formula: In fact, wavy surface is computed as
    134    a sum of large number of polynomials, for which there is low-level FMA (Fused
    135    Multiply-Add) assembly instruction, and memory access pattern allows for
    136    efficient use of CPU cache.
    137 
    138 **** Methodology and research methods.
    139 Software implementation of ARMA model and pressure field calculation method was
    140 created incrementally: a prototype written in high-level engineering languages
    141 (Mathematica\nbsp{}cite:mathematica10 and Octave\nbsp{}cite:octave2015) was
    142 rewritten in lower level language (C++). Due to usage of different abstractions
    143 and language primitives, implementation of the same algorithms and methods in
    144 languages of varying levels allows to correct errors, which would left
    145 unnoticed, if only one language was used. Wavy surface, generated by ARMA model,
    146 as well as all input parameters (ACF, distribution of wave elevation etc.) were
    147 inspected via graphical means built into the programming language.
    148 
    149 **** Theses for the defence.
    150 - Sea wave simulation model which allows to generate wavy surface realisations
    151   with large period and consisting of waves of arbitrary amplitudes;
    152 - Pressure field calculation method derived for this model without assumptions
    153   of linear wave theory;
    154 - Software implementation of the simulation model and the method for shared and
    155   distributed memory computer systems.
    156 **** Results verification and approbation.
    157 ARMA model is verified by comparing generated wavy surface integral
    158 characteristics (distribution of wave elevation, wave heights and lengths etc.)
    159 to the ones of real sea waves. Pressure field calculation method was developed
    160 using Mathematica language, where resulting formulae are verified by built-in
    161 graphical means.
    162 
    163 ARMA model and pressure field calculation method were incorporated into Large
    164 Amplitude Motion Programme (LAMP)\nbsp{}--- an ship motion simulation software
    165 programme\nbsp{}--- where they were compared to previously used LH model.
    166 Numerical experiments showed higher computational efficiency of ARMA model.
    167 
    168 * Problem statement
    169 The aim of the study reported here is to apply ARMA process mathematical
    170 apparatus to sea wave modelling and to develop a method to compute pressure
    171 field under discretely given wavy surface for inviscid incompressible fluid
    172 without assumptions of linear wave theory.
    173 - In case of small-amplitude waves the resulting pressure field must correspond
    174   to the one obtained via formula from linear wave theory; in all other cases
    175   field values must not tend to infinity.
    176 - Integral characteristics of generated wavy surface must match the ones of real
    177   sea waves.
    178 - Software implementation of ARMA model and pressure field computation method
    179   must work on shared and distributed memory systems.
    180 
    181 **** Pressure field formula.
    182 The problem of finding pressure field under wavy sea surface represents inverse
    183 problem of hydrodynamics for incompressible inviscid fluid. System of equations
    184 for it in general case is written as\nbsp{}cite:kochin1966theoretical
    185 \begin{align}
    186     & \nabla^2\phi = 0,\nonumber\\
    187     & \phi_t+\frac{1}{2} |\vec{\upsilon}|^2 + g\zeta=-\frac{p}{\rho}, & \text{at }z=\zeta(x,y,t),\label{eq-problem}\\
    188     & D\zeta = \nabla \phi \cdot \vec{n}, & \text{at }z=\zeta(x,y,t),\nonumber
    189 \end{align}
    190 where \(\phi\) is velocity potential, \(\zeta\)\nbsp{}--- elevation (\(z\)
    191 coordinate) of wavy surface, \(p\)\nbsp{}--- wave pressure, \(\rho\)\nbsp{}---
    192 fluid density, \(\vec{\upsilon}=(\phi_x,\phi_y,\phi_z)\)\nbsp{}--- velocity
    193 vector, \(g\)\nbsp{}--- acceleration of gravity, and \(D\)\nbsp{}--- substantial
    194 (Lagrange) derivative. The first equation is called continuity (Laplace)
    195 equation, the second one is the conservation of momentum law (the so called
    196 dynamic boundary condition); the third one is kinematic boundary condition for
    197 free wavy surface, which states that a rate of change of wavy surface elevation
    198 (\(D\zeta\)) equals to the change of velocity potential derivative in the
    199 direction of wavy surface normal (\(\nabla\phi\cdot\vec{n}\), see
    200 section\nbsp{}[[#directional-derivative]]).
    201 
    202 Inverse problem of hydrodynamics consists in solving this system of equations
    203 for \(\phi\). In this formulation dynamic boundary condition becomes explicit
    204 formula to determine pressure field using velocity potential derivatives
    205 obtained from the remaining equations. From mathematical point of view inverse
    206 problem of hydrodynamics reduces to Laplace equation with mixed boundary
    207 condition\nbsp{}--- Robin problem.
    208 
    209 Inverse problem is feasible because ARMA model generate hydrodynamically
    210 adequate sea wavy surface: distributions of integral characteristics and
    211 dispersion relation match the ones of real
    212 waves\nbsp{}cite:boukhanovsky1997thesis,degtyarev2011modelling.
    213 
    214 * ARMA model for sea wave simulation
    215 ** Sea wave models analysis
    216 Pressure computation is only possible when the shape of wavy surface is known.
    217 It is defined either at discrete grid points, or continuously via some analytic
    218 formula. As will be shown in section\nbsp{}[[#linearisation]], such formula may
    219 simplify pressure computation by effectively reducing the task to pressure field
    220 generation, instead of wavy surface generation.
    221 
    222 **** Longuet---Higgins model.
    223 The simplest model, formula of which is derived in the framework of linear wave
    224 theory (see\nbsp{}sec.\nbsp{}[[#longuet-higgins-derivation]]), is Longuet---Higgins
    225 (LH) model\nbsp{}cite:longuet1957statistical. In-depth comparative analysis of
    226 this model and ARMA model is done
    227 in\nbsp{}cite:degtyarev2011modelling,boukhanovsky1997thesis.
    228 
    229 LH model represents sea wavy surface as a superposition of
    230 sine waves with random amplitudes \(c_n\) and phases \(\epsilon_n\), uniformly
    231 distributed on interval \([0,2\pi]\). Wavy surface elevation (\(z\) coordinate) is
    232 defined by
    233 \begin{equation}
    234     \zeta(x,y,t) = \sum\limits_n c_n \cos(u_n x + v_n y - \omega_n t + \epsilon_n).
    235     \label{eq-longuet-higgins}
    236 \end{equation}
    237 Here wave numbers \((u_n,v_n)\) are continuously distributed on plane \((u,v)\),
    238 i.e.\nbsp{}area \(du\times{}dv\) contains infinite quantity of wave numbers.
    239 Frequency is related to wave numbers via dispersion relation
    240 \(\omega_n=\omega(u_n,v_n)\). Function \(\zeta(x,y,t)\) is a three-dimensional
    241 ergodic stationary homogeneous Gaussian process defined by
    242 \begin{equation*}
    243     2E_\zeta(u,v)\, du\, dv = \sum\limits_n c_n^2,
    244 \end{equation*}
    245 where \(E_\zeta(u,v)\)\nbsp{}--- two-dimensional wave energy spectral density.
    246 Coefficients \(c_n\) are derived from wave energy spectrum \(S(\omega)\) via
    247 \begin{equation*}
    248     c_n = \sqrt{ \textstyle\int\limits_{\omega_n}^{\omega_{n+1}} S(\omega) d\omega}.
    249 \end{equation*}
    250 
    251 **** Disadvantages of Longuet-Higgins model.
    252 Despite simplicity of LH model numerical algorithm, in practice it has several
    253 disadvantages.
    254 1. The model simulates only stationary Gaussian field. This is a consequence of
    255    central limit theorem (CLT): sum of large number of sines with random
    256    amplitudes and phases has normal distribution, no matter what spectrum is
    257    used as the model input. Using lower number of coefficients may solve the
    258    problem, but also make realisation period smaller. Using LH model to simulate
    259    waves with non-Gaussian distribution of elevation\nbsp{}--- a distribution
    260    which real sea waves have\nbsp{}cite:huang1980experimental,rozhkov1996theory
    261    \nbsp{}--- is impractical.
    262 2. From computational point of view, the deficiency of the model is non-linear
    263    increase of wavy surface generation time with the increase of realisation
    264    size. The larger the size of the realisation, the higher number of
    265    coefficients (discrete points of frequency-directional spectrum) is needed to
    266    eliminate periodicity. This makes LH model inefficient for long-time
    267    simulations.
    268 3. Finally, there are peculiarities which make LH model unsuitable as a base for
    269    building more advanced simulation models.
    270    - In software implementation convergence rate
    271      of\nbsp{}[[eqref:eq-longuet-higgins]] is low due to uniform distribution of
    272      phases \(\epsilon_n\).
    273    - It is difficult to generalise LH model for non-Gaussian processes as it
    274      involves incorporating non-linear terms in\nbsp{}[[eqref:eq-longuet-higgins]]
    275      for which there is no known formula to determine
    276      coefficients\nbsp{}cite:rozhkov1990stochastic.
    277 
    278 To summarise, LH model is applicable to generating sea wavy surface only in the
    279 framework of linear wave theory, inefficient for long-time simulations, and has
    280 a number of deficiencies which do not allow to use it as a base for more
    281 advanced models.
    282 
    283 **** ARMA model.
    284 In\nbsp{}cite:spanos1982arma ARMA model is used to generate time series spectrum
    285 of which is compatible with Pierson---Moskowitz (PM) spectrum. The authors carry
    286 out experiments for one-dimensional AR, MA and ARMA models. They mention
    287 excellent agreement between target and initial spectra and higher performance of
    288 ARMA model compared to models based on summing large number of harmonic
    289 components with random phases. The also mention that in order to reach agreement
    290 between target and initial spectrum MA model require lesser number of
    291 coefficients than AR model. In\nbsp{}cite:spanos1996efficient the authors
    292 generalise ARMA model coefficients determination formulae for multi-variate
    293 (vector) case.
    294 
    295 One thing that distinguishes present work with respect to afore-mentioned ones
    296 is the study of three-dimensional (2D in space and 1D in time) ARMA model, which
    297 is mostly a different problem.
    298 1. Yule---Walker system of equations, which are used to determine AR
    299    coefficients, has complex block-block structure.
    300 2. Optimal model order (in a sense that target spectrum agrees with initial) is
    301    determined manually.
    302 3. Instead of PM spectrum, analytic formulae for standing and propagating
    303    waves ACF are used as the model input.
    304 4. Three-dimensional wavy surface should be compatible with real sea surface
    305    not only in terms of spectral characteristics, but also in the shape of wave
    306    profiles. So, model verification includes distributions of various parameters
    307    of generated waves (lengths, heights, periods etc.).
    308 Multi-dimensionality of investigated model not only complexifies the task, but
    309 also allows to carry out visual validation of generated wavy surface. It is the
    310 opportunity to visualise output of the programme that allows to ensure that
    311 generated surface is compatible with real sea surface, and is not abstract
    312 multi-dimensional stochastic process that corresponds to the real one only
    313 statistically.
    314 
    315 In\nbsp{}cite:fusco2010short AR model is used to predict swell waves to control
    316 wave-energy converters (WEC) in real-time. In order to make WEC more efficient
    317 its internal oscillator frequency should match the one of sea waves. The authors
    318 treat wave elevation as time series and compare performance of AR model, neural
    319 networks and cyclical models in forecasting time series future values. AR model
    320 gives the most accurate prediction of low-frequency swell waves for up to two
    321 typical wave periods. It is an example of successful application of AR process
    322 to sea wave modelling.
    323 
    324 ** Governing equations for 3-dimensional ARMA process
    325 ARMA sea simulation model defines sea wavy surface as three-dimensional (two
    326 dimensions in space and one in time) autoregressive moving average process:
    327 every surface point is represented as a weighted sum of previous in time and
    328 space points plus weighted sum of previous in time and space normally
    329 distributed random impulses. The governing equation for 3-D ARMA process is
    330 \begin{equation}
    331     \zeta_{\vec i}
    332     =
    333     \sum\limits_{\vec j = \vec 0}^{\vec N}
    334     \Phi_{\vec j} \zeta_{\vec i - \vec j}
    335     +
    336     \sum\limits_{\vec j = \vec 0}^{\vec M}
    337     \Theta_{\vec j} \epsilon_{\vec i - \vec j}
    338     ,
    339     \label{eq-arma-process}
    340 \end{equation}
    341 where \(\zeta\)\nbsp{}--- wave elevation, \(\Phi\)\nbsp{}--- AR process
    342 coefficients, \(\Theta\)\nbsp{}--- MA process coefficients,
    343 \(\epsilon\)\nbsp{}--- white noise with Gaussian distribution,
    344 \(\vec{N}\)\nbsp{}--- AR process order, \(\vec{M}\)\nbsp{}--- MA process order,
    345 and \(\Phi_{\vec{0}}\equiv{0}\), \(\Theta_{\vec{0}}\equiv{0}\). Here arrows
    346 denote multi-component indices with a component for each dimension. In general,
    347 any scalar quantity can be a component (temperature, salinity, concentration of
    348 some substance in water etc.). Equation parameters are AR and MA process
    349 coefficients and order.
    350 
    351 Sea simulation model is used to simulate realistic realisations of wind induced
    352 wave field and is suitable to use in ship dynamics calculations. Stationarity
    353 and invertibility properties are the main criteria for choosing one or another
    354 process to simulate waves with different profiles, which are discussed in
    355 section\nbsp{}[[#sec-process-selection]].
    356 
    357 **** Autoregressive (AR) process.
    358 AR process is ARMA process with only one random impulse instead of theirs
    359 weighted sum:
    360 \begin{equation}
    361     \zeta_{\vec i}
    362     =
    363     \sum\limits_{\vec j = \vec 0}^{\vec N}
    364     \Phi_{\vec j} \zeta_{\vec i - \vec j}
    365     +
    366     \epsilon_{i,j,k}
    367     .
    368     \label{eq-ar-process}
    369 \end{equation}
    370 The coefficients \(\Phi\) are calculated from ACF via three-dimensional
    371 Yule---Walker (YW) equations, which are obtained after multiplying both parts of
    372 the previous equation by \(\zeta_{\vec{i}-\vec{k}}\) and computing the expected
    373 value. Generic form of YW equations is
    374 \begin{equation}
    375     \label{eq-yule-walker}
    376     \gamma_{\vec k}
    377     =
    378     \sum\limits_{\vec j = \vec 0}^{\vec N}
    379     \Phi_{\vec j}
    380     \text{ }\gamma_{\vec{k}-\vec{j}}
    381     +
    382     \Var{\epsilon} \delta_{\vec{k}},
    383     \qquad
    384     \delta_{\vec{k}} =
    385     \begin{cases}
    386         1, \quad \text{if } \vec{k}=0 \\
    387         0, \quad \text{if } \vec{k}\neq0,
    388     \end{cases}
    389 \end{equation}
    390 where \(\gamma\)\nbsp{}--- ACF of process \(\zeta\),
    391 \(\Var{\epsilon}\)\nbsp{}--- white noise variance. Matrix form of
    392 three-dimensional YW equations, which is used in the present work, is
    393 \begin{equation*}
    394     \Gamma
    395     \left[
    396         \begin{array}{l}
    397             \Phi_{\vec 0}\\
    398             \Phi_{0,0,1}\\
    399             \vdotswithin{\Phi_{\vec 0}}\\
    400             \Phi_{\vec N}
    401         \end{array}
    402     \right]
    403     =
    404     \left[
    405         \begin{array}{l}
    406             \gamma_{0,0,0}-\Var{\epsilon}\\
    407             \gamma_{0,0,1}\\
    408             \vdotswithin{\gamma_{\vec 0}}\\
    409             \gamma_{\vec N}
    410         \end{array}
    411     \right],
    412     \qquad
    413     \Gamma=
    414     \left[
    415         \begin{array}{llll}
    416             \Gamma_0 & \Gamma_1 & \cdots & \Gamma_{N_1} \\
    417             \Gamma_1 & \Gamma_0 & \ddots & \vdotswithin{\Gamma_0} \\
    418             \vdotswithin{\Gamma_0} & \ddots & \ddots & \Gamma_1 \\
    419             \Gamma_{N_1} & \cdots & \Gamma_1 & \Gamma_0
    420         \end{array}
    421     \right],
    422 \end{equation*}
    423 where \(\vec N = \left( p_1, p_2, p_3 \right)\) and
    424 \begin{equation*}
    425     \Gamma_i =
    426     \left[
    427     \begin{array}{llll}
    428         \Gamma^0_i & \Gamma^1_i & \cdots & \Gamma^{N_2}_i \\
    429         \Gamma^1_i & \Gamma^0_i & \ddots & \vdotswithin{\Gamma^0_i} \\
    430         \vdotswithin{\Gamma^0_i} & \ddots & \ddots & \Gamma^1_i \\
    431         \Gamma^{N_2}_i & \cdots & \Gamma^1_i & \Gamma^0_i
    432     \end{array}
    433     \right]
    434     \qquad
    435     \Gamma_i^j=
    436     \left[
    437     \begin{array}{llll}
    438         \gamma_{i,j,0} & \gamma_{i,j,1} & \cdots & \gamma_{i,j,N_3} \\
    439         \gamma_{i,j,1} & \gamma_{i,j,0} & \ddots & \vdotswithin{\gamma_{i,j,0}} \\
    440         \vdotswithin{\gamma_{i,j,0}} & \ddots & \ddots & \gamma_{i,j,1} \\
    441         \gamma_{i,j,N_3} & \cdots & \gamma_{i,j,1} & \gamma_{i,j,0}
    442     \end{array}
    443     \right],
    444 \end{equation*}
    445 Since \(\Phi_{\vec 0}\equiv0\), the first row and column of \(\Gamma\) can be
    446 eliminated. Matrix \(\Gamma\) is block-block-toeplitz, positive definite and
    447 symmetric, hence the system is efficiently solved by Cholesky decomposition,
    448 which is particularly suitable for these types of matrices.
    449 
    450 After solving this system of equations white noise variance is estimated
    451 from\nbsp{}eqref:eq-yule-walker by plugging \(\vec k = \vec 0\):
    452 \begin{equation*}
    453     \Var{\epsilon} =
    454     \Var{\zeta}
    455     -
    456     \sum\limits_{\vec j = \vec 0}^{\vec N}
    457     \Phi_{\vec j}
    458     \text{ }\gamma_{\vec{j}}.
    459 \end{equation*}
    460 
    461 **** Moving average (MA) process.
    462 MA process is ARMA process with \(\Phi\equiv0\):
    463 \begin{equation}
    464     \zeta_{\vec i}
    465     =
    466     \sum\limits_{\vec j = \vec 0}^{\vec M}
    467     \Theta_{\vec j} \epsilon_{\vec i - \vec j}
    468     .
    469     \label{eq-ma-process}
    470 \end{equation}
    471 MA coefficients \(\Theta\) are defined implicitly via the following non-linear
    472 system of equations:
    473 \begin{equation*}
    474   \gamma_{\vec i} =
    475 	\left[
    476 		\displaystyle
    477     \sum\limits_{\vec j = \vec i}^{\vec M}
    478     \Theta_{\vec j}\Theta_{\vec j - \vec i}
    479 	\right]
    480   \Var{\epsilon}.
    481 \end{equation*}
    482 The system is solved numerically by fixed-point iteration method via the
    483 following formulae
    484 \begin{equation*}
    485   \Theta_{\vec i} =
    486     -\frac{\gamma_{\vec 0}}{\Var{\epsilon}}
    487 		+
    488     \sum\limits_{\vec j = \vec i}^{\vec M}
    489     \Theta_{\vec j} \Theta_{\vec j - \vec i}.
    490 \end{equation*}
    491 Here coefficients \(\Theta\) are calculated from back to front: from
    492 \(\vec{i}=\vec{M}\) to \(\vec{i}=\vec{0}\). White noise variance is estimated by
    493 \begin{equation*}
    494     \Var{\epsilon} = \frac{\gamma_{\vec 0}}{
    495 		1
    496 		+
    497     \sum\limits_{\vec j = \vec 0}^{\vec M}
    498     \Theta_{\vec j}^2
    499     }.
    500 \end{equation*}
    501 Authors of\nbsp{}cite:box1976time suggest using Newton---Raphson method to solve
    502 this equation with higher precision, however, in this case it is difficult to
    503 generalise the method for three dimensions. Using slower method does not have
    504 dramatic effect on the overall programme performance, because the number of
    505 coefficients is small and most of the time is spent generating wavy surface.
    506 
    507 **** Stationarity and invertibility of AR and MA processes.
    508 In order for modelled wavy surface to represent physical phenomena, the
    509 corresponding process must be stationary and invertible. If the process is
    510 /invertible/, then there is a reasonable connection of current events with the
    511 events in the past, and if the process is /stationary/, the modelled physical
    512 signal amplitude does not increase infinitely in time and space.
    513 
    514 AR process is always invertible, and for stationarity it is necessary for roots
    515 of characteristic equation
    516 \begin{equation*}
    517 1 - \Phi_{0,0,1} z - \Phi_{0,0,2} z^2
    518 - \cdots
    519 - \Phi_{\vec N} z^{N_0 N_1 N_2} = 0,
    520 \end{equation*}
    521 to lie \emph{outside} the unit circle. Here \(\vec{N}\) is AR process order
    522 and \(\Phi\) are coefficients.
    523 
    524 MA process is always stationary, and for invertibility it is necessary for roots
    525 of characteristic equation
    526 \begin{equation*}
    527 1 - \Theta_{0,0,1} z - \Theta_{0,0,2} z^2
    528 - \cdots
    529 - \Theta_{\vec M} z^{M_0 M_1 M_2} = 0,
    530 \end{equation*}
    531 to lie \emph{outside} the unit circle. Here \(\vec{M}\) is
    532 three-dimensional MA process order and \(\Theta\) are coefficients.
    533 
    534 Stationarity and invertibility properties are the main criteria in selection of
    535 the process to model different wave profiles, which are discussed in
    536 section\nbsp{}[[#sec-process-selection]].
    537 
    538 **** Mixed autoregressive moving average (ARMA) process.
    539 :PROPERTIES:
    540 :CUSTOM_ID: sec-how-to-mix-arma
    541 :END:
    542 
    543 Generally speaking, ARMA process is obtained by plugging MA generated wavy
    544 surface as random impulse to AR process, however, in order to get the process
    545 with desired ACF one should re-compute AR coefficients before plugging. There
    546 are several approaches to "mix" AR and MA processes.
    547 - The approach proposed in\nbsp{}cite:box1976time which involves dividing ACF
    548   into MA and AR part along each dimension is not applicable here, because in
    549   three dimensions such division is not possible: there always be parts of the
    550   ACF that are not taken into account by AR and MA process.
    551 - The alternative approach is to use the same (undivided) ACF for both AR and MA
    552   processes but use different process order, however, then realisation
    553   characteristics (mean, variance etc.) become skewed: these are characteristics
    554   of the two overlapped processes.
    555 For the first approach the authors of\nbsp{}cite:box1976time propose a formula
    556 to correct AR process coefficients, but there is no such formula for the second
    557 approach. So, the only proven solution for now is to simply use AR and MA
    558 process exclusively.
    559 
    560 **** Process selection criteria.
    561 :PROPERTIES:
    562 :CUSTOM_ID: sec-process-selection
    563 :END:
    564 
    565 One problem of ARMA model application to sea wave generation is that for
    566 different types of wave profiles different processes /must/ be used: standing
    567 waves are modelled by AR process, and propagating waves by MA process. This
    568 statement comes from practice: if one tries to use the processes the other way
    569 round, the resulting realisation either diverges or does not correspond to real
    570 sea waves. (The latter happens for non-invertible MA process, as it is always
    571 stationary.) So, the best way to apply ARMA model to sea wave generation is to
    572 use AR process for standing waves and MA process for progressive waves.
    573 
    574 The other problem is difficulty of determination of optimal number of
    575 coefficients for three-dimensional AR and MA processes. For one-dimensional
    576 processes there are easy to implement iterative methods\nbsp{}cite:box1976time,
    577 but for three-dimensional case there are analogous
    578 methods\nbsp{}cite:byoungseon1999arma3d,liew2005arma3d which are difficult to
    579 implement, hence they were not used in this work. Manual choice of model's order
    580 is trivial: choosing knowingly higher order than needed affects performance, but
    581 not realisation quality, because processes' periodicity does not depend on the
    582 number of coefficients. Software implementation of iterative methods of finding
    583 the optimal model order would improve automation level of wavy surface
    584 generator, but not the quality of the experiments being conducted.
    585 
    586 In practice some statements made for AR and MA processes
    587 in\nbsp{}cite:box1976time should be flipped for three-dimensional case. For
    588 example, the authors say that ACF of MA process cuts at \(q\) and ACF of AR
    589 process decays to nought infinitely, but in practice making ACF of 3-dimensional
    590 MA process not decay results in it being non-invertible and producing
    591 realisation that does not look like real sea waves, whereas doing the same for
    592 ACF of AR process results in stationary process and adequate realisation. Also,
    593 the authors say that one should allocate the first \(q\) points of ACF to MA
    594 process (as it often needed to describe the peaks in ACF) and leave the rest
    595 points to AR process, but in practice in case of ACF of a propagating wave AR
    596 process is stationary only for the first time slice of the ACF, and the rest is
    597 left to MA process.
    598 
    599 To summarise, the only established scenario of applying ARMA model to sea wave
    600 generation is to use AR process for standing waves and MA process for
    601 propagating waves. Using mixed ARMA process for both types of waves might
    602 increase model precision, given that coefficient correction formulae for three
    603 dimensions become available, which is the objective of the future research.
    604 
    605 **** Verification of wavy surface integral characteristics.
    606 In\nbsp{}cite:degtyarev2011modelling,degtyarev2013synoptic,boukhanovsky1997thesis
    607 AR model the following items are verified experimentally:
    608 - probability distributions of different wave characteristics (heights, lengths,
    609   crests, periods, slopes, three-dimensionality),
    610 - dispersion relation,
    611 - retention of integral characteristics for mixed wave sea state.
    612 In this work both AR and MA model are verified by comparing probability
    613 distributions of different wave characteristics.
    614 
    615 In\nbsp{}cite:rozhkov1990stochastic the authors show that several sea wave
    616 characteristics (listed in table\nbsp{}[[tab-weibull-shape]]) have Weibull
    617 distribution, and wavy surface elevation has Gaussian distribution. In order to
    618 verify that distributions corresponding to generated realisation are correct,
    619 quantile-quantile plots are used (plots where analytic quantile values are used
    620 for \(OX\) axis and estimated quantile values for \(OY\) axis). If the estimated
    621 distribution matches analytic then the graph has the form of the straight line.
    622 Tails of the graph may diverge from the straight line, because they can not be
    623 reliably estimated from the finite-size realisation.
    624 
    625 #+name: tab-weibull-shape
    626 #+caption[Values of Weibull shape parameter]:
    627 #+caption: Values of Weibull shape parameter for different wave characteristics.
    628 #+attr_latex: :booktabs t
    629 | Characteristic       | Weibull shape (\(k\)) |
    630 |----------------------+-----------------------|
    631 |                      | <l>                   |
    632 | Wave height          | 2                     |
    633 | Wave length          | 2.3                   |
    634 | Crest length         | 2.3                   |
    635 | Wave period          | 3                     |
    636 | Wave slope           | 2.5                   |
    637 | Three-dimensionality | 2.5                   |
    638 
    639 Verification was performed for standing and propagating waves. The corresponding
    640 ACFs and quantile-quantile plots of wave characteristics distributions are shown
    641 in
    642 fig.\nbsp{}[[propagating-wave-distributions]],\nbsp{}[[standing-wave-distributions]],\nbsp{}[[acf-slices]].
    643 
    644 #+name: propagating-wave-distributions
    645 #+begin_src R :file build/propagating-wave-qqplots.pdf
    646 source(file.path("R", "common.R"))
    647 par(pty="s", mfrow=c(2, 2), family="serif")
    648 arma.qqplot_grid(
    649   file.path("build", "arma-benchmarks", "verification-orig", "propagating_wave"),
    650   c("elevation", "heights_y", "lengths_y", "periods"),
    651   c("elevation", "height Y", "length Y", "period"),
    652   xlab="x",
    653   ylab="y"
    654 )
    655 #+end_src
    656 
    657 #+caption[Quantile-quantile plots for propagating waves]:
    658 #+caption: Quantile-quantile plots for propagating waves.
    659 #+name: propagating-wave-distributions
    660 #+RESULTS: propagating-wave-distributions
    661 [[file:build/propagating-wave-qqplots.pdf]]
    662 
    663 #+name: standing-wave-distributions
    664 #+begin_src R :file build/standing-wave-qqplots.pdf
    665 source(file.path("R", "common.R"))
    666 par(pty="s", mfrow=c(2, 2), family="serif")
    667 arma.qqplot_grid(
    668   file.path("build", "arma-benchmarks", "verification-orig", "standing_wave"),
    669   c("elevation", "heights_y", "lengths_y", "periods"),
    670   c("elevation", "height Y", "length Y", "period"),
    671   xlab="x",
    672   ylab="y"
    673 )
    674 #+end_src
    675 
    676 #+caption[Quantile-quantile plots for standing waves]:
    677 #+caption: Quantile-quantile plots for standing waves.
    678 #+name: standing-wave-distributions
    679 #+RESULTS: standing-wave-distributions
    680 [[file:build/standing-wave-qqplots.pdf]]
    681 
    682 #+name: acf-slices
    683 #+header: :width 6 :height 9
    684 #+begin_src R :file build/acf-slices.pdf
    685 source(file.path("R", "common.R"))
    686 propagating_acf <- read.csv(file.path("build", "arma-benchmarks", "verification-orig", "propagating_wave", "acf.csv"))
    687 standing_acf <- read.csv(file.path("build", "arma-benchmarks", "verification-orig", "standing_wave", "acf.csv"))
    688 par(mfrow=c(5, 2), mar=c(0,0,0,0), family="serif")
    689 for (i in seq(0, 4)) {
    690   arma.wavy_plot(standing_acf, i, zlim=c(-5,5))
    691   arma.wavy_plot(propagating_acf, i, zlim=c(-5,5))
    692 }
    693 #+end_src
    694 
    695 #+caption[ACF time slices for standing and propagating waves]:
    696 #+caption: ACF time slices for standing (left column) and propagating waves
    697 #+caption: (right column).
    698 #+name: acf-slices
    699 #+RESULTS: acf-slices
    700 [[file:build/acf-slices.pdf]]
    701 
    702 Graph tails in fig.\nbsp{}[[propagating-wave-distributions]] deviate from original
    703 distribution for individual wave characteristics, because every wave have to be
    704 extracted from the resulting wavy surface to measure its length, period and
    705 height. There is no algorithm that guarantees correct extraction of all waves,
    706 because they may and often overlap each other. Weibull distribution right tail
    707 represents infrequently occurring waves, so it deviates more than left tail.
    708 
    709 Correspondence rate for standing waves (fig.\nbsp{}[[standing-wave-distributions]])
    710 is lower for height and length, roughly the same for surface elevation and
    711 higher for wave period distribution tails. Lower correspondence degree for
    712 length and height may be attributed to the fact that Weibull distributions were
    713 obtained empirically for sea waves which are typically propagating, and
    714 distributions may be different for standings waves. Higher correspondence degree
    715 for wave periods is attributed to the fact that wave periods of standing waves
    716 are extracted more precisely as the waves do not move outside simulated wavy
    717 surface region. The same correspondence degree for wave elevation is obtained,
    718 because this is the characteristic of the wavy surface (and corresponding AR or
    719 MA process) and is not affected by the type of waves.
    720 
    721 To summarise, software implementation of AR and MA models (which is described in
    722 detail in sec.\nbsp{}[[#sec-hpc]]) generates wavy surface, distributions of
    723 individual waves' characteristics of which correspond to /in situ/ data.
    724 
    725 ** Modelling non-linearity of sea waves
    726 ARMA model allows to model asymmetry of wave elevation distribution,
    727 i.e.\nbsp{}generate sea waves, distribution of \(z\)-coordinate of which has
    728 non-nought kurtosis and asymmetry. Such distribution is inherent to real sea
    729 waves\nbsp{}cite:longuet1963nonlinear and given by either polynomial
    730 approximation of /in situ/ data or analytic formula. Wave asymmetry is modelled
    731 by non-linear inertialess transform (NIT) of stochastic process, however,
    732 transforming resulting wavy surface means transforming initial ACF. In order to
    733 alleviate this, ACF must be preliminary transformed as shown
    734 in\nbsp{}cite:boukhanovsky1997thesis.
    735 
    736 **** Wavy surface transformation.
    737 Explicit formula \(z=f(y)\) that transforms wavy surface to desired
    738 one-dimensional distribution \(F(z)\) is the solution of non-linear
    739 transcendental equation \(F(z)=\Phi(y)\), where \(\Phi(y)\)\nbsp{}---
    740 one-dimensional Gaussian distribution. Since distribution of wave elevation is
    741 often given by some approximation based on /in situ/ data, this equation is
    742 solved numerically with respect to \(z_k\) in each grid point \(y_k|_{k=0}^N\)
    743 of generated wavy surface. In this case equation is rewritten as
    744 \begin{equation}
    745     \label{eq-distribution-transformation}
    746     F(z_k)
    747     =
    748     \frac{1}{\sqrt{2\pi}}
    749     \int\limits_0^{y_k} \exp\left[ -\frac{t^2}{2} \right] dt
    750     .
    751 \end{equation}
    752 Since, distribution functions are monotonic, the simplest interval halving
    753 (bisection) numerical method is used to solve this equation.
    754 
    755 **** Preliminary ACF transformation.
    756 In order to transform ACF \(\gamma_z\) of the process, it is expanded as a
    757 series of Hermite polynomials (Gram---Charlier series)
    758 \begin{equation*}
    759     \gamma_z \left( \vec u \right)
    760     =
    761     \sum\limits_{m=0}^{\infty}
    762     C_m^2 \frac{\gamma_y^m \left( \vec u \right)}{m!},
    763 \end{equation*}
    764 where
    765 \begin{equation*}
    766     C_m = \frac{1}{\sqrt{2\pi}}
    767   \int\limits_{0}^\infty
    768     f(y) H_m(y) \exp\left[ -\frac{y^2}{2} \right],
    769 \end{equation*}
    770 \(H_m\)\nbsp{}--- Hermite polynomial, and \(f(y)\)\nbsp{}--- solution to
    771 equation\nbsp{}eqref:eq-distribution-transformation. Plugging polynomial
    772 approximation \(f(y)\approx\sum\limits_{i}d_{i}y^i\) and analytic formulae for
    773 Hermite polynomial yields
    774 \begin{equation*}
    775     \frac{1}{\sqrt{2\pi}}
    776     \int\limits_\infty^\infty
    777     y^k \exp\left[ -\frac{y^2}{2} \right]
    778     =
    779     \begin{cases}
    780         (k-1)!! & \text{if }k\text{ is even},\\
    781         0       & \text{if }k\text{ is odd},
    782     \end{cases}
    783 \end{equation*}
    784 which simplifies the former equation. Optimal number of coefficients \(C_m\) is
    785 determined by computing them sequentially and stopping when variances of both
    786 fields become equal with desired accuracy \(\epsilon\):
    787 \begin{equation}
    788     \label{eq-nit-error}
    789     \left| \Var{z} - \sum\limits_{k=0}^m
    790     \frac{C_k^2}{k!} \right| \leq \epsilon.
    791 \end{equation}
    792 
    793 In\nbsp{}cite:boukhanovsky1997thesis the author suggests using polynomial
    794 approximation \(f(y)\) also for wavy surface transformation, however, in
    795 practice sea surface realisation often contains points, in which
    796 \(z\)-coordinate is beyond the limits of the approximation, leading to sharp
    797 precision decrease. In these points it is more efficient to solve
    798 equation\nbsp{}eqref:eq-distribution-transformation by bisection method. Using
    799 the same approximation in Gram---Charlier series does not lead to such errors.
    800 
    801 **** Gram---Charlier series expansion.
    802 In\nbsp{}cite:huang1980experimental the authors experimentally show, that PDF of
    803 sea surface elevation is distinguished from normal distribution by non-nought
    804 kurtosis and skewness. In\nbsp{}cite:rozhkov1996theory the authors show, that
    805 this type of PDF expands as a Gram---Charlier series (GCS):
    806 \begin{align}
    807     \label{eq-skew-normal-1}
    808     & F(z; \mu=0, \sigma=1, \gamma_1, \gamma_2) \approx
    809     \Phi(z; \mu, \sigma) \frac{2 + \gamma_2}{2}
    810     - \frac{2}{3} \phi(z; \mu, \sigma)
    811     \left(\gamma_2 z^3+\gamma_1
    812     \left(2 z^2+1\right)\right) \nonumber
    813     \\
    814     & f(z; \mu, \sigma, \gamma_1, \gamma_2) \approx
    815     \phi(z; \mu, \sigma)
    816     \left[
    817         1+
    818         \frac{1}{3!} \gamma_1 H_3 \left(\frac{z-\mu}{\sigma}\right)
    819         + \frac{1}{4!} \gamma_2 H_4 \left(\frac{z-\mu}{\sigma}\right)
    820     \right],
    821 \end{align}
    822 where \(\Phi(z)\)\nbsp{}--- cumulative density function (CDF) of normal
    823 distribution, \(\phi\)\nbsp{}--- PDF of normal distribution,
    824 \(\gamma_1\)\nbsp{}--- skewness, \(\gamma_2\)\nbsp{}--- kurtosis,
    825 \(f\)\nbsp{}--- PDF, \(F\)\nbsp{}--- cumulative distribution function (CDF).
    826 According to\nbsp{}cite:rozhkov1990stochastic for sea waves skewness is selected
    827 from interval \(0.1\leq\gamma_1\leq{0.52}]\) and kurtosis from interval
    828 \(0.1\leq\gamma_2\leq{0.7}\). Family of probability density functions for
    829 different parameters is shown in fig.\nbsp{}[[fig-skew-normal-1]].
    830 
    831 #+NAME: fig-skew-normal-1
    832 #+begin_src R :file build/skew-normal-1.pdf
    833 source(file.path("R", "common.R"))
    834 par(family="serif")
    835 x <- seq(-3, 3, length.out=100)
    836 params <- data.frame(
    837   skewness = c(0.00, 0.52, 0.00, 0.52),
    838   kurtosis = c(0.00, 0.00, 0.70, 0.70),
    839   linetypes = c("solid", "dashed", "dotdash", "dotted")
    840 )
    841 arma.skew_normal_1_plot(x, params)
    842 legend(
    843   "topleft",
    844   mapply(
    845     function (s, k) {
    846       as.expression(bquote(list(
    847         gamma[1] == .(arma.fmt(s, 2)),
    848         gamma[2] == .(arma.fmt(k, 2))
    849       )))
    850     },
    851     params$skewness,
    852     params$kurtosis
    853   ),
    854   lty = paste(params$linetypes)
    855 )
    856 #+end_src
    857 
    858 #+caption[Graph of PDF of GCS-based distribution]: Graph of probability density function of GCS-based distribution law for different values of skewness \(\gamma_1\) and kurtosis \(\gamma_2\).
    859 #+name: fig-skew-normal-1
    860 #+RESULTS: fig-skew-normal-1
    861 [[file:build/skew-normal-1.pdf]]
    862 
    863 **** Skew-normal distribution.
    864 Alternative approach is to approximate distribution of sea wavy surface
    865 elevation by skew-normal (SN) distribution:
    866 \begin{align}
    867     \label{eq-skew-normal-2}
    868     F(z; \alpha) & = \frac{1}{2}
    869    \mathrm{erfc}\left[-\frac{z}{\sqrt{2}}\right]-2 T(z,\alpha ), \nonumber \\
    870     f(z; \alpha) & = \frac{e^{-\frac{z^2}{2}}}{\sqrt{2 \pi }}
    871    \mathrm{erfc}\left[-\frac{\alpha z}{\sqrt{2}}\right],
    872 \end{align}
    873 where \(T\)\nbsp{}--- Owen \(T\)-function\nbsp{}cite:owen1956tables. Using this
    874 formula it is impossible to specify skewness and kurtosis separately\nbsp{}---
    875 both values are adjusted via \(\alpha\) parameter. The only advantage of the
    876 formula is its relative computational simplicity: this function is available in
    877 some programmes and mathematical libraries. Its graph for different values of
    878 \(\alpha\) is shown in fig.\nbsp{}[[fig-skew-normal-2]].
    879 
    880 #+name: fig-skew-normal-2
    881 #+begin_src R :file build/skew-normal-2.pdf
    882 source(file.path("R", "common.R"))
    883 par(family="serif")
    884 x <- seq(-3, 3, length.out=100)
    885 alpha <- c(0.00, 0.87, 2.25, 4.90)
    886 params <- data.frame(
    887   alpha = alpha,
    888   skewness = arma.bits.skewness_2(alpha),
    889   kurtosis = arma.bits.kurtosis_2(alpha),
    890   linetypes = c("solid", "dashed", "dotdash", "dotted")
    891 )
    892 arma.skew_normal_2_plot(x, params)
    893 legend(
    894   "topleft",
    895   mapply(
    896     function (a, s, k) {
    897       as.expression(bquote(list(
    898         alpha == .(arma.fmt(a, 2)),
    899         gamma[1] == .(arma.fmt(s, 2)),
    900         gamma[2] == .(arma.fmt(k, 2))
    901       )))
    902     },
    903     params$alpha,
    904     params$skewness,
    905     params$kurtosis
    906   ),
    907   lty = paste(params$linetypes)
    908 )
    909 #+end_src
    910 
    911 #+caption[Graph of PDF of SN distribution]: Graph of PDF of skew-normal distribution for different values of skewness coefficient \(\alpha\).
    912 #+name: fig-skew-normal-2
    913 #+RESULTS: fig-skew-normal-2
    914 [[file:build/skew-normal-2.pdf]]
    915 
    916 **** Evaluation.
    917 In order to measure the effect of NIT on the shape of the resulting wavy
    918 surface, three realisations were generated:
    919 - realisation with Gaussian distribution (without NIT),
    920 - realisation with Gram---Charlier series (GCS) based distribution, and
    921 - realisation with skew normal distribution.
    922 The seed of PRNG was set to be the same for all programme executions to make ARMA
    923 model produce the same values for each realisation. There we two experiments:
    924 for standing and propagating waves with ACFs given by formulae from
    925 section\nbsp{}[[#sec-wave-acfs]].
    926 
    927 While the experiment showed that applying NIT with GCS-based distribution
    928 increases wave steepness, the same is not true for skew normal distribution
    929 (fig.\nbsp{}[[fig-nit]]). Using this distribution results in wavy surface each
    930 \(z\)-coordinate of which is always greater or equal to nought. So, skew normal
    931 distribution is unsuitable for NIT. NIT increases the wave height and steepness
    932 of both standing and propagating waves. Increasing either skewness or kurtosis
    933 of GCS-based distribution increases both wave height and steepness. The error of
    934 ACF approximation (eq.\nbsp{}eqref:eq-nit-error) ranges from 0.20 for GCS-based
    935 distribution to 0.70 for skew normal distribution (table\nbsp{}[[tab-nit-error]]).
    936 
    937 #+name: fig-nit
    938 #+header: :width 7 :height 7
    939 #+begin_src R :file build/nit.pdf
    940 source(file.path("R", "nonlinear.R"))
    941 par(mfrow=c(2, 1), mar=c(4,4,4,0.5), family='serif')
    942 args <- list(
    943   graphs=c('Gaussian', 'Gram—Charlier', 'Skew normal'),
    944   linetypes=c('solid', 'dashed', 'dotted')
    945 )
    946 args$title <- 'Propagating waves'
    947 arma.plot_nonlinear(file.path("build", "arma-benchmarks", "verification-orig", "nit-propagating"), args)
    948 args$title <- 'Standing waves'
    949 arma.plot_nonlinear(file.path("build", "arma-benchmarks", "verification-orig", "nit-standing"), args)
    950 #+end_src
    951 
    952 #+name: fig-nit
    953 #+caption[Surface slices with different distributions of applicates]:
    954 #+caption: Wavy surface slices with different distributions
    955 #+caption: of wave elevation (Gaussian, GCS-based and SN).
    956 #+RESULTS: fig-nit
    957 [[file:build/nit.pdf]]
    958 
    959 #+name: tab-nit-error
    960 #+caption[Errors of ACF approximations]:
    961 #+caption: Errors of ACF approximations (the difference of variances) for
    962 #+caption: different wave elevation distributions. \(N\) is the number of
    963 #+caption: coefficients of ACF approximation.
    964 #+attr_latex: :booktabs t
    965 | Wave type   | Distribution | \(\gamma_1\) | \(\gamma_2\) | \(\alpha\) | Error | \(N\) | Wave height |
    966 |-------------+--------------+--------------+--------------+------------+-------+-------+-------------|
    967 | propagating | Gaussian     |              |              |            |       |       |        2.41 |
    968 | propagating | GCS-based    |         2.25 |          0.4 |            |  0.20 |     2 |        2.75 |
    969 | propagating | skew normal  |              |              |          1 |  0.70 |     3 |        1.37 |
    970 | standing    | Gaussian     |              |              |            |       |       |        1.73 |
    971 | standing    | GCS-based    |         2.25 |          0.4 |            |  0.26 |     2 |        1.96 |
    972 | standing    | skew normal  |              |              |          1 |  0.70 |     3 |        0.94 |
    973 
    974 To summarise, the only test case that showed acceptable results is realisation
    975 with GCS-based distribution for both standing and propagating waves. Skew normal
    976 distribution warps wavy surface for both types of waves. GCS-based realisations
    977 have large error of ACF approximation, which results in increase of wave height.
    978 The reason for the large error is that GCS approximations are not accurate as
    979 they do not converge for all possible
    980 functions\nbsp{}cite:wallace1958asymptotic. Despite the large error, the change
    981 in wave height is small (table\nbsp{}[[tab-nit-error]]).
    982 
    983 ***** Wave height                                              :noexport:
    984 :PROPERTIES:
    985 :header-args:R: :results output org
    986 :END:
    987 
    988 #+header:
    989 #+begin_src R :results output org
    990 source(file.path("R", "nonlinear.R"))
    991 propagating <- arma.print_wave_height(file.path("build", "arma-benchmarks", "verification-orig", "nit-propagating"))
    992 standing <- arma.print_wave_height(file.path("build", "arma-benchmarks", "verification-orig", "nit-standing"))
    993 result <- data.frame(
    994   h1=c(propagating$h1, standing$h1),
    995   h2=c(propagating$h2, standing$h2),
    996   h3=c(propagating$h3, standing$h3)
    997 )
    998 rownames(result) <- c('propagating', 'standing')
    999 colnames(result) <- c('none', 'gcs', 'sn')
   1000 ascii(result)
   1001 #+end_src
   1002 
   1003 #+RESULTS:
   1004 #+BEGIN_SRC org
   1005 |             | none |  gcs |   sn |
   1006 |-------------+------+------+------|
   1007 | propagating | 2.41 | 2.75 | 1.37 |
   1008 | standing    | 1.73 | 1.96 | 0.94 |
   1009 #+END_SRC
   1010 
   1011 ** The shape of ACF for different types of waves
   1012 :PROPERTIES:
   1013 :CUSTOM_ID: sec-wave-acfs
   1014 :END:
   1015 
   1016 **** Analytic method of finding the ACF.
   1017 The straightforward way to find ACF for a given sea wave profile is to apply
   1018 Wiener---Khinchin theorem. According to this theorem the autocorrelation \(K\) of
   1019 a function \(\zeta\) is given by the Fourier transform of the absolute square of
   1020 the function:
   1021 \begin{equation}
   1022   K(t) = \Fourier{\left| \zeta(t) \right|^2}.
   1023   \label{eq-wiener-khinchin}
   1024 \end{equation}
   1025 When \(\zeta\) is replaced with actual wave profile, this formula gives you
   1026 analytic formula for the corresponding ACF.
   1027 
   1028 For three-dimensional wave profile (2D in space and 1D in time) analytic formula
   1029 is a polynomial of high order and is best obtained via symbolic computation
   1030 programme. Then for practical usage it can be approximated by superposition of
   1031 exponentially decaying cosines (which is how ACF of a stationary ARMA process
   1032 looks like\nbsp{}cite:box1976time).
   1033 
   1034 **** Empirical method of finding the ACF.
   1035 However, for three-dimensional case there exists simpler empirical method which
   1036 does not require sophisticated software to determine shape of the ACF. It is
   1037 known that ACF represented by exponentially decaying cosines satisfies first
   1038 order Stokes' equations for gravity waves\nbsp{}cite:boccotti1983wind. So, if
   1039 the shape of the wave profile is the only concern in the simulation, then one
   1040 can simply multiply it by a decaying exponent to get appropriate ACF. This ACF
   1041 does not reflect other wave profile parameters, such as wave height and period,
   1042 but opens possibility to simulate waves of a particular shape by defining their
   1043 profile with discretely given function, then multiplying it by an exponent and
   1044 using the resulting function as ACF. So, this empirical method is imprecise but
   1045 offers simpler alternative to Wiener---Khinchin theorem approach; it is mainly
   1046 useful to test ARMA model.
   1047 
   1048 **** Standing wave ACF.
   1049 For three-dimensional plain standing wave the profile is given by
   1050 \begin{equation}
   1051   \zeta(t, x, y) = A \sin (k_x x + k_y y) \sin (\sigma t).
   1052   \label{eq-standing-wave}
   1053 \end{equation}
   1054 Find ACF via analytic method. Multiplying the formula by a decaying exponent
   1055 (because Fourier transform is defined for a function \(f\) that
   1056 \(f\underset{x\rightarrow\pm\infty}{\longrightarrow}0\)) yields
   1057 \begin{equation}
   1058   \zeta(t, x, y) =
   1059   A
   1060   \exp\left[-\alpha (|t|+|x|+|y|) \right]
   1061   \sin (k_x x + k_y y) \sin (\sigma t).
   1062   \label{eq-decaying-standing-wave}
   1063 \end{equation}
   1064 Then, apply three-dimensional Fourier transform to both sides of the equation
   1065 via symbolic computation programme, fit the resulting polynomial to the
   1066 following approximation:
   1067 \begin{equation}
   1068   K(t,x,y) =
   1069   \gamma
   1070   \exp\left[-\alpha (|t|+|x|+|y|) \right]
   1071   \cos \beta t
   1072   \cos \left[ \beta x + \beta y \right].
   1073   \label{eq-standing-wave-acf}
   1074 \end{equation}
   1075 So, after applying Wiener---Khinchin theorem we get initial formula but with
   1076 cosines instead of sines. This difference is important because the value of ACF
   1077 at \((0,0,0)\) equals to the ARMA process variance, and if one used sines the
   1078 value would be wrong.
   1079 
   1080 If one tries to replicate the same formula via empirical method, the usual way
   1081 is to adapt\nbsp{}eqref:eq-decaying-standing-wave to
   1082 match\nbsp{}eqref:eq-standing-wave-acf. This can be done either by changing the
   1083 phase of the sine, or by substituting sine with cosine to move the maximum of
   1084 the function to the origin of coordinates.
   1085 
   1086 **** Propagating wave ACF.
   1087 Three-dimensional profile of plain propagating wave is given by
   1088 \begin{equation}
   1089   \zeta(t, x, y) = A \cos (\sigma t + k_x x + k_y y).
   1090   \label{eq-propagating-wave}
   1091 \end{equation}
   1092 For the analytic method repeating steps from the previous two paragraphs yields
   1093 \begin{equation}
   1094   K(t,x,y) =
   1095   \gamma
   1096   \exp\left[-\alpha (|t|+|x|+|y|) \right]
   1097   \cos\left[\beta (t+x+y) \right].
   1098   \label{eq-propagating-wave-acf}
   1099 \end{equation}
   1100 For the empirical method the wave profile is simply multiplied by a decaying
   1101 exponent without need to adapt the maximum value of ACF (as it is required for
   1102 standing wave).
   1103 
   1104 **** Comparison of studied methods.
   1105 To summarise, the analytic method of finding sea wave's ACF reduces to the
   1106 following steps.
   1107 - Make wave profile decay when approaching \(\pm\infty\) by multiplying it by
   1108   a decaying exponent.
   1109 - Apply Fourier transform to the absolute square of the resulting equation using
   1110   symbolic computation programme.
   1111 - Fit the resulting polynomial to the appropriate ACF approximation.
   1112 
   1113 Two examples in this section showed that in case of standing and propagating
   1114 waves their decaying profiles resemble the corresponding ACFs with the exception
   1115 that the ACF's maximum should be moved to the origin to preserve simulated
   1116 process variance. Empirical method of finding ACF reduces to the following
   1117 steps.
   1118 - Make wave profile decay when approaching \(\pm\infty\) by multiplying it by
   1119   a decaying exponent.
   1120 - Move maximum value of the resulting function to the origin by using
   1121   trigonometric identities to shift the phase.
   1122 
   1123 ** Summary
   1124 ARMA model, owing to its non-physical nature, does not have the notion of sea
   1125 wave; it simulates wavy surface as a whole instead. Motions of individual waves
   1126 and their shape are often rough, and the total number of waves can not be
   1127 determined precisely. However, integral characteristics of wavy surface match
   1128 the ones of real sea waves.
   1129 
   1130 Theoretically, sea waves themselves can be chosen as ACFs, the only
   1131 pre-processing step is to make them decay exponentially. This may allow to
   1132 generate waves of arbitrary profiles without using NIT, and is one of the
   1133 directions of future work.
   1134 
   1135 * Pressure field under discretely given wavy surface
   1136 ** Known pressure field determination formulae
   1137 **** Small amplitude waves theory.
   1138 In\nbsp{}cite:stab2012,degtyarev1998modelling,degtyarev1997analysis the authors
   1139 propose a solution for inverse problem of hydrodynamics of potential flow in the
   1140 framework of small-amplitude wave theory (under assumption that wave length is
   1141 much larger than height: \(\lambda \gg h\)). In that case inverse problem is
   1142 linear and reduces to Laplace equation with mixed boundary conditions, and
   1143 equation of motion is solely used to determine pressures for calculated velocity
   1144 potential derivatives. The assumption of small amplitudes means the slow decay
   1145 of wind wave coherence function, i.e. small change of local wave number in time
   1146 and space compared to the wavy surface elevation (\(z\) coordinate). This
   1147 assumption allows to calculate elevation \(z\) derivative as \(\zeta_z=k\zeta\),
   1148 where \(k\) is wave number. In two-dimensional case the solution is written
   1149 explicitly as
   1150 \begin{align}
   1151     \left.\frac{\partial\phi}{\partial x}\right|_{x,t}= &
   1152         -\frac{1}{\sqrt{1+\alpha^{2}}}e^{-I(x)}
   1153             \int\limits_{0}^x\frac{\partial\dot{\zeta}/\partial
   1154                 z+\alpha\dot{\alpha}}{\sqrt{1+\alpha^{2}}}e^{I(x)}dx,\label{eq-old-sol-2d}\\
   1155     I(x)= & \int\limits_{0}^x\frac{\partial\alpha/\partial z}{1+\alpha^{2}}dx,\nonumber
   1156 \end{align}
   1157 where \(\alpha\) is wave slope. In three-dimensional case the solution is
   1158 written in the form of elliptic partial differential equation (PDE):
   1159 \begin{align*}
   1160     & \frac{\partial^2 \phi}{\partial x^2} \left( 1 + \alpha_x^2 \right) +
   1161     \frac{\partial^2 \phi}{\partial y^2} \left( 1 + \alpha_y^2 \right) +
   1162     2\alpha_x\alpha_y \frac{\partial^2 \phi}{\partial x \partial y} + \\
   1163     & \left(
   1164         \frac{\partial \alpha_x}{\partial z} +
   1165         \alpha_x \frac{\partial \alpha_x}{\partial x} +
   1166         \alpha_y \frac{\partial \alpha_x}{\partial y}
   1167     \right) \frac{\partial \phi}{\partial x} + \\
   1168     & \left(
   1169         \frac{\partial \alpha_y}{\partial z} +
   1170         \alpha_x \frac{\partial \alpha_y}{\partial x} +
   1171         \alpha_y \frac{\partial \alpha_y}{\partial y}
   1172     \right) \frac{\partial \phi}{\partial y} + \\
   1173     & \frac{\partial \dot{\zeta}}{\partial z} +
   1174     \alpha_x \dot{\alpha_x} + \alpha_y \dot{\alpha_y} = 0.
   1175 \end{align*}
   1176 The authors suggest transforming this equation to finite differences and solve
   1177 it numerically.
   1178 
   1179 As will be shown in section\nbsp{}[[#sec-compare-formulae]],
   1180 formula\nbsp{}eqref:eq-old-sol-2d diverges when attempted to calculate velocity
   1181 field for large-amplitude waves, and this is the reason that it can not be used
   1182 in conjunction with a model, that generates arbitrary-amplitude waves.
   1183 
   1184 **** Linearisation of boundary condition.
   1185 :PROPERTIES:
   1186 :CUSTOM_ID: linearisation
   1187 :END:
   1188 
   1189 LH model allows to derive an explicit formula for velocity field by linearising
   1190 kinematic boundary condition. Velocity potential formula is written as
   1191 \begin{equation*}
   1192 \phi(x,y,z,t) = \sum_n \frac{c_n g}{\omega_n}
   1193      e^{\sqrt{u_n^2+v_n^2} z}
   1194      \sin(u_n x + v_n y - \omega_n t + \epsilon_n).
   1195 \end{equation*}
   1196 This formula is differentiated to obtain velocity potential derivatives, which
   1197 are plugged to dynamic boundary condition to determine pressure field.
   1198 
   1199 ** Determining wave pressures for discretely given wavy surface
   1200 Analytic solutions to boundary problems in classical equations are often used to
   1201 study different properties of the solution, and for that purpose general
   1202 solution formula is too difficult to study, as it contains integrals of unknown
   1203 functions. Fourier method is one of the methods to find analytic solutions to
   1204 PDE. It is based on Fourier transform, applying of which to some PDEs reduces
   1205 them to algebraic equations, and the solution is written as inverse Fourier
   1206 transform of some function (which may contain Fourier transforms of other
   1207 functions). Since it is not possible to write analytic forms of these Fourier
   1208 transforms in some cases, unique solutions are found and their behaviour is
   1209 studied in different domains instead. At the same time, computing discrete
   1210 Fourier transforms numerically is possible for any discretely defined function
   1211 using FFT algorithms. These algorithms use symmetry of complex exponentials to
   1212 decrease asymptotic complexity from \(\mathcal{O}(n^2)\) to
   1213 \(\mathcal{O}(n\log_{2}n)\). So, even if general solution contains Fourier
   1214 transforms of unknown functions, they still can be computed numerically, and FFT
   1215 family of algorithms makes this approach efficient.
   1216 
   1217 Alternative approach to solve PDEs is to reduce them to difference equations,
   1218 which are solved by constructing various numerical schemes. This approach leads
   1219 to approximate solution, and asymptotic complexity of corresponding algorithms
   1220 is comparable to that of FFT. For example, for stationary elliptic PDE an
   1221 implicit numerical scheme is constructed; the scheme is computed by iterative
   1222 method on each step of which a tridiagonal or five-diagonal system of algebraic
   1223 equations is solved by Thomas algorithm. Asymptotic complexity of this approach
   1224 is \(\mathcal{O}({n}{m})\), where \(n\) is the number of wavy surface grid
   1225 points and \(m\) is the number of iterations.
   1226 
   1227 Despite their wide spread, iterative algorithms are inefficient on parallel
   1228 computer architectures due to inevitable process synchronisation after each
   1229 iteration; in particular, their mapping to co-processors may involve copying
   1230 data in and out of the co-processor on each iteration, which negatively affects
   1231 their performance. At the same time, high number of Fourier transforms in the
   1232 solution is an advantage, rather than a disadvantage. First, solutions obtained
   1233 by Fourier method are explicit, hence their implementations scale with the large
   1234 number of parallel computer cores. Second, there are implementations of FFT
   1235 optimised for different processor architectures as well as co-processors (GPU,
   1236 MIC) which makes it easy to get high performance on any computing platform.
   1237 These advantages substantiate the choice of Fourier method to obtain explicit
   1238 solution to the problem of determining pressures under wavy sea surface.
   1239 
   1240 *** Two-dimensional velocity potential field
   1241 :PROPERTIES:
   1242 :CUSTOM_ID: sec-pressure-2d
   1243 :END:
   1244 
   1245 **** Formula for infinite depth fluid.
   1246 Two-dimensional Laplace equation with Robin boundary condition is written as
   1247 \begin{align}
   1248     \label{eq-problem-2d}
   1249     & \phi_{xx}+\phi_{zz}=0,\\
   1250     & \zeta_t + \zeta_x\phi_x
   1251     = \FracSqrtZetaX{\zeta_x}\phi_x
   1252     - \FracSqrtZetaX{1}\phi_z,
   1253     & \text{at }z=\zeta(x,t).\nonumber
   1254 \end{align}
   1255 Use Fourier method to solve this problem. Applying Fourier transform to both
   1256 sides of the equation yields
   1257 \begin{equation*}
   1258     -4 \pi^2 \left( u^2 + v^2 \right)
   1259     \FourierY{\phi(x,z)}{u,v} = 0,
   1260 \end{equation*}
   1261 hence \(v = \pm i u\). Hereinafter we use the following symmetric form of
   1262 Fourier transform:
   1263 \begin{equation*}
   1264     \FourierY{f(x,y)}{u,v} =
   1265     \iint\limits_{-\infty}^{\phantom{--}\infty}
   1266     f(x,y)
   1267     e^{-2\pi i (x u + y v)}
   1268     dx dy.
   1269 \end{equation*}
   1270 We seek solution in the form of inverse Fourier transform
   1271 \(\phi(x,z)=\InverseFourierY{E(u,v)}{x,z}\). Plugging[fn::\(v={-i}{u}\) is not
   1272 applicable because velocity potential must go to nought when depth goes to
   1273 infinity.] \(v={i}{u}\) into the formula yields
   1274 \begin{equation}
   1275     \label{eq-guessed-sol-2d}
   1276     \phi(x,z) = \InverseFourierY{e^{2\pi u z}E(u)}{x}.
   1277 \end{equation}
   1278 In order to make substitution \(z=\zeta(x,t)\) not interfere with Fourier
   1279 transforms, we rewrite\nbsp{}eqref:eq-guessed-sol-2d as a convolution:
   1280 \begin{equation*}
   1281     \phi(x,z)
   1282     =
   1283     \Fun{z}
   1284     \ast
   1285     \InverseFourierY{E(u)}{x},
   1286 \end{equation*}
   1287 where \(\Fun{z}\)\nbsp{}--- a function, form of which is defined in
   1288 section\nbsp{}[[#sec-compute-delta]] and which satisfies equation
   1289 \(\FourierY{\Fun{z}}{u}=e^{2\pi{u}{z}}\). Plugging formula \(\phi\) into the
   1290 boundary condition yields
   1291 \begin{equation*}
   1292     \zeta_t
   1293     =
   1294     \left( i f(x) - 1/\SqrtZetaX \right)
   1295     \left[
   1296         \Fun{z}
   1297         \ast
   1298         \InverseFourierY{2\pi u E(u)}{x}
   1299     \right],
   1300 \end{equation*}
   1301 where \(f(x)={\zeta_x}/{\sqrt{1+\zeta_x^2}}-\zeta_x\). Applying Fourier transform
   1302 to both sides of this equation yields formula for coefficients \(E\):
   1303 \begin{equation*}
   1304     E(u) =
   1305     \frac{1}{2\pi u}
   1306     \frac{
   1307     \FourierY{\zeta_t / \left(i f(x) - 1/\SqrtZetaX\right)}{u}
   1308     }{
   1309     \FourierY{\Fun{z}}{u}
   1310     }
   1311 \end{equation*}
   1312 Finally, substituting \(z\) for \(\zeta(x,t)\) and plugging resulting equation
   1313 into\nbsp{}eqref:eq-guessed-sol-2d yields formula for \(\phi(x,z)\):
   1314 \begin{equation}
   1315     \label{eq-solution-2d}
   1316     \boxed{
   1317         \phi(x,z)
   1318         =
   1319         \InverseFourierY{
   1320             \frac{e^{2\pi u z}}{2\pi u}
   1321             \frac{
   1322             \FourierY{ \zeta_t / \left(i f(x) - 1/\SqrtZetaX\right) }{u}
   1323             }{
   1324             \FourierY{ \Fun{\zeta(x,t)} }{u}
   1325             }
   1326         }{x}.
   1327     }
   1328 \end{equation}
   1329 
   1330 Multiplier \(e^{2\pi{u}{z}}/(2\pi{u})\) makes graph of a function to which
   1331 Fourier transform is applied asymmetric with respect to \(OY\) axis. This makes
   1332 it difficult to use FFT, because it expects periodic function which takes nought
   1333 values on both ends of the interval. At the same time, using numerical
   1334 integration instead of FFT is not faster than solving the initial system of
   1335 equations with numerical schemes. This problem is alleviated by using
   1336 formula\nbsp{}eqref:eq-solution-2d-full for finite depth fluid with knowingly
   1337 large depth \(h\). This formula is derived in the following section.
   1338 
   1339 **** Formula for finite depth fluid.
   1340 On the sea bottom vertical fluid velocity component equals nought,
   1341 i.e.\nbsp{}\(\phi_z=0\) on \(z=-h\), where \(h\)\nbsp{}--- water depth. In this
   1342 case equation \(v=-{i}{u}\), which came from Laplace equation, can not be
   1343 neglected, hence the solution is sought in the following form:
   1344 \begin{equation}
   1345     \phi(x,z)
   1346     =
   1347     \InverseFourierY{
   1348         \left( C_1 e^{2\pi u z} + C_2 e^{-2\pi u z} \right)
   1349         E(u)
   1350     }{x}.
   1351     \label{eq-guessed-sol-2d-full}
   1352 \end{equation}
   1353 Plugging \(\phi\) into the boundary condition on the sea bottom yields
   1354 \begin{equation*}
   1355     C_1 e^{-2\pi u h} - C_2 e^{2\pi u h} = 0,
   1356 \end{equation*}
   1357 hence \(C_1=\frac{1}{2}C{e}^{2\pi{u}{h}}\) and
   1358 \(C_2=-\frac{1}{2}C{e}^{-2\pi{u}{h}}\). Constant \(C\) may take arbitrary values
   1359 here, because after plugging it becomes part of unknown coefficients \(E(u)\).
   1360 Plugging formulae for \(C_1\) and \(C_2\)
   1361 into\nbsp{}eqref:eq-guessed-sol-2d-full yields
   1362 \begin{equation*}
   1363     \phi(x,z) = \InverseFourierY{ \Sinh{2\pi u (z+h)} E(u) }{x}.
   1364 \end{equation*}
   1365 Plugging \(\phi\) into the boundary condition on the free surface yields
   1366 \begin{align*}
   1367     \zeta_t & = f(x) \InverseFourierY{ 2\pi i u \Sinh{2\pi u (z+h)} E(u) }{x} \\
   1368             & - \frac{1}{\SqrtZetaX}
   1369               \InverseFourierY{ 2\pi u \SinhX{2\pi u (z+h)} E(u) }{x}.
   1370 \end{align*}
   1371 Here \(\sinh\) and \(\cosh\) give similar results near free surface, and since this
   1372 is the main area of interest in practical applications, we assume that
   1373 \(\Sinh{2\pi{u}(z+h)}\approx\SinhX{2\pi{u}(z+h)}\). Performing analogous to the
   1374 previous section transformations yields final formula for \(\phi(x,z)\):
   1375 \begin{equation}
   1376 \boxed{
   1377     \phi(x,z,t)
   1378     =
   1379   \InverseFourierY{
   1380         \frac{\Sinh{2\pi u (z+h)}}{2\pi u}
   1381         \frac{
   1382             \FourierY{ \zeta_t / \left(i f(x) - 1/\SqrtZetaX\right) }{u}
   1383         }{
   1384             \FourierY{ \FunSecond{\zeta(x,t)} }{u}
   1385         }
   1386     }{x},
   1387 }
   1388     \label{eq-solution-2d-full}
   1389 \end{equation}
   1390 where \(\FunSecond{z}\) is a function, a form of which is defined in
   1391 section\nbsp{}[[#sec-compute-delta]] and which satisfies equation
   1392 \(\FourierY{\FunSecond{z}}{u}=\Sinh{2\pi{u}{z}}\).
   1393 
   1394 **** Reduction to the formulae from linear wave theory.
   1395 Check the validity of derived formulae by substituting \(\zeta(x,t)\) with known
   1396 analytic formula for plain waves. Symbolic computation of Fourier transforms in
   1397 this section were performed in Mathematica\nbsp{}cite:mathematica10. In the
   1398 framework of linear wave theory assume that waves have small amplitude compared
   1399 to their lengths, which allows us to simplify initial system of
   1400 equations\nbsp{}eqref:eq-problem-2d to
   1401 \begin{align*}
   1402     & \phi_{xx}+\phi_{zz}=0,\\
   1403     & \zeta_t = -\phi_z & \text{at }z=\zeta(x,t),
   1404 \end{align*}
   1405 solution to which is written as
   1406 \begin{equation*}
   1407     \phi(x,z,t)
   1408     =
   1409     -\InverseFourierY{
   1410         \frac{e^{2\pi u z}}{2\pi u}
   1411         \FourierY{\zeta_t}{u}
   1412     }{x}
   1413     .
   1414 \end{equation*}
   1415 Propagating wave profile is defined as \(\zeta(x,t)=A\cos(2\pi(kx-t))\).
   1416 Plugging this formula into\nbsp{}eqref:eq-solution-2d yields
   1417 \(\phi(x,z,t)=-\frac{A}{k}\sin(2\pi(kx-t))\Sinh{2\pi{k}{z}}\). In order to
   1418 reduce it to the formula from linear wave theory, rewrite hyperbolic cosine in
   1419 exponential form and discard the term containing \(e^{-2\pi{k}{z}}\) as
   1420 contradicting condition
   1421 \(\phi\underset{z\rightarrow-\infty}{\longrightarrow}0\). Taking real part of
   1422 the resulting formula yields
   1423 \(\phi(x,z,t)=\frac{A}{k}e^{2\pi{k}{z}}\sin(2\pi(kx-t))\), which corresponds to
   1424 the known formula from linear wave theory. Similarly, under small-amplitude
   1425 waves assumption the formula for finite depth
   1426 fluid\nbsp{}eqref:eq-solution-2d-full is reduced to
   1427 \begin{equation*}
   1428     \phi(x,z,t)
   1429     =
   1430     -\InverseFourierY{
   1431         \frac{\Sinh{2\pi u (z+h)}}{2\pi u \Sinh{2\pi u h}}
   1432         \FourierY{\zeta_t}{u}
   1433     }{x}.
   1434 \end{equation*}
   1435 Substituting \(\zeta(x,t)\) with propagating plain wave profile formula yields
   1436 \begin{equation}
   1437     \label{eq-solution-2d-linear}
   1438     \phi(x,z,t)=\frac{A}{k}
   1439     \frac{\Sinh{2 \pi k (z+h)}}{ \Sinh{2 \pi k h} }
   1440     \sin(2 \pi (k x-t)),
   1441 \end{equation}
   1442 which corresponds to the formula from linear wave theory for finite depth fluid.
   1443 
   1444 Different forms of Laplace equation solutions, in which decaying exponent is
   1445 written with either "+" or "-" signs, may cause incompatibilities between
   1446 formulae from linear wave theory and formulae derived in this work, where
   1447 \(\sinh\) is used instead of \(\cosh\). Equality
   1448 \(\frac{\Sinh{2\pi{k}(z+h)}}{\Sinh{2\pi{k}{h}}}\approx\frac{\sinh(2\pi{k}(z+h))}{\sinh(2\pi{k}{h})}\)
   1449 becomes strict on the free surface, and difference between left-hand and
   1450 right-hand sides increases when approaching sea bottom (for sufficiently large
   1451 depth difference near free surface is negligible). So, for sufficiently large
   1452 depth any function (\(\cosh\) or \(\sinh\)) may be used for velocity potential
   1453 computation near free surface.
   1454 
   1455 Reducing\nbsp{}eqref:eq-solution-2d and\nbsp{}eqref:eq-solution-2d-full to the
   1456 known formulae from linear wave theory shows, that formula for infinite
   1457 depth\nbsp{}eqref:eq-solution-2d is not suitable to compute velocity potentials
   1458 with Fourier method, because it does not have symmetry, which is required for
   1459 Fourier transform. However, formula for finite depth can be used instead by
   1460 setting \(h\) to some characteristic water depth. For standing wave reducing to
   1461 linear wave theory formulae is made under the same assumptions.
   1462 *** Three-dimensional velocity potential field
   1463 Three-dimensional version of\nbsp{}eqref:eq-problem is written as
   1464 \begin{align}
   1465     \label{eq-problem-3d}
   1466     & \phi_{xx} + \phi_{yy} + \phi_{zz} = 0,\\
   1467     & \zeta_t + \zeta_x\phi_x + \zeta_y\phi_y
   1468     = \FracSqrtZetaY{\zeta_x} \phi_x
   1469     + \FracSqrtZetaY{\zeta_y} \phi_y
   1470     - \FracSqrtZetaY{1} \phi_z, \nonumber\\
   1471     & \text{at }z=\zeta(x,y,t).\nonumber
   1472 \end{align}
   1473 Again, use Fourier method to solve it. Applying Fourier transform to both sides
   1474 of Laplace equation yields
   1475 \begin{equation*}
   1476     -4 \pi^2 \left( u^2 + v^2 + w^2 \right)
   1477     \FourierY{\phi(x,y,z)}{u,v,w} = 0,
   1478 \end{equation*}
   1479 hence \(w=\pm{i}\sqrt{u^2+v^2}\). We seek solution in the form of inverse Fourier
   1480 transform \(\phi(x,y,z)=\InverseFourierY{E(u,v,w)}{x,y,z}\). Plugging
   1481 \(w=i\sqrt{u^2+v^2}=i\Kveclen\) into the formula yields
   1482 \begin{equation*}
   1483     \phi(x,y,z) = \InverseFourierY{
   1484         \left(
   1485             C_1 e^{2\pi \Kveclen z}
   1486             -C_2 e^{-2\pi \Kveclen z}
   1487         \right)
   1488         E(u,v)
   1489     }{x,y}.
   1490 \end{equation*}
   1491 Plugging \(\phi\) into the boundary condition on the sea bottom (analogous to
   1492 two-dimensional case) yields
   1493 \begin{equation}
   1494     \label{eq-guessed-sol-3d}
   1495     \phi(x,y,z) = \InverseFourierY{
   1496         \Sinh{2\pi \Kveclen (z+h)} E(u,v)
   1497     }{x,y}.
   1498 \end{equation}
   1499 Plugging \(\phi\) into the boundary condition on the free surface yields
   1500 \begin{equation*}
   1501     \arraycolsep=1.4pt
   1502     \begin{array}{rl}
   1503         \zeta_t = & i f_1(x,y) \InverseFourierY{2 \pi u \Sinh{2\pi \Kveclen (z+h)}E(u,v)}{x,y} \\
   1504         + & i f_2(x,y) \InverseFourierY{2 \pi v \Sinh{2\pi \Kveclen (z+h)}E(u,v)}{x,y} \\
   1505         - & f_3(x,y) \InverseFourierY{2 \pi \Kveclen \SinhX{2\pi \Kveclen (z+h)}E(u,v)}{x,y}
   1506     \end{array}
   1507 \end{equation*}
   1508 where \(f_1(x,y)={\zeta_x}/{\SqrtZetaY}-\zeta_x\),
   1509 \(f_2(x,y)={\zeta_y}/{\SqrtZetaY}-\zeta_y\) and \(f_3(x,y)=1/\SqrtZetaY\).
   1510 
   1511 Like in section\nbsp{}[[#sec-pressure-2d]] we assume that
   1512 \(\Sinh{2\pi{u}(z+h)}\approx\SinhX{2\pi{u}(z+h)}\) near free surface, but in
   1513 three-dimensional case this is not enough to solve the problem. In order to get
   1514 explicit formula for coefficients \(E\) we need to assume, that all Fourier
   1515 transforms in the equation have radially symmetric kernels, i.e.\nbsp{}replace
   1516 \(u\) and \(v\) with \(\Kveclen\). There are two points supporting this
   1517 assumption. First, in numerical implementation integration is done over positive
   1518 wave numbers, so the sign of \(u\) and \(v\) does not affect the solution.
   1519 Second, the growth rate of \(\cosh\) term of the integral kernel is much higher
   1520 than the one of \(u\) or \(\Kveclen\), so the substitution has small effect on
   1521 the magnitude of the solution. Despite these two points, a use of more
   1522 mathematically rigorous approach would be preferable.
   1523 
   1524 Making the replacement, applying Fourier transform to both sides of the equation
   1525 and plugging the result into\nbsp{}eqref:eq-guessed-sol-3d yields formula for
   1526 \(\phi\):
   1527 \begin{equation}
   1528     \label{eq-phi-3d}
   1529     \phi(x,y,z,t) = \InverseFourierY{
   1530         \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} }{ 2\pi\Kveclen }
   1531         \frac{ \FourierY{ \zeta_t / \left( i f_1(x,y) + i f_2(x,y) - f_3(x,y) \right)}{u,v} }
   1532         { \FourierY{\mathcal{D}_3\left( x,y,\zeta\left(x,y\right) \right)}{u,v} }
   1533     }{x,y},
   1534 \end{equation}
   1535 where
   1536 \(\FourierY{\mathcal{D}_3\left(x,y,z\right)}{u,v}=\Sinh{\smash{2\pi\Kveclen{}z}}\).
   1537 
   1538 *** Velocity potential normalisation formulae
   1539 :PROPERTIES:
   1540 :CUSTOM_ID: sec-compute-delta
   1541 :END:
   1542 
   1543 In solutions\nbsp{}eqref:eq-solution-2d and\nbsp{}eqref:eq-solution-2d-full to
   1544 two-dimensional pressure determination problem there are functions
   1545 \(\Fun{z}=\InverseFourierY{e^{2\pi{u}{z}}}{x}\) and
   1546 \(\FunSecond{z}=\InverseFourierY{\Sinh{2\pi{u}{z}}}{x}\) which has multiple
   1547 analytic representations and are difficult to compute. Each function is a
   1548 Fourier transform of linear combination of exponents which reduces to poorly
   1549 defined Dirac delta function of a complex argument (see
   1550 table\nbsp{}[[tab-delta-functions]]). The usual way of handling this type of
   1551 functions is to write them as multiplication of Dirac delta functions of real
   1552 and imaginary part, however, this approach does not work here, because applying
   1553 inverse Fourier transform to this representation does not produce exponent,
   1554 which severely warp resulting velocity field. In order to get unique analytic
   1555 definition normalisation factor \(1/\Sinh{2\pi{u}{h}}\) (which is also included
   1556 in formula for \(E(u)\)) may be used. Despite the fact that normalisation allows
   1557 to obtain adequate velocity potential field, numerical experiments show that
   1558 there is little difference between this field and the one produced by formulae
   1559 from linear wave theory, in which terms with \(\zeta\) are omitted. As a result,
   1560 the formula for three-dimensional case was not derived.
   1561 
   1562 #+name: tab-delta-functions
   1563 #+caption[Formulae for computing \(\Fun{z}\) and \(\FunSecond{z}\)]:
   1564 #+caption: Formulae for computing \(\Fun{z}\) and \(\FunSecond{z}\) from
   1565 #+caption: sec.\nbsp{}[[#sec-pressure-2d]], that use normalisation to eliminate
   1566 #+caption: uncertainty from definition of Dirac delta function of complex argument.
   1567 #+attr_latex: :booktabs t
   1568 | Function          | Without normalisation                                        | Normalised                                                                                                                             |
   1569 |-------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
   1570 | \(\Fun{z}\)       | \(\delta (x+i z)\)                                           | \(\frac{1}{2 h}\mathrm{sech}\left(\frac{\pi  (x-i (h+z))}{2 h}\right)\)                                                                |
   1571 | \(\FunSecond{z}\) | \(\frac{1}{2}\left[\delta (x-i z) + \delta (x+i z) \right]\) | \(\frac{1}{4 h}\left[\text{sech}\left(\frac{\pi  (x-i (h+z))}{2 h}\right)+\text{sech}\left(\frac{\pi  (x+i(h+z))}{2 h}\right)\right]\) |
   1572 
   1573 ** Verification of velocity potential fields
   1574 :PROPERTIES:
   1575 :CUSTOM_ID: sec-compare-formulae
   1576 :END:
   1577 
   1578 Comparing obtained generic formulae\nbsp{}eqref:eq-solution-2d
   1579 and\nbsp{}eqref:eq-solution-2d-full to the known formulae from linear wave
   1580 theory allows to see the difference between velocity fields for both large and
   1581 small amplitude waves. In general case analytic formula for velocity potential
   1582 in not known, even for plain waves, so comparison is done numerically. Taking
   1583 into account conclusions of section\nbsp{}[[#sec-pressure-2d]], only finite depth
   1584 formulae are compared.
   1585 
   1586 **** The difference with linear wave theory formulae.
   1587 In order to obtain velocity potential fields, plain waves of varying amplitude
   1588 were generated. Wave numbers in Fourier transforms were chosen on the interval
   1589 from \(0\) to the maximal wave number determined numerically from the resulting
   1590 wavy surface. Experiments were conducted for waves of both small and large
   1591 amplitudes.
   1592 
   1593 The experiment showed that velocity potential fields for large-amplitude waves
   1594 produced by formula\nbsp{}eqref:eq-solution-2d-full for finite depth fluid and
   1595 formula\nbsp{}eqref:eq-solution-2d-linear from linear wave theory are
   1596 qualitatively different (fig.\nbsp{}[[fig-potential-field-nonlinear]]). First,
   1597 velocity potential contours have sinusoidal shape, which is different from oval
   1598 shape described by linear wave theory. Second, velocity potential decays more
   1599 rapidly than in linear wave theory as getting closer to the bottom, and the
   1600 region where the majority of wave energy is concentrated is closer to the wave
   1601 crest. Similar numerical experiment, in which all terms
   1602 of\nbsp{}eqref:eq-solution-2d-full that are neglected in the framework of linear
   1603 wave theory are eliminated, shows no difference (as much as machine precision
   1604 allows) in resulting velocity potential fields.
   1605 
   1606 #+name: fig-potential-field-nonlinear
   1607 #+header: :width 8 :height 11
   1608 #+begin_src R :file build/plain-wave-velocity-field-comparison.pdf
   1609 source(file.path("R", "velocity-potentials.R"))
   1610 par(pty="s", family="serif")
   1611 nlevels <- 41
   1612 levels <- pretty(c(-200,200), nlevels)
   1613 palette <- colorRampPalette(c("blue", "lightyellow", "red"))
   1614 col <- palette(nlevels-1)
   1615 
   1616 # linear solver
   1617 par(fig=c(0,0.95,0,0.5),new=TRUE)
   1618 arma.plot_velocity_potential_field(
   1619   file.path("build", "arma-benchmarks", "verification-orig", "plain_wave_linear_solver"),
   1620   levels=levels,
   1621   col=col
   1622 )
   1623 
   1624 # high-amplitude solver
   1625 par(fig=c(0,0.95,0.5,1),new=TRUE)
   1626 arma.plot_velocity_potential_field(
   1627   file.path("build", "arma-benchmarks", "verification-orig", "plain_wave_high_amplitude_solver"),
   1628   levels=levels,
   1629   col=col
   1630 )
   1631 
   1632 # legend 1
   1633 par(pty="m",fig=c(0.80,1,0.5,1), new=TRUE)
   1634 arma.plot_velocity_potential_field_legend(
   1635   levels=levels,
   1636   col=col
   1637 )
   1638 
   1639 # legend 2
   1640 par(pty="m",fig=c(0.80,1,0,0.5), new=TRUE)
   1641 arma.plot_velocity_potential_field_legend(
   1642   levels=levels,
   1643   col=col
   1644 )
   1645 #+end_src
   1646 
   1647 #+caption[Velocity potential fields for propagating wave]:
   1648 #+caption: Comparison of velocity potential fields for propagating wave
   1649 #+caption: \(\zeta(x,y,t) = \cos(2\pi x - t/2)\). Field produced by generic
   1650 #+caption: formula (top) and linear wave theory formula (bottom).
   1651 #+name: fig-potential-field-nonlinear
   1652 #+attr_latex: :width \textwidth
   1653 #+RESULTS: fig-potential-field-nonlinear
   1654 [[file:build/plain-wave-velocity-field-comparison.pdf]]
   1655 
   1656 **** The difference with small-amplitude wave theory.
   1657 The experiment, in which velocity fields produced numerically by different
   1658 formulae were compared, shows that velocity fields produced by
   1659 formula\nbsp{}eqref:eq-solution-2d-full and\nbsp{}eqref:eq-old-sol-2d correspond
   1660 to each other for small-amplitude waves. Two sea wavy surface realisations were
   1661 made by AR model: one containing small-amplitude waves, other containing
   1662 large-amplitude waves. Integration in formula\nbsp{}eqref:eq-solution-2d-full
   1663 was done over wave numbers range extracted from the generated wavy surface. For
   1664 small-amplitude waves both formulae showed comparable results (the difference in
   1665 the velocity is attributed to the stochastic nature of AR model), whereas for
   1666 large-amplitude waves stable velocity field was produced only by
   1667 formula\nbsp{}eqref:eq-solution-2d-full (fig.\nbsp{}[[fig-velocity-field-2d]]).
   1668 So, generic formula\nbsp{}eqref:eq-solution-2d-full gives satisfactory results
   1669 without restriction on wave amplitudes.
   1670 
   1671 #+name: fig-velocity-field-2d
   1672 #+header: :width 8 :height 11
   1673 #+begin_src R :file build/large-and-small-amplitude-velocity-field-comparison.pdf
   1674 source(file.path("R", "velocity-potentials.R"))
   1675 linetypes = c("solid", "dashed")
   1676 par(mfrow=c(2, 1), family="serif")
   1677 arma.plot_velocity(
   1678   file.path("data", "velocity", "low-amp"),
   1679   file.path("data", "velocity", "low-amp-0"),
   1680   linetypes=linetypes,
   1681   ylim=c(-2,2)
   1682 )
   1683 arma.plot_velocity(
   1684   file.path("data", "velocity", "high-amp"),
   1685   file.path("data", "velocity", "high-amp-0"),
   1686   linetypes=linetypes,
   1687   ylim=c(-2,2)
   1688 )
   1689 #+end_src
   1690 
   1691 #+name: fig-velocity-field-2d
   1692 #+caption[Velocity potential fields for small and large amplitude waves]:
   1693 #+caption: Comparison of velocity potential fields obtained by generic formula
   1694 #+caption: (\(u_1\)) and formula for small-amplitude waves (\(u_2\)): velocity
   1695 #+caption: field for realisations containing small-amplitude (top) and
   1696 #+caption: large-amplitude (bottom) waves.
   1697 #+attr_latex: :width \textwidth
   1698 #+RESULTS: fig-velocity-field-2d
   1699 [[file:build/large-and-small-amplitude-velocity-field-comparison.pdf]]
   1700 
   1701 ** Summary
   1702 Formulae derived in this section allow to numerically compute velocity potential
   1703 field (and hence pressure field) near discretely or mathematically given wavy
   1704 sea surface, bypassing assumptions of linear wavy theory and theory of small
   1705 amplitude waves. For small amplitude waves new formulae produce the same
   1706 velocity potential field, as formulae from linear wave theory. For
   1707 large-amplitude waves the usage of new formulae results in a shift of a region
   1708 where the majority of wave energy is concentrated closer to the wave crest.
   1709 
   1710 * High-performance software implementation of sea wave simulation
   1711 :PROPERTIES:
   1712 :CUSTOM_ID: sec-hpc
   1713 :END:
   1714 
   1715 ** SMP implementation
   1716 *** Wavy surface generation
   1717 **** Parallel AR, MA and LH model algorithms.
   1718 :PROPERTIES:
   1719 :CUSTOM_ID: sec-parallel
   1720 :END:
   1721 
   1722 Although, AR and MA models are part of the single mixed model they have
   1723 disparate parallel algorithms, which are different from trivial one of LH model.
   1724 AR algorithm consists in partitioning wavy surface into equally-sized parts in
   1725 each dimension and computing them in parallel taking into account causal
   1726 constraints imposed by autoregressive dependencies between surface points. There
   1727 are no such dependencies in MA model, and its formula represents convolution of
   1728 white noise with model coefficients, which is reduced to computation of three
   1729 Fourier transforms via convolution theorem. So, MA algorithm consists in
   1730 parallel computation of the convolution which is based on FFT computation.
   1731 Finally, LH algorithm is made parallel by simply computing each wavy surface
   1732 point in parallel in several threads. So, parallel implementation of ARMA model
   1733 consists of two parallel algorithms, one for each part of the model, which are
   1734 more sophisticated than the one for LH model.
   1735 
   1736 AR model's formula main feature is autoregressive dependencies between wavy
   1737 surface points in each dimension which prevent computing each surface point in
   1738 parallel. Instead the surface is partitioned along each dimension into equal
   1739 three-dimensional parts, and for each part information dependencies, which
   1740 define computation order, are established. Figure\nbsp{}[[fig-ar-cubes]] shows these
   1741 dependencies. An arrow denotes dependency of one part on availability of
   1742 another, i.e.\nbsp{}computation of a part may start only when all parts on which
   1743 it depends were computed. Here part \(A\) does not have dependencies, parts
   1744 \(B\) and \(D\) depend only on \(A\), and part \(E\) depends on \(A\), \(B\) and
   1745 \(C\). In general, each part depends on all parts that have previous index in at
   1746 least one dimension (if such parts exist). The first part does not have any
   1747 dependencies; and the size of each part along each dimension is made greater or
   1748 equal to the corresponding number of coefficients along the dimension to
   1749 consider only adjacent parts in dependency resolution.
   1750 
   1751 #+name: fig-ar-cubes
   1752 #+header: :width 5 :height 5
   1753 #+begin_src R :file build/ar-cubes.pdf
   1754 source(file.path("R", "common.R"))
   1755 par(family="serif")
   1756 arma.plot_ar_cubes_2d(3, 3, xlabel="Part index (X)", ylabel="Part index (Y)")
   1757 #+end_src
   1758 
   1759 #+name: fig-ar-cubes
   1760 #+caption[Autoregressive dependencies between surface parts]:
   1761 #+caption: Autoregressive dependencies between wavy surface parts.
   1762 #+RESULTS: fig-ar-cubes
   1763 [[file:build/ar-cubes.pdf]]
   1764 
   1765 Each part has an three-dimensional index and a completion status. The algorithm
   1766 starts by submitting objects containing this information into a queue. After
   1767 that parallel threads start, each thread finds the first part for which all
   1768 dependencies are satisfied (by checking the completion status of each part),
   1769 removes the part from the queue, generates wavy surface for this part and sets
   1770 completion status. The algorithm ends when the queue becomes empty. Access to
   1771 the queue from different threads is synchronised by locks. The algorithm is
   1772 suitable for SMP machines; for MPP all parts on which the current one depends,
   1773 dependent parts needs to be copied to the node where computation is carried out.
   1774 
   1775 So, parallel AR model algorithm reduces to implementing minimalistic job
   1776 scheduler, in which
   1777 - each job corresponds to a wavy surface part,
   1778 - the order of execution of jobs is defined by autoregressive dependencies, and
   1779 - job queue is processed by a simple thread pool in which each thread in a loop
   1780   removes the first job from the queue for which all dependent jobs have
   1781   completed and executes it.
   1782 
   1783 In contrast to AR model, MA model does not have autoregressive dependencies
   1784 between points; instead, each surface point depends on previous in time and
   1785 space white noise values. MA model's formula allows for rewriting it as a
   1786 convolution of white noise with the coefficients as a kernel. Using convolution
   1787 theorem, the convolution is rewritten as inverse Fourier transform of the
   1788 product of Fourier transforms of white noise and coefficients. Since the number
   1789 of MA coefficients is much smaller than the number of wavy surface points,
   1790 parallel FFT implementation is not suitable here, as it requires padding the
   1791 coefficients with noughts to match the size of the surface. Instead, the surface
   1792 is divided into parts along each dimension which are padded with noughts to
   1793 match the number of the coefficients along each dimension multiplied by two.
   1794 Then Fourier transform of each part is computed in parallel, multiplied by
   1795 previously computed Fourier transform of the coefficients, and inverse Fourier
   1796 transform of the result is computed. After that, each part is written to the
   1797 output array with overlapping (due to padding) points added to each other. This
   1798 algorithm is commonly known in signal processing as
   1799 "overlap-add"\nbsp{}cite:oppenheim1989discrete,svoboda2011efficient,pavel2013algorithms.
   1800 Padding with noughts is needed to prevent aliasing errors: without it the result
   1801 would be circular convolution.
   1802 
   1803 Despite the fact that MA model algorithm partitions the surface into the same
   1804 parts (but possibly of different sizes) as AR model algorithm, the absence of
   1805 autoregressive dependencies between them allows to compute them in parallel
   1806 without the use of specialised job scheduler. However, this algorithm requires
   1807 padding parts with noughts to make the result of calculations correspond to the
   1808 result obtained with original MA model's formula. So, MA model's algorithm is
   1809 more scalable to a large number of nodes as it has no information dependencies
   1810 between parts, but the size of the parts is greater than in AR model's
   1811 algorithm.
   1812 
   1813 The distinct feature of LH model's algorithm is its simplicity: to make it
   1814 parallel, wavy surface is partitioned into parts of equal sizes and each part is
   1815 generated in parallel. There are no information dependencies between parts,
   1816 which makes this algorithm particularly suitable for computation on GPU: each
   1817 hardware thread simply computes its own point. In addition, sine and cosine
   1818 functions in the model's formula are slow to compute on CPU, which makes GPU
   1819 even more favourable choice.
   1820 
   1821 To summarise, even though AR and MA models are part of the single mixed model,
   1822 their parallel algorithms are fundamentally different and are more complicated
   1823 than trivial parallel algorithm of LH model. Efficient implementation AR
   1824 algorithm requires specialised job scheduler to manage autoregressive
   1825 dependencies between wavy surface parts, whereas MA algorithm requires padding
   1826 each part with noughts to be able to compute them in parallel. In contrast to
   1827 these models, LH model has no information dependencies between parts, but
   1828 requires more computational resources (floating point operations per seconds)
   1829 for transcendental functions in its formula.
   1830 
   1831 **** Parallel white noise generation algorithm.
   1832 In order to eliminate periodicity from wavy surface generated by sea wave model,
   1833 it is imperative to use PRNG with sufficiently large period to generate white
   1834 noise. Parallel Mersenne Twister\nbsp{}cite:matsumoto1998mersenne with a period
   1835 of \(2^{19937}-1\) is used as a generator in this work. It allows to produce
   1836 aperiodic sea wavy surface realisations in any practical usage scenarios.
   1837 
   1838 There is no guarantee that multiple PRNGs executed in parallel threads with
   1839 distinct initial states produce uncorrelated pseudo-random number sequences, and
   1840 algorithm of dynamic creation of Mersenne
   1841 Twisters\nbsp{}cite:matsumoto1998dynamic is used to provide such guarantee. The
   1842 essence of the algorithm is to find matrices of initial generator states, that
   1843 give maximally uncorrelated pseudo-random number sequences when Mersenne
   1844 Twisters are executed in parallel with these initial states. Since finding such
   1845 initial states consumes considerable amount of processor time, vector of initial
   1846 states is created beforehand with knowingly larger number of parallel threads
   1847 and saved to a file, which is then read before starting white noise generation.
   1848 
   1849 **** Ramp-up interval elimination.
   1850 In AR model value of wavy surface elevation at a particular point depends on
   1851 previous in space and time points, as a result the so called /ramp-up interval/
   1852 (see fig.\nbsp{}[[fig-ramp-up-interval]]), in which realisation does not correspond
   1853 to specified ACF, forms in the beginning of the realisation. There are several
   1854 solutions to this problem which depend on the simulation context. If realisation
   1855 is used in the context of ship stability simulation without manoeuvring, ramp-up
   1856 interval will not affect results of the simulation, because it is located on the
   1857 border (too far away from the studied marine object). Alternative approach is to
   1858 generate sea wavy surface on ramp-up interval with LH model and generate the
   1859 rest of the realisation with AR model. If ship stability with manoeuvring is
   1860 studied, then the interval may be simply discarded from the realisation (the
   1861 size of the interval approximately equals the number of AR coefficients).
   1862 However, this may lead to a loss of a very large number of points, because
   1863 discarding occurs for each of three dimensions.
   1864 
   1865 #+name: fig-ramp-up-interval
   1866 #+begin_src R :file build/ramp-up-interval.pdf
   1867 source(file.path("R", "common.R"))
   1868 par(family="serif")
   1869 arma.plot_ramp_up_interval()
   1870 #+end_src
   1871 
   1872 #+caption[Ramp-up interval at the beginning of AR process realisation]:
   1873 #+caption: Ramp-up interval at the beginning of AR process realisation.
   1874 #+name: fig-ramp-up-interval
   1875 #+RESULTS: fig-ramp-up-interval
   1876 [[file:build/ramp-up-interval.pdf]]
   1877 
   1878 **** Performance of OpenMP and OpenCL implementations.
   1879 :PROPERTIES:
   1880 :header-args:R: :results output raw :exports results
   1881 :END:
   1882 
   1883 Differences in models' parallel algorithms make them efficient on different
   1884 processor architectures, and to find the most efficient one, all the models were
   1885 benchmarked in both CPU and GPU.
   1886 
   1887 AR and MA models do not require highly optimised codes to be efficient, their
   1888 performance is high even without use of co-processors; there are two main causes
   1889 of that. First, these models do not use transcendental functions (sines, cosines
   1890 and exponents) as opposed to LH model. All calculations (except model
   1891 coefficients) are done via polynomials, which can be efficiently computed on
   1892 modern processors using FMA instructions. Second, pressure computation is done
   1893 via explicit formula using a number of nested FFTs. Since two-dimensional FFT of
   1894 the same size is repeatedly applied to every time slice, its coefficients
   1895 (complex exponents) are pre-computed one time for all slices, and further
   1896 computations involve only a few transcendental functions. In case of MA model,
   1897 performance is also increased by doing convolution with FFT. So, high
   1898 performance of AR and MA models is due to scarce use of transcendental functions
   1899 and heavy use of FFT, not to mention that high convergence rate and
   1900 non-existence of periodicity allows to use far fewer coefficients compared to LH
   1901 model.
   1902 
   1903 #+name: tab-gpulab
   1904 #+caption["Gpulab" system configuration]:
   1905 #+caption: "Gpulab" system configuration.
   1906 #+attr_latex: :booktabs t
   1907 | CPU                     | AMD FX-8370           |
   1908 | RAM                     | 16Gb                  |
   1909 | GPU                     | GeForce GTX 1060      |
   1910 | GPU memory              | 6GB                   |
   1911 | HDD                     | WDC WD40EZRZ, 5400rpm |
   1912 | No. of CPU cores        | 4                     |
   1913 | No. of threads per core | 2                     |
   1914 
   1915 Software implementation uses several libraries of mathematical functions,
   1916 numerical algorithms and visualisation primitives (listed in
   1917 table\nbsp{}[[tab-arma-libs]]), and was implemented using several parallel
   1918 programming technologies (OpenMP, OpenCL) to have a possibility to choose the
   1919 most efficient one. For each technology and each model an optimised wavy surface
   1920 generation was implemented (except for MA model for which only OpenMP
   1921 implementation was done). Velocity potential computation was done in OpenMP and
   1922 was implemented in OpenCL only for real-time visualisation of wavy surface. For
   1923 each technology the programme was recompiled and run multiple times and
   1924 performance of each top-level subroutine was measured using system clock.
   1925 
   1926 Results of benchmarks of the technologies are summarised in
   1927 table\nbsp{}[[tab-arma-performance]]. All benchmarks were run on a machine equipped
   1928 with a GPU, characteristics of which are summarised in table\nbsp{}[[tab-gpulab]].
   1929 All benchmarks were run with the same input parameters for all the models:
   1930 realisation length 10000s and output grid size \(40\times40\)m. The only
   1931 parameter that was different is the order (the number of coefficients): order of
   1932 AR and MA model was \(7,7,7\) and order of LH model was \(40,40\). This is due
   1933 to higher number of coefficient for LH model to eliminate periodicity.
   1934 
   1935 In all benchmarks wavy surface generation and NIT take the most of the running
   1936 time, whereas velocity potential calculation together with other subroutines
   1937 only a small fraction of it.
   1938 
   1939 #+name: tab-arma-libs
   1940 #+caption[A list of libraries used in software implementation]:
   1941 #+caption: A list of libraries used in software implementation.
   1942 #+attr_latex: :booktabs t
   1943 | Library                                                      | What it is used for             |
   1944 |--------------------------------------------------------------+---------------------------------|
   1945 | DCMT\nbsp{}cite:matsumoto1998dynamic                         | parallel PRNG                   |
   1946 | Blitz\nbsp{}cite:veldhuizen1997will,veldhuizen2000techniques | multidimensional arrays         |
   1947 | GSL\nbsp{}cite:gsl2008scientific                             | PDF, CDF, FFT computation       |
   1948 |                                                              | checking process stationarity   |
   1949 | clFFT\nbsp{}cite:clfft                                       | FFT computation                 |
   1950 | LAPACK, GotoBLAS\nbsp{}cite:goto2008high,goto2008anatomy     | finding AR coefficients         |
   1951 | GL, GLUT\nbsp{}cite:kilgard1996opengl                        | three-dimensional visualisation |
   1952 | CGAL\nbsp{}cite:fabri2009cgal                                | wave numbers interpolation      |
   1953 
   1954 AR model exhibits the highest performance in OpenMP and the lowest performance
   1955 in OpenCL implementations, which is also the best and the worst performance
   1956 across all model and framework combinations. In the most optimal model and
   1957 framework combination AR performance is 4.5 times higher than MA performance,
   1958 and 20 times higher than LH performance; in the most suboptimal
   1959 combination\nbsp{}--- 137 times slower than MA and two times slower than LH. The
   1960 ratio between the best (OpenMP) and the worst (OpenCL) AR model performance is
   1961 several hundreds. This is explained by the fact that the model
   1962 formula\nbsp{}eqref:eq-ar-process is efficiently mapped on the CPU architecture,
   1963 which is distinguished from GPU architecture by having multiple caches,
   1964 low-bandwidth memory and small number of floating point units compared to GPU.
   1965 - This formula does not contain transcendental mathematical functions (sines,
   1966   cosines and exponents),
   1967 - all of the multiplications and additions in the formula can be implemented
   1968   using FMA processor instructions, and
   1969 - efficient use (locality) of cache is achieved by using Blitz library which
   1970   implements optimised traversals for multidimensional arrays based on Hilbert
   1971   space-filling curve.
   1972 In contrast to CPU, GPU has less number of caches, high-bandwidth memory and
   1973 large number of floating point units, which is the worst case scenario for AR
   1974 model:
   1975 - there are no transcendental functions which could compensate high memory
   1976   latency,
   1977 - there are FMA instructions in GPU but they do not improve performance due to
   1978   high latency, and
   1979 - optimal traversal of multidimensional arrays was not used due to a lack of
   1980   libraries implementing it for a GPU.
   1981 Finally, GPU architecture does not contain synchronisation primitives that allow
   1982 to implement autoregressive dependencies between distinct wavy surface parts;
   1983 instead of this a separate OpenCL kernel is launched for each part, and
   1984 information dependency management between them is done on CPU side. So, in AR
   1985 model case CPU architecture is superior compared to GPU due to better handling
   1986 of complex information dependencies, simple calculations (multiplications and
   1987 additions) and complex memory access patterns.
   1988 
   1989 #+header: :results output raw :exports results
   1990 #+name: tab-arma-performance
   1991 #+begin_src R
   1992 source(file.path("R", "benchmarks.R"))
   1993 model_names <- list(
   1994 	ar.x="AR",
   1995 	ma.x="MA",
   1996 	lh.x="LH",
   1997 	ar.y="AR",
   1998 	ma.y="MA",
   1999 	lh.y="LH",
   2000   Row.names="\\orgcmidrule{2-4}{5-6}Subroutine"
   2001 )
   2002 row_names <- list(
   2003   determine_coefficients="Determine coefficients",
   2004   validate="Validate model",
   2005   generate_surface="Generate wavy surface",
   2006   nit="NIT",
   2007   write_all="Write output to files",
   2008   copy_to_host="Copy data from GPU",
   2009   velocity="Compute velocity potentials"
   2010 )
   2011 arma.print_openmp_vs_opencl(model_names, row_names)
   2012 #+end_src
   2013 
   2014 #+name: tab-arma-performance
   2015 #+caption[Performance of OpenMP and OpenCL implementations]:
   2016 #+caption: Running time (s.) for OpenMP and OpenCL implementations
   2017 #+caption: of AR, MA and LH models.
   2018 #+attr_latex: :booktabs t
   2019 #+RESULTS: tab-arma-performance
   2020 |                                  |      |      | OpenMP |        | OpenCL |
   2021 |                                  |  <r> |  <r> |    <r> |    <r> |    <r> |
   2022 | \orgcmidrule{2-4}{5-6}Subroutine |   AR |   MA |     LH |     AR |     LH |
   2023 |----------------------------------+------+------+--------+--------+--------|
   2024 | Determine coefficients           | 0.02 | 0.01 |   0.19 |   0.01 |   1.19 |
   2025 | Validate model                   | 0.08 | 0.10 |        |   0.08 |        |
   2026 | Generate wavy surface            | 1.26 | 5.57 | 350.98 | 769.38 |   0.02 |
   2027 | NIT                              | 7.11 | 7.43 |        |   0.02 |        |
   2028 | Copy data from GPU               |      |      |        |   5.22 |  25.06 |
   2029 | Compute velocity potentials      | 0.05 | 0.05 |   0.06 |   0.03 |   0.03 |
   2030 | Write output to files            | 0.27 | 0.27 |   0.27 |   0.28 |   0.27 |
   2031 
   2032 In contrast to AR model, LH model exhibits the best performance on GPU and the
   2033 worst performance on CPU. The reasons for that are
   2034 - the large number of transcendental functions in its formula which offset high
   2035   memory latency,
   2036 - linear memory access pattern which allows to vectorise calculations and
   2037   coalesce memory accesses by different hardware threads, and
   2038 - no information dependencies between wavy surface points.
   2039 Despite the fact that GPU on the test system is more performant than CPU (in
   2040 terms of floating point operations per second), the overall performance of LH
   2041 model compared to AR model is lower. The reason for that is slow data transfer
   2042 between GPU and CPU memory.
   2043 
   2044 MA model is faster than LH model and slower than AR model. As the convolution in
   2045 its formula is implemented using FFT, its performance depends on the performance
   2046 of FFT implementation: GSL for CPU and clFFT for GPU. In this work
   2047 performance of MA model on GPU was not tested due to unavailability of the
   2048 three-dimensional FFT in clFFT library; if the transform was available, it could
   2049 made the model even faster than AR.
   2050 
   2051 NIT takes less time on GPU and more time on CPU, but taking data transfer
   2052 between them into consideration makes their execution time comparable. This is
   2053 explained by the large amount of transcendental functions that need to be
   2054 computed for each wavy surface point to transform distribution of its
   2055 \(z\)-coordinates. For each point a non-linear transcendental
   2056 equation\nbsp{}eqref:eq-distribution-transformation is solved using bisection
   2057 method. GPU performs this task several hundred times faster than CPU, but spends
   2058 a lot of time to transfer the result back to the processor memory. So, the only
   2059 possibility to optimise this routine is to use root finding method with
   2060 quadratic convergence rate to reduce the number of transcendental functions that
   2061 need to be computed.
   2062 
   2063 **** I/O performance.
   2064 :PROPERTIES:
   2065 :header-args:R: :results output raw :exports results
   2066 :CUSTOM_ID: sec-io-perf
   2067 :END:
   2068 
   2069 Although, in the benchmarks from the previous section writing data to files does
   2070 not consume much of the running time, the use of network-mounted file systems
   2071 may slow down this stage of the programme. To optimise it wavy surface parts are
   2072 written to a file as soon as generation of the whole time slice is completed
   2073 (fig.\nbsp{}[[fig-arma-io-events]]): a separate thread starts writing to files as
   2074 soon as the first time slice is available and finishes it after the main thread
   2075 group finishes the computation. The total time spent to perform I/O is
   2076 increased, but the total programme running time is decreased, because the I/O is
   2077 done in parallel to computation (table\nbsp{}[[tab-arma-io-performance]]). Using
   2078 this approach with local file system has the same effect, but performance gain
   2079 is lower.
   2080 
   2081 #+header: :results output raw :exports results
   2082 #+name: tab-arma-io-performance
   2083 #+begin_src R
   2084 source(file.path("R", "benchmarks.R"))
   2085 suffix_names <- list(
   2086   xfs.x="XFS",
   2087   xfs.y="XFS",
   2088   nfs.x="NFS",
   2089   nfs.y="NFS",
   2090   gfs.x="GlusterFS",
   2091   gfs.y="GlusterFS",
   2092   Row.names="\\orgcmidrule{2-4}{5-7}Subroutine"
   2093 )
   2094 top_names <- c("I", "II")
   2095 row_names <- list(
   2096   generate_surface="Generate wavy surface",
   2097   write_all="Write output to files"
   2098 )
   2099 arma.print_sync_vs_async_io(suffix_names, row_names, top_names)
   2100 #+end_src
   2101 
   2102 #+name: tab-arma-io-performance
   2103 #+caption[I/O performance for XFS, NFS and GlusterFS]:
   2104 #+caption: Running time of subroutines (s.) with the use of XFS, NFS and
   2105 #+caption: GlusterFS along with sequential (I) and parallel (II) file output.
   2106 #+attr_latex: :booktabs t
   2107 #+RESULTS: tab-arma-io-performance
   2108 |                                  |      |      |         I |      |      |        II |
   2109 |                                  |  <r> |  <r> |       <r> |  <r> |  <r> |       <r> |
   2110 | \orgcmidrule{2-4}{5-7}Subroutine |  XFS |  NFS | GlusterFS |  XFS |  NFS | GlusterFS |
   2111 |----------------------------------+------+------+-----------+------+------+-----------|
   2112 | Generate wavy surface            | 1.26 | 1.26 |      1.33 | 1.33 | 3.30 |     11.06 |
   2113 | Write output to files            | 0.28 | 2.34 |     10.95 | 0.00 | 0.00 |      0.00 |
   2114 
   2115 #+name: fig-arma-io-events
   2116 #+header: :height 9 :results output graphics
   2117 #+begin_src R :file build/arma-io-events.pdf
   2118 source(file.path("R", "benchmarks.R"))
   2119 names <- list(xlab="Time, s", ylab="Thread")
   2120 par(mfrow=c(3, 1), mar=c(4,4,4,0.5), family='serif', cex=1)
   2121 arma.plot_io_events(names)
   2122 #+end_src
   2123 
   2124 #+name: fig-arma-io-events
   2125 #+caption[Event graph for XFS, NFS and GlusterFS]:
   2126 #+caption: Event graph for XFS, NFS and GlusterFS that shows time intervals
   2127 #+caption: spent for I/O (red) and computation (black) by different threads.
   2128 #+RESULTS: fig-arma-io-events
   2129 [[file:build/arma-io-events.pdf]]
   2130 
   2131 *** Velocity potential field computation
   2132 **** Parallel velocity potential field computation.
   2133 The benchmarks for AR, MA and LH models showed that velocity potential field
   2134 computation consumes only a fraction of total programme execution time, however,
   2135 the absolute computation time over a dense \(XY\) grid may be greater. One
   2136 application where dense grid is used is real-time simulation and visualisation
   2137 of wavy surface. Real-time visualisation allows to
   2138 - adjust parameters of the model and ACF function, getting the result of the
   2139   changes immediately, and
   2140 - compare the size and the shape of regions where the most wave energy is
   2141   concentrated.
   2142 
   2143 Since visualisation is done by GPU, doing velocity potential computation on CPU
   2144 may cause data transfer between memory of these two devices to become a
   2145 bottleneck. To circumvent this, the programme uses OpenCL/OpenGL
   2146 interoperability API which allows creating buffers, that are shared between
   2147 OpenCL and OpenGL contexts thus eliminating copying of data between devices. In
   2148 addition to this, three-dimensional velocity potential field
   2149 formula\nbsp{}eqref:eq-phi-3d is particularly suitable for computation by GPUs:
   2150 - it contains transcendental functions (hyperbolic cosines and complex
   2151   exponents);
   2152 - it is computed over large four-dimensional \((t,x,y,z)\) region;
   2153 - it is explicit with no information dependencies between individual points in
   2154   \(t\) and \(z\) dimensions.
   2155 These considerations make velocity potential field computation on GPU
   2156 advantageous in the application to real-time simulation and visualisation of
   2157 wavy surface.
   2158 
   2159 In order to investigate, how much the use of GPU can speed-up velocity potential
   2160 field computation, we benchmarked simplified version of\nbsp{}eqref:eq-phi-3d:
   2161 \begin{align}
   2162     \label{eq-phi-linear}
   2163     \phi(x,y,z,t) &= \InverseFourierY{
   2164         \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} }
   2165         { 2\pi\Kveclen \Sinh{\smash{2\pi \Kveclen h}} }
   2166         \FourierY{ \zeta_t }{u,v}
   2167     }{x,y}\nonumber \\
   2168     &= \InverseFourierY{ g_1(u,v) \FourierY{ g_2(x,y) }{u,v} }{x,y}.
   2169 \end{align}
   2170 Velocity potential computation code was rewritten in OpenCL and its performance
   2171 was compared to an existing OpenMP implementation.
   2172 
   2173 For each implementation running time of the corresponding subroutines and time
   2174 spent for data transfer between devices was measured. Velocity potential field
   2175 was computed for one \(t\) point, for 128 \(z\) points below wavy surface and
   2176 for each \(x\) and \(y\) point of four-dimensional \((t,x,y,z)\) grid. Between
   2177 programme runs the size of the grid along \(x\) dimension was varied.
   2178 
   2179 A different FFT library was used for each implementation: GNU Scientific Library
   2180 (GSL)\nbsp{}cite:galassi2015gnu for OpenMP and clFFT\nbsp{}cite:clfft for
   2181 OpenCL. FFT routines from these libraries have the following features:
   2182 - The order of frequencies in FFT is different. In case of clFFT library
   2183   elements of the resulting array are additionally shifted to make it correspond
   2184   to the correct velocity potential field. In case of GSL no shift is needed.
   2185 - Discontinuity at \((x,y)=(0,0)\) is handled automatically by clFFT library,
   2186   whereas GSL library produce skewed values at this point, thus in case of GSL
   2187   these points are interpolated.
   2188 Other differences of FFT subroutines, that have impact on performance, were not
   2189 discovered.
   2190 
   2191 **** Performance of velocity potential solver.
   2192 :PROPERTIES:
   2193 :header-args:R: :results output raw :exports results
   2194 :END:
   2195 
   2196 The experiments showed that OpenCL outperforms OpenMP implementation by a factor
   2197 of 2--6 (fig.\nbsp{}[[fig-arma-realtime-graph]]), however, distribution of running
   2198 time between subroutines is different (table\nbsp{}[[tab-arma-realtime]]). For CPU
   2199 the most of the time is spent to compute \(g_1\), whereas for GPU time spent to
   2200 compute \(g_1\) is comparable to \(g_2\). Copying the resulting velocity
   2201 potential field between CPU and GPU consumes \(\approx{}20\%\) of the total
   2202 field computation time. Computing \(g_2\) consumes the most of the execution
   2203 time for OpenCL and the least of the time for OpenMP. In both implementations
   2204 \(g_2\) is computed on CPU, because subroutine for derivative computation in
   2205 OpenCL was not found. For OpenCL the result is duplicated for each \(z\) grid
   2206 point in order to perform multiplication of all \(XYZ\) planes along \(z\)
   2207 dimension in single OpenCL kernel, and, subsequently copied to GPU memory which
   2208 has negative impact on performance. All benchmarks were run on a machine
   2209 equipped with a GPU, characteristics of which are summarised in
   2210 table\nbsp{}[[tab-storm]].
   2211 
   2212 #+name: tab-storm
   2213 #+caption["Storm" system configuration]:
   2214 #+caption: "Storm" system configuration.
   2215 #+attr_latex: :booktabs t
   2216 | CPU              | Intel Core 2 Quad Q9550     |
   2217 | RAM              | 8Gb                         |
   2218 | GPU              | AMD Radeon R7 360           |
   2219 | GPU memory       | 2GB                         |
   2220 | HDD              | Seagate Barracuda, 7200 rpm |
   2221 | No. of CPU cores | 4                           |
   2222 
   2223 #+name: fig-arma-realtime-graph
   2224 #+header: :results output graphics
   2225 #+begin_src R :file build/realtime-performance.pdf
   2226 source(file.path("R", "benchmarks.R"))
   2227 par(family="serif")
   2228 data <- arma.load_realtime_data()
   2229 arma.plot_realtime_data(data)
   2230 title(xlab="Wavy surface size", ylab="Time, s")
   2231 #+end_src
   2232 
   2233 #+name: fig-arma-realtime-graph
   2234 #+caption[Performance of velocity potential code]:
   2235 #+caption: Performance comparison of CPU (OpenMP) and GPU (OpenCL) versions of
   2236 #+caption: velocity potential calculation code.
   2237 #+RESULTS: fig-arma-realtime-graph
   2238 [[file:build/realtime-performance.pdf]]
   2239 
   2240 The reason for different distribution of time between OpenCL and OpenMP
   2241 subroutines is the same as for different AR model performance on CPU and GPU:
   2242 GPU has more floating point units and modules for transcendental mathematical
   2243 functions, than CPU, which are needed for computation of \(g_1\), but lacks
   2244 caches which are needed to optimise irregular memory access pattern of \(g_2\).
   2245 In contrast to AR model, performance of multidimensional derivative computation
   2246 on GPU is easier to improve, as there are no information dependencies between
   2247 points: in this work optimisation was not done due to unavailability of existing
   2248 implementation. Additionally, such library may allow to efficiently compute the
   2249 non-simplified formula entirely on GPU, since omitted terms also contain
   2250 derivatives.
   2251 
   2252 #+name: tab-arma-realtime
   2253 #+begin_src R
   2254 source(file.path("R", "benchmarks.R"))
   2255 routine_names <- list(
   2256   harts_g1="\\(g_1\\) function",
   2257   harts_g2="\\(g_2\\) function",
   2258   harts_fft="FFT",
   2259   harts_copy_to_host="Copy data from GPU"
   2260 )
   2261 column_names <- c("Subroutine", "OpenMP", "OpenCL")
   2262 data <- arma.load_realtime_data()
   2263 arma.print_table_for_realtime_data(data, routine_names, column_names)
   2264 #+end_src
   2265 
   2266 #+name: tab-arma-realtime
   2267 #+caption[Velocity potential calculation performance]:
   2268 #+caption: Running time (s.) of real-time velocity potential calculation subroutines
   2269 #+caption: for wavy surface (OX size equals 16384).
   2270 #+attr_latex: :booktabs t
   2271 #+RESULTS: tab-arma-realtime
   2272 |                    |    <r> |    <r> |
   2273 | Subroutine         | OpenMP | OpenCL |
   2274 |--------------------+--------+--------|
   2275 | \(g_1\) function   | 4.6730 | 0.0038 |
   2276 | \(g_2\) function   | 0.0002 | 0.8253 |
   2277 | FFT                | 2.8560 | 0.3585 |
   2278 | Copy data from GPU |        | 2.6357 |
   2279 
   2280 As expected, sharing the same buffer between OpenCL and OpenGL contexts
   2281 increases overall solver performance by eliminating data transfer between CPU
   2282 and GPU memory, but also requires for the data to be in vertex buffer object
   2283 format, that OpenGL can operate on. Conversion to this format is fast, but the
   2284 resulting array occupies more memory, since each point now is a vector with
   2285 three components. The other disadvantage of using OpenCL and OpenGL together is
   2286 the requirement for manual locking of shared buffer: failure to do so results in
   2287 appearance of screen image artefacts which can be removed only by rebooting the
   2288 computer.
   2289 
   2290 *** Summary
   2291 Benchmarks showed that GPU outperforms CPU in arithmetic intensive tasks,
   2292 i.e.\nbsp{}tasks requiring large number of floating point operations per second,
   2293 however, its performance degrades when the volume of data that needs to be
   2294 copied between CPU and GPU memory increases or when memory access pattern
   2295 differs from linear. The first problem may be solved by using a co-processor
   2296 where high-bandwidth memory is located on the same die together with the
   2297 processor and the main memory. This eliminates data transfer bottleneck, but may
   2298 also increase execution time due to smaller number of floating point units. The
   2299 second problem may be solved programmatically, if OpenCL library that computes
   2300 multi-dimensional derivatives were available.
   2301 
   2302 AR and MA models outperform LH model in benchmarks and does not require GPU to
   2303 do so. From computational point of view their strengths are
   2304 - absence of transcendental mathematical functions, and
   2305 - simple algorithm for both AR and MA model, performance of which depends on the
   2306   performance of multi-dimensional array library and FFT library.
   2307 Providing main functionality via low-level libraries makes performance of the
   2308 programme portable: support for new processor architectures can be added by
   2309 substituting the libraries. Finally, using explicit formula makes velocity
   2310 potential field computation consume only a small fraction of total programme
   2311 execution time. If such formula did not exist or did not have all integrals as
   2312 Fourier transforms, velocity potential field computation would consume much more
   2313 time.
   2314 
   2315 ** Fault-tolerant batch job scheduler
   2316 *** System architecture
   2317 **** Physical layer.
   2318 Consists of nodes and direct/routed physical network links. On this layer full
   2319 network connectivity, i.e.\nbsp{}an ability to send network packets between any
   2320 pair of cluster nodes.
   2321 
   2322 **** Daemon process layer.
   2323 :PROPERTIES:
   2324 :CUSTOM_ID: sec-daemon-layer
   2325 :END:
   2326 
   2327 Consists of daemon processes running on cluster nodes and hierarchical
   2328 (master/slave) logical links between them. Only one daemon process is launched
   2329 per node, so these terms are use interchangeably in this work. Master and slave
   2330 roles are dynamically assigned to daemon processes, i.e.\nbsp{}any physical
   2331 cluster node may become a master or a slave, or both simultaneously. Dynamic
   2332 role assignment uses leader election algorithm that does not require periodic
   2333 broadcasting of messages to all cluster nodes, instead the role is determined by
   2334 node's IP address. Detailed explanation of the algorithm is provided
   2335 in\nbsp{}[[#sec-node-discovery]]. Its strengths are scalability to a large number of
   2336 nodes and low overhead, which makes it suitable for large-scale high-performance
   2337 computations; its weakness is in artificial dependence of node's position in the
   2338 hierarchy on its IP address, which makes it unsuitable for virtual environments,
   2339 where nodes' IP addresses may change dynamically.
   2340 
   2341 The only purpose of tree hierarchy is to balance the load between cluster nodes.
   2342 The load is distributed from the current node to its neighbours in the hierarchy
   2343 by simply iterating over them. When new links appear in the hierarchy or old
   2344 links change (due to new node joining the cluster or due to node failure),
   2345 daemon processes communicate each other the number of processes /behind/ the
   2346 corresponding link in the hierarchy. This information is used to distribute the
   2347 load evenly, even if distributed programme is launched on slave node. In
   2348 addition, this hierarchy reduces the number of simultaneous network connections
   2349 between nodes: only one network connection is established and maintained for
   2350 each hierarchical link\nbsp{}--- this decreases probability of network
   2351 congestion when there is a large number of nodes in the cluster.
   2352 
   2353 Load balancing is implemented as follows. When node \(A\) tries to become a
   2354 subordinate of node \(B\), it sends a message to a corresponding IP address
   2355 telling how many daemon processes are already linked to it in the hierarchy
   2356 (including itself). After all hierarchical links are established, every node has
   2357 enough information to tell, how many nodes exist behind each link. If the link
   2358 is between a slave and a master, and the slave wants to know, how many nodes are
   2359 behind the link to the master, then it subtracts the total number of nodes
   2360 behind all of its slave nodes from the total number of nodes behind the master
   2361 to get the correct amount. To distribute kernels
   2362 (see\nbsp{}sec.\nbsp{}[[#sec-kernel-layer]]) across nodes simple round-robin
   2363 algorithm is used, i.e.\nbsp{}iterate over all links of the current node
   2364 including the link to its master and taking into account the number of nodes
   2365 behind each link. So, even if a programme is launched on a slave node in the
   2366 bottom of the hierarchy, the kernels are distributed evenly between all cluster
   2367 nodes.
   2368 
   2369 The proposed approach can be extended to include sophisticated logic into load
   2370 distribution algorithm. Along with the number of nodes behind the hierarchical
   2371 link, any metric, that is required to implement this logic, can be sent.
   2372 However, if an update of the metric happens more frequently, than a change in
   2373 the hierarchy occurs, or it changes periodically, then status messages will be
   2374 also sent more frequently. To ensure that this transfers do not affect
   2375 programmes' performance, a separate physical network may be used to transmit the
   2376 messages. The other disadvantage is that when reconfiguration of the hierarchy
   2377 occurs, the kernels that are already sent to the nodes are not taken into
   2378 account in the load distribution, so frequent hierarchy changes may cause uneven
   2379 load on cluster nodes (which, however, balances over time).
   2380 
   2381 Dynamic master/slave role assignment coupled with kernel distribution makes
   2382 overall system architecture homogeneous within the framework of single cluster.
   2383 On every node the same daemon is run, and no prior configuration is needed to
   2384 construct a hierarchy of daemons processes running on different nodes.
   2385 
   2386 **** Control flow objects layer.
   2387 :PROPERTIES:
   2388 :CUSTOM_ID: sec-kernel-layer
   2389 :END:
   2390 
   2391 The key feature that is missing in the current parallel programming technologies
   2392 is a possibility to specify hierarchical dependencies between tasks executed in
   2393 parallel. When such dependency exists, the principal task becomes responsible
   2394 for re-executing a failed subordinate task on the survived cluster nodes. To
   2395 re-execute a task which does not have a principal, a copy of it is created and
   2396 sent to a different node (see\nbsp{}sec.\nbsp{}[[#sec-fail-over]]). There exists a
   2397 number of systems that are capable of executing directed acyclic graphs of tasks
   2398 in parallel\nbsp{}cite:acun2014charmpp,islam2012oozie, but graphs are not
   2399 suitable to determine principal-subordinate relationship between tasks, because
   2400 a node in the graph may have multiple incoming edges (and hence multiple
   2401 principal nodes).
   2402 
   2403 The main purpose of the proposed approach is to simplify development of
   2404 distributed batch processing applications and middleware. The idea is to provide
   2405 resilience to failures on the lowest possible level. The implementation is
   2406 divided into two layers: the lower layer consists of routines and classes for
   2407 single node applications (with no network interactions), and the upper layer for
   2408 applications that run on an arbitrary number of nodes. There are two kinds of
   2409 tightly coupled entities\nbsp{}--- /control flow objects/ (or /kernels/ for
   2410 short) and /pipelines/\nbsp{}--- which form a basis of the programme.
   2411 
   2412 Kernels implement control flow logic in theirs ~act~ and ~react~ methods and
   2413 store the state of the current control flow branch. Both logic and state are
   2414 implemented by a programmer. In ~act~ method some function is either directly
   2415 computed or decomposed into nested function calls (represented by a set of
   2416 subordinate kernels) which are subsequently sent to a pipeline. In ~react~
   2417 method subordinate kernels that returned from the pipeline are processed by
   2418 their parent. Calls to ~act~ and ~react~ methods are asynchronous and are made
   2419 within threads attached to a pipeline. For each kernel ~act~ is called only
   2420 once, and for multiple kernels the calls are done in parallel to each other,
   2421 whereas ~react~ method is called once for each subordinate kernel, and all the
   2422 calls are made in the same thread to prevent race conditions (for different
   2423 parent kernels different threads may be used).
   2424 
   2425 Pipelines implement asynchronous calls to ~act~ and ~react~, and try to make as
   2426 many parallel calls as possible considering concurrency of the platform (no. of
   2427 cores per node and no. of nodes in a cluster). A pipeline consists of a kernel
   2428 pool, which contains all the subordinate kernels sent by their parents, and a
   2429 thread pool that processes kernels in accordance with rules outlined in the
   2430 previous paragraph. A separate pipeline is used for each device: There are
   2431 pipelines for parallel processing, schedule-based processing (periodic and
   2432 delayed tasks), and a proxy pipeline for processing of kernels on other cluster
   2433 nodes (see fig.\nbsp{}[[fig-subord-ppl]]).
   2434 
   2435 In principle, kernels and pipelines machinery reflect the one of procedures and
   2436 call stacks, with the advantage that kernel methods are called asynchronously
   2437 and in parallel to each other (as much as programme logic allows). Kernel fields
   2438 are stack local variables, ~act~ method is a sequence of processor instructions
   2439 before nested procedure call, and ~react~ method is a sequence of processor
   2440 instructions after the call. Constructing and sending subordinate kernels to the
   2441 pipeline is nested procedure call. Two methods are necessary to make calls
   2442 asynchronous, and replace active wait for completion of subordinate kernels with
   2443 passive one. Pipelines, in turn, allow to implement passive wait, and call
   2444 correct kernel methods by analysing their internal state.
   2445 
   2446 #+name: fig-subord-ppl
   2447 #+begin_src dot :exports results :file build/subord-ppl.pdf
   2448 graph G {
   2449 
   2450   node [fontname="Old Standard",fontsize=14,margin="0.055,0",shape=box]
   2451   graph [nodesep="0.25",ranksep="0.25",rankdir="LR",margin=0]
   2452   edge [arrowsize=0.66]
   2453 
   2454   subgraph cluster_daemon {
   2455     label="Daemon process"
   2456     style=filled
   2457     color=lightgrey
   2458     graph [margin=8]
   2459 
   2460     factory [label="Factory"]
   2461     parallel_ppl [label="Parallel\npipeline"]
   2462     io_ppl [label="I/O\npipeline"]
   2463     sched_ppl [label="Schedule-based\npipeline"]
   2464     net_ppl [label="Network\npipeline"]
   2465     proc_ppl [label="Process\npipeline"]
   2466 
   2467     upstream [label="Upstream\nthread pool"]
   2468     downstream [label="Downstream\nthread pool"]
   2469   }
   2470 
   2471   factory--parallel_ppl
   2472   factory--io_ppl
   2473   factory--sched_ppl
   2474   factory--net_ppl
   2475   factory--proc_ppl
   2476 
   2477   subgraph cluster_hardware {
   2478     label="Compute devices"
   2479     style=filled
   2480     color=lightgrey
   2481     graph [margin=8]
   2482 
   2483     cpu [label="CPU"]
   2484     core0 [label="Core 0"]
   2485     core1 [label="Core 1"]
   2486     core2 [label="Core 2"]
   2487     core3 [label="Core 3"]
   2488 
   2489     storage [label="Storage"]
   2490     disk0 [label="Disk 0"]
   2491 
   2492     network [label="Network"]
   2493     nic0 [label="NIC 0"]
   2494 
   2495     timer [label="Timer"]
   2496 
   2497   }
   2498 
   2499   core0--cpu
   2500   core1--cpu
   2501   core2--cpu
   2502   core3--cpu
   2503 
   2504   disk0--storage
   2505   nic0--network
   2506 
   2507   parallel_ppl--upstream
   2508   parallel_ppl--downstream
   2509 
   2510   upstream--{core0,core1,core2,core3} [style="dashed"]
   2511   downstream--core0 [style="dashed"]
   2512 
   2513   io_ppl--core0 [style="dashed"]
   2514   io_ppl--disk0 [style="dashed"]
   2515   sched_ppl--core0 [style="dashed"]
   2516   sched_ppl--timer [style="dashed"]
   2517   net_ppl--core0 [style="dashed"]
   2518   net_ppl--nic0 [style="dashed"]
   2519   proc_ppl--core0 [style="dashed"]
   2520 
   2521   subgraph cluster_children {
   2522     style=filled
   2523     color=white
   2524 
   2525     subgraph cluster_child0 {
   2526       label="Child process 0"
   2527       style=filled
   2528       color=lightgrey
   2529       labeljust=right
   2530 
   2531       app0_factory [label="Factory"]
   2532       app0 [label="Child process\rpipeline"]
   2533     }
   2534 
   2535 #    subgraph cluster_child1 {
   2536 #      label="Child process 1"
   2537 #      style=filled
   2538 #      color=lightgrey
   2539 #      labeljust=right
   2540 #
   2541 #      app1_factory [label="Factory"]
   2542 #      app1 [label="Child process\rpipeline"]
   2543 #    }
   2544   }
   2545 
   2546   proc_ppl--app0
   2547 #  proc_ppl--app1
   2548 
   2549   app0_factory--app0 [constraint=false]
   2550 #  app1_factory--app1 [constraint=false]
   2551 
   2552 }
   2553 #+end_src
   2554 
   2555 #+caption[Mapping of pipelines to compute devices]:
   2556 #+caption: Mapping of parent and child process pipelines to compute devices.
   2557 #+caption: Solid lines denote aggregation, dashed lines\nbsp{}--- mapping between
   2558 #+caption: logical and physical entities.
   2559 #+attr_latex: :width \textwidth
   2560 #+name: fig-subord-ppl
   2561 #+RESULTS: fig-subord-ppl
   2562 [[file:build/subord-ppl.pdf]]
   2563 
   2564 **** Software implementation.
   2565 For efficiency reasons object pipeline and fault tolerance techniques (which
   2566 will be described later) are implemented in the C++ framework: From the author's
   2567 perspective C language is deemed low-level for distributed programmes, and Java
   2568 incurs too much overhead and is not popular in HPC. The framework is called
   2569 Bscheduler, it is now in proof-of-concept development stage.
   2570 
   2571 **** Application programming interface.
   2572 Each kernel has four types of fields (listed in table\nbsp{}[[tab-kernel-fields]]):
   2573 - fields related to control flow,
   2574 - fields defining the source location of the kernel,
   2575 - fields defining the current location of the kernel, and
   2576 - fields defining the target location of the kernel.
   2577 
   2578 #+name: tab-kernel-fields
   2579 #+caption[Description of kernel fields]:
   2580 #+caption: Description of kernel fields.
   2581 #+attr_latex: :booktabs t :align lp{0.7\textwidth}
   2582 | Field             | Description                                                                                    |
   2583 |-------------------+------------------------------------------------------------------------------------------------|
   2584 | ~process_id~      | Identifier of a cluster-wide process (application) a kernel belongs to.                        |
   2585 | ~id~              | Identifier of a kernel within a process.                                                       |
   2586 | ~pipeline~        | Identifier of a pipeline a kernel is processed by.                                             |
   2587 |                   |                                                                                                |
   2588 | ~exit_code~       | A result of a kernel execution.                                                                |
   2589 | ~flags~           | Auxiliary bit flags used in scheduling.                                                        |
   2590 | ~time_point~      | A time point at which a kernel is scheduled to be executed.                                    |
   2591 |                   |                                                                                                |
   2592 | ~parent~          | Address/identifier of a parent kernel.                                                         |
   2593 | ~src_ip~          | IP-address of a source cluster node.                                                           |
   2594 | ~from_process_id~ | Identifier of a cluster-wide process from which a kernel originated.                           |
   2595 |                   |                                                                                                |
   2596 | ~principal~       | Address/identifier of a target kernel (a kernel to which the current one is sent or returned). |
   2597 | ~dst_ip~          | IP-address of a destination cluster node.                                                      |
   2598 
   2599 Upon creation each kernel is assigned a parent and a pipeline. If there no other
   2600 fields are set, then the kernel is an /upstream/ kernel\nbsp{}--- a kernel that
   2601 can be distributed on any node and any processor core to exploit parallelism. If
   2602 principal field is set, then the kernel is a /downstream/ kernel\nbsp{}--- a
   2603 kernel that can only be sent to its principal, and a processor core to which the
   2604 kernel is sent is defined by the principal memory address/identifier. If a
   2605 downstream kernel is to be sent to another node, the destination IP-address must
   2606 be set, otherwise the system will not find the target kernel.
   2607 
   2608 When kernel execution completes (its ~act~ method finishes), the kernel is
   2609 explicitly sent to some other kernel, this directive is explicitly called inside
   2610 ~act~ method. Usually, after the execution completes a kernel is sent to its
   2611 parent by setting ~principal~ field to the address/identifier of the parent,
   2612 destination IP-address field to the source IP-address, and process identifier to
   2613 the source process identifier. After that kernel becomes a downstream kernel and
   2614 is sent by the system to the node, where its current principal is located
   2615 without invoking load balancing algorithm. For downstream kernel ~react~ method
   2616 of its parent is called by a pipeline with the kernel as the method argument to
   2617 make it possible for a parent to collect the result of the execution.
   2618 
   2619 There is no way to provide fine-grained resilience to cluster node failures, if
   2620 there are downstream kernels in the programme, except the ones returning to
   2621 their parents. Instead, an exit code of the kernel is checked and a
   2622 user-provided recovery subroutine is executed. If there is no error checking,
   2623 the system restarts execution from the first parent kernel, which did not
   2624 produce any downstream kernels. This means, that if a problem being solved by
   2625 the programme has information dependencies between parts that are computed in
   2626 parallel, and a node failure occurs during computation of these parts, then this
   2627 computation is restarted from the very beginning, discarding any already
   2628 computed parts. This does not occur for embarrassingly parallel programmes,
   2629 where parallel parts do not have such information dependencies between each
   2630 other: in this case only parts from failed nodes are recomputed and all
   2631 previously computed parts are retained.
   2632 
   2633 Unlike ~main~ function in programmes based on MPI library, the first (the main)
   2634 kernel is initially run only on one node, and other cluster nodes are used
   2635 on-demand. This allows to use more nodes for highly parallel parts of the code,
   2636 and less nodes for other parts. Similar choice is made in the design of big data
   2637 frameworks\nbsp{}cite:dean2008mapreduce,vavilapalli2013yarn --- a user
   2638 submitting a job does not specify the number of hosts to run its job on, and
   2639 actual hosts are the hosts where input files are located.
   2640 
   2641 **** Mapping of simulation models on system architecture.
   2642 Software implementation of AR and MA models works as a computational pipeline,
   2643 in which each joint applies some function to the output coming from the pipe of
   2644 the previous joint. Joints are distributed across computer cluster nodes to
   2645 enable function parallelism, and then data flowing through the joints is
   2646 distributed across processor cores to enable data parallelism.
   2647 Figure\nbsp{}[[fig-pipeline]] shows a diagram of such pipeline. Here rectangles with
   2648 rounded corners denote joints, regular rectangles denote arrays of problem
   2649 domain objects flowing from one joint to another, and arrows show flow
   2650 direction. Some joints are divided into /sections/ each of which process a
   2651 separate part of the array. If joints are connected without a /barrier/
   2652 (horizontal or vertical bar), then transfer of separate objects between them is
   2653 done in parallel to computations, as they become available. Sections work in
   2654 parallel on each processor core (or node of the cluster). There is surjective
   2655 mapping between a set of processor cores, a set of pipeline joint sections and
   2656 objects, i.e.\nbsp{}each processor core may run several sections, each of which
   2657 may sequentially process several objects, but a section can not work
   2658 simultaneously on several processor cores, and an object can not be processed
   2659 simultaneously by several sections. So, the programme is a pipeline through
   2660 which control objects flow.
   2661 
   2662 #+name: fig-pipeline
   2663 #+begin_src dot :exports results :file build/pipeline.pdf
   2664 digraph {
   2665 
   2666   node [fontname="Old Standard",fontsize=14,margin="0.055,0"]
   2667   graph [nodesep="0.25",ranksep="0.25",rankdir="TB",margin=0]
   2668   edge [arrowsize=0.66]
   2669 
   2670   # data
   2671   subgraph xcluster_linear {
   2672     label="Linear model"
   2673 
   2674     start [label="",shape=circle,style=filled,fillcolor=black,width=0.23]
   2675     spectrum [label="S(ω,θ)",shape=box]
   2676     acf [label="K(i,j,k)",shape=box]
   2677     phi [label="Φ(i,j,k)",shape=box]
   2678 
   2679     # transformations
   2680     fourier_transform [label="Fourier transform",shape=box,style=rounded]
   2681     solve_yule_walker [label="Solve Yule—Walker\nequations",shape=box,style=rounded]
   2682 
   2683     subgraph cluster_nonlinear_1 {
   2684       label="Simulate non-linearity\l"
   2685       labeljust=left
   2686       style=filled
   2687       color=lightgrey
   2688       graph [margin=8]
   2689       acf2 [label="K*(i,j,k)",shape=box]
   2690       transform_acf [label="Transform ACF",shape=box,style=rounded]
   2691     }
   2692   }
   2693 
   2694   subgraph xcluster_linear2 {
   2695 
   2696     eps_parts [label="<e1> ε₁|<e2> ε₂|<e3> …|<e4> εₙ|<e> ε(t,x,y)",shape=record]
   2697     end [label="",shape=doublecircle,style=filled,fillcolor=black,width=0.23]
   2698 
   2699     generate_white_noise [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate\lwhite noise",shape=record,style=rounded]
   2700     generate_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate sea\lwavy surface parts\l",shape=record,style=rounded]
   2701 
   2702     zeta_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Non-crosslinked\lrealisation parts",shape=record]
   2703     overlap_add [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Crosslink realisation\lparts\l",shape=record,style=rounded]
   2704 
   2705     zeta_parts:g1->overlap_add:g1
   2706     zeta_parts:g2->overlap_add:g2
   2707     zeta_parts:g3->overlap_add:g3
   2708     zeta_parts:g4->overlap_add:g4
   2709 
   2710     zeta_parts:g2->overlap_add:g1 [constraint=false]
   2711     zeta_parts:g3->overlap_add:g2 [constraint=false]
   2712     zeta_parts:g4->overlap_add:g3 [constraint=false]
   2713 
   2714     overlap_add:g1->zeta2_parts:g1
   2715     overlap_add:g2->zeta2_parts:g2
   2716     overlap_add:g3->zeta2_parts:g3
   2717     overlap_add:g4->zeta2_parts:g4
   2718 
   2719     zeta2_parts:g1->transform_zeta:g1->zeta3_parts:g1->write_zeta:g1->eps_end
   2720     zeta2_parts:g2->transform_zeta:g2->zeta3_parts:g2->write_zeta:g2->eps_end
   2721     zeta2_parts:g3->transform_zeta:g3->zeta3_parts:g3->write_zeta:g3->eps_end
   2722     zeta2_parts:g4->transform_zeta:g4->zeta3_parts:g4->write_zeta:g4->eps_end
   2723 
   2724   }
   2725 
   2726   subgraph part3 {
   2727 
   2728     zeta2_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Wavy surface with\lGaussian distribution\l",shape=record]
   2729 
   2730     subgraph cluster_nonlinear_2 {
   2731       label="Simulate non-linearity\r"
   2732       labeljust=right
   2733       style=filled
   2734       color=lightgrey
   2735       graph [margin=8]
   2736       zeta3_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> ζ(t,x,y)",shape=record]
   2737       transform_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Transform wavy\lsurface elevation\lprobability distribution\l",shape=record,style=rounded]
   2738     }
   2739 
   2740     # barriers
   2741     eps_start [label="",shape=box,style=filled,fillcolor=black,height=0.05]
   2742     eps_end [label="",shape=box,style=filled,fillcolor=black,height=0.05]
   2743 
   2744     write_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Write finished\lparts to a file\l",shape=record,style=rounded]
   2745   }
   2746 
   2747   # edges
   2748   start->spectrum->fourier_transform->acf->transform_acf
   2749   transform_acf->acf2
   2750   acf2->solve_yule_walker
   2751   solve_yule_walker->phi
   2752   phi->eps_start [constraint=false]
   2753   eps_start->generate_white_noise:g1
   2754   eps_start->generate_white_noise:g2
   2755   eps_start->generate_white_noise:g3
   2756   eps_start->generate_white_noise:g4
   2757   generate_white_noise:g1->eps_parts:e1->generate_zeta:g1->zeta_parts:g1
   2758   generate_white_noise:g2->eps_parts:e2->generate_zeta:g2->zeta_parts:g2
   2759   generate_white_noise:g3->eps_parts:e3->generate_zeta:g3->zeta_parts:g3
   2760   generate_white_noise:g4->eps_parts:e4->generate_zeta:g4->zeta_parts:g4
   2761 
   2762   eps_end->end
   2763 }
   2764 #+end_src
   2765 
   2766 #+caption[Diagram of data processing pipeline]:
   2767 #+caption: Diagram of data processing pipeline, that implements sea wavy
   2768 #+caption: surface generation via AR model.
   2769 #+name: fig-pipeline
   2770 #+RESULTS: fig-pipeline
   2771 [[file:build/pipeline.pdf]]
   2772 
   2773 Object pipeline may be seen as an improvement of Bulk Synchronous Parallel (BSP)
   2774 model\nbsp{}cite:valiant1990bridging, which is used in graph
   2775 processing\nbsp{}cite:malewicz2010pregel,seo2010hama. Pipeline eliminates global
   2776 synchronisation (where it is possible) after each sequential computation step by
   2777 doing data transfer between joints in parallel to computations, whereas in BSP
   2778 model global synchronisation occurs after each step.
   2779 
   2780 Object pipeline speeds up the programme by parallel execution of code blocks
   2781 that work with different compute devices: while the current part of wavy surface
   2782 is generated by a processor, the previous part is written to a disk. This
   2783 approach allows to get speed-up (see\nbsp{}sec.\nbsp{}[[#sec-io-perf]]) because
   2784 compute devices operate asynchronously, and their parallel usage increases the
   2785 whole programme performance.
   2786 
   2787 Since data transfer between pipeline joints is done in parallel to computations,
   2788 the same pipeline may be used to run several copies of the application but with
   2789 different parameters (generate several sea wavy surfaces having different
   2790 characteristics). In practise, high-performance applications do not always
   2791 consume 100% of processor time spending a portion of time on synchronisation of
   2792 parallel processes and writing data to disk. Using pipeline in this case allows
   2793 to run several computations on the same set of processes, and use all of the
   2794 computer devices at maximal efficiency. For example, when one object writes data
   2795 to a file, the other do computations on the processor in parallel. This
   2796 minimises downtime of the processor and other computer devices and increases
   2797 overall throughput of the computer cluster.
   2798 
   2799 Pipelining of otherwise sequential steps is beneficial not only for code working
   2800 with different devices, but for code different branches of which are suitable
   2801 for execution by multiple hardware threads of the same processor core,
   2802 i.e.\nbsp{}branches accessing different memory blocks or performing mixed
   2803 arithmetic (integer and floating point). Code branches which use different CPU
   2804 modules are good candidates to run in parallel on a processor core with multiple
   2805 hardware threads.
   2806 
   2807 So, computational model with a pipeline can be seen as /bulk-asynchronous
   2808 model/, because of the parallel nature of programme steps. This model is the
   2809 basis of the fault-tolerance model which will be described later.
   2810 
   2811 *** Cluster node discovery
   2812 :PROPERTIES:
   2813 :CUSTOM_ID: sec-node-discovery
   2814 :END:
   2815 
   2816 **** Leader election algorithms.
   2817 Many batch job scheduling systems are built on the principle of /subordination/:
   2818 there is master node in each cluster which manages job queue, schedules job
   2819 execution on subordinate nodes and monitors their state. Master role is assigned
   2820 either /statically/ by an administrator to a particular physical node, or
   2821 /dynamically/ by periodically electing one of the cluster nodes as master. In
   2822 the former case fault tolerance is provided by reserving additional spare node
   2823 which takes master role when current master fails. In the latter case fault
   2824 tolerance is provided by electing new master node from survived nodes. Despite
   2825 the fact that dynamic role assignment requires leader election algorithm, this
   2826 approach becomes more and more popular as it does not require spare reserved
   2827 nodes to recover from master node
   2828 failure\nbsp{}cite:hunt2010zookeeper,lakshman2010cassandra,divya2013elasticsearch
   2829 and generally leads to a symmetric system architecture, in which the same
   2830 software stack with the same configuration is installed on every
   2831 node\nbsp{}cite:boyer2012glusterfs,ostrovsky2015couchbase.
   2832 
   2833 Leader election algorithms (which sometimes referred to as /distributed
   2834 consensus/ algorithms are special cases of wave algorithms.
   2835 In\nbsp{}cite:tel2000introduction the author defines them as algorithms in which
   2836 termination event is preceded by at least one event occurring in /each/ parallel
   2837 process. Wave algorithms are not defined for anonymous networks, that is they
   2838 apply only to processes that can uniquely define themselves. However, the number
   2839 of processes affected by the "wave" can be determined in the course of an
   2840 algorithm. For a distributed system this means that wave algorithms work for
   2841 computer clusters with dynamically changing number of nodes, and the algorithm
   2842 is unaffected by some nodes going on-line and off-line.
   2843 
   2844 The approach for cluster node discovery does not use wave algorithms, and hence
   2845 does not require communicating with each node of the cluster to determine a
   2846 leader. Instead, each node enumerates all nodes in the network it is part of,
   2847 and converts this list to a /tree hierarchy/ with a user-defined /fan-out/ value
   2848 (maximal number of subordinate nodes a node may have). Then the node determines
   2849 its hierarchy level and tries to communicate with nodes from higher levels to
   2850 become their subordinate. First, it checks the closest ones and then goes all
   2851 the way to the top. If there is no top-level nodes or the node cannot connect to
   2852 them, then the node itself becomes the master of the whole hierarchy.
   2853 
   2854 Tree hierarchy of all hosts in a network defines strict total order on a set of
   2855 cluster nodes. Although, technically any function can be chosen to map a node to
   2856 a number, in practise this function should be sufficiently smooth along the time
   2857 axis and may have infrequent jumps: high-frequency oscillations (which are often
   2858 caused by measurement errors) may result in constant passing of master role from
   2859 one node to another, which makes the hierarchy unusable for load balancing. The
   2860 simplest such function is the position of an IP address in network IP address
   2861 range.
   2862 
   2863 **** Tree hierarchy creation algorithm.
   2864 Strict total order on the set \(\mathcal{N}\) of cluster nodes connected to a
   2865 network is defined as
   2866 \begin{equation*}
   2867   \forall n_1 \forall n_2 \in \mathcal{N},
   2868   \forall f \colon \mathcal{N} \rightarrow \mathcal{R}^n
   2869   \Rightarrow (f(n_1) < f(n_2) \Leftrightarrow \neg (f(n_1) \geq f(n_2))),
   2870 \end{equation*}
   2871 where \(f\) maps a node to its level and operator \(<\) defines strict total
   2872 order on \(\mathcal{R}^n\). Function \(f\) defines node's sequential number, and
   2873 \(<\) makes this number unique.
   2874 
   2875 The simplest function \(f\) maps each node to its IP address position in network
   2876 IP address range. Without the use of tree hierarchy a node with the lowest
   2877 position in this range becomes the master. If IP-address of a node occupies the
   2878 first position in the range, then there is no master for it, and it continues to
   2879 be at the top of the hierarchy until it fails. Although, IP address mapping is
   2880 simple to implement, it introduces artificial dependence of the master role on
   2881 the address of a node. Still, it is useful for initial configuration of a
   2882 cluster when more complex mappings are not applicable.
   2883 
   2884 To make node discovery scale to a large number of nodes, IP address range is
   2885 mapped to a tree hierarchy. In this hierarchy each node is uniquely identified
   2886 by its hierarchy level \(l\), which it occupies, and offset \(o\), which equals
   2887 to the sequential number of node on its level. Values of level and offset are
   2888 computed from the following optimisation problem.
   2889 \begin{align*}
   2890     n = \sum\limits_{i=0}^{l(n)} p^i + o(n), \quad
   2891     l \rightarrow \min, \quad
   2892     o \rightarrow \min, \quad
   2893     l \geq 0, \quad
   2894     o \geq 0
   2895 \end{align*}
   2896 where \(n\) is the position of node's IP address in network IP address range and
   2897 \(p\) is fan-out value (the maximal number of subordinates, a node can have).
   2898 The master of a node with level \(l\) and offset \(o\) has level \(l-1\) and
   2899 offset \(\lfloor{o/p}\rfloor\). The distance between any two nodes in the tree
   2900 hierarchy with network positions \(i\) and \(j\) is computed as
   2901 \begin{align*}
   2902     & \langle
   2903         \text{lsub}(l(j), l(i)), \quad
   2904         \left| o(j) - o(i)/p \right|
   2905     \rangle,\\
   2906     & \text{lsub}(l_1, l_2) =
   2907     \begin{cases}
   2908         \infty & \quad \text{if } l_1 \geq l_2, \\
   2909         l_1 - l_2 & \quad \text{if } l_1 < l_2.
   2910     \end{cases}
   2911 \end{align*}
   2912 The distance is compound to account for level in the first place.
   2913 
   2914 To determine its master, each node ranks all nodes in the network according to
   2915 their position \(\langle{l(n),o(n)}\rangle\), and using distance formula chooses
   2916 the node which is closest to potential master position and has lower level. That
   2917 way IP addresses of off-line nodes are skipped, however, for sparse networks (in
   2918 which nodes occupy non-contiguous IP addresses) perfect tree is not guaranteed.
   2919 
   2920 Formula for computing distance can be made arbitrary complex (e.g.\nbsp{}to
   2921 account for network latency and throughput or geographical location of the
   2922 node), however, for its simplest form it is more efficient to use cluster node
   2923 /traversal algorithm/. The algorithm requires less memory as there is no need to
   2924 store ranked list of all nodes, but iterates over IP addresses of the network in
   2925 the order defined by the fan-out value. The algorithm is as follows. First, the
   2926 /base/ node (a node which searches for its master) computes its potential master
   2927 address and tries to connect to this node. If the connection fails, the base
   2928 node sequentially tries to connect to each node from the higher hierarchy
   2929 levels, until it reaches the top of the hierarchy (the root of the tree). If
   2930 none of the connections succeed, the base node sequentially connects to all
   2931 nodes on its own level having lower position in IP address range. If none of the
   2932 nodes respond, the base node becomes the master node of the whole hierarchy, and
   2933 the traversal repeats after a set period of time. An example of traversal order
   2934 for a cluster of 11 nodes and a tree hierarchy with fan-out value of 2 is shown
   2935 in\nbsp{}fig.\nbsp{}[[fig-tree-hierarchy-11]].
   2936 
   2937 #+name: fig-tree-hierarchy-11
   2938 #+begin_src dot :exports results :file build/tree-hierarchy-11.pdf
   2939 digraph {
   2940 
   2941   node [fontname="Old Standard",fontsize=14,margin="0.055,0.055",shape=box,style=rounded]
   2942   graph [nodesep="0.30",ranksep="0.30",rankdir="BT"]
   2943   edge [arrowsize=0.66]
   2944 
   2945   m1 [label="10.0.0.1"]
   2946   m2 [label="10.0.0.2"]
   2947   m3 [label="10.0.0.3"]
   2948   m4 [label="10.0.0.4"]
   2949   m5 [label="10.0.0.5"]
   2950   m6 [label="10.0.0.6"]
   2951   m7 [label="10.0.0.7"]
   2952   m8 [label="10.0.0.8"]
   2953   m9 [label="10.0.0.9"]
   2954   m10 [label="10.0.0.10",fillcolor="#c0c0c0",style="filled,rounded"]
   2955   m11 [label="10.0.0.11",shape=Mrecord]
   2956 
   2957   m2->m1
   2958   m3->m1
   2959   m4->m2
   2960   m5->m2
   2961   m6->m3
   2962   m7->m3
   2963   m8->m4
   2964   m9->m4
   2965   m10->m5
   2966   m11->m5
   2967 
   2968   m5->m4->m6 [style="dashed,bold",color="#c00000",constraint=false]
   2969   {rank=same; m6->m7 [style="dashed,bold",color="#c00000"]}
   2970   m7->m2->m3->m1->m8->m9 [style="dashed,bold",color="#c00000",constraint=false]
   2971 
   2972 }
   2973 #+end_src
   2974 
   2975 #+caption[Tree hierarchy for 11 nodes]:
   2976 #+caption: Tree hierarchy for 11 nodes with fan-out of 2.
   2977 #+caption: Red arrows denote hierarchy traversal order for a node
   2978 #+caption: with IP address 10.0.0.10.
   2979 #+name: fig-tree-hierarchy-11
   2980 #+RESULTS: fig-tree-hierarchy-11
   2981 [[file:build/tree-hierarchy-11.pdf]]
   2982 
   2983 **** Evaluation results.
   2984 To benchmark performance of traversal algorithm on large number of nodes,
   2985 several daemon processes were launched on each physical cluster node, each bound
   2986 to its own IP address. The number of processes per physical core varied from 2
   2987 to 16. Each process was bound to a particular physical core to reduce overhead
   2988 of process migration between cores. The algorithm has low requirements for
   2989 processor time and network throughput, so running multiple processes per
   2990 physical core is feasible, in contrast to HPC codes, where it often lowers
   2991 performance. Test platform configuration is shown in table\nbsp{}[[tab-ant]].
   2992 
   2993 #+name: tab-ant
   2994 #+caption["Ant" system configuration]:
   2995 #+caption: "Ant" system configuration.
   2996 #+attr_latex: :booktabs t
   2997 | CPU                       | Intel Xeon E5440, 2.83GHz |
   2998 | RAM                       | 4Gb                       |
   2999 | HDD                       | ST3250310NS, 7200rpm      |
   3000 | No. of nodes              | 12                        |
   3001 | No. of CPU cores per node | 8                         |
   3002 
   3003 Similar approach was used in
   3004 in\nbsp{}cite:lantz2010network,handigol2012reproducible,heller2013reproducible
   3005 where the authors reproduce various real-world experiments using virtual
   3006 clusters, based on Linux namespaces, and compare the results to physical ones.
   3007 The advantage of it is that the tests can be performed on a large virtual
   3008 cluster using relatively small number of physical nodes. The advantage of the
   3009 approach used in this work (which does not use Linux namespaces) is that it is
   3010 more lightweight and larger number of daemon processes can be benchmarked on the
   3011 same physical cluster.
   3012 
   3013 Traversal algorithm performance was evaluated by measuring time needed for all
   3014 nodes of the cluster to discover each other, i.e.\nbsp{}the time needed for the
   3015 tree hierarchy of nodes to reach stable state. Each change of the hierarchy, as
   3016 seen by each node, was written to a log file and after a set amount of time all
   3017 daemon processes (each of which models cluster node) were forcibly terminated.
   3018 Daemon processes were launched sequentially with a 100ms delay to ensure that
   3019 master nodes are always come online before subordinates and hierarchy does not
   3020 change randomly as a result of different start time of each process. This
   3021 artificial delay was subsequently subtracted from the results. So, benchmark
   3022 results represent discovery time in an "ideal" cluster, in which every daemon
   3023 process always finds its master on the first try.
   3024 
   3025 The benchmark was run multiple times varying the number of daemon processes per
   3026 cluster node. The experiment showed that discovery of 512 nodes (8 physical
   3027 nodes with 64 processes per node) each other takes no more than two seconds
   3028 (fig.\nbsp{}[[fig-discovery-benchmark]]). This value does not change significantly
   3029 with the increase in the number of physical nodes. Using more than 8 nodes with
   3030 64 processes per node causes large variation in discovery time due to a large
   3031 total number of processes connecting simultaneously to one master process (a
   3032 fan-out value of 10000 was used for all tests), so these results were excluded
   3033 from consideration.
   3034 
   3035 #+name: fig-discovery-benchmark
   3036 #+header: :width 7 :height 5
   3037 #+begin_src R :file build/discovery-benchmark.pdf
   3038 source(file.path("R", "discovery.R"))
   3039 par(family="serif")
   3040 bscheduler.plot_discovery(xlabel="No. of physical nodes",toplabel="ppn")
   3041 #+end_src
   3042 
   3043 #+caption[Time needed to discover all daemon processes on the cluster]:
   3044 #+caption: Time needed to discover all daemon processes running on the cluster
   3045 #+caption: depending on the number of daemon processes. Dashed lines denote
   3046 #+caption: minimum and maximum values, "ppn" is the number of daemon
   3047 #+caption: processes per node.
   3048 #+name: fig-discovery-benchmark
   3049 #+RESULTS: fig-discovery-benchmark
   3050 [[file:build/discovery-benchmark.pdf]]
   3051 
   3052 **** Discussion.
   3053 Traversal algorithm scales to a large number of nodes, because in order to
   3054 determine its master, a node is required to communicate to a node address of
   3055 which it knows beforehand. Communication with other nodes occurs only when the
   3056 current master node fails. So, if cluster nodes occupy contiguous addresses in
   3057 network IP address range, each node connects only to its master, and inefficient
   3058 scan of the whole network by each node does not occur.
   3059 
   3060 The following key features distinguish the proposed approach with respect to
   3061 some existing
   3062 approaches\nbsp{}cite:brunekreef1996design,aguilera2001stable,romano2014design.
   3063 - *Multi-level hierarchy.* The number of master nodes in a network depends on
   3064   the fan-out value. If it is lesser than the number of IP-addresses in the
   3065   network, then there are multiple master nodes in the cluster. If it is
   3066   greater or equal to the number of IP-addresses in the network, then there is
   3067   only one master node. When some node fail, multi-level hierarchy changes
   3068   locally, and only nodes that are adjacent to the failed one communicate.
   3069   However, node weight changes propagate to every node in the cluster via
   3070   hierarchical links.
   3071 - *IP-address mapping.* Since hierarchy structure solely depends on the nodes'
   3072   IP addresses, there is no election phase in the algorithm. To change the
   3073   master each node sends a message to the old master and to the new one.
   3074 - *Completely event-based.* The messages are sent only when some node joins the
   3075   cluster or fails, so there is no constant load on the network. Since the
   3076   algorithm allows to tolerate failure of sending any message, there is no need
   3077   in heartbeat packets indicating node serviceability in the network; instead,
   3078   all messages play the role of heartbeats and packet send time-out is
   3079   adjusted\nbsp{}cite:rfc5482.
   3080 - *No manual configuration.* A node does not require any prior knowledge to find
   3081   the master: it determines the network it is part of, calculates potential
   3082   master IP-address and sends the message. If it fails, the process is repeated
   3083   for the next potential master node. So, the algorithm is able to bootstrap a
   3084   cluster (make all nodes aware of each other) without prior manual
   3085   configuration, the only requirement is to assign an IP address to each node
   3086   and start the daemon process on it.
   3087 To summarise, the advantage of the algorithm is that it
   3088 - scales to a large number of nodes by means of hierarchy with multiple
   3089   master nodes,
   3090 - does not constantly load the network with node status messages and heartbeat
   3091   packets,
   3092 - does not require manual configuration to bootstrap a cluster.
   3093 
   3094 The disadvantage of the algorithm is that it requires IP-address to change
   3095 infrequently, in order to be useful for load balancing. It is not suitable for
   3096 cloud environments in which node DNS name is preserved, but IP-address may
   3097 change over time. When IP-address changes, current connections may close, and
   3098 node hierarchy is rebuilt. After each change in the hierarchy only newly created
   3099 kernels are distributed using the new links, old kernels continue to execute on
   3100 whatever nodes they were sent. So, environments where an IP address is not an
   3101 unique identifier of the node are not suitable for the algorithm.
   3102 
   3103 The other disadvantage is that the algorithm creates artificial dependence of
   3104 node rank on an IP-address: it is difficult to substitute IP-address mapping
   3105 with a more sophisticated one. If the mapping uses current node and network load
   3106 to infer node ranks, measurement errors may result in unstable hierarchy, and
   3107 the algorithm cease to be fully event-based as load levels need to be periodically
   3108 collected on every node and propagated to each node in the cluster.
   3109 
   3110 Node discovery algorithm is designed to balance the load on a cluster of
   3111 compute nodes (see\nbsp{}sec.\nbsp{}[[#sec-daemon-layer]]), its use in other
   3112 applications is not studied here. When distributed or parallel programme starts
   3113 on any of cluster nodes, its kernels are distributed to all adjacent nodes in
   3114 the hierarchy (including master node if applicable). To distribute the load
   3115 evenly when the application is run on a subordinate node, each node maintains
   3116 the weight of each adjacent node in the hierarchy. The weight equals to the
   3117 number of nodes in the tree "behind" the adjacent node. For example, if the
   3118 weight of the first adjacent node is 2, then round-robin load balancing
   3119 algorithm distributes two kernels to the first node before moving to the next
   3120 one.
   3121 
   3122 To summarise, traversal algorithm is
   3123 - designed to ease load balancing on the large number of cluster nodes,
   3124 - fully fault-tolerant, because the state of every node can be recomputed at any
   3125   time,
   3126 - fully event-based as it does not overload the network by periodically sending
   3127   state update messages.
   3128 
   3129 *** Fail over algorithm
   3130 :PROPERTIES:
   3131 :CUSTOM_ID: sec-fail-over
   3132 :END:
   3133 
   3134 **** Checkpoints.
   3135 Node failures in a distributed system are divided into two types: failure of a
   3136 slave node and failure of a master node. In order for a job running on the
   3137 cluster to survive slave node failure, job scheduler periodically creates
   3138 checkpoints and writes them to a stable (redundant) storage. In order to create
   3139 the checkpoint, the scheduler temporarily suspends all parallel processes of the
   3140 job, copies all memory pages and all internal operating system kernel structures
   3141 allocated for these processes to disk, and resumes execution of the job. In
   3142 order to survive master node failure, job scheduler daemon process continuously
   3143 copies its internal state to a backup node, which becomes the master after the
   3144 failure.
   3145 
   3146 There are many works on improving performance of
   3147 checkpoints\nbsp{}cite:egwutuoha2013survey, and alternative approaches do not
   3148 receive much attention. Usually HPC applications use message passing for
   3149 parallel processes communication and store their state in global memory space,
   3150 hence there is no way one can restart a failed process from its current state
   3151 without writing the whole memory image to disk. Usually the total number of
   3152 processes is fixed by the job scheduler, and all parallel processes restart upon
   3153 a failure. There is ongoing effort to make it possible to restart only the
   3154 failed process\nbsp{}cite:meyer2012radic by restoring them from a checkpoint on
   3155 the surviving nodes, but this may lead to overload if there are other processes
   3156 on these nodes. Theoretically, process restart is not needed, if the job can
   3157 proceed on the surviving nodes, however, message passing library does not allow
   3158 to change the number of processes at runtime, and most programmes assume this
   3159 number to be constant. So, there is no reliable way to provide fault tolerance
   3160 in message passing library other than restarting all parallel processes from a
   3161 checkpoint.
   3162 
   3163 At the same time, there is a possibility to continue execution of a job on
   3164 lesser number of nodes than it was initially requested by implementing fault
   3165 tolerance on job scheduler level. In this case master and slave roles are
   3166 dynamically distributed between scheduler's daemon processes running on each
   3167 cluster node, forming a tree hierarchy of cluster nodes, and parallel programme
   3168 consists of kernels which use node hierarchy to dynamically distribute the load
   3169 and use their own hierarchy to restart kernels upon node failure.
   3170 
   3171 **** Dynamic role distribution.
   3172 Fault tolerance of a parallel programme is one of the problems which is solved
   3173 by big data and HPC job schedulers, however, most schedulers provide fault
   3174 tolerance for slave nodes only. These types of failures are routinely handled by
   3175 restarting the affected job (from a checkpoint) or its part on the remaining
   3176 nodes, and failure of a principal node is often considered either improbable, or
   3177 too complicated to handle and configure on the target platform. System
   3178 administrators often find alternatives to application level fault tolerance:
   3179 they isolate principal process of the scheduler from the rest of the cluster
   3180 nodes by placing it on a dedicated machine, or use virtualisation technologies
   3181 instead. All these alternatives complexify configuration and maintenance, and by
   3182 decreasing probability of a machine failure resulting in a whole system failure,
   3183 they increase probability of a human error.
   3184 
   3185 From such point of view it seems more practical to implement master node fault
   3186 tolerance at application level, but there is no proven generic solution. Most
   3187 implementations are too tied to a particular application to become universally
   3188 applicable. Often a cluster is presented as a machine, which has a distinct
   3189 control unit (master node) and a number of identical compute units (slave
   3190 nodes). In practice, however, master and slave nodes are physically the same,
   3191 and are distinguished only by batch job scheduler processes running on them.
   3192 Realisation of the fact that cluster nodes are merely compute units allows to
   3193 implement middleware that distributes master and slave roles automatically and
   3194 handles node failures in a generic way. This software provides an API to
   3195 distribute kernels between currently available nodes. Using this API one can
   3196 write a programme that runs on a cluster without knowing the exact number of
   3197 working nodes. The middleware works as a cluster operating system in user space,
   3198 allowing to execute distributed applications transparently on any number of
   3199 nodes.
   3200 
   3201 **** Symmetric architecture.
   3202 Many distributed key-value stores and parallel file systems have symmetric
   3203 architecture, in which master and slave roles are dynamically distributed, so
   3204 that any node can act as a master when the current master node
   3205 fails\nbsp{}cite:ostrovsky2015couchbase,divya2013elasticsearch,boyer2012glusterfs,anderson2010couchdb,lakshman2010cassandra.
   3206 However, this architecture is still not used in big data and HPC job schedulers.
   3207 For example, in YARN scheduler\nbsp{}cite:vavilapalli2013yarn master and slave
   3208 roles are static. Failure of a slave node is tolerated by restarting a part of a
   3209 job, that worked on it, on one of the surviving nodes, and failure of a master
   3210 node is tolerated by setting up standby master
   3211 node\nbsp{}cite:murthy2011architecture. Both master nodes are coordinated by
   3212 Zookeeper service which uses dynamic role assignment to ensure its own
   3213 fault-tolerance\nbsp{}cite:okorafor2012zookeeper. So, the lack of dynamic role
   3214 distribution in YARN scheduler complicates the whole cluster configuration: if
   3215 dynamic roles were available, Zookeeper would be redundant in this
   3216 configuration.
   3217 
   3218 The same problem occurs in HPC job schedulers where master node (where the main
   3219 job scheduler process is run) is the single point of failure.
   3220 In\nbsp{}cite:uhlemann2006joshua,engelmann2006symmetric the authors replicate
   3221 job scheduler state to a backup node to make the master node highly available,
   3222 but backup node role is assigned statically. This solution is close to symmetric
   3223 architecture, because it does not involve external service to provide high
   3224 availability, but far from ideal in which backup node is dynamically chosen.
   3225 
   3226 Finally, the simplest master node high availability is implemented in VRRP
   3227 protocol (Virtual Router Redundancy
   3228 Protocol)\nbsp{}cite:rfc2338,rfc3768,rfc5798. Although VRRP protocol does
   3229 provide dynamic role distribution, it is designed to be used by routers and
   3230 reverse proxy servers behind them. Such servers lack the state (a job queue)
   3231 that needs to be restored upon node failure, so it is easier for them to provide
   3232 high availability. The protocol can be implemented even without routers using
   3233 Keepalived daemon\nbsp{}cite:cassen2002keepalived instead.
   3234 
   3235 Symmetric architecture is beneficial for job schedulers because it allows to
   3236 - make physical nodes interchangeable,
   3237 - implement dynamic distribution of master and slave node roles, and
   3238 - implement automatic recovery after a failure of any node.
   3239 The following sections describe the components that are required to write
   3240 parallel programme and job scheduler, that can tolerate failure of cluster
   3241 nodes.
   3242 
   3243 **** Definitions of hierarchies.
   3244 To disambiguate hierarchical links between daemon processes and kernels and to
   3245 simplify the discussion, the following naming conventions are used throughout
   3246 the text. If the link is between two daemon processes, the relationship is
   3247 denoted as /master-slave/. If the link is between two kernels, then the
   3248 relationship is denoted as either /principal-subordinate/ or /parent-child/. Two
   3249 hierarchies are orthogonal to each other in a sense that no kernel may have a
   3250 link to a daemon, and vice versa. Since daemon process hierarchy is used to
   3251 distribute the load on cluster nodes, kernel hierarchy is mapped onto it, and
   3252 this mapping can be arbitrary: It is common to have principal kernel on a slave
   3253 node with its subordinate kernels distributed evenly between all cluster nodes
   3254 (including the node where the principal is located). Both hierarchies can be
   3255 arbitrarily deep, but "shallow" ones are preferred for highly parallel
   3256 programmes, as there are less number of hops when kernels are distributed
   3257 between cluster nodes. Since there is one-to-one correspondence between daemons
   3258 and cluster nodes, they are used interchangeably in the text.
   3259 
   3260 **** Handling nodes failures.
   3261 Basic method to overcome a failure of a subordinate node is to restart
   3262 corresponding kernels on a healthy node (Erlang language uses the same method to
   3263 restart failed subordinate processes\nbsp{}cite:armstrong2003thesis). In order
   3264 to implement this method in the framework of kernel hierarchy, sender node saves
   3265 every kernel that is sent to remote cluster nodes, and in an event of a failure
   3266 of any number of nodes, where kernels were sent, their copies are redistributed
   3267 between the remaining nodes without custom handling by a programmer. If there
   3268 are no nodes to sent kernels to, they are executed locally. In contrast to
   3269 "heavy-weight" checkpoint/restart machinery employed by HPC cluster job
   3270 schedulers, tree hierarchy of nodes coupled with hierarchy of kernels allows to
   3271 automatically and transparently handle any number of subordinate node failures
   3272 without restarting any processes of a parallel programme, and processes act as
   3273 resource allocation units.
   3274 
   3275 A possible way of handling failure of the main node (a node where the main
   3276 kernel is executed) is to replicate the main kernel to a backup node, and make
   3277 all updates to its state propagate to the backup node by means of a distributed
   3278 transaction, but this approach does not correlate with asynchronous nature of
   3279 kernels and too complex to implement. In practice, however, the main kernel
   3280 usually does not perform operations in parallel, it is rather sequentially
   3281 executes programme steps one by one, hence it has only one subordinate at a
   3282 time. (Each subordinate kernel represents sequential computational step which
   3283 may or may not be internally parallel.) Keeping this in mind, one can simplify
   3284 synchronisation of the main kernel state: send the main kernel along with its
   3285 subordinate to the subordinate node. Then if the main node fails, the copy of
   3286 the main kernel receives its subordinate (because both of them are on the same
   3287 node) and no time is spent on recovery. When the subordinate node, to which
   3288 subordinate kernel together with the copy of the main kernel was sent, fails,
   3289 the subordinate kernel is sent to a survived node, and in the worst case the
   3290 current computational step is executed again.
   3291 
   3292 The approach described above is designed for kernels that do not have a parent
   3293 and have only one subordinate at a time, which means that it functions as
   3294 checkpoint mechanism. The advantage of this approach is that it
   3295 - saves programme state after each sequential step, when memory footprint of a
   3296   programme is low,
   3297 - saves only relevant data, and
   3298 - uses memory of a subordinate node rather than disk storage.
   3299 This simple approach allows to tolerate at most one failure of /any/ cluster node
   3300 per computational step or arbitrary number of subordinate nodes at any time
   3301 during programme execution.
   3302 
   3303 For a four-node cluster and a hypothetical parallel programme failure recovery
   3304 algorithm works as follows (fig.\nbsp{}[[fig-fail-over-example]]).
   3305 1. *Initial state.* Initially, computer cluster does not need to be configured
   3306    except setting up local network. The algorithm assumes full connectivity of
   3307    cluster nodes, and works best with tree network topologies in which several
   3308    network switches connect all cluster nodes.
   3309 2. *Build node hierarchy.* When the cluster is bootstrapped, daemon processes
   3310    start on all cluster nodes and collectively build their hierarchy
   3311    superimposed on the topology of cluster network. Position of a daemon process
   3312    in the hierarchy is defined by the position of its node IP address in the
   3313    network IP address range. To establish hierarchical link each process
   3314    connects to its assumed master process. The hierarchy is changed only when a
   3315    new node joins the cluster or an existing node fails.
   3316 3. *Launch main kernel.* The first kernel launches on one of the subordinate
   3317    nodes (node \(B\)). Main kernel may have only one subordinate at a time, and
   3318    backup copy of the main kernel is sent along with the subordinate kernel
   3319    \(T_1\) to master node \(A\). \(T_1\) represents one sequential step of a
   3320    programme. There can be any number of sequential steps in a programme, and
   3321    when node \(A\) fails, the current step is restarted from the beginning.
   3322 4. *Launch subordinate kernels.* Kernels \(S_1\), \(S_2\), \(S_3\) are launched
   3323    on subordinate cluster nodes. When node \(B\), \(C\) or \(D\) fails,
   3324    corresponding main kernel restarts failed subordinates (\(T_1\) restarts
   3325    \(S_1\), master kernel restarts \(T_1\) etc.). When node \(B\) fails, master
   3326    kernel is recovered from backup copy.
   3327 
   3328 #+name: fig-fail-over-example
   3329 #+header: :headers '("\\input{preamble}")
   3330 #+begin_src latex :file build/fail-over-example.pdf :exports results :results raw
   3331 \input{tex/preamble}
   3332 \newcommand*{\spbuInsertFigure}[1]{%
   3333 \vspace{2\baselineskip}%
   3334 \begin{minipage}{0.5\textwidth}%
   3335     \Large%
   3336     \input{#1}%
   3337 \end{minipage}%
   3338 }%
   3339 \noindent%
   3340 \spbuInsertFigure{tex/cluster-0}~\spbuInsertFigure{tex/frame-0}\newline
   3341 \spbuInsertFigure{tex/frame-3}~\spbuInsertFigure{tex/frame-4}\newline
   3342 \spbuInsertFigure{tex/legend}
   3343 #+end_src
   3344 
   3345 #+caption[Working scheme of fail over algorithm]:
   3346 #+caption: Working scheme of fail over algorithm.
   3347 #+name: fig-fail-over-example
   3348 #+attr_latex: :width \textwidth
   3349 #+RESULTS: fig-fail-over-example
   3350 [[file:build/fail-over-example.pdf]]
   3351 
   3352 **** Evaluation results.
   3353 Fail over algorithm was evaluated on physical cluster (table\nbsp{}[[tab-ant]]) on
   3354 the example of distributed AR model application, which is described in detail in
   3355 section\nbsp{}[[#sec-arma-mpp]]. The application consists of a series of functions, each
   3356 of which is applied to the result of the previous one. Some of the functions are
   3357 computed in parallel, so the programme is written as a sequence of steps, some
   3358 of which are made internally parallel to get better performance. In the
   3359 programme only the most compute-intensive step (the surface generation) is
   3360 executed in parallel across all cluster nodes, and other steps are executed in
   3361 parallel across all cores of the master node.
   3362 
   3363 The application was rewritten for the distributed version of the framework which
   3364 required adding read/write methods to each kernel which is sent over the network
   3365 and slight modifications to handle failure of a node with the main kernel. The
   3366 kernel was marked so that the framework makes a replica and sends it to some
   3367 subordinate node along with its subordinate kernel. Other code changes involved
   3368 modifying some parts to match the new API. So, providing fault tolerance by
   3369 means of kernel hierarchy is mostly transparent to the programmer: it demands
   3370 explicit marking of the main kernel and adding code to read and write kernels to
   3371 the byte buffer.
   3372 
   3373 In a series of experiments performance of the new version of the application in
   3374 the presence of different types of failures was benchmarked:
   3375 - no failures,
   3376 - failure of a slave node (a node where a copy of the main kernel is stored),
   3377 - failure of a master node (a node where the main kernel is run).
   3378 Only two directly connected cluster nodes were used for the benchmark. Node
   3379 failure was simulated by sending ~SIGKILL~ signal to the daemon process on the
   3380 corresponding node right after the copy of the main kernel is made. The
   3381 application immediately recognised node as offline, because the corresponding
   3382 connection was closed; in real-world scenario, however, the failure is detected
   3383 only after a configurable TCP user timeout\nbsp{}cite:rfc5482. The execution
   3384 time of these runs were compared to the run without node failures. Results are
   3385 presented in fig.\nbsp{}[[fig-master-slave-failure]], and a diagram of how kernels
   3386 are distributed between two nodes is shown in
   3387 fig.\nbsp{}[[fig-master-slave-backup]].
   3388 
   3389 #+name: fig-master-slave-backup
   3390 #+begin_src dot :exports results :file build/master-slave-backup.pdf
   3391 digraph {
   3392 
   3393   node [fontname="Old Standard",fontsize=14,margin="0.055,0.055",shape=box,style=rounded]
   3394   graph [nodesep="0.30",ranksep="0.30",rankdir="BT"]
   3395   edge [arrowsize=0.66]
   3396 
   3397   m1 [label="{{<master>Master | 10.0.0.1} | {<main>M | <slave1>S₁ | <slave3>S₃}}",shape=record]
   3398   m2 [label="{{<slave>Slave | 10.0.0.2} | {M' | <step>N | <slave2>S₂ | <slave4>S₄}}",shape=record]
   3399 
   3400   m2->m1
   3401 
   3402 }
   3403 #+end_src
   3404 
   3405 #+name: fig-master-slave-backup
   3406 #+caption[Mapping of kernels on master and slave nodes]:
   3407 #+caption: Master and slave nodes and mapping of main kernel \(M\),
   3408 #+caption: a copy of main kernel \(M'\), current step kernel \(N\)
   3409 #+caption: and subordinate kernels \(S_{1,2,3}\) on them.
   3410 #+RESULTS: fig-master-slave-backup
   3411 [[file:build/master-slave-backup.pdf]]
   3412 
   3413 As expected, there is considerable difference in application performance for
   3414 different types of failures. In case of slave node failure the main kernel as
   3415 well as some subordinate kernels (that were distributed to the slave node) are
   3416 lost, but master node has a copy of the main kernel and uses it to continue the
   3417 execution. So, in case of slave node failure nothing is lost except performance
   3418 potential of the slave node. In case of master node failure, a copy of the main
   3419 kernel as well as the subordinate kernel, which carried the copy, are lost, but
   3420 slave node has the original main kernel and uses it to restart execution of the
   3421 current sequential step, i.e.\nbsp{}send the subordinate kernel to one of the
   3422 survived cluster nodes (in case of two directly connected nodes, it sends the
   3423 kernel to itself). So, application performance is different, because the number
   3424 of kernels that are lost as a result of a failure as well as their roles are
   3425 different.
   3426 
   3427 Slave node failure needs some time to be discovered: it is detected only when
   3428 subordinate kernel carrying a copy of the main kernel finishes its execution and
   3429 tries to reach its parent. Instant detection requires forcible termination of
   3430 the subordinate kernel which may be inapplicable for programmes with complicated
   3431 logic.
   3432 
   3433 #+name: fig-master-slave-failure
   3434 #+begin_src R :file build/master-slave-failure.pdf
   3435 source(file.path("R", "benchmarks.R"))
   3436 par(family="serif")
   3437 data <- arma.load_master_slave_failure_data()
   3438 arma.plot_master_slave_failure_data(
   3439   data,
   3440   list(
   3441     master="Bscheduler (master failure)",
   3442     slave="Bscheduler (slave failure)",
   3443     nofailures="Bscheduler (no failures)"
   3444   )
   3445 )
   3446 title(xlab="Wavy surface size", ylab="Time, s")
   3447 #+end_src
   3448 
   3449 #+caption[Performance of a programme in the presence of node failures]:
   3450 #+caption: Performance of AR model in the presence of node failures.
   3451 #+name: fig-master-slave-failure
   3452 #+RESULTS: fig-master-slave-failure
   3453 [[file:build/master-slave-failure.pdf]]
   3454 
   3455 To summarise, if failure occurs right after a copy if the main kernel is made,
   3456 only a small fraction of performance is lost no matter whether the main kernel
   3457 or its copy is lost.
   3458 
   3459 **** Discussion of test results.
   3460 Since failure is simulated right after the first subordinate kernel reaches its
   3461 destination (a node where it is supposed to be executed), slave node failure
   3462 results in a loss of a small fraction of performance; in real-world scenario,
   3463 where failure may occur in the middle of wavy surface generation, performance
   3464 loss due to /backup/ node failure (a node where a copy of the main kernel is
   3465 located) would be higher. Similarly, in real-world scenario the number of
   3466 cluster nodes is larger, hence less amount of subordinate kernels is lost due to
   3467 master node failure, so performance penalty is lower. In the benchmark the
   3468 penalty is higher for the slave node failure, which is the result of absence of
   3469 parallelism in the beginning of AR model wavy surface generation: the first part
   3470 is computed sequentially, and other parts are computed only when the first one
   3471 is available. So, the loss of the first subordinate kernel delays execution of
   3472 every dependent kernel in the programme.
   3473 
   3474 Fail over algorithm guarantees to handle one failure per sequential programme
   3475 step, more failures can be tolerated if they do not affect the master node. The
   3476 algorithm handles simultaneous failure of all subordinate nodes, however, if
   3477 both master and backup node fail, there is no chance for a programme to continue
   3478 the work. In this case the state of the current computation step is lost, and
   3479 the only way to restore it is to restart the application from the beginning
   3480 (which is currently not implemented in Bscheduler).
   3481 
   3482 Kernels are means of abstraction that decouple distributed application from
   3483 physical hardware: it does not matter how many cluster nodes are currently
   3484 available for a programme to run without interruption. Kernels eliminate the
   3485 need to allocate a physical live spare node to tolerate master node failures: in
   3486 the framework of kernel hierarchy any physical node (except master) can act as a
   3487 live spare. Finally, kernels allow to handle failures in a way that is
   3488 transparent to a programmer, deriving the order of actions from the internal
   3489 state of a kernel.
   3490 
   3491 The experiments show that it is essential for a parallel programme to have
   3492 multiple sequential steps to make it resilient to cluster node failures,
   3493 otherwise failure of backup node, in fact triggers recovery of the initial state
   3494 of the programme. Although, the probability of a master node failure is lower
   3495 than the probability of a failure of any of the slave nodes, it does not justify
   3496 loosing all the data when the long-running programme is near completion. In
   3497 general, the more sequential steps one has in a parallel programme the less time
   3498 is lost in an event of a backup node failure, and the more parallel parts each
   3499 sequential step has the less time is lost in case of a master or slave node
   3500 failure. In other words, the more nodes a programme uses the more resilient to
   3501 their failures it becomes.
   3502 
   3503 Although it is not shown in the experiments, Bscheduler does not only provide
   3504 tolerance to cluster node failures, but allows for new nodes to automatically
   3505 join the cluster and receive their portion of kernels from the already running
   3506 programmes. This is trivial process in the context of the framework as it does
   3507 not involve restarting failed kernels or copying their state, so it has not been
   3508 studied experimentally in this work.
   3509 
   3510 Theoretically, hierarchy-based fault tolerance based can be implemented on top
   3511 of the message-passing library without loss of generality. Although it would be
   3512 complicated to reuse free nodes instead of failed ones in the framework of this
   3513 library, as the number of nodes is often fixed in such libraries, allocating
   3514 reasonably large number of nodes for the programme would be enough to make it
   3515 fault-tolerant. At the same time, implementing hierarchy-based fault tolerance
   3516 inside message-passing library itself is not practical, because it would require
   3517 saving the state of a parallel programme which equals to the total amount of
   3518 memory it occupies on each cluster node, which in turn would not make it more
   3519 efficient than checkpoints.
   3520 
   3521 The weak point of the proposed method is the period of time starting from a
   3522 failure of master node up to the moment when the failure is detected, the main
   3523 kernel is restored and new subordinate kernel with the parent's copy is received
   3524 by a slave node. If backup node fails within this time frame, execution state of
   3525 a programme is completely lost, and there is no way to recover it other than
   3526 restarting the programme from the beginning. The duration of the dangerous
   3527 period can be minimised, but the probability of an abrupt programme termination
   3528 can not be fully eliminated. This result is consistent with the scrutiny of
   3529 /impossibility theory/, in the framework of which it is proved the impossibility
   3530 of the distributed consensus with one faulty
   3531 process\nbsp{}cite:fischer1985impossibility and impossibility of reliable
   3532 communication in the presence of node
   3533 failures\nbsp{}cite:fekete1993impossibility.
   3534 
   3535 *** Summary
   3536 Current state-of-the-art approach to developing and running parallel programmes
   3537 on the cluster is the use of message passing library and job scheduler, and
   3538 despite the fact that this approach is highly efficient in terms of parallel
   3539 computation, it is not flexible enough to accommodate dynamic load balancing and
   3540 automatic fault-tolerance. Programmes written with message passing library
   3541 typically assume
   3542 - equal load on each processor,
   3543 - non-interruptible and reliable execution of batch jobs, and
   3544 - constant number of parallel processes throughout the execution which is equal
   3545   to the total number of processors.
   3546 The first assumption does not hold for sea wave simulation programme because AR
   3547 model requires dynamic load balancing between processors to generate each part
   3548 of the surface only when all dependent parts have already been generated. The
   3549 last assumption also does not hold, because for the sake of efficiency each part
   3550 is written to a file asynchronously by a separate thread. The remaining
   3551 assumption is not related to the programme itself, but to the job scheduler, and
   3552 does not generally hold for very large computer clusters in which node failures
   3553 occur regularly, and job scheduler restores the failed job from the checkpoint
   3554 greatly increasing its running time. The idea of the proposed approach is to
   3555 give parallel programmes more flexibility:
   3556 - provide dynamic load balancing via pipelined execution of sequential,
   3557   internally parallel programme steps,
   3558 - restart only processes that were affected by node failure, and
   3559 - execute the programme on as many compute nodes as are available in the
   3560   cluster.
   3561 In this section advantages and disadvantages of this approach are discussed.
   3562 
   3563 In comparison to portable batch systems (PBS) the proposed approach uses
   3564 lightweight control flow objects instead of heavy-weight parallel jobs to
   3565 distribute the load on cluster nodes. First, this allows to have per-node kernel
   3566 queues instead of several cluster-wide job queues. The granularity of control
   3567 flow objects is much higher than of the batch jobs, and despite the fact that
   3568 their execution time cannot be reliably predicted (as is execution time of batch
   3569 jobs\nbsp{}cite:zotkin1999job), objects from multiple parallel programmes can be
   3570 dynamically distributed between the same set of cluster nodes, thus making the
   3571 load more even. The disadvantage is that this requires more RAM to execute many
   3572 programmes on the same set of nodes, and execution time of each programme may be
   3573 greater because of the shared control flow object queues. Second, the proposed
   3574 approach uses dynamic distribution of master and slave roles between cluster
   3575 nodes instead of their static assignment to the particular physical nodes. This
   3576 makes nodes interchangeable, which is required to provide fault tolerance.
   3577 Simultaneous execution of multiple parallel programmes on the same set of nodes
   3578 increases throughput of the cluster, but also decreases their performance taken
   3579 separately; dynamic role distribution is the base on which resilience to
   3580 failures builds.
   3581 
   3582 In comparison to MPI the proposed approach uses lightweight control flow objects
   3583 instead of heavy-weight processes to decompose the programme into individual
   3584 entities. First, this allows to determine the number of entities computed in
   3585 parallel based on the problem being solved, not the computer or cluster
   3586 architecture. A programmer is encouraged to create as many objects as needed,
   3587 guided by the algorithm or restrictions on the size of data structures from the
   3588 problem domain. In sea wave simulation programme the minimal size of each wavy
   3589 surface part depends on the number of coefficients along each dimension, and at
   3590 the same time the number of parts should be larger than the number of processors
   3591 to make the load on each processor more even. Considering these limits the
   3592 optimal part size is determined at runtime, and, in general, is not equal the
   3593 number of parallel processes. The disadvantage is that the more control flow
   3594 objects there are in the programme, the more shared data structures
   3595 (e.g.\nbsp{}coefficients) are copied to the same node with subordinate objects;
   3596 this problem is solved by introducing another intermediate layer of objects,
   3597 which in turn adds more complexity to the programme. Second, hierarchy of
   3598 control flow objects together with hierarchy of nodes allows for automatic
   3599 recomputation of failed objects on surviving nodes in an event of hardware
   3600 failure. It is possible because the state of the programme execution is stored
   3601 in objects and not in global variables like in MPI programmes. By duplicating
   3602 the state to a subordinate node, the system recomputes only objects from
   3603 affected processes instead of the whole programme. Transition from processes to
   3604 control flow objects increases performance of a parallel programme via dynamic
   3605 load balancing, but may inhibit its scalability for a large number of nodes and
   3606 large amount of shared structures due to duplication of these structures.
   3607 
   3608 Decomposition of a parallel programme into individual entities is done in
   3609 Charm++ framework\nbsp{}cite:kale2012charm (with load
   3610 balancing\nbsp{}cite:bhandarkar2001adaptive) and in actor
   3611 model\nbsp{}cite:hewitt1973universal,agha1985actors, but none of the approaches
   3612 uses hierarchical relationship to restart entity processing after a failure.
   3613 Instead of using tree hierarchy of kernels, these approaches allow exchanging
   3614 messages between any pair of entities. Such irregular message exchange pattern
   3615 makes it impossible to decide which object is responsible for restarting a
   3616 failed one, hence non-universal fault tolerance techniques are used
   3617 instead\nbsp{}cite:lifflander2014scalable.
   3618 
   3619 Three building blocks of the proposed approach\nbsp{}--- control flow objects,
   3620 pipelines and hierarchies\nbsp{}--- complement each other. Without control flow
   3621 objects carrying programme state it is impossible to recompute failed
   3622 subordinate objects and provide fault tolerance. Without node hierarchy it is
   3623 impossible to distribute the load between cluster nodes, because all nodes are
   3624 equal without the hierarchy. Without pipelines for each device it is impossible
   3625 to execute control flow objects asynchronously and implement dynamic load
   3626 balancing. These three entities form a closed system, in which programme logic
   3627 is implemented in kernels or pipelines, failure recovery logic\nbsp{}--- in
   3628 kernel hierarchy, and data transfer logic\nbsp{}--- in cluster node hierarchy.
   3629 
   3630 ** MPP implementation
   3631 :PROPERTIES:
   3632 :CUSTOM_ID: sec-arma-mpp
   3633 :END:
   3634 
   3635 **** Distributed AR model algorithm.
   3636 :PROPERTIES:
   3637 :CUSTOM_ID: sec-distributed
   3638 :END:
   3639 
   3640 This algorithm, unlike its parallel counterpart, employs copying of data to
   3641 execute computation on different cluster nodes, and since network throughput is
   3642 much lower than memory bandwidth, the size of data that is sent over the network
   3643 have to be optimised to get better performance than on SMP system. One way to
   3644 accomplish this is to distribute wavy surface parts between cluster nodes
   3645 copying in the coefficients and all the boundary points, and copying out
   3646 generated wavy surface part. Autoregressive dependencies prevent from creating
   3647 all the parts at once and statically distributing them between cluster nodes, so
   3648 the parts are created dynamically on the first node, when points on which they
   3649 depend become available. So, distributed AR model algorithm is a master-slave
   3650 algorithm\nbsp{}cite:lusk2010more in which the master dynamically creates tasks
   3651 for each wavy surface part taking into account autoregressive dependencies
   3652 between points and sends them to slaves, whereas slaves compute each wavy
   3653 surface part and send them back to the master.
   3654 
   3655 In MPP implementation each task is modelled by a kernel: there is a principal
   3656 kernel that creates subordinate kernels on demand, and subordinate kernels that
   3657 generate wavy surface parts. In ~act~ method of principal kernel a subordinate
   3658 kernel is created for the first part\nbsp{}--- a part that does not depend on
   3659 any points. When this kernel returns, the principal kernel in ~react~ method
   3660 determines which parts can be computed in turn, creates a subordinate kernel for
   3661 each part and sends them to the pipeline. In ~act~ method of subordinate kernel
   3662 wavy surface part is generated and then the kernel sends itself back to the
   3663 principal. The ~react~ method of subordinate kernel is empty.
   3664 
   3665 Distributed AR algorithm implementation has several advantages over the parallel
   3666 one.
   3667 - Pipelines automatically distribute subordinate kernels between available
   3668   cluster nodes, and the main programme does not have to deal with these
   3669   implementation details.
   3670 - There is no need to implement minimalistic job scheduler, which determines
   3671   execution order of jobs (kernels) taking into account autoregressive
   3672   dependencies: the order is fully defined in ~react~ method of the principal
   3673   kernel.
   3674 - There is no need in separate version of the programme for SMP machine, the
   3675   implementation works transparently on any number of nodes, even if job
   3676   scheduler is not running.
   3677 
   3678 **** Performance of distributed AR model implementation.
   3679 Distributed AR model implementation was benchmarked on the two nodes of "Ant"
   3680 system (table\nbsp{}[[tab-ant]]). To increase network throughput these nodes were
   3681 directly connected to each other and maximum transmission unit (MTU) was set to
   3682 9200 bytes. Two cases were considered: with one Bscheduler daemon process
   3683 running on the local node, and with two daemon processes running on each node.
   3684 The performance of the programme was compared to the performance of OpenMP
   3685 version running on single node.
   3686 
   3687 Bscheduler outperforms OpenMP in both cases
   3688 (fig.\nbsp{}[[fig-bscheduler-performance]]). In case of one node the higher
   3689 performance is explained by the fact that Bscheduler does not scan the queue for
   3690 wavy surface parts for which dependencies are ready (as in parallel version of
   3691 the algorithm), but for each part updates a counter of completed parts on which
   3692 it depends. The same approach can be used in OpenMP version, but was discovered
   3693 only for newer Bscheduler version, as queue scanning can not be performed
   3694 efficiently in this framework. In case of two nodes the higher performance is
   3695 explained by greater total number of processor cores (16) and high network
   3696 throughput of the direct network link. So, Bscheduler implementation of
   3697 distributed AR model algorithm is faster on shared memory system due to more
   3698 efficient autoregressive dependencies handling and its performance on
   3699 distributed memory system scales to a larger number of cores due to small data
   3700 transmission overhead of direct network link.
   3701 
   3702 #+name: fig-bscheduler-performance
   3703 #+begin_src R :file build/bscheduler-performance.pdf
   3704 source(file.path("R", "benchmarks.R"))
   3705 par(family="serif")
   3706 data <- arma.load_bscheduler_performance_data()
   3707 arma.plot_bscheduler_performance_data(
   3708   data,
   3709   list(
   3710     openmp="OpenMP",
   3711     bsc1="Bscheduler (single node)",
   3712     bsc2="Bscheduler (two nodes)"
   3713   )
   3714 )
   3715 title(xlab="Wavy surface size", ylab="Time, s")
   3716 #+end_src
   3717 
   3718 #+name: fig-bscheduler-performance
   3719 #+caption[Performance of Bscheduler and OpenMP programme versions]:
   3720 #+caption: Performance comparison of Bscheduler and OpenMP programme versions for
   3721 #+caption: AR model.
   3722 #+RESULTS: fig-bscheduler-performance
   3723 [[file:build/bscheduler-performance.pdf]]
   3724 
   3725 * Conclusion
   3726 In the study of mathematical apparatus for sea wave simulations which goes
   3727 beyond linear wave theory the following main results were achieved.
   3728 - AR and MA models were applied to simulation of sea waves of arbitrary
   3729   amplitudes. Integral characteristics of wavy surface were verified by
   3730   comparing to the ones of real sea surface.
   3731 - New method was applied to compute velocity potentials under generated surface.
   3732   The resulting field was verified by comparing it to the one given by formulae
   3733   from linear wave theory for small-amplitude waves. For large-amplitude waves
   3734   the new method gives a reasonably different field. The method is
   3735   computationally efficient because all the integrals in its formula are written
   3736   as Fourier transforms, for which there are high-performance implementations.
   3737 - The model and the method were implemented for both shared and distributed
   3738   memory systems, and showed near linear scalability for different number of
   3739   cores in several benchmarks. AR model is more computationally efficient on CPU
   3740   than on GPU, and outperforms LH model.
   3741 
   3742 One of the topic of future research is studying generation of waves of arbitrary
   3743 profiles on the basis of mixed ARMA process. Another direction is integration of
   3744 the developed model and pressure calculation method into existing application
   3745 software packages.
   3746 
   3747 * Summary
   3748 A problem of determining pressures under sea surface is solved explicitly
   3749 without assumptions of linear and small-amplitude wave theories. This solution
   3750 coupled with AR and MA models, that generate sea waves of arbitrary amplitudes,
   3751 can be used to determine the impact of wave oscillations on the dynamic marine
   3752 object in a sea way, and in case of large-amplitude waves gives more accurate
   3753 velocity potential field than analogous solutions obtained in the framework of
   3754 linear wave theory and theory of small-amplitude waves.
   3755 
   3756 Numerical experiments show that wavy surface generation as well as pressure
   3757 computation are efficiently implemented for both shared and distributed memory
   3758 systems, without computing on GPU. High performance in case of wavy surface
   3759 generation is provided by the use of specialised job scheduler and a library for
   3760 multi-dimensional arrays, and in case of velocity potential
   3761 computation\nbsp{}--- by the use of FFT algorithms for computing integrals.
   3762 
   3763 The developed mathematical apparatus and its numerical implementation can become
   3764 a base of virtual testbed for marine objects dynamics studies. The use of new
   3765 models in virtual testbed would allow to
   3766 - conduct long-time simulations without the loss of efficiency,
   3767 - obtain more accurate pressure fields due to new velocity potential field
   3768   computation method, and
   3769 - make software complex more robust due to high convergence rate of the models
   3770   and by using fault-tolerant batch job scheduler.
   3771 
   3772 * Acknowledgements
   3773 The graphs in this work were prepared using R language for statistical
   3774 computing\nbsp{}cite:rlang2016,sarkar2008lattice and Graphviz
   3775 software\nbsp{}cite:gansner00anopen. The manuscript was prepared using
   3776 Org-mode\nbsp{}cite:schulte2011org2,schulte2011org1,dominik2010org for GNU Emacs
   3777 which provides computing environment for reproducible research. This means that
   3778 all graphs can be reproduced and corresponding statements verified on different
   3779 computer systems by cloning thesis repository[fn:repo], installing Emacs and
   3780 exporting the document.
   3781 
   3782 [fn:repo] [[https://github.com/igankevich/arma-thesis]]
   3783 
   3784 * List of acronyms and symbols
   3785 - <<<MPP>>> :: Massively Parallel Processing, computers with distributed memory.
   3786 - <<<SMP>>> :: Symmetric Multi-Processing, computers with shared memory.
   3787 - <<<GPGPU>>> :: General-purpose computing on graphics processing units.
   3788 - ACF :: auto-covariate function.
   3789 - <<<FFT>>> :: fast Fourier transform.
   3790 - <<<PRNG>>> :: pseudo-random number generator.
   3791 - <<<BC>>> :: boundary condition.
   3792 - <<<PDE>>> :: partial differential equation.
   3793 - <<<NIT>>> :: non-linear inertialess transform which allows to specify
   3794                arbitrary distribution law for wavy surface elevation without
   3795                changing its original auto-covariate function.
   3796 - AR :: auto-regressive process.
   3797 - ARMA :: auto-regressive moving-average process.
   3798 - MA :: moving average process.
   3799 - LH :: Longuet---Higgins model, formula of which is derived in the framework of
   3800         linear wave theory.
   3801 - <<<LAMP>>> :: Large Amplitude Motion Programme, a programme that simulates
   3802                 ship behaviour in ocean waves.
   3803 - <<<CLT>>> :: central limit theorem.
   3804 - <<<PM>>> :: Pierson---Moskowitz ocean wave spectrum approximation.
   3805 - <<<YW>>> :: Yule---Walker equations which are used to determine autoregressive
   3806               model coefficients for a given auto-covariate function.
   3807 - <<<LS>>> :: least squares.
   3808 - PDF :: probability density function.
   3809 - <<<CDF>>> :: cumulative distribution function.
   3810 - <<<BSP>>> :: Bulk Synchronous Parallel.
   3811 - OpenCL :: Open Computing Language, parallel programming technology for hybrid
   3812             systems with GPUs or other co-processors.
   3813 - <<<OpenGL>>> :: Open Graphics Library.
   3814 - OpenMP :: Open Multi-Processing, parallel programming technology for
   3815             multi-processor systems.
   3816 - <<<MPI>>> :: Message Passing Interface.
   3817 - <<<FMA>>> :: Fused multiply-add.
   3818 - <<<DCMT>>> :: Dynamic creation of Mersenne Twisters, an algorithm for creating
   3819                 pseudo-random number generators that produce uncorrelated
   3820                 sequences when run in parallel.
   3821 - <<<GSL>>> :: GNU Scientific Library.
   3822 - <<<BLAS>>> :: Basic Linear Algebra Sub-programmes.
   3823 - <<<LAPACK>>> :: Linear Algebra Package.
   3824 - <<<DNS>>> :: Dynamic name resolution.
   3825 - <<<HPC>>> :: High-performance computing.
   3826 - GCS :: Gram---Charlier series.
   3827 - SN :: Skew normal distribution.
   3828 - <<<PBS>>> :: Portable batch system, a system for allocating and distributing
   3829                cluster resources for parallel programmes.
   3830 - Transcendental functions :: non-algebraic mathematical functions
   3831      (i.e.\nbsp{}logarithmic, trigonometric, exponential etc.).
   3832 
   3833 #+begin_export latex
   3834 \input{postamble}
   3835 #+end_export
   3836 
   3837 bibliographystyle:ugost2008
   3838 bibliography:bib/refs.bib
   3839 
   3840 * Appendix
   3841 ** Longuet---Higgins model formula derivation
   3842 :PROPERTIES:
   3843 :CUSTOM_ID: longuet-higgins-derivation
   3844 :END:
   3845 
   3846 In the framework of linear wave theory two-dimensional system of
   3847 equations\nbsp{}eqref:eq-problem is written as
   3848 \begin{align*}
   3849     & \phi_{xx} + \phi_{zz} = 0,\\
   3850     & \zeta(x,t) = -\frac{1}{g} \phi_t, & \text{at }z=\zeta(x,t),
   3851 \end{align*}
   3852 where \(\frac{p}{\rho}\) includes \(\phi_t\). The solution to the Laplace
   3853 equation is sought in a form of Fourier series\nbsp{}cite:kochin1966theoretical:
   3854 \begin{equation*}
   3855     \phi(x,z,t) = \int\limits_{0}^{\infty} e^{k z}
   3856     \left[ A(k, t) \cos(k x) + B(k, t) \sin(k x) \right] dk.
   3857 \end{equation*}
   3858 Plugging it in the boundary condition yields
   3859 \begin{align*}
   3860     \zeta(x,t) &= -\frac{1}{g} \int\limits_{0}^{\infty}
   3861     \left[ A_t(k, t) \cos(k x) + B_t(k, t) \sin(k x) \right] dk \\
   3862     &= -\frac{1}{g} \int\limits_{0}^{\infty} C_t(k, t) \cos(kx + \epsilon(k, t)).
   3863 \end{align*}
   3864 Here \(\epsilon\) is white noise and \(C_t\) includes \(dk\). Substituting
   3865 integral with infinite sum yields two-dimensional form
   3866 of\nbsp{}eqref:eq-longuet-higgins.
   3867 
   3868 ** Derivative in the direction of the surface normal
   3869 :PROPERTIES:
   3870 :CUSTOM_ID: directional-derivative
   3871 :END:
   3872 Directional derivative of \(\phi\) in the direction of vector \(\vec{n}\) is
   3873 given by \(\nabla_n\phi=\nabla\phi\cdot\frac{\vec{n}}{|\vec{n}|}\). Normal
   3874 vector \(\vec{n}\) to the surface \(z=\zeta(x,y)\) at point \((x_0,y_0)\) is
   3875 given by
   3876 \begin{equation*}
   3877   \vec{n} = \begin{bmatrix}\zeta_x(x_0,y_0)\\\zeta_y(x_0,y_0)\\-1\end{bmatrix}.
   3878 \end{equation*}
   3879 Hence, derivative in the direction of the surface normal is given by
   3880 \begin{equation*}
   3881 \nabla_n \phi = \phi_x \frac{\zeta_x}{\sqrt{\zeta_x^2+\zeta_y^2+1}}
   3882     + \phi_y \frac{\zeta_y}{\sqrt{\zeta_x^2+\zeta_y^2+1}}
   3883     + \phi_z \frac{-1}{\sqrt{\zeta_x^2+\zeta_y^2+1}},
   3884 \end{equation*}
   3885 where \(\zeta\) derivatives are calculated at \((x_0,y_0)\).