waves-16-arma

git clone https://git.igankevich.com/waves-16-arma.git
Log | Files | Refs

commit 7895d9831db370b26f5a5c2a93cc241cdfbe2d32
parent 04e79c6fff6cc50f761e2053b466eef3d9de71de
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Mon, 29 May 2017 18:40:31 +0300

Edit HPC section.

Diffstat:
arma.org | 123++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
1 file changed, 68 insertions(+), 55 deletions(-)

diff --git a/arma.org b/arma.org @@ -1212,47 +1212,61 @@ arma.plot_velocity( ** White noise generation In order to eliminate periodicity from generated wavy surface, it is imperative to use PRNG with sufficiently large period to generate white noise. Parallel -Mersenne Twister\nbsp{}cite:matsumoto1998mersenne with a period of \(2^{19937}-1\) is -used as a generator in this work. It allows to produce aperiodic ocean wavy -surface realisations in any practical usage scenarios. +Mersenne Twister\nbsp{}cite:matsumoto1998mersenne with a period of +\(2^{19937}-1\) is used as a generator in this work. It allows to produce +aperiodic ocean wavy surface realisations in any practical usage scenarios. There is no guarantee that multiple Mersenne Twisters executed in parallel threads with distinct initial states produce uncorrelated pseudo-random number -sequences, however, algorithm of dynamic creation of Mersenne Twisters\nbsp{}cite:matsumoto1998dynamic may be used to provide such guarantee. The essence of -the algorithm is to find matrices of initial generator states, that give -maximally uncorrelated pseudo-random number sequences when Mersenne Twisters are -executed in parallel with these initial states. Since finding such initial -states consumes considerable amount of processor time, vector of initial states -is created preliminary with knowingly larger number of parallel threads and -saved to a file, which is then read before starting white noise generation. +sequences, however, algorithm of dynamic creation of Mersenne +Twisters\nbsp{}cite:matsumoto1998dynamic may be used to provide such a +guarantee. The essence of the algorithm is to find matrices of initial generator +states, that give maximally uncorrelated pseudo-random number sequences when +Mersenne Twisters are executed in parallel with these initial states. Since +finding such initial states consumes considerable amount of processor time, +vector of initial states is created preliminary with knowingly larger number of +parallel threads and saved to a file, which is then read before starting white +noise generation. ** Wavy surface generation In ARMA model value of wavy surface elevation at a particular point depends on previous in space and time points, as a result the so called /ramp-up interval/ (see fig.\nbsp{}[[fig-ramp-up-interval]]), in which realisation does not correspond to -specified ACF, forms in the beginning of the realisation. There are several +specified ACF, forms at the beginning of the realisation. There are several solutions to this problem which depend on the simulation context. -If realisation is used in the context of ship stability simulation without +If the realisation is used in the context of ship stability simulation without manoeuvring, ramp-up interval will not affect results of the simulation, because it is located on the border (too far away from the studied marine object). If ship stability with manoeuvring is studied, then the interval may be simply discarded from the realisation (the size of the interval approximately equals -the number of AR coefficients in each dimension). However, this may lead to loss -of a very large number of points, because discarding occurs for each dimension. -Alternative approach is to generate ocean wavy surface on ramp-up interval with -LH model and generate the rest of the realisation with ARMA model. - -Algorithm of wavy surface generation is data-parallel: realisation is divided -into equal parts each of which is generated independently, however, in the -beginning of each realisation there is ramp-up interval. To eliminate it -/overlap-add/ method\nbsp{}cite:oppenheim1989discrete,svoboda2011efficient,pavel2013algorithms (a popular -method in signal processing) is used. The essence of the method is to add -another interval, size of which is equal to the ramp-up interval size, to the -end of each part. Then wavy surface is generated in each point of each part -(including points from the added interval), the interval at the end of part \(N\) -is superimposed on the ramp-up interval at the beginning of the part \(N+1\), and -values in corresponding points are added. +the number of AR coefficients in each dimension). However, this may lead to a +loss of a very large number of points, because discarding is done for each +dimension. Alternative approach is to generate ocean wavy surface on ramp-up +interval with LH model and generate the rest of the realisation with ARMA model. + +Algorithm of wavy surface generation is data-parallel: the realisation is +divided into equal parts along the time axis each of which is generated +independently, however, in the beginning of each realisation there is ramp-up +interval. To eliminate it for MA process, /overlap-add/ +method\nbsp{}cite:oppenheim1989discrete,svoboda2011efficient,pavel2013algorithms +(a popular method in signal processing) is used. The essence of the method is to +add another interval, size of which is equal to the ramp-up interval size, to +the end of each part. Then wavy surface is generated in each point of each part +(including points from the added interval), the interval at the end of part +\(N\) is superimposed on the ramp-up interval at the beginning of the part +\(N+1\), and values in corresponding points are added. To eliminate the ramp-up +interval for AR process, the realisation is divided into part along each +dimension, and each part is computed only when all dependent parts are ready. +For that purpose, an array of current part states is maintained in the +programme, and all the parts are put into a queue. A parallel thread acquires a +shared lock, finds the first part in the queue, for which all dependent parts +have "completed" state, removes this part for the queue, frees the lock and +generates the part. After that a thread updates the state of the part, and +repeats the same steps until the queue becomes empty. This algorithm eliminates +all ramp-up intervals except the one at the beginning of the realisation, and +the size of the parts should be sufficiently small to balance the load on all +processor cores. #+name: fig-ramp-up-interval #+begin_src R :file build/ramp-up-interval.pdf @@ -1282,12 +1296,12 @@ functions is to write them as multiplication of Dirac delta functions of real and imaginary part, however, this approach does not work here, because applying inverse Fourier transform to this representation does not produce exponent, which severely warp resulting velocity field. In order to get unique analytic -definition normalisation factor \(1/\Sinh{2\pi{u}{h}}\) (which is also included -in formula for \(E(u)\)) may be used. Despite the fact that normalisation allows -to obtain adequate velocity potential field, numerical experiments show that -there is little difference between this field and the one produced by formulae -from linear wave theory, in which terms with \(\zeta\) are omitted. As a result, -the formula for three-dimensional case was not derived. +definition, normalisation factor \(1/\Sinh{2\pi{u}{h}}\) (which is also included +in the formula for \(E(u)\)) may be used. Despite the fact that normalisation +allows to obtain adequate velocity potential field, numerical experiments show +that there is little difference between this field and the one produced by +formulae, in which terms with \(\zeta\) are omitted. As a result, we do not use +normalisation factors in the formula. #+name: tab-delta-functions #+caption: Formulae for computing \(\Fun{z}\) and \(\FunSecond{z}\) from [[#sec:pressure-2d]], that use normalisation to eliminate uncertainty from definition of Dirac delta function of complex argument. @@ -1302,22 +1316,22 @@ ARMA model does not require highly optimised software implementation to be efficient, its performance is high even without use of co-processors; there are two main causes of that. First, ARMA model itself does not use transcendental functions (sines, cosines and exponents) as opposed to LH model. All -calculations (except model coefficients) are done via polynomials, which can be -efficiently computed on modern processors using a series of FMA instructions. -Second, pressure computation is done via explicit analytic formula using nested -FFTs. Since two-dimensional FFT of the same size is repeatedly applied to every -time slice, its coefficients (complex exponents) are pre-computed for all -slices, and computations are performed with only a few transcendental functions. -In case of MA model, performance is also increased by doing convolution with FFT -transforms. So, high performance of ARMA model is due to scarce use of -transcendental functions and heavy use of FFT, not to mention that high -convergence rate and non-existence of periodicity allows to use far fewer -coefficients compared to LH model. +calculations, except model coefficients, are done via polynomials, which can be +efficiently computed on modern processors using a series of fused multiply-add +(FMA) instructions. Second, pressure computation is done via explicit analytic +formula using nested FFTs. Since two-dimensional FFT of the same size is +repeatedly applied to every time slice, its coefficients (complex exponents) are +pre-computed for all slices, and computations are performed with only a few +transcendental functions. In case of MA model, performance is also increased by +doing convolution with FFT transforms. So, high performance of ARMA model is due +to scarce use of transcendental functions and heavy use of FFT, not to mention +that high convergence rate and non-existence of periodicity allows to use far +fewer coefficients compared to LH model. ARMA implementation uses several libraries of reusable mathematical functions and numerical algorithms (listed in table\nbsp{}[[tab-arma-libs]]), and was -implemented using several parallel programming technologies (OpenMP, -OpenCL) to find the most efficient one. +implemented using OpenMP and OpenCL parallel programming technologies, that +allow to use the most efficient implementation for a particular algorithm. #+name: tab-arma-libs #+caption: A list of mathematical libraries used in ARMA model implementation. @@ -1358,13 +1372,12 @@ time instant was measured. Velocity field was computed for one $t$ point, for 128 $z$ points below wavy surface and for each $x$ and $y$ point of four-dimensional $(t,x,y,z)$ grid. The only parameter that was varied between subsequent programme runs is the size of the grid along $x$ dimension. A total -of 10 runs were performed and average time of each stage was computed. +of 10 runs were performed and an average time of each stage was computed. A different FFT library was used for each version of the solver. For OpenMP version FFT routines from GNU Scientific Library (GSL)\nbsp{}cite:galassi2015gnu were used, and for OpenCL version clFFT library\nbsp{}cite:clfft was used instead. There are two major differences in the routines from these libraries. - - The order of frequencies in Fourier transforms is different and clFFT library requires reordering the result of\nbsp{}eqref:eq-phi-linear whereas GSL does not. - Discontinuity at $(x,y) = (0,0)$ of velocity potential field grid is handled @@ -1376,7 +1389,7 @@ smooth velocity potential field at these points. We have not spotted other differences in FFT implementations that have impact on the overall performance. In the course of the numerical experiments we have measured how much time each -solver's implementation spends in each computation stage to explain find out how +solver's implementation spends in each computation stage to explain how efficient data copying between host and device is in OpenCL implementation, and how one implementation corresponds to the other in terms of performance. @@ -1385,15 +1398,15 @@ The experiments showed that GPU implementation outperforms CPU implementation by a factor of 10--15 (fig.\nbsp{}[[fig-bench-cpu-gpu]]), however, distribution of time between computation stages is different for each implementation (fig.\nbsp{}[[fig-breakdown-cpu-gpu]]). The major time consumer in CPU -implementation is computation of $g_1$, whereas in GPU implementation its -running time is comparable to computation of $g_2$. GPU computes $g_1$ much +implementation is computation of \(g_1\), whereas in GPU implementation its +running time is comparable to computation of \(g_2\). GPU computes \(g_1\) much faster than CPU due to a large amount of modules for transcendental mathematical -function computation. In both implementations $g_2$ is computed on CPU, but for -GPU implementation the result is duplicated for each $z$ grid point in order to -perform multiplication of all $XYZ$ planes along $z$ dimension in single OpenCL +function computation. In both implementations \(g_2\) is computed on CPU, but for +GPU implementation the result is duplicated for each \(z\) grid point in order to +perform multiplication of all \(XYZ\) planes along \(z\) dimension in single OpenCL kernel, and, subsequently copied to GPU memory which severely hinders overall stage performance. Copying the resulting velocity potential field between CPU -and GPU consumes $\approx{}20\%$ of velocity potential solver execution time. +and GPU consumes \(\approx{}20\%\) of velocity potential solver execution time. #+name: fig-bench-cpu-gpu #+caption: Performance comparison of CPU (OpenMP) and GPU (OpenCL) versions of velocity potential solver.