arma-thesis

git clone https://git.igankevich.com/arma-thesis.git
Log | Files | Refs | LICENSE

commit a54f5cd440478e56007034440fc0fd2c1704eae8
parent 16ec5798752395cb936657ef3ca0b1bb973b13b3
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Wed, 16 Aug 2017 18:53:44 +0300

Move computational model to MPP section. Rename labels.

Diffstat:
arma-thesis.org | 1805++++++++++++++++++++++++++++++++++++++++---------------------------------------
1 file changed, 905 insertions(+), 900 deletions(-)

diff --git a/arma-thesis.org b/arma-thesis.org @@ -2436,7 +2436,7 @@ legend( #+end_src #+caption: Probability density function eqref:eq-skew-normal-1 of sea wavy surface elevation for different values of skewness \(\gamma_1\) and kurtosis \(\gamma_2\). -#+label: fig-skew-normal-1 +#+name: fig-skew-normal-1 #+RESULTS: fig-skew-normal-1 [[file:build/skew-normal-1.pdf]] @@ -2488,7 +2488,7 @@ legend( #+end_src #+caption: Probability density function eqref:eq-skew-normal-2 of sea wavy surface for different values of skewness coefficient \(\alpha\). -#+label: fig-skew-normal-2 +#+name: fig-skew-normal-2 #+RESULTS: fig-skew-normal-2 [[file:build/skew-normal-2.pdf]] @@ -2557,7 +2557,7 @@ arma.plot_ramp_up_interval() #+end_src #+caption: Ramp-up interval at the beginning of the \(OX\) axis of the realisation. -#+label: fig-ramp-up-interval +#+name: fig-ramp-up-interval #+RESULTS: fig-ramp-up-interval [[file:build/ramp-up-interval.pdf]] @@ -2652,7 +2652,7 @@ arma.qqplot_grid( #+end_src #+caption: Quantile-quantile plots for propagating waves. -#+label: propagating-wave-distributions +#+name: propagating-wave-distributions #+RESULTS: propagating-wave-distributions [[file:build/propagating-wave-qqplots.pdf]] @@ -2670,7 +2670,7 @@ arma.qqplot_grid( #+end_src #+caption: Quantile-quantile plots for standing waves. -#+label: standing-wave-distributions +#+name: standing-wave-distributions #+RESULTS: standing-wave-distributions [[file:build/standing-wave-qqplots.pdf]] @@ -2688,7 +2688,7 @@ for (i in seq(0, 4)) { #+end_src #+caption: Time slices of ACF for standing (left column) and propagating waves (right column). -#+label: acf-slices +#+name: acf-slices #+RESULTS: acf-slices [[file:build/acf-slices.pdf]] @@ -2710,6 +2710,7 @@ periods of standing waves are extracted more precisely as the waves do not move outside simulated wavy surface region. The same correspondence degree for wave elevation is obtained, because this is the characteristic of the wavy surface (and corresponding AR or MA process) and is not affected by the type of waves. + *** Verification of velocity potential fields :PROPERTIES: :CUSTOM_ID: sec:compare-formulae @@ -2783,7 +2784,7 @@ arma.plot_velocity_potential_field_legend( #+end_src #+caption: Velocity potential field of propagating wave \(\zeta(x,y,t) = \cos(2\pi x - t/2)\). Field produced by formula eqref:eq-solution-2d-full (top) and linear wave theory formula (bottom). -#+label: fig-potential-field-nonlinear +#+name: fig-potential-field-nonlinear #+attr_latex: :width \textwidth #+RESULTS: fig-potential-field-nonlinear [[file:build/plain-wave-velocity-field-comparison.pdf]] @@ -2823,7 +2824,7 @@ arma.plot_velocity( ) #+end_src -#+label: fig-velocity-field-2d +#+name: fig-velocity-field-2d #+caption: Comparison of velocity field on the sea wavy surface obtained by generic formula (\(u_1\)) and formula for small-amplitude waves (\(u_2\)). Velocity field for realisations containing small-amplitude (top) and large-amplitude (bottom) waves. #+attr_latex: :width \textwidth #+RESULTS: fig-velocity-field-2d @@ -2866,7 +2867,7 @@ args$title <- 'Standing waves' arma.plot_nonlinear(file.path("build", "nit-standing"), args) #+end_src -#+label: fig-nit +#+name: fig-nit #+caption: Wavy surface slices with different distributions of wave elevation (Gaussian, Gram---Charlier series based and skew normal). #+RESULTS: fig-nit [[file:build/nit.pdf]] @@ -2933,1002 +2934,1002 @@ to generate waves of arbitrary profiles, and is one of the directions of future work. * High-performance software implementation of sea wave simulation -** Computational model -**** Mapping wavy surface generation algorithm on computational model. -Software implementation of ARMA model works as a computational pipeline, in -which each joint applies some function to the output coming from the pipe of the -previous joint. Joints are distributed across computer cluster nodes to enable -function parallelism, and then data flowing through the joints is distributed -across processor cores to enable data parallelism. Figure\nbsp{}[[fig-pipeline]] -shows a diagram of data processing pipeline in which rectangles with rounded -corners denote joints, regular rectangles denote arrays of problem domain -objects flowing from one joint to another, and arrows show flow direction. Some -joints are divided into /sections/ each of which process a separate part of the -array. If joints are connected without a /barrier/ (horizontal or vertical bar), -then transfer of separate objects between them is done in parallel to -computations, as they become available. Sections work in parallel on each -processor core (or node of the cluster). There is surjective mapping between a -set of processor cores, a set of pipeline joint sections and objects, i.e. each -processor core may run several sections, each of which may sequentially process -several objects, but a section can not work simultaneously on several processor -cores, and an object can not be processed simultaneously by several sections. -So, the programme is a pipeline through which control objects flow. - -#+name: fig-pipeline -#+begin_src dot :exports results :file build/pipeline.pdf -digraph { - - node [fontsize=14,margin="0.055,0"] - graph [nodesep="0.25",ranksep="0.25",rankdir="TB"] - edge [arrowsize=0.66] - - # data - subgraph xcluster_linear { - label="Linear model" - - start [label="",shape=circle,style=filled,fillcolor=black,width=0.23] - spectrum [label="S(ω,θ)",shape=box] - acf [label="K(i,j,k)",shape=box] - phi [label="Φ(i,j,k)",shape=box] - - # transformations - fourier_transform [label="Fourier transform",shape=box,style=rounded] - solve_yule_walker [label="Solve Yule—Walker\nequations",shape=box,style=rounded] +** SMP implementation +**** Parallel AR, MA and LH model algorithms. +Although, AR and MA models are part of the mixed ARMA model they have completely +disparate parallel algorithms, which are different from trivial one of LH model. +AR algorithm is based on partitioning wavy surface into equal three-dimensional +parts along each dimension and computing them in parallel taking into account +causality constraints imposed by autoregressive dependencies between surface +points. There are no autoregressive dependencies between points in MA model, its +formula represents convolution of white noise with model coefficients, which is +reduced to computation of three Fourier transforms via convolution theorem, +which in turn have parallel implementations; so, MA algorithm is based on +parallel FFT. Finally, LH algorithm is made parallel by simply computing each +wavy surface point in parallel. So, parallel implementation of ARMA model +consists of two parallel algorithms, one for each part of the model, which are +more sophisticated than the one for LH model. - subgraph cluster_nonlinear_1 { - label="Simulate non-linearity\l" - labeljust=left - style=filled - color=lightgrey - acf2 [label="K*(i,j,k)",shape=box] - transform_acf [label="Transform ACF",shape=box,style=rounded] - } - } +AR model's formula main feature is autoregressive dependencies between wavy +surface points along each dimension which prevent computing each surface point +in parallel. Instead the surface is partitioned along each dimension into equal +three-dimensional parts, and for each part information dependencies, which +define execution order, are established. Figure\nbsp{}[[fig-ar-cubes]] shows these +dependencies. Here arrow denotes dependency of one part on availability of +another. Here part \(A\) does not have dependencies, parts \(B\) and \(D\) +depend on \(A\), and part \(E\) depends on \(A\), \(B\) and \(C\). In essence, +each part depends only on the parts that have previous index along each +dimension (if such parts exist). The first part does not have any dependencies, +and the size of each part along each dimension is made greater or equal to the +corresponding number of coefficients along the dimension to consider only +adjacent parts. - subgraph xcluster_linear2 { +#+name: fig-ar-cubes +#+header: :width 5 :height 5 +#+begin_src R :file build/ar-cubes.pdf +source(file.path("R", "common.R")) +arma.plot_ar_cubes_2d(3, 3, xlabel="Part index (X)", ylabel="Part index (Y)") +#+end_src - eps_parts [label="<e1> ε₁|<e2> ε₂|<e3> …|<e4> εₙ|<e> ε(t,x,y)",shape=record] - end [label="",shape=doublecircle,style=filled,fillcolor=black,width=0.23] +#+RESULTS: fig-ar-cubes +[[file:build/ar-cubes.pdf]] - generate_white_noise [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate\lwhite noise",shape=record,style=rounded] - generate_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate sea\lwavy surface parts\l",shape=record,style=rounded] +Each part has an three-dimensional index and a completion status. The algorithm +starts by submitting objects containing this information into a queue. After +that parallel threads start, each thread finds the first part for which all +dependencies are satisfied (by checking the completion status of each part), +removes the part from the queue, generates wavy surface for this part and sets +completion status. The algorithm ends when the queue becomes empty. Access to +the queue is synchronised by locks. The algorithm is suitable for SMP machines, +for MPP the copying of dependent parts needs to be done prior to computation of +each part. - zeta_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Non-crosslinked\lrealisation parts",shape=record] - overlap_add [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Crosslink realisation\lparts\l",shape=record,style=rounded] +So, the AR model algorithm is made parallel by implementing minimalistic job +scheduler, in which +- each job corresponds to a wavy surface part, +- the order of execution of jobs is defined by autoregressive dependencies, +- and jobs are executed by a simple thread pool in which each thread removes the + first job for which all dependent jobs have completed and executes it. - zeta_parts:g1->overlap_add:g1 - zeta_parts:g2->overlap_add:g2 - zeta_parts:g3->overlap_add:g3 - zeta_parts:g4->overlap_add:g4 +In contrast to AR model, MA model does not have autoregressive dependencies +between points, instead, each surface points depends on previous in time and +space white noise values. MA model's formula allows for rewriting it as a +convolution of white noise with the coefficients as a kernel. Using convolution +theorem, the convolution is rewritten as inverse Fourier transform of the +product of Fourier transforms of white noise and coefficients. Since the number +of MA coefficients is much smaller than the number of wavy surface points, +parallel FFT implementation is not suitable here, as it requires padding the +coefficients with noughts to match the size of the surface. Instead, the surface +is divided into parts along each dimension which are padded with noughts to +match the number of the coefficients along each dimension multiplied by two. +Then Fourier transform of each part is computed in parallel, multiplied by +previously computed Fourier transform of the coefficients, and inverse Fourier +transform of the result is computed. After that, each part is written to the +output array with overlapping points (due to padding) added to each other. This +algorithm is commonly known in signal processing as +"overlap-add"\nbsp{}cite:svoboda2011efficient. Padding with noughts is needed to +prevent aliasing errors: without it the result would be circular convolution. - zeta_parts:g2->overlap_add:g1 [constraint=false] - zeta_parts:g3->overlap_add:g2 [constraint=false] - zeta_parts:g4->overlap_add:g3 [constraint=false] +Despite the fact that MA model algorithm partitions the surface into the same +parts (but possible of different sizes), the vicinity of autoregressive +dependencies between them allows to compute them in parallel without the use of +specialised job scheduler. However, it requires padding with noughts to make the +result correspond to the original MA model's formula. So, MA model's algorithm +is more scalable to a large number of nodes as it has less dependencies between +parts computed in parallel, but the size of the parts is greater than in AR +model, so they are slower to compute. - overlap_add:g1->zeta2_parts:g1 - overlap_add:g2->zeta2_parts:g2 - overlap_add:g3->zeta2_parts:g3 - overlap_add:g4->zeta2_parts:g4 +The distinct feature of LH model's algorithm is its simplicity: to make it +parallel, the surface is partitioned into parts of equal sizes and each part is +computed in parallel. There are no dependencies between parts, which makes this +algorithm particularly suitable for computation on GPU: each hardware thread +simply computes its own point. In addition, sine and cosine functions in the +model's formula which are slow to compute on CPU, make GPU even more favourable +choice. - zeta2_parts:g1->transform_zeta:g1->zeta3_parts:g1->write_zeta:g1->eps_end - zeta2_parts:g2->transform_zeta:g2->zeta3_parts:g2->write_zeta:g2->eps_end - zeta2_parts:g3->transform_zeta:g3->zeta3_parts:g3->write_zeta:g3->eps_end - zeta2_parts:g4->transform_zeta:g4->zeta3_parts:g4->write_zeta:g4->eps_end +To summarise, even though AR and MA models are part of the mixed ARMA model, +their parallel algorithms are fundamentally different and are more complicated +than trivial parallel algorithm of LH model. Efficient AR algorithm requires +specialised job scheduler to manage autoregressive dependencies between wavy +surface parts, whereas MA algorithm requires padding part with noughts to be +able to compute them in parallel. In contrast to these models, LH model has no +dependencies between parts computed in parallel, but requires more computational +power (floating point operations per seconds). +**** Performance of OpenMP and OpenCL implementations. +:PROPERTIES: +:header-args:R: :results output raw :exports results +:END: - } +Differences in models' parallel algorithms make them efficient on different +processor architectures, and to find the most efficient one all the models were +benchmarked in both CPU and GPU. - subgraph part3 { +ARMA model does not require highly optimised software implementation to be +efficient, its performance is high even without use of co-processors; there are +two main causes of that. First, ARMA model itself does not use transcendental +functions (sines, cosines and exponents) as opposed to LH model. All +calculations (except model coefficients) are done via polynomials, which can be +efficiently computed on modern processors using a series of FMA instructions. +Second, pressure computation is done via explicit analytic formula using nested +FFTs. Since two-dimensional FFT of the same size is repeatedly applied to every +time slice, its coefficients (complex exponents) are pre-computed for all +slices, and computations are performed with only a few transcendental functions. +In case of MA model, performance is also increased by doing convolution with FFT +transforms. So, high performance of ARMA model is due to scarce use of +transcendental functions and heavy use of FFT, not to mention that high +convergence rate and non-existence of periodicity allows to use far fewer +coefficients compared to LH model. - zeta2_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Wavy surface with\lGaussian distribution\l",shape=record] +#+name: tab-gpulab +#+caption: "Gpulab" test platform configuration. +#+attr_latex: :booktabs t +| CPU | AMD FX-8370 | +| RAM | 16Gb | +| GPU | GeForce GTX 1060 | +| GPU memory | 6GB | +| HDD | WDC WD40EZRZ-00WN9B0, 5400rpm | +| No. of CPU cores | 4 | +| No. of threads per core | 2 | - subgraph cluster_nonlinear_2 { - label="Simulate non-linearity\r" - labeljust=right - style=filled - color=lightgrey - zeta3_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> ζ(t,x,y)",shape=record] - transform_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Transform wavy\lsurface elevation\lprobability distribution\l",shape=record,style=rounded] - } +ARMA implementation uses several libraries of mathematical functions, numerical +algorithms and visualisation primitives (listed in table\nbsp{}[[tab-arma-libs]]), +and was implemented using several parallel programming technologies (OpenMP, +OpenCL) to find the most efficient one. For each technology and each model an +optimised wavy surface generation was implemented (except for MA model for which +only OpenMP implementation was done). Velocity potential computation was done in +OpenMP and was implemented in OpenCL only for real-time visualisation of the +surface. For each technology the programme was recompiled and run multiple times +and performance of each top-level subroutine was measured using system clock. +Results of benchmarks of the technologies are summarised in +table\nbsp{}[[tab-arma-performance]]. All benchmarks were run on a machine equipped +with a GPU, characteristics of which are summarised in table\nbsp{}[[tab-gpulab]]. +All benchmarks were run with the same input parameters for all the models: +realisation length 10000s and output grid size \(40\times40\)m. The only +parameter that was different is the order (the number of coefficients): order of +AR and MA model was \(7,7,7\) and order of LH model was \(40,40\). This is due +to higher number of coefficient for LH model to eliminate periodicity. - # barriers - eps_start [label="",shape=box,style=filled,fillcolor=black,height=0.05] - eps_end [label="",shape=box,style=filled,fillcolor=black,height=0.05] +In all benchmarks wavy surface generation and NIT take the most of the running +time, whereas velocity potential calculation together with other subroutines +only a small fraction of it. - write_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Write finished\lparts to a file\l",shape=record,style=rounded] - } +#+name: tab-arma-libs +#+caption: A list of mathematical libraries used in ARMA model implementation. +#+attr_latex: :booktabs t +| Library | What it is used for | +|--------------------------------------------------------------+---------------------------------| +| DCMT\nbsp{}cite:matsumoto1998dynamic | parallel PRNG | +| Blitz\nbsp{}cite:veldhuizen1997will,veldhuizen2000techniques | multidimensional arrays | +| GSL\nbsp{}cite:gsl2008scientific | PDF, CDF, FFT computation | +| | checking process stationarity | +| clFFT\nbsp{}cite:clfft | FFT computation | +| LAPACK, GotoBLAS\nbsp{}cite:goto2008high,goto2008anatomy | finding AR coefficients | +| GL, GLUT\nbsp{}cite:kilgard1996opengl | three-dimensional visualisation | +| CGAL\nbsp{}cite:fabri2009cgal | wave numbers interpolation | - # edges - start->spectrum->fourier_transform->acf->transform_acf - transform_acf->acf2 - acf2->solve_yule_walker - solve_yule_walker->phi - phi->eps_start [constraint=false] - eps_start->generate_white_noise:g1 - eps_start->generate_white_noise:g2 - eps_start->generate_white_noise:g3 - eps_start->generate_white_noise:g4 - generate_white_noise:g1->eps_parts:e1->generate_zeta:g1->zeta_parts:g1 - generate_white_noise:g2->eps_parts:e2->generate_zeta:g2->zeta_parts:g2 - generate_white_noise:g3->eps_parts:e3->generate_zeta:g3->zeta_parts:g3 - generate_white_noise:g4->eps_parts:e4->generate_zeta:g4->zeta_parts:g4 +AR model exhibits the best performance in OpenMP and the worst performance in +OpenCL implementations, which is also the best and the worst performance across +all model and framework combinations. In the best model and framework +combination AR performance is 4.5 times higher than MA performance, and 20 times +higher than LH performance; in the worst combination\nbsp{}--- 137 times slower +than MA and 2 times slower than LH. The ratio between the best (OpenMP) and the +worst (OpenCL) AR model performance is several hundreds. This is explained by +the fact that the model formula\nbsp{}eqref:eq-ar-process is efficiently mapped +on the CPU architecture with caches, low-bandwidth memory and small number of +floating point units: +- it contains no transcendental mathematical functions (sines, cosines and + exponents), +- all of the multiplications and additions in the formula can be implemented + using FMA processor instructions, +- and cache locality is achieved by using Blitz library which implements + optimised traversals for multidimensional arrays based on Hilbert + space-filling curve. +In contrast to CPU, GPU has less number of caches, high-bandwidth memory and +large number of floating point units, which is the worst case scenario for AR +model: +- there are no transcendental functions which compensate high memory latency, +- there are FMA instructions but they do not improve performance due to high + latency, +- and optimal traversal was not used due to a lack of libraries implementing it + for a GPU. +Finally, GPU does not have synchronisation primitives that allow to implement +autoregressive dependencies between distinct wavy surface parts to compute them +in parallel, and instead of this processor launches a separate OpenCL kernel for +each part, controlling all the dependencies between them using CPU. So, for AR +model CPU architecture is superior compared to GPU due to better handling of +complex information dependencies, non-intensive calculations (multiplications +and additions) and complex memory access patterns. - eps_end->end -} +#+header: :results output raw :exports results +#+name: tab-arma-performance +#+begin_src R +source(file.path("R", "benchmarks.R")) +model_names <- list( + ar.x="AR", + ma.x="MA", + lh.x="LH", + ar.y="AR", + ma.y="MA", + lh.y="LH", + Row.names="\\orgcmidrule{2-4}{5-6}Subroutine" +) +row_names <- list( + determine_coefficients="Determine coefficients", + validate="Validate model", + generate_surface="Generate wavy surface", + nit="NIT", + write_all="Write output to files", + copy_to_host="Copy data from GPU", + velocity="Compute velocity potentials" +) +arma.print_openmp_vs_opencl(model_names, row_names) #+end_src -#+caption: Diagram of data processing pipeline, that implements sea wavy surface generation via AR model. -#+label: fig-pipeline -#+RESULTS: fig-pipeline -[[file:build/pipeline.pdf]] +#+name: tab-arma-performance +#+caption: Running time (s.) for OpenMP and OpenCL implementations of AR, MA and LH models. +#+attr_latex: :booktabs t +#+RESULTS: tab-arma-performance +| | | | OpenMP | | OpenCL | +| \orgcmidrule{2-4}{5-6}Subroutine | AR | MA | LH | AR | LH | +|----------------------------------+------+------+--------+--------+--------| +| Determine coefficients | 0.02 | 0.01 | 0.19 | 0.01 | 1.19 | +| Validate model | 0.08 | 0.10 | | 0.08 | | +| Generate wavy surface | 1.26 | 5.57 | 350.98 | 769.38 | 0.02 | +| NIT | 7.11 | 7.43 | | 0.02 | | +| Copy data from GPU | | | | 5.22 | 25.06 | +| Compute velocity potentials | 0.05 | 0.05 | 0.06 | 0.03 | 0.03 | +| Write output to files | 0.27 | 0.27 | 0.27 | 0.28 | 0.27 | -Object pipeline may be seen as an improvement of BSP (Bulk Synchronous Parallel) -model\nbsp{}cite:valiant1990bridging, which is used in graph -processing\nbsp{}cite:malewicz2010pregel,seo2010hama. Pipeline eliminates global -synchronisation (where it is possible) after each sequential computation step by -doing data transfer between joints in parallel to computations, whereas in BSP -model global synchronisation occurs after each step. +In contrast to AR model, LH model exhibits the best performance on GPU and the +worst performance on CPU. The reasons for that are +- the large number of transcendental functions in its formula which help offset + high memory latency, +- linear memory access pattern which help vectorise calculations and coalesce + memory accesses by different hardware threads, +- and no information dependencies between output grid points. +Despite the fact that GPU on the test platform is more performant than CPU (in +terms of floating point operations per second), the overall performance of LH +model compared to AR model is lower. The reason for that is slow data transfer +between GPU and CPU memory. -Object pipeline speeds up the programme by parallel execution of code blocks -that work with different compute devices: while the current part of wavy surface -is generated by a processor, the previous part is written to a disk. This -approach allows to get speed-up because compute devices operate asynchronously, -and their parallel usage increases the whole programme performance. +The last MA model is faster than LH and slower than AR. As the convolution in +its formula is implemented using FFT, its performance depends on the performance +of underlying FFT implementation: GSL for CPU and clFFT for GPU. In this work +performance of MA model on GPU was not tested due to unavailability of the +three-dimensional transform in clFFT library; if the transform was available, it +could made the model even faster than AR. -Since data transfer between pipeline joints is done in parallel to computations, -the same pipeline may be used to run several copies of the application but with -different parameters (generate several sea wavy surfaces having different -characteristics). In practise, high-performance applications do not always -consume 100% of processor time spending a portion of time on synchronisation of -parallel processes and writing data to disk. Using pipeline in this case allows -to run several computations on the same set of processes, and use all of the -computer devices at maximal efficiency. For example, when one object writes data -to a file, the other do computations on the processor in parallel. This -minimises downtime of the processor and other computer devices and increases -throughput of the computer cluster. +NIT takes less time on GPU and more time on CPU, but taking data transfer +between CPU and GPU into consideration makes their execution time comparable. +This is explained by the large amount of transcendental mathematical functions +that need to be computed for each wavy surface point to transform distribution +of its \(z\)-coordinates. For each point a non-linear transcendental +equation\nbsp{}eqref:eq-distribution-transformation is solved using bisection +method. GPU performs this task several hundred times faster than CPU, but spends +a lot of time to transfer the result back to the main memory. So, the only +possibility to optimise this routine is to use root finding method with +quadratic convergence rate to reduce the number of transcendental functions that +need to be computed. -Pipelining of otherwise sequential steps is beneficial not only for code working -with different devices, but for code different branches of which are suitable -for execution by multiple hardware threads of the same processor core, -i.e.\nbsp{}branches accessing different memory blocks or performing mixed -arithmetic (integer and floating point). Code branches which use different -modules of processor are good candidates to run in parallel on a processor core -with multiple hardware threads. +**** I/O performance. +:PROPERTIES: +:header-args:R: :results output raw :exports results +:END: -So, computational model with a pipeline can be seen as /bulk-asynchronous -model/, because of the parallel nature of programme steps. This model is the -basis of the fault-tolerance model which will be described later. +Although, in the current benchmarks writing data to files does not consume much +of the running time, the use of network-mounted file systems may slow down this +stage. To optimise it wavy surface parts were written to file as soon as full +time slice was available: all completed parts were grouped by time slices they +belong to and subsequently written to file, as soon as the whole time slice is +finished (fig.\nbsp{}[[fig-arma-io-events]]). That way a separate thread starts +writing to files as soon as the first time slice is available and finishes it +after the main thread group finishes the computation. The total time needed to +perform I/O is slightly increased, but the I/O is done in parallel to +computation so the total running time is decreased +(table\nbsp{}[[tab-arma-io-performance]]). Using this approach with local file +system has the same effect, but the total reduction in execution time is small, +because local file system is more performant. -**** Software implementation. -For efficiency reasons object pipeline and fault tolerance techniques (which -will be described later) are implemented in the C++ framework: From the author's -perspective C language is deemed low-level for distributed programmes, and Java -incurs too much overhead and is not popular in HPC community. As of now, the -framework runs in the same process as an parallel application that uses it. The -framework is called Factory, it is now in proof-of-concept development stage. +#+header :results output raw :exports results +#+name: tab-arma-io-performance +#+begin_src R +source(file.path("R", "benchmarks.R")) +suffix_names <- list( + xfs.x="XFS", + xfs.y="XFS", + nfs.x="NFS", + nfs.y="NFS", + gfs.x="GlusterFS", + gfs.y="GlusterFS", + Row.names="\\orgcmidrule{2-4}{5-7}Subroutine" +) +top_names <- c("Sequential", "Parallel") +row_names <- list( + generate_surface="Generate wavy surface", + write_all="Write output to files" +) +arma.print_sync_vs_async_io(suffix_names, row_names, top_names) +#+end_src -**** Computational model overview. -The key feature that is missing in the current parallel programming technologies -is a possibility to specify hierarchical dependencies between parallel tasks. -When one has such dependency, it is trivial to determine which task should be -responsible for re-executing a failed task on one of the survived nodes. To -re-execute the task on the top of the hierarchy, a backup task is created and -executed on a different node. There exists a number of systems that are capable -of executing directed acyclic graphs of tasks in parallel\nbsp{}cite:acun2014charmpp,islam2012oozie, but graphs are not suitable to infer -principal-subordinate relationship between tasks, because a node in the graph -may have multiple parent nodes. +#+name: tab-arma-io-performance +#+caption: Running time (s.) for XFS, NFS and GlusterFS with sequential and parallel I/O programme versions. +#+attr_latex: :booktabs t +#+RESULTS: tab-arma-io-performance +| | | | Sequential | | | Parallel | +| \orgcmidrule{2-4}{5-7}Subroutine | XFS | NFS | GlusterFS | XFS | NFS | GlusterFS | +|----------------------------------+------+------+------------+------+------+-----------| +| Generate wavy surface | 1.26 | 1.26 | 1.33 | 1.33 | 3.30 | 11.06 | +| Write output to files | 0.28 | 2.34 | 10.95 | 0.00 | 0.00 | 0.00 | -The main purpose of the model is to simplify development of distributed batch -processing applications and middleware. The main focus is to make application -resilient to failures, i.e. make it fault tolerant and highly available, and do -it transparently to a programmer. The implementation is divided into two layers: -the lower layer consists of routines and classes for single node applications -(with no network interactions), and the upper layer for applications that run on -an arbitrary number of nodes. There are two kinds of tightly coupled entities in -the model\nbsp{}--- /control flow objects/ (or /kernels/ for short) and -/pipelines/\nbsp{}--- which are used together to compose a programme. +#+name: fig-arma-io-events +#+header: :width 6 :height 9 :results output graphics +#+begin_src R :file build/arma-io-events.pdf +source(file.path("R", "benchmarks.R")) +fsnames <- list( + xfs="XFS", + nfs="NFS", + gfs="GlusterFS" +) +par(mfrow=c(3,1), family="serif") +arma.plot_io_events(fsnames) +#+end_src -Kernels implement control flow logic in theirs ~act~ and ~react~ methods and -store the state of the current control flow branch. Both logic and state are -implemented by a programmer. In ~act~ method some function is either directly -computed or decomposed into nested functions (represented by a set of -subordinate kernels) which are subsequently sent to a pipeline. In ~react~ -method subordinate kernels that returned from the pipeline are processed by -their parent. Calls to ~act~ and ~react~ methods are asynchronous and are made -within threads attached to a pipeline. For each kernel ~act~ is called only -once, and for multiple kernels the calls are done in parallel to each other, -whereas ~react~ method is called once for each subordinate kernel, and all the -calls are made in the same thread to prevent race conditions (for different -parent kernels different threads may be used). +#+name: fig-arma-io-events +#+caption: Event plot for XFS, NFS and GlusterFS that shows time intervals spent for I/O and computation by different threads. +#+RESULTS: fig-arma-io-events +[[file:build/arma-io-events.pdf]] -Pipelines implement asynchronous calls to ~act~ and ~react~, and try to make as -many parallel calls as possible considering concurrency of the platform (no. of -cores per node and no. of nodes in a cluster). A pipeline consists of a kernel -pool, which contains all the subordinate kernels sent by their parents, and a -thread pool that processes kernels in accordance with rules outlined in the -previous paragraph. A separate pipeline is used for each device: There are -pipelines for parallel processing, schedule-based processing (periodic and -delayed tasks), and a proxy pipeline for processing of kernels on other cluster -nodes (see fig.\nbsp{}[[fig-subord-ppl]]). +**** Parallel velocity potential field computation. +The benchmarks for AR, MA and LH models showed that velocity potential field +computation consume only a fraction of total programme execution time, however, +the absolute computation time over a large \(XY\) domain may still be high. One +application where faster computation is needed is real-time simulation and +visualisation of wavy surface. The purpose of real-time visualisation is +two-fold: +- it allows to adjust parameters of the model and ACF function in + real-time and see the result of the changes immediately, and +- it allows to visually compare the size and the shape of regions where the most + wave energy is concentrated, which is valuable for education. -In principle, kernels and pipelines machinery reflect the one of procedures and -call stacks, with the advantage that kernel methods are called asynchronously -and in parallel to each other (as much as programme logic allows). Kernel field -is the stack, ~act~ method is a sequence of processor instructions before nested -procedure call, and ~react~ method is a sequence of processor instructions after -the call. Constructing and sending subordinate kernels to the pipeline is nested -procedure call. Two methods are necessary to make calls asynchronous, and -replace active wait for completion of subordinate kernels with passive one. -Pipelines, in turn, allow to implement passive wait, and call correct kernel -methods by analysing their internal state. +Since visualisation is done by GPU, doing velocity potential computation on CPU +may cause a bottleneck in data transfer between RAM and GPU memory. To +circumvent this, we use OpenCL/OpenGL interoperability API which allows creating +buffers, that are shared between OpenCL and OpenGL contexts thus eliminating +unnecessary copying of data. In addition to this, three-dimensional velocity +potential field formula\nbsp{}eqref:eq-phi-3d is particularly suitable for +computation by GPUs: +- it contains transcendental mathematical functions (hyperbolic cosines and + complex exponents); +- it is computed over large four-dimensional \(t,x,y,z\) region; +- it is analytic with no information dependencies between individual data points + in \(t\) and \(z\) dimensions. +These considerations make velocity potential field computation on GPU +advantageous in the application to real-time simulation and visualisation of +wavy surface. -#+name: fig-subord-ppl -#+begin_src dot :exports results :file build/subord-ppl.pdf -graph G { +In order to investigate, how GPGPU computations can be used to speed-up velocity +potential field computation, we benchmarked simplified version of +eq.\nbsp{}eqref:eq-phi-3d: +\begin{align} + \label{eq:phi-linear} + \phi(x,y,z,t) &= \InverseFourierY{ + \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} } + { 2\pi\Kveclen \Sinh{\smash{2\pi \Kveclen h}} } + \FourierY{ \zeta_t }{u,v} + }{x,y}\nonumber \\ + &= \InverseFourierY{ g_1(u,v) \FourierY{ g_2(x,y) }{u,v} }{x,y}. +\end{align} +Velocity potential solver was rewritten in OpenCL and its performance was +compared to an existing OpenMP implementation. - node [fontname="Old Standard",fontsize=14,margin="0.055,0",shape=box] - graph [nodesep="0.25",ranksep="0.25",rankdir="LR"] - edge [arrowsize=0.66] +For each implementation the overall performance of the solver for a particular +time instant was measured. Velocity field was computed for one \(t\) point, for +128 \(z\) points below wavy surface and for each \(x\) and \(y\) point of +four-dimensional \((t,x,y,z)\) grid. The only parameter that was varied between +subsequent programme runs is the size of the grid along \(x\) dimension. - subgraph cluster_daemon { - label="Daemon process" - style=filled - color=lightgrey +A different FFT library was used for each version of the solver: GNU Scientific +Library (GSL)\nbsp{}cite:galassi2015gnu for OpenMP and clFFT +library\nbsp{}cite:clfft for OpenCL. There are two major differences in the +routines from these libraries. +- The order of frequencies in Fourier transforms is different and clFFT library + requires reordering the result of\nbsp{}eqref:eq:phi-linear whereas GSL does + not. +- Discontinuity at \((x,y)=(0,0)\) of velocity potential field grid is handled + automatically by clFFT library, whereas GSL library produce skewed values at + this point. +For GSL library an additional interpolation from neighbouring points was used to +smooth velocity potential field at these points. We have not spotted other +differences in FFT implementations that have impact on the overall performance. - factory [label="Factory"] - parallel_ppl [label="Parallel\npipeline"] - io_ppl [label="I/O\npipeline"] - sched_ppl [label="Schedule-based\npipeline"] - net_ppl [label="Network\npipeline"] - proc_ppl [label="Process\npipeline"] +In the course of the numerical experiments we have measured how much time each +solver's implementation spends in each computation stage to explain find out how +efficient data copying between host and device is in OpenCL implementation, and +how one implementation corresponds to the other in terms of performance. - upstream [label="Upstream\nthread pool"] - downstream [label="Downstream\nthread pool"] - } +**** Performance of velocity potential OpenCL solver. +:PROPERTIES: +:header-args:R: :results output raw :exports results +:END: - factory--parallel_ppl - factory--io_ppl - factory--sched_ppl - factory--net_ppl - factory--proc_ppl +The experiments showed that OpenCL outperforms OpenMP implementation by a factor +of 10--15 (fig.\nbsp{}[[fig-arma-realtime-graph]]), however, distribution of time +between computation stages is different for each implementation +(table\nbsp{}[[tab-arma-realtime]]). The major time consumer on CPU is \(g_1\), +whereas in GPU its running time is comparable to \(g_2\). Copying the resulting +velocity potential field between CPU and GPU consumes \(\approx{}20\%\) of +solver execution time. \(g_2\) consumes the most of the execution time for +OpenCL solver, and \(g_1\) for OpenMP solver. In both implementations \(g_2\) is +computed on CPU, but for GPU implementation the result is duplicated for each +\(z\) grid point in order to perform multiplication of all \(XYZ\) planes along +\(z\) dimension in single OpenCL kernel, and, subsequently copied to GPU memory +which severely hinders the overall performance. All benchmarks were run on a +machine equipped with a GPU, characteristics of which are summarised in +table\nbsp{}[[tab-storm]]. - subgraph cluster_hardware { - label="Compute devices" - style=filled - color=lightgrey +#+name: tab-storm +#+caption: "Storm" test platform configuration. +#+attr_latex: :booktabs t +| CPU | Intel Core 2 Quad Q9550 | +| RAM | 8Gb | +| GPU | AMD Radeon R7 360 | +| GPU memory | 2GB | +| HDD | Seagate Barracuda, 7200 rpm | +| No. of CPU cores | 4 | - cpu [label="CPU"] - core0 [label="Core 0"] - core1 [label="Core 1"] - core2 [label="Core 2"] - core3 [label="Core 3"] +#+name: fig-arma-realtime-graph +#+header: :results output graphics +#+begin_src R :file build/realtime-performance.pdf +source(file.path("R", "benchmarks.R")) +par(family="serif") +data <- arma.load_realtime_data() +arma.plot_realtime_data(data) +title(xlab="Wavy surface size", ylab="Time, s") +#+end_src - storage [label="Storage"] - disk0 [label="Disk 0"] +#+name: fig-arma-realtime-graph +#+caption: Performance comparison of CPU (OpenMP) and GPU (OpenCL) versions of velocity potential solver. +#+RESULTS: fig-arma-realtime-graph +[[file:build/realtime-performance.pdf]] - network [label="Network"] - nic0 [label="NIC 0"] +The reason for different distribution of time between computation stages is the +same as for different AR model performance on CPU and GPU: GPU has more floating +point units and modules for transcendental mathematical functions, which are +needed for computation of \(g_1\), but lacks caches which are needed to +optimised irregular memory access pattern of \(g_2\). In contrast to AR model, +performance of multidimensional derivative computation on GPU is easier to +improve, as there are no information dependencies between points: +Multidimensional array library optimised for GPU may solve the problem, however, +due to unavailability of such library it was not done in this work. +Additionally, such library may allow to efficiently compute the non-simplified +formula entirely on GPU, since omitted terms also contain derivatives. - timer [label="Timer"] +#+name: tab-arma-realtime +#+begin_src R +source(file.path("R", "benchmarks.R")) +routine_names <- list( + harts_g1="\\(g_1\\)", + harts_g2="\\(g_2\\)", + harts_fft="FFT", + harts_copy_to_host="Copy data from GPU" +) +column_names <- c("Subroutine", "OpenMP time, s", "OpenCL time, s") +data <- arma.load_realtime_data() +arma.print_table_for_realtime_data(data, routine_names, column_names) +#+end_src - } +#+name: tab-arma-realtime +#+caption: Running time of real-time velocity potential solver subroutines for wavy surface \(\text{size}=16384\). +#+attr_latex: :booktabs t +#+RESULTS: tab-arma-realtime +| Subroutine | OpenMP time, s | OpenCL time, s | +|--------------------+----------------+----------------| +| \(g_1\) | 4.6730 | 0.0038 | +| \(g_2\) | 0.0002 | 0.8253 | +| FFT | 2.8560 | 0.3585 | +| Copy data from GPU | | 2.6357 | - core0--cpu - core1--cpu - core2--cpu - core3--cpu +As expected, sharing the same buffer between OpenCL and OpenGL contexts +increases overall solver performance by eliminating data transfer between CPU +and GPU memory, but also requires for the data to be in vertex buffer object +format, that OpenGL can operate on. Conversion to this format is fast, but +requires more memory to store velocity potential field to be able to visualise +it, since each point now is a vector with three components. The other +disadvantage of using OpenCL and OpenGL together is the requirement for manual +locking of shared objects: failure to do so results in screen artefacts which +can be removed only by rebooting the computer. - disk0--storage - nic0--network +**** Conclusions. +Performance benchmarks showed that GPU outperforms CPU in arithmetic intensive +tasks, i.e.\nbsp{}tasks requiring high number of floating point operations per +second, however, its performance degrades when the volume of data that needs to +be copied between CPU and GPU memory increases or when memory access pattern of +the algorithm is non-linear. The first problem may be solved by using +co-processors where high-bandwidth memory is located on the same die as the +processor and the main memory. This eliminates data transfer bottleneck, but may +also increase execution time due to smaller number of floating point units. The +second problem may be solved programmatically, by using OpenCL library that +optimises multi-dimensional array traversals for GPUs; due to unavailability of +such library this was not done in present work. + +ARMA model outperforms LH model in benchmarks and does not require GPU to do so. +Its computational strengths are: +- vicinity of transcendental mathematical functions, and +- simple algorithm for both AR and MA model, performance of which depends on the + performance of multi-dimensional array library and FFT library. +Providing main functionality via low-level libraries makes performance portable: +support for new processor architectures can be added by substituting the +libraries. Finally, using analytic formula for velocity potential field made +velocity potential solver consume only a small fraction of total programme +execution time. If such formula was not available or did not have all integrals +as Fourier transforms, performance of velocity potential computation would be +much lower. + +** MPP implementation +*** Computational model +**** Mapping wavy surface generation algorithm on computational model. +Software implementation of ARMA model works as a computational pipeline, in +which each joint applies some function to the output coming from the pipe of the +previous joint. Joints are distributed across computer cluster nodes to enable +function parallelism, and then data flowing through the joints is distributed +across processor cores to enable data parallelism. Figure\nbsp{}[[fig-pipeline]] +shows a diagram of data processing pipeline in which rectangles with rounded +corners denote joints, regular rectangles denote arrays of problem domain +objects flowing from one joint to another, and arrows show flow direction. Some +joints are divided into /sections/ each of which process a separate part of the +array. If joints are connected without a /barrier/ (horizontal or vertical bar), +then transfer of separate objects between them is done in parallel to +computations, as they become available. Sections work in parallel on each +processor core (or node of the cluster). There is surjective mapping between a +set of processor cores, a set of pipeline joint sections and objects, i.e. each +processor core may run several sections, each of which may sequentially process +several objects, but a section can not work simultaneously on several processor +cores, and an object can not be processed simultaneously by several sections. +So, the programme is a pipeline through which control objects flow. + +#+name: fig-pipeline +#+begin_src dot :exports results :file build/pipeline.pdf +digraph { - parallel_ppl--upstream - parallel_ppl--downstream + node [fontsize=14,margin="0.055,0"] + graph [nodesep="0.25",ranksep="0.25",rankdir="TB"] + edge [arrowsize=0.66] - upstream--{core0,core1,core2,core3} [style="dashed"] - downstream--core0 [style="dashed"] + # data + subgraph xcluster_linear { + label="Linear model" - io_ppl--core0 [style="dashed"] - io_ppl--disk0 [style="dashed"] - sched_ppl--core0 [style="dashed"] - sched_ppl--timer [style="dashed"] - net_ppl--core0 [style="dashed"] - net_ppl--nic0 [style="dashed"] - proc_ppl--core0 [style="dashed"] + start [label="",shape=circle,style=filled,fillcolor=black,width=0.23] + spectrum [label="S(ω,θ)",shape=box] + acf [label="K(i,j,k)",shape=box] + phi [label="Φ(i,j,k)",shape=box] - subgraph cluster_children { - style=filled - color=white + # transformations + fourier_transform [label="Fourier transform",shape=box,style=rounded] + solve_yule_walker [label="Solve Yule—Walker\nequations",shape=box,style=rounded] - subgraph cluster_child0 { - label="Child process 0" + subgraph cluster_nonlinear_1 { + label="Simulate non-linearity\l" + labeljust=left style=filled color=lightgrey - labeljust=right - - app0_factory [label="Factory"] - app0 [label="Child process\rpipeline"] + acf2 [label="K*(i,j,k)",shape=box] + transform_acf [label="Transform ACF",shape=box,style=rounded] } - -# subgraph cluster_child1 { -# label="Child process 1" -# style=filled -# color=lightgrey -# labeljust=right -# -# app1_factory [label="Factory"] -# app1 [label="Child process\rpipeline"] -# } } - proc_ppl--app0 -# proc_ppl--app1 + subgraph xcluster_linear2 { - app0_factory--app0 [constraint=false] -# app1_factory--app1 [constraint=false] + eps_parts [label="<e1> ε₁|<e2> ε₂|<e3> …|<e4> εₙ|<e> ε(t,x,y)",shape=record] + end [label="",shape=doublecircle,style=filled,fillcolor=black,width=0.23] -} -#+end_src + generate_white_noise [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate\lwhite noise",shape=record,style=rounded] + generate_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Generate sea\lwavy surface parts\l",shape=record,style=rounded] -#+caption: Mapping of parent and child process pipelines to compute devices. Solid lines denote aggregation, dashed lines denote mapping between logical and physical entities. -#+attr_latex: :width \textwidth -#+label: fig-subord-ppl -#+RESULTS: fig-subord-ppl -[[file:build/subord-ppl.pdf]] -**** Governing principles. -Data processing pipeline model is based on the following principles, following -which maximises efficiency of a programme. -- There is no notion of a message in the model, a kernel is itself a message - that can be sent over network to another node and directly access any kernel - on the local node. Only programme logic may guarantee the existence of the - kernel. -- A kernel is a /cooperative routine/, which is submitted to kernel pool upon the - call and is executed asynchronously by a scheduler. There can be any number of - calls to other subroutines inside routine body. Every call submits - corresponding subroutine to kernel pool and returns immediately. Kernels in the - pool can be executed in any order; this fact is used by a scheduler to exploit - parallelism offered by the computer by distributing kernels from the pool - across available cluster nodes and processor cores. -- Asynchronous execution prevents the use of explicit synchronisation after the - call to subroutine is made; system scheduler returns control flow to the - routine each time one of its subroutine returns. Such cooperation transforms - each routine which calls subroutines into event handler, where each event is a - subroutine and the handler is the routine that called them. -- The routine may communicate with any number of local kernels, addresses of - which it knows; communication with kernels which are not adjacent in the call - stack complexifies control flow and call stack looses its tree shape. Only - programme logic may guarantee presence of communicating kernels in memory. One - way to ensure this is to perform communication between subroutines which are - called from the same routine. Since such communication is possible within - hierarchy through parent routine, it may treated as an optimisation that - eliminates overhead of transferring data over intermediate node. The situation - is different for interactive or event-based programmes (e.g. servers and - programmes with graphical interface) in which this is primary type of - communication. -- In addition to this, communication which does not occur along hierarchical - links and executed over cluster network complexify design of resiliency - algorithms. Since it is impossible to ensure that a kernel resides in memory - of a neighbour node, because a node may fail in the middle of its execution of - the corresponding routine. As a result, upon failure of a routine all of its - subroutines must be restarted. This encourages a programmer to construct - - deep tree hierarchies of tightly-coupled kernels (which communicate on the - same level of hierarchy) to reduce overhead of recomputation; - - fat tree hierarchies of loosely-coupled kernels, providing maximal degree of - parallelism. - Deep hierarchy is not only requirement of technology, it helps optimise - communication of large number of cluster nodes reducing it to communication of - adjacent nodes. + zeta_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Non-crosslinked\lrealisation parts",shape=record] + overlap_add [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Crosslink realisation\lparts\l",shape=record,style=rounded] -So, control flow objects (or kernels) possess properties of both cooperative -routines and event handlers. -**** Definitions of hierarchies -To disambiguate hierarchical links between daemon processes and kernels and to -simplify the discussion, we will use the following naming conventions throughout -the text. If the link is between two daemon processes, the relationship is -denoted as /master-slave/. If the link is between two kernels, then the -relationship is denoted as either /principal-subordinate/ or /parent-child/. Two -hierarchies are orthogonal to each other in a sense that no kernel may have a -link to a daemon, and vice versa. Since daemon hierarchy is used to distribute -the load on cluster nodes, kernel hierarchy is mapped onto it, and this mapping -can be arbitrary: It is common to have principal kernel on a slave node with its -subordinate kernels distributed evenly between all cluster nodes (including the -node where the principal is located). Both hierarchies can be arbitrarily deep, -but "shallow" ones are preferred for highly parallel programmes, as there are -less number of hops when kernels are distributed between cluster nodes. Since -there is one-to-one correspondence between daemons and cluster nodes, they are -used interchangeably in the work. + zeta_parts:g1->overlap_add:g1 + zeta_parts:g2->overlap_add:g2 + zeta_parts:g3->overlap_add:g3 + zeta_parts:g4->overlap_add:g4 -**** Kernel structure and algorithms -Each kernel has four types of fields (listed in table\nbsp{}[[tab-kernel-fields]]): -- fields related to control flow, -- fields defining the source location of the kernel, -- fields defining the current location of the kernel, and -- fields defining the target location of the kernel. + zeta_parts:g2->overlap_add:g1 [constraint=false] + zeta_parts:g3->overlap_add:g2 [constraint=false] + zeta_parts:g4->overlap_add:g3 [constraint=false] -#+name: tab-kernel-fields -#+caption: Description of kernel fields. -#+attr_latex: :booktabs t :align lp{0.7\textwidth} -| Field | Description | -|-------------------+------------------------------------------------------------------------------------------------| -| ~process_id~ | Identifier of a cluster-wide process (application) a kernel belongs to. | -| ~id~ | Identifier of a kernel within a process. | -| ~pipeline~ | Identifier of a pipeline a kernel is processed by. | -| | | -| ~exit_code~ | A result of a kernel execution. | -| ~flags~ | Auxiliary bit flags used in scheduling. | -| ~time_point~ | A time point at which a kernel is scheduled to be executed. | -| | | -| ~parent~ | Address/identifier of a parent kernel. | -| ~src_ip~ | IP-address of a source cluster node. | -| ~from_process_id~ | Identifier of a cluster-wide process from which a kernel originated. | -| | | -| ~principal~ | Address/identifier of a target kernel (a kernel to which the current one is sent or returned). | -| ~dst_ip~ | IP-address of a destination cluster node. | + overlap_add:g1->zeta2_parts:g1 + overlap_add:g2->zeta2_parts:g2 + overlap_add:g3->zeta2_parts:g3 + overlap_add:g4->zeta2_parts:g4 -Upon creation each kernel is assigned a parent and a pipeline. If there no other -fields are set, then the kernel is an /upstream/ kernel\nbsp{}--- a kernel that -can be distributed on any node and any processor core to exploit parallelism. If -principal field is set, then the kernel is a /downstream/ kernel\nbsp{}--- a -kernel that can only be sent to its principal, and a processor core to which the -kernel is sent is defined by the principal memory address/identifier. If a -downstream kernel is to be sent to another node, the destination IP-address must -be set, otherwise the system will not find the target kernel. + zeta2_parts:g1->transform_zeta:g1->zeta3_parts:g1->write_zeta:g1->eps_end + zeta2_parts:g2->transform_zeta:g2->zeta3_parts:g2->write_zeta:g2->eps_end + zeta2_parts:g3->transform_zeta:g3->zeta3_parts:g3->write_zeta:g3->eps_end + zeta2_parts:g4->transform_zeta:g4->zeta3_parts:g4->write_zeta:g4->eps_end -When kernel execution completes (its ~act~ method finishes), the kernel is -explicitly sent to some other kernel, this directive is explicitly called inside -~act~ method. Usually, after the execution completes a kernel is sent to its -parent by setting principal field to the address/identifier of the parent, -destination IP-address field to the source IP-address, and process identifier to -the source process identifier. After that kernel becomes a downstream kernel and -is sent by the system to the node, where its current principal is located -without invoking load balancing algorithm. For downstream kernel ~react~ method -of its parent is called by a pipeline with the kernel as the method argument to -make it possible for a parent to collect the result of the execution. + } -There is no way to provide fine-grained resilience to cluster node failures, if -there are downstream kernels in the programme, except the ones returning to -their parents. Instead, an exit code of the kernel is checked and a custom -recovery code is executed. If there is no error checking, the system restarts -execution from the first parent kernel, which did not produce any downstream -kernels. This means, that if a problem being solved by the programme has -information dependencies between parts that are computed in parallel, and a node -failure occurs during computation of these parts, then this computation is -restarted from the very beginning, discarding any already computed parts. This -does not occur for embarrassingly parallel programmes, where parallel parts do -not have such information dependencies between each other: in this case only -parts from failed nodes are recomputed and all previously computed parts are -retained. + subgraph part3 { -** SMP implementation -**** Parallel AR, MA and LH model algorithms. -Although, AR and MA models are part of the mixed ARMA model they have completely -disparate parallel algorithms, which are different from trivial one of LH model. -AR algorithm is based on partitioning wavy surface into equal three-dimensional -parts along each dimension and computing them in parallel taking into account -causality constraints imposed by autoregressive dependencies between surface -points. There are no autoregressive dependencies between points in MA model, its -formula represents convolution of white noise with model coefficients, which is -reduced to computation of three Fourier transforms via convolution theorem, -which in turn have parallel implementations; so, MA algorithm is based on -parallel FFT. Finally, LH algorithm is made parallel by simply computing each -wavy surface point in parallel. So, parallel implementation of ARMA model -consists of two parallel algorithms, one for each part of the model, which are -more sophisticated than the one for LH model. + zeta2_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> Wavy surface with\lGaussian distribution\l",shape=record] -AR model's formula main feature is autoregressive dependencies between wavy -surface points along each dimension which prevent computing each surface point -in parallel. Instead the surface is partitioned along each dimension into equal -three-dimensional parts, and for each part information dependencies, which -define execution order, are established. Figure\nbsp{}[[fig-ar-cubes]] shows these -dependencies. Here arrow denotes dependency of one part on availability of -another. Here part \(A\) does not have dependencies, parts \(B\) and \(D\) -depend on \(A\), and part \(E\) depends on \(A\), \(B\) and \(C\). In essence, -each part depends only on the parts that have previous index along each -dimension (if such parts exist). The first part does not have any dependencies, -and the size of each part along each dimension is made greater or equal to the -corresponding number of coefficients along the dimension to consider only -adjacent parts. + subgraph cluster_nonlinear_2 { + label="Simulate non-linearity\r" + labeljust=right + style=filled + color=lightgrey + zeta3_parts [label="<g1> ζ₁|<g2> ζ₂|<g3> …|<g4> ζₙ|<gen> ζ(t,x,y)",shape=record] + transform_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Transform wavy\lsurface elevation\lprobability distribution\l",shape=record,style=rounded] + } + + # barriers + eps_start [label="",shape=box,style=filled,fillcolor=black,height=0.05] + eps_end [label="",shape=box,style=filled,fillcolor=black,height=0.05] + + write_zeta [label="<g1> g₁|<g2> g₂|<g3> …|<g4> gₙ|<gen> Write finished\lparts to a file\l",shape=record,style=rounded] + } + + # edges + start->spectrum->fourier_transform->acf->transform_acf + transform_acf->acf2 + acf2->solve_yule_walker + solve_yule_walker->phi + phi->eps_start [constraint=false] + eps_start->generate_white_noise:g1 + eps_start->generate_white_noise:g2 + eps_start->generate_white_noise:g3 + eps_start->generate_white_noise:g4 + generate_white_noise:g1->eps_parts:e1->generate_zeta:g1->zeta_parts:g1 + generate_white_noise:g2->eps_parts:e2->generate_zeta:g2->zeta_parts:g2 + generate_white_noise:g3->eps_parts:e3->generate_zeta:g3->zeta_parts:g3 + generate_white_noise:g4->eps_parts:e4->generate_zeta:g4->zeta_parts:g4 -#+name: fig-ar-cubes -#+header: :width 5 :height 5 -#+begin_src R :file build/ar-cubes.pdf -source(file.path("R", "common.R")) -arma.plot_ar_cubes_2d(3, 3, xlabel="Part index (X)", ylabel="Part index (Y)") + eps_end->end +} #+end_src -#+RESULTS: fig-ar-cubes -[[file:build/ar-cubes.pdf]] - -Each part has an three-dimensional index and a completion status. The algorithm -starts by submitting objects containing this information into a queue. After -that parallel threads start, each thread finds the first part for which all -dependencies are satisfied (by checking the completion status of each part), -removes the part from the queue, generates wavy surface for this part and sets -completion status. The algorithm ends when the queue becomes empty. Access to -the queue is synchronised by locks. The algorithm is suitable for SMP machines, -for MPP the copying of dependent parts needs to be done prior to computation of -each part. - -So, the AR model algorithm is made parallel by implementing minimalistic job -scheduler, in which -- each job corresponds to a wavy surface part, -- the order of execution of jobs is defined by autoregressive dependencies, -- and jobs are executed by a simple thread pool in which each thread removes the - first job for which all dependent jobs have completed and executes it. +#+caption: Diagram of data processing pipeline, that implements sea wavy surface generation via AR model. +#+name: fig-pipeline +#+RESULTS: fig-pipeline +[[file:build/pipeline.pdf]] -In contrast to AR model, MA model does not have autoregressive dependencies -between points, instead, each surface points depends on previous in time and -space white noise values. MA model's formula allows for rewriting it as a -convolution of white noise with the coefficients as a kernel. Using convolution -theorem, the convolution is rewritten as inverse Fourier transform of the -product of Fourier transforms of white noise and coefficients. Since the number -of MA coefficients is much smaller than the number of wavy surface points, -parallel FFT implementation is not suitable here, as it requires padding the -coefficients with noughts to match the size of the surface. Instead, the surface -is divided into parts along each dimension which are padded with noughts to -match the number of the coefficients along each dimension multiplied by two. -Then Fourier transform of each part is computed in parallel, multiplied by -previously computed Fourier transform of the coefficients, and inverse Fourier -transform of the result is computed. After that, each part is written to the -output array with overlapping points (due to padding) added to each other. This -algorithm is commonly known in signal processing as -"overlap-add"\nbsp{}cite:svoboda2011efficient. Padding with noughts is needed to -prevent aliasing errors: without it the result would be circular convolution. +Object pipeline may be seen as an improvement of BSP (Bulk Synchronous Parallel) +model\nbsp{}cite:valiant1990bridging, which is used in graph +processing\nbsp{}cite:malewicz2010pregel,seo2010hama. Pipeline eliminates global +synchronisation (where it is possible) after each sequential computation step by +doing data transfer between joints in parallel to computations, whereas in BSP +model global synchronisation occurs after each step. -Despite the fact that MA model algorithm partitions the surface into the same -parts (but possible of different sizes), the vicinity of autoregressive -dependencies between them allows to compute them in parallel without the use of -specialised job scheduler. However, it requires padding with noughts to make the -result correspond to the original MA model's formula. So, MA model's algorithm -is more scalable to a large number of nodes as it has less dependencies between -parts computed in parallel, but the size of the parts is greater than in AR -model, so they are slower to compute. +Object pipeline speeds up the programme by parallel execution of code blocks +that work with different compute devices: while the current part of wavy surface +is generated by a processor, the previous part is written to a disk. This +approach allows to get speed-up because compute devices operate asynchronously, +and their parallel usage increases the whole programme performance. -The distinct feature of LH model's algorithm is its simplicity: to make it -parallel, the surface is partitioned into parts of equal sizes and each part is -computed in parallel. There are no dependencies between parts, which makes this -algorithm particularly suitable for computation on GPU: each hardware thread -simply computes its own point. In addition, sine and cosine functions in the -model's formula which are slow to compute on CPU, make GPU even more favourable -choice. +Since data transfer between pipeline joints is done in parallel to computations, +the same pipeline may be used to run several copies of the application but with +different parameters (generate several sea wavy surfaces having different +characteristics). In practise, high-performance applications do not always +consume 100% of processor time spending a portion of time on synchronisation of +parallel processes and writing data to disk. Using pipeline in this case allows +to run several computations on the same set of processes, and use all of the +computer devices at maximal efficiency. For example, when one object writes data +to a file, the other do computations on the processor in parallel. This +minimises downtime of the processor and other computer devices and increases +throughput of the computer cluster. -To summarise, even though AR and MA models are part of the mixed ARMA model, -their parallel algorithms are fundamentally different and are more complicated -than trivial parallel algorithm of LH model. Efficient AR algorithm requires -specialised job scheduler to manage autoregressive dependencies between wavy -surface parts, whereas MA algorithm requires padding part with noughts to be -able to compute them in parallel. In contrast to these models, LH model has no -dependencies between parts computed in parallel, but requires more computational -power (floating point operations per seconds). -**** Performance of OpenMP and OpenCL implementations. -:PROPERTIES: -:header-args:R: :results output raw :exports results -:END: +Pipelining of otherwise sequential steps is beneficial not only for code working +with different devices, but for code different branches of which are suitable +for execution by multiple hardware threads of the same processor core, +i.e.\nbsp{}branches accessing different memory blocks or performing mixed +arithmetic (integer and floating point). Code branches which use different +modules of processor are good candidates to run in parallel on a processor core +with multiple hardware threads. -Differences in models' parallel algorithms make them efficient on different -processor architectures, and to find the most efficient one all the models were -benchmarked in both CPU and GPU. +So, computational model with a pipeline can be seen as /bulk-asynchronous +model/, because of the parallel nature of programme steps. This model is the +basis of the fault-tolerance model which will be described later. -ARMA model does not require highly optimised software implementation to be -efficient, its performance is high even without use of co-processors; there are -two main causes of that. First, ARMA model itself does not use transcendental -functions (sines, cosines and exponents) as opposed to LH model. All -calculations (except model coefficients) are done via polynomials, which can be -efficiently computed on modern processors using a series of FMA instructions. -Second, pressure computation is done via explicit analytic formula using nested -FFTs. Since two-dimensional FFT of the same size is repeatedly applied to every -time slice, its coefficients (complex exponents) are pre-computed for all -slices, and computations are performed with only a few transcendental functions. -In case of MA model, performance is also increased by doing convolution with FFT -transforms. So, high performance of ARMA model is due to scarce use of -transcendental functions and heavy use of FFT, not to mention that high -convergence rate and non-existence of periodicity allows to use far fewer -coefficients compared to LH model. +**** Software implementation. +For efficiency reasons object pipeline and fault tolerance techniques (which +will be described later) are implemented in the C++ framework: From the author's +perspective C language is deemed low-level for distributed programmes, and Java +incurs too much overhead and is not popular in HPC community. As of now, the +framework runs in the same process as an parallel application that uses it. The +framework is called Factory, it is now in proof-of-concept development stage. -#+name: tab-gpulab -#+caption: "Gpulab" test platform configuration. -#+attr_latex: :booktabs t -| CPU | AMD FX-8370 | -| RAM | 16Gb | -| GPU | GeForce GTX 1060 | -| GPU memory | 6GB | -| HDD | WDC WD40EZRZ-00WN9B0, 5400rpm | -| No. of CPU cores | 4 | -| No. of threads per core | 2 | +**** Computational model overview. +The key feature that is missing in the current parallel programming technologies +is a possibility to specify hierarchical dependencies between parallel tasks. +When one has such dependency, it is trivial to determine which task should be +responsible for re-executing a failed task on one of the survived nodes. To +re-execute the task on the top of the hierarchy, a backup task is created and +executed on a different node. There exists a number of systems that are capable +of executing directed acyclic graphs of tasks in parallel\nbsp{}cite:acun2014charmpp,islam2012oozie, but graphs are not suitable to infer +principal-subordinate relationship between tasks, because a node in the graph +may have multiple parent nodes. -ARMA implementation uses several libraries of mathematical functions, numerical -algorithms and visualisation primitives (listed in table\nbsp{}[[tab-arma-libs]]), -and was implemented using several parallel programming technologies (OpenMP, -OpenCL) to find the most efficient one. For each technology and each model an -optimised wavy surface generation was implemented (except for MA model for which -only OpenMP implementation was done). Velocity potential computation was done in -OpenMP and was implemented in OpenCL only for real-time visualisation of the -surface. For each technology the programme was recompiled and run multiple times -and performance of each top-level subroutine was measured using system clock. -Results of benchmarks of the technologies are summarised in -table\nbsp{}[[tab-arma-performance]]. All benchmarks were run on a machine equipped -with a GPU, characteristics of which are summarised in table\nbsp{}[[tab-gpulab]]. -All benchmarks were run with the same input parameters for all the models: -realisation length 10000s and output grid size \(40\times40\)m. The only -parameter that was different is the order (the number of coefficients): order of -AR and MA model was \(7,7,7\) and order of LH model was \(40,40\). This is due -to higher number of coefficient for LH model to eliminate periodicity. +The main purpose of the model is to simplify development of distributed batch +processing applications and middleware. The main focus is to make application +resilient to failures, i.e. make it fault tolerant and highly available, and do +it transparently to a programmer. The implementation is divided into two layers: +the lower layer consists of routines and classes for single node applications +(with no network interactions), and the upper layer for applications that run on +an arbitrary number of nodes. There are two kinds of tightly coupled entities in +the model\nbsp{}--- /control flow objects/ (or /kernels/ for short) and +/pipelines/\nbsp{}--- which are used together to compose a programme. -In all benchmarks wavy surface generation and NIT take the most of the running -time, whereas velocity potential calculation together with other subroutines -only a small fraction of it. +Kernels implement control flow logic in theirs ~act~ and ~react~ methods and +store the state of the current control flow branch. Both logic and state are +implemented by a programmer. In ~act~ method some function is either directly +computed or decomposed into nested functions (represented by a set of +subordinate kernels) which are subsequently sent to a pipeline. In ~react~ +method subordinate kernels that returned from the pipeline are processed by +their parent. Calls to ~act~ and ~react~ methods are asynchronous and are made +within threads attached to a pipeline. For each kernel ~act~ is called only +once, and for multiple kernels the calls are done in parallel to each other, +whereas ~react~ method is called once for each subordinate kernel, and all the +calls are made in the same thread to prevent race conditions (for different +parent kernels different threads may be used). -#+name: tab-arma-libs -#+caption: A list of mathematical libraries used in ARMA model implementation. -#+attr_latex: :booktabs t -| Library | What it is used for | -|--------------------------------------------------------------+---------------------------------| -| DCMT\nbsp{}cite:matsumoto1998dynamic | parallel PRNG | -| Blitz\nbsp{}cite:veldhuizen1997will,veldhuizen2000techniques | multidimensional arrays | -| GSL\nbsp{}cite:gsl2008scientific | PDF, CDF, FFT computation | -| | checking process stationarity | -| clFFT\nbsp{}cite:clfft | FFT computation | -| LAPACK, GotoBLAS\nbsp{}cite:goto2008high,goto2008anatomy | finding AR coefficients | -| GL, GLUT\nbsp{}cite:kilgard1996opengl | three-dimensional visualisation | -| CGAL\nbsp{}cite:fabri2009cgal | wave numbers interpolation | +Pipelines implement asynchronous calls to ~act~ and ~react~, and try to make as +many parallel calls as possible considering concurrency of the platform (no. of +cores per node and no. of nodes in a cluster). A pipeline consists of a kernel +pool, which contains all the subordinate kernels sent by their parents, and a +thread pool that processes kernels in accordance with rules outlined in the +previous paragraph. A separate pipeline is used for each device: There are +pipelines for parallel processing, schedule-based processing (periodic and +delayed tasks), and a proxy pipeline for processing of kernels on other cluster +nodes (see fig.\nbsp{}[[fig-subord-ppl]]). -AR model exhibits the best performance in OpenMP and the worst performance in -OpenCL implementations, which is also the best and the worst performance across -all model and framework combinations. In the best model and framework -combination AR performance is 4.5 times higher than MA performance, and 20 times -higher than LH performance; in the worst combination\nbsp{}--- 137 times slower -than MA and 2 times slower than LH. The ratio between the best (OpenMP) and the -worst (OpenCL) AR model performance is several hundreds. This is explained by -the fact that the model formula\nbsp{}eqref:eq-ar-process is efficiently mapped -on the CPU architecture with caches, low-bandwidth memory and small number of -floating point units: -- it contains no transcendental mathematical functions (sines, cosines and - exponents), -- all of the multiplications and additions in the formula can be implemented - using FMA processor instructions, -- and cache locality is achieved by using Blitz library which implements - optimised traversals for multidimensional arrays based on Hilbert - space-filling curve. -In contrast to CPU, GPU has less number of caches, high-bandwidth memory and -large number of floating point units, which is the worst case scenario for AR -model: -- there are no transcendental functions which compensate high memory latency, -- there are FMA instructions but they do not improve performance due to high - latency, -- and optimal traversal was not used due to a lack of libraries implementing it - for a GPU. -Finally, GPU does not have synchronisation primitives that allow to implement -autoregressive dependencies between distinct wavy surface parts to compute them -in parallel, and instead of this processor launches a separate OpenCL kernel for -each part, controlling all the dependencies between them using CPU. So, for AR -model CPU architecture is superior compared to GPU due to better handling of -complex information dependencies, non-intensive calculations (multiplications -and additions) and complex memory access patterns. +In principle, kernels and pipelines machinery reflect the one of procedures and +call stacks, with the advantage that kernel methods are called asynchronously +and in parallel to each other (as much as programme logic allows). Kernel field +is the stack, ~act~ method is a sequence of processor instructions before nested +procedure call, and ~react~ method is a sequence of processor instructions after +the call. Constructing and sending subordinate kernels to the pipeline is nested +procedure call. Two methods are necessary to make calls asynchronous, and +replace active wait for completion of subordinate kernels with passive one. +Pipelines, in turn, allow to implement passive wait, and call correct kernel +methods by analysing their internal state. -#+header :results output raw :exports results -#+name: tab-arma-performance -#+begin_src R -source(file.path("R", "benchmarks.R")) -model_names <- list( - ar.x="AR", - ma.x="MA", - lh.x="LH", - ar.y="AR", - ma.y="MA", - lh.y="LH", - Row.names="\\orgcmidrule{2-4}{5-6}Subroutine" -) -row_names <- list( - determine_coefficients="Determine coefficients", - validate="Validate model", - generate_surface="Generate wavy surface", - nit="NIT", - write_all="Write output to files", - copy_to_host="Copy data from GPU", - velocity="Compute velocity potentials" -) -arma.print_openmp_vs_opencl(model_names, row_names) -#+end_src +#+name: fig-subord-ppl +#+begin_src dot :exports results :file build/subord-ppl.pdf +graph G { -#+name: tab-arma-performance -#+caption: Running time (s.) for OpenMP and OpenCL implementations of AR, MA and LH models. -#+attr_latex: :booktabs t -#+RESULTS: tab-arma-performance -| | | | OpenMP | | OpenCL | -| \orgcmidrule{2-4}{5-6}Subroutine | AR | MA | LH | AR | LH | -|----------------------------------+------+------+--------+--------+--------| -| Determine coefficients | 0.02 | 0.01 | 0.19 | 0.01 | 1.19 | -| Validate model | 0.08 | 0.10 | | 0.08 | | -| Generate wavy surface | 1.26 | 5.57 | 350.98 | 769.38 | 0.02 | -| NIT | 7.11 | 7.43 | | 0.02 | | -| Copy data from GPU | | | | 5.22 | 25.06 | -| Compute velocity potentials | 0.05 | 0.05 | 0.06 | 0.03 | 0.03 | -| Write output to files | 0.27 | 0.27 | 0.27 | 0.28 | 0.27 | + node [fontname="Old Standard",fontsize=14,margin="0.055,0",shape=box] + graph [nodesep="0.25",ranksep="0.25",rankdir="LR"] + edge [arrowsize=0.66] -In contrast to AR model, LH model exhibits the best performance on GPU and the -worst performance on CPU. The reasons for that are -- the large number of transcendental functions in its formula which help offset - high memory latency, -- linear memory access pattern which help vectorise calculations and coalesce - memory accesses by different hardware threads, -- and no information dependencies between output grid points. -Despite the fact that GPU on the test platform is more performant than CPU (in -terms of floating point operations per second), the overall performance of LH -model compared to AR model is lower. The reason for that is slow data transfer -between GPU and CPU memory. + subgraph cluster_daemon { + label="Daemon process" + style=filled + color=lightgrey -The last MA model is faster than LH and slower than AR. As the convolution in -its formula is implemented using FFT, its performance depends on the performance -of underlying FFT implementation: GSL for CPU and clFFT for GPU. In this work -performance of MA model on GPU was not tested due to unavailability of the -three-dimensional transform in clFFT library; if the transform was available, it -could made the model even faster than AR. + factory [label="Factory"] + parallel_ppl [label="Parallel\npipeline"] + io_ppl [label="I/O\npipeline"] + sched_ppl [label="Schedule-based\npipeline"] + net_ppl [label="Network\npipeline"] + proc_ppl [label="Process\npipeline"] -NIT takes less time on GPU and more time on CPU, but taking data transfer -between CPU and GPU into consideration makes their execution time comparable. -This is explained by the large amount of transcendental mathematical functions -that need to be computed for each wavy surface point to transform distribution -of its \(z\)-coordinates. For each point a non-linear transcendental -equation\nbsp{}eqref:eq-distribution-transformation is solved using bisection -method. GPU performs this task several hundred times faster than CPU, but spends -a lot of time to transfer the result back to the main memory. So, the only -possibility to optimise this routine is to use root finding method with -quadratic convergence rate to reduce the number of transcendental functions that -need to be computed. + upstream [label="Upstream\nthread pool"] + downstream [label="Downstream\nthread pool"] + } -**** I/O performance. -:PROPERTIES: -:header-args:R: :results output raw :exports results -:END: + factory--parallel_ppl + factory--io_ppl + factory--sched_ppl + factory--net_ppl + factory--proc_ppl -Although, in the current benchmarks writing data to files does not consume much -of the running time, the use of network-mounted file systems may slow down this -stage. To optimise it wavy surface parts were written to file as soon as full -time slice was available: all completed parts were grouped by time slices they -belong to and subsequently written to file, as soon as the whole time slice is -finished (fig.\nbsp{}[[fig-arma-io-events]]). That way a separate thread starts -writing to files as soon as the first time slice is available and finishes it -after the main thread group finishes the computation. The total time needed to -perform I/O is slightly increased, but the I/O is done in parallel to -computation so the total running time is decreased -(table\nbsp{}[[tab-arma-io-performance]]). Using this approach with local file -system has the same effect, but the total reduction in execution time is small, -because local file system is more performant. + subgraph cluster_hardware { + label="Compute devices" + style=filled + color=lightgrey -#+header :results output raw :exports results -#+name: tab-arma-io-performance -#+begin_src R -source(file.path("R", "benchmarks.R")) -suffix_names <- list( - xfs.x="XFS", - xfs.y="XFS", - nfs.x="NFS", - nfs.y="NFS", - gfs.x="GlusterFS", - gfs.y="GlusterFS", - Row.names="\\orgcmidrule{2-4}{5-7}Subroutine" -) -top_names <- c("Sequential", "Parallel") -row_names <- list( - generate_surface="Generate wavy surface", - write_all="Write output to files" -) -arma.print_sync_vs_async_io(suffix_names, row_names, top_names) -#+end_src + cpu [label="CPU"] + core0 [label="Core 0"] + core1 [label="Core 1"] + core2 [label="Core 2"] + core3 [label="Core 3"] -#+name: tab-arma-io-performance -#+caption: Running time (s.) for XFS, NFS and GlusterFS with sequential and parallel I/O programme versions. -#+attr_latex: :booktabs t -#+RESULTS: tab-arma-io-performance -| | | | Sequential | | | Parallel | -| \orgcmidrule{2-4}{5-7}Subroutine | XFS | NFS | GlusterFS | XFS | NFS | GlusterFS | -|----------------------------------+------+------+------------+------+------+-----------| -| Generate wavy surface | 1.26 | 1.26 | 1.33 | 1.33 | 3.30 | 11.06 | -| Write output to files | 0.28 | 2.34 | 10.95 | 0.00 | 0.00 | 0.00 | + storage [label="Storage"] + disk0 [label="Disk 0"] -#+name: fig-arma-io-events -#+header: :width 6 :height 9 :results output graphics -#+begin_src R :file build/arma-io-events.pdf -source(file.path("R", "benchmarks.R")) -fsnames <- list( - xfs="XFS", - nfs="NFS", - gfs="GlusterFS" -) -par(mfrow=c(3,1), family="serif") -arma.plot_io_events(fsnames) -#+end_src + network [label="Network"] + nic0 [label="NIC 0"] -#+name: fig-arma-io-events -#+caption: Event plot for XFS, NFS and GlusterFS that shows time intervals spent for I/O and computation by different threads. -#+RESULTS: fig-arma-io-events -[[file:build/arma-io-events.pdf]] + timer [label="Timer"] -**** Parallel velocity potential field computation. -The benchmarks for AR, MA and LH models showed that velocity potential field -computation consume only a fraction of total programme execution time, however, -the absolute computation time over a large \(XY\) domain may still be high. One -application where faster computation is needed is real-time simulation and -visualisation of wavy surface. The purpose of real-time visualisation is -two-fold: -- it allows to adjust parameters of the model and ACF function in - real-time and see the result of the changes immediately, and -- it allows to visually compare the size and the shape of regions where the most - wave energy is concentrated, which is valuable for education. + } -Since visualisation is done by GPU, doing velocity potential computation on CPU -may cause a bottleneck in data transfer between RAM and GPU memory. To -circumvent this, we use OpenCL/OpenGL interoperability API which allows creating -buffers, that are shared between OpenCL and OpenGL contexts thus eliminating -unnecessary copying of data. In addition to this, three-dimensional velocity -potential field formula\nbsp{}eqref:eq-phi-3d is particularly suitable for -computation by GPUs: -- it contains transcendental mathematical functions (hyperbolic cosines and - complex exponents); -- it is computed over large four-dimensional \(t,x,y,z\) region; -- it is analytic with no information dependencies between individual data points - in \(t\) and \(z\) dimensions. -These considerations make velocity potential field computation on GPU -advantageous in the application to real-time simulation and visualisation of -wavy surface. + core0--cpu + core1--cpu + core2--cpu + core3--cpu -In order to investigate, how GPGPU computations can be used to speed-up velocity -potential field computation, we benchmarked simplified version of -eq.\nbsp{}eqref:eq-phi-3d: -\begin{align} - \label{eq:phi-linear} - \phi(x,y,z,t) &= \InverseFourierY{ - \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} } - { 2\pi\Kveclen \Sinh{\smash{2\pi \Kveclen h}} } - \FourierY{ \zeta_t }{u,v} - }{x,y}\nonumber \\ - &= \InverseFourierY{ g_1(u,v) \FourierY{ g_2(x,y) }{u,v} }{x,y}. -\end{align} -Velocity potential solver was rewritten in OpenCL and its performance was -compared to an existing OpenMP implementation. + disk0--storage + nic0--network -For each implementation the overall performance of the solver for a particular -time instant was measured. Velocity field was computed for one \(t\) point, for -128 \(z\) points below wavy surface and for each \(x\) and \(y\) point of -four-dimensional \((t,x,y,z)\) grid. The only parameter that was varied between -subsequent programme runs is the size of the grid along \(x\) dimension. + parallel_ppl--upstream + parallel_ppl--downstream -A different FFT library was used for each version of the solver: GNU Scientific -Library (GSL)\nbsp{}cite:galassi2015gnu for OpenMP and clFFT -library\nbsp{}cite:clfft for OpenCL. There are two major differences in the -routines from these libraries. -- The order of frequencies in Fourier transforms is different and clFFT library - requires reordering the result of\nbsp{}eqref:eq:phi-linear whereas GSL does - not. -- Discontinuity at \((x,y)=(0,0)\) of velocity potential field grid is handled - automatically by clFFT library, whereas GSL library produce skewed values at - this point. -For GSL library an additional interpolation from neighbouring points was used to -smooth velocity potential field at these points. We have not spotted other -differences in FFT implementations that have impact on the overall performance. + upstream--{core0,core1,core2,core3} [style="dashed"] + downstream--core0 [style="dashed"] + + io_ppl--core0 [style="dashed"] + io_ppl--disk0 [style="dashed"] + sched_ppl--core0 [style="dashed"] + sched_ppl--timer [style="dashed"] + net_ppl--core0 [style="dashed"] + net_ppl--nic0 [style="dashed"] + proc_ppl--core0 [style="dashed"] + + subgraph cluster_children { + style=filled + color=white + + subgraph cluster_child0 { + label="Child process 0" + style=filled + color=lightgrey + labeljust=right -In the course of the numerical experiments we have measured how much time each -solver's implementation spends in each computation stage to explain find out how -efficient data copying between host and device is in OpenCL implementation, and -how one implementation corresponds to the other in terms of performance. + app0_factory [label="Factory"] + app0 [label="Child process\rpipeline"] + } -**** Performance of velocity potential OpenCL solver. -:PROPERTIES: -:header-args:R: :results output raw :exports results -:END: +# subgraph cluster_child1 { +# label="Child process 1" +# style=filled +# color=lightgrey +# labeljust=right +# +# app1_factory [label="Factory"] +# app1 [label="Child process\rpipeline"] +# } + } -The experiments showed that OpenCL outperforms OpenMP implementation by a factor -of 10--15 (fig.\nbsp{}[[fig-arma-realtime-graph]]), however, distribution of time -between computation stages is different for each implementation -(table\nbsp{}[[tab-arma-realtime]]). The major time consumer on CPU is \(g_1\), -whereas in GPU its running time is comparable to \(g_2\). Copying the resulting -velocity potential field between CPU and GPU consumes \(\approx{}20\%\) of -solver execution time. \(g_2\) consumes the most of the execution time for -OpenCL solver, and \(g_1\) for OpenMP solver. In both implementations \(g_2\) is -computed on CPU, but for GPU implementation the result is duplicated for each -\(z\) grid point in order to perform multiplication of all \(XYZ\) planes along -\(z\) dimension in single OpenCL kernel, and, subsequently copied to GPU memory -which severely hinders the overall performance. All benchmarks were run on a -machine equipped with a GPU, characteristics of which are summarised in -table\nbsp{}[[tab-storm]]. + proc_ppl--app0 +# proc_ppl--app1 -#+name: tab-storm -#+caption: "Storm" test platform configuration. -#+attr_latex: :booktabs t -| CPU | Intel Core 2 Quad Q9550 | -| RAM | 8Gb | -| GPU | AMD Radeon R7 360 | -| GPU memory | 2GB | -| HDD | Seagate Barracuda, 7200 rpm | -| No. of CPU cores | 4 | + app0_factory--app0 [constraint=false] +# app1_factory--app1 [constraint=false] -#+name: fig-arma-realtime-graph -#+header: :results output graphics -#+begin_src R :file build/realtime-performance.pdf -source(file.path("R", "benchmarks.R")) -par(family="serif") -data <- arma.load_realtime_data() -arma.plot_realtime_data(data) -title(xlab="Wavy surface size", ylab="Time, s") +} #+end_src -#+name: fig-arma-realtime-graph -#+caption: Performance comparison of CPU (OpenMP) and GPU (OpenCL) versions of velocity potential solver. -#+RESULTS: fig-arma-realtime-graph -[[file:build/realtime-performance.pdf]] +#+caption: Mapping of parent and child process pipelines to compute devices. Solid lines denote aggregation, dashed lines denote mapping between logical and physical entities. +#+attr_latex: :width \textwidth +#+name: fig-subord-ppl +#+RESULTS: fig-subord-ppl +[[file:build/subord-ppl.pdf]] +**** Governing principles. +Data processing pipeline model is based on the following principles, following +which maximises efficiency of a programme. +- There is no notion of a message in the model, a kernel is itself a message + that can be sent over network to another node and directly access any kernel + on the local node. Only programme logic may guarantee the existence of the + kernel. +- A kernel is a /cooperative routine/, which is submitted to kernel pool upon the + call and is executed asynchronously by a scheduler. There can be any number of + calls to other subroutines inside routine body. Every call submits + corresponding subroutine to kernel pool and returns immediately. Kernels in the + pool can be executed in any order; this fact is used by a scheduler to exploit + parallelism offered by the computer by distributing kernels from the pool + across available cluster nodes and processor cores. +- Asynchronous execution prevents the use of explicit synchronisation after the + call to subroutine is made; system scheduler returns control flow to the + routine each time one of its subroutine returns. Such cooperation transforms + each routine which calls subroutines into event handler, where each event is a + subroutine and the handler is the routine that called them. +- The routine may communicate with any number of local kernels, addresses of + which it knows; communication with kernels which are not adjacent in the call + stack complexifies control flow and call stack looses its tree shape. Only + programme logic may guarantee presence of communicating kernels in memory. One + way to ensure this is to perform communication between subroutines which are + called from the same routine. Since such communication is possible within + hierarchy through parent routine, it may treated as an optimisation that + eliminates overhead of transferring data over intermediate node. The situation + is different for interactive or event-based programmes (e.g. servers and + programmes with graphical interface) in which this is primary type of + communication. +- In addition to this, communication which does not occur along hierarchical + links and executed over cluster network complexify design of resiliency + algorithms. Since it is impossible to ensure that a kernel resides in memory + of a neighbour node, because a node may fail in the middle of its execution of + the corresponding routine. As a result, upon failure of a routine all of its + subroutines must be restarted. This encourages a programmer to construct + - deep tree hierarchies of tightly-coupled kernels (which communicate on the + same level of hierarchy) to reduce overhead of recomputation; + - fat tree hierarchies of loosely-coupled kernels, providing maximal degree of + parallelism. + Deep hierarchy is not only requirement of technology, it helps optimise + communication of large number of cluster nodes reducing it to communication of + adjacent nodes. -The reason for different distribution of time between computation stages is the -same as for different AR model performance on CPU and GPU: GPU has more floating -point units and modules for transcendental mathematical functions, which are -needed for computation of \(g_1\), but lacks caches which are needed to -optimised irregular memory access pattern of \(g_2\). In contrast to AR model, -performance of multidimensional derivative computation on GPU is easier to -improve, as there are no information dependencies between points: -Multidimensional array library optimised for GPU may solve the problem, however, -due to unavailability of such library it was not done in this work. -Additionally, such library may allow to efficiently compute the non-simplified -formula entirely on GPU, since omitted terms also contain derivatives. +So, control flow objects (or kernels) possess properties of both cooperative +routines and event handlers. +**** Definitions of hierarchies +To disambiguate hierarchical links between daemon processes and kernels and to +simplify the discussion, we will use the following naming conventions throughout +the text. If the link is between two daemon processes, the relationship is +denoted as /master-slave/. If the link is between two kernels, then the +relationship is denoted as either /principal-subordinate/ or /parent-child/. Two +hierarchies are orthogonal to each other in a sense that no kernel may have a +link to a daemon, and vice versa. Since daemon hierarchy is used to distribute +the load on cluster nodes, kernel hierarchy is mapped onto it, and this mapping +can be arbitrary: It is common to have principal kernel on a slave node with its +subordinate kernels distributed evenly between all cluster nodes (including the +node where the principal is located). Both hierarchies can be arbitrarily deep, +but "shallow" ones are preferred for highly parallel programmes, as there are +less number of hops when kernels are distributed between cluster nodes. Since +there is one-to-one correspondence between daemons and cluster nodes, they are +used interchangeably in the work. -#+name: tab-arma-realtime -#+begin_src R -source(file.path("R", "benchmarks.R")) -routine_names <- list( - harts_g1="\\(g_1\\)", - harts_g2="\\(g_2\\)", - harts_fft="FFT", - harts_copy_to_host="Copy data from GPU" -) -column_names <- c("Subroutine", "OpenMP time, s", "OpenCL time, s") -data <- arma.load_realtime_data() -arma.print_table_for_realtime_data(data, routine_names, column_names) -#+end_src +**** Kernel structure and algorithms +Each kernel has four types of fields (listed in table\nbsp{}[[tab-kernel-fields]]): +- fields related to control flow, +- fields defining the source location of the kernel, +- fields defining the current location of the kernel, and +- fields defining the target location of the kernel. -#+name: tab-arma-realtime -#+caption: Running time of real-time velocity potential solver subroutines for wavy surface \(\text{size}=16384\). -#+attr_latex: :booktabs t -#+RESULTS: tab-arma-realtime -| Subroutine | OpenMP time, s | OpenCL time, s | -|--------------------+----------------+----------------| -| \(g_1\) | 4.6730 | 0.0038 | -| \(g_2\) | 0.0002 | 0.8253 | -| FFT | 2.8560 | 0.3585 | -| Copy data from GPU | | 2.6357 | +#+name: tab-kernel-fields +#+caption: Description of kernel fields. +#+attr_latex: :booktabs t :align lp{0.7\textwidth} +| Field | Description | +|-------------------+------------------------------------------------------------------------------------------------| +| ~process_id~ | Identifier of a cluster-wide process (application) a kernel belongs to. | +| ~id~ | Identifier of a kernel within a process. | +| ~pipeline~ | Identifier of a pipeline a kernel is processed by. | +| | | +| ~exit_code~ | A result of a kernel execution. | +| ~flags~ | Auxiliary bit flags used in scheduling. | +| ~time_point~ | A time point at which a kernel is scheduled to be executed. | +| | | +| ~parent~ | Address/identifier of a parent kernel. | +| ~src_ip~ | IP-address of a source cluster node. | +| ~from_process_id~ | Identifier of a cluster-wide process from which a kernel originated. | +| | | +| ~principal~ | Address/identifier of a target kernel (a kernel to which the current one is sent or returned). | +| ~dst_ip~ | IP-address of a destination cluster node. | -As expected, sharing the same buffer between OpenCL and OpenGL contexts -increases overall solver performance by eliminating data transfer between CPU -and GPU memory, but also requires for the data to be in vertex buffer object -format, that OpenGL can operate on. Conversion to this format is fast, but -requires more memory to store velocity potential field to be able to visualise -it, since each point now is a vector with three components. The other -disadvantage of using OpenCL and OpenGL together is the requirement for manual -locking of shared objects: failure to do so results in screen artefacts which -can be removed only by rebooting the computer. +Upon creation each kernel is assigned a parent and a pipeline. If there no other +fields are set, then the kernel is an /upstream/ kernel\nbsp{}--- a kernel that +can be distributed on any node and any processor core to exploit parallelism. If +principal field is set, then the kernel is a /downstream/ kernel\nbsp{}--- a +kernel that can only be sent to its principal, and a processor core to which the +kernel is sent is defined by the principal memory address/identifier. If a +downstream kernel is to be sent to another node, the destination IP-address must +be set, otherwise the system will not find the target kernel. -**** Conclusions. -Performance benchmarks showed that GPU outperforms CPU in arithmetic intensive -tasks, i.e.\nbsp{}tasks requiring high number of floating point operations per -second, however, its performance degrades when the volume of data that needs to -be copied between CPU and GPU memory increases or when memory access pattern of -the algorithm is non-linear. The first problem may be solved by using -co-processors where high-bandwidth memory is located on the same die as the -processor and the main memory. This eliminates data transfer bottleneck, but may -also increase execution time due to smaller number of floating point units. The -second problem may be solved programmatically, by using OpenCL library that -optimises multi-dimensional array traversals for GPUs; due to unavailability of -such library this was not done in present work. +When kernel execution completes (its ~act~ method finishes), the kernel is +explicitly sent to some other kernel, this directive is explicitly called inside +~act~ method. Usually, after the execution completes a kernel is sent to its +parent by setting principal field to the address/identifier of the parent, +destination IP-address field to the source IP-address, and process identifier to +the source process identifier. After that kernel becomes a downstream kernel and +is sent by the system to the node, where its current principal is located +without invoking load balancing algorithm. For downstream kernel ~react~ method +of its parent is called by a pipeline with the kernel as the method argument to +make it possible for a parent to collect the result of the execution. -ARMA model outperforms LH model in benchmarks and does not require GPU to do so. -Its computational strengths are: -- vicinity of transcendental mathematical functions, and -- simple algorithm for both AR and MA model, performance of which depends on the - performance of multi-dimensional array library and FFT library. -Providing main functionality via low-level libraries makes performance portable: -support for new processor architectures can be added by substituting the -libraries. Finally, using analytic formula for velocity potential field made -velocity potential solver consume only a small fraction of total programme -execution time. If such formula was not available or did not have all integrals -as Fourier transforms, performance of velocity potential computation would be -much lower. +There is no way to provide fine-grained resilience to cluster node failures, if +there are downstream kernels in the programme, except the ones returning to +their parents. Instead, an exit code of the kernel is checked and a custom +recovery code is executed. If there is no error checking, the system restarts +execution from the first parent kernel, which did not produce any downstream +kernels. This means, that if a problem being solved by the programme has +information dependencies between parts that are computed in parallel, and a node +failure occurs during computation of these parts, then this computation is +restarted from the very beginning, discarding any already computed parts. This +does not occur for embarrassingly parallel programmes, where parallel parts do +not have such information dependencies between each other: in this case only +parts from failed nodes are recomputed and all previously computed parts are +retained. -** MPP implementation *** Cluster node discovery algorithm :PROPERTIES: :CUSTOM_ID: sec:node-discovery @@ -4166,7 +4167,7 @@ digraph { #+end_src #+caption: Tree hierarchy for 11 nodes with fan-out equals 2. -#+label: fig-tree-hierarchy-11 +#+name: fig-tree-hierarchy-11 #+RESULTS: fig-tree-hierarchy-11 [[file:build/tree-hierarchy-11.pdf]] @@ -4418,7 +4419,7 @@ An example of fail over algorithm follows (fig.\nbsp{}[[fig-fail-over-example]]) #+end_src #+caption: An example of fail over algorithm in action. -#+label: fig-fail-over-example +#+name: fig-fail-over-example #+attr_latex: :width \textwidth #+RESULTS: fig-fail-over-example [[file:build/fail-over-example.pdf]] @@ -4515,11 +4516,13 @@ inapplicable for programmes with complicated logic. #+name: fig-benchmark #+begin_src R :file build/benchmark-xxx.pdf # TODO +plot(c(1:10)) #+end_src #+caption: Performance of hydrodynamics HPC application in the presence of node failures. -#+label: fig-benchmark +#+name: fig-benchmark #+RESULTS: fig-benchmark +[[file:build/benchmark-xxx.pdf]] The results of the benchmark allows to conclude that no matter a principal or a subordinate node fails, the overall performance of a parallel programme roughly @@ -4529,11 +4532,13 @@ when a backup node fails performance penalty is much higher. #+name: fig-slowdown #+begin_src R :file build/slowdown-xxx.pdf # TODO +plot(c(1:10)) #+end_src #+caption: Slowdown of the hydrodynamics HPC application in the presence of different types of node failures compared to execution without failures but with the number of nodes minus one. -#+label: fig-slowdown +#+name: fig-slowdown #+RESULTS: fig-slowdown +[[file:build/slowdown-xxx.pdf]] **** Discussion of test results. Fail over algorithm guarantees to handle one failure per sequential programme