iccsa-19-vtestbed

Virtual Testbed: Ship Motion Simulation for Personal Workstations
git clone https://git.igankevich.com/iccsa-19-vtestbed.git
Log | Files | Refs

main.tex (32406B)


      1 \documentclass[runningheads]{llncs}
      2 
      3 \usepackage{amsmath}
      4 \usepackage{booktabs}
      5 \usepackage{graphicx}
      6 \usepackage{url}
      7 \usepackage{tikz}
      8 \usetikzlibrary{arrows.meta}
      9 \setlength{\tabcolsep}{4pt}
     10 
     11 \definecolor{spbuWhite2}{RGB}{230,231,230}
     12 \definecolor{spbuDarkGray}{HTML}{404040}
     13 \input{preamble}
     14 
     15 \begin{document}
     16 
     17 \title{Virtual testbed: Ship motion simulation for~personal workstations}
     18 \author{%
     19 Alexander Degtyarev\orcidID{0000-0003-0967-2949} \and\\
     20 Vasily Khramushin\orcidID{0000-0002-3357-169X} \and\\
     21 Ivan Gankevich\textsuperscript{*}\orcidID{0000-0001-7067-6928} \and\\
     22 Ivan Petriakov \and\\
     23 Anton Gavrikov\orcidID{0000-0003-2128-8368} \and\\
     24 Artemii Grigorev
     25 }
     26 
     27 \titlerunning{Virtual testbed}
     28 \authorrunning{A.\,Degtyarev et al.}
     29 
     30 \institute{Saint Petersburg State University, Russia\\
     31 \email{\{a.degtyarev,v.khramushin,i.gankevich\}@spbu.ru},\\
     32 \email{\{st049350,st047437,st016177\}@student.spbu.ru}\\
     33 \url{https://spbu.ru/}}
     34 
     35 \maketitle
     36 
     37 \begin{abstract}
     38 
     39 Virtual testbed is a computer programme that simulates ocean waves, ship
     40 motions and compartment flooding. One feature of this programme is that it
     41 visualises physical phenomena frame by frame as the simulation progresses. The
     42 aim of the studies reported here was to assess how much performance can be
     43 gained using graphical accelerators compared to ordinary processors when
     44 repeating the same computations in a loop.  We rewrote programme's hot spots in
     45 OpenCL to able to execute them on a graphical accelerator and benchmarked their
     46 performance with a number of real-world ship models. The analysis of the
     47 results showed that data copying in and out of accelerator's main memory has
     48 major impact on performance when done in a loop, and the best performance is
     49 achieved when copying in and out is done outside the loop (when data copying
     50 inside the loop involves accelerator's main memory only). This result comes in
     51 line with how distributed computations are performed on a set of cluster nodes,
     52 and suggests using similar approaches for single heterogeneous node with a
     53 graphical accelerator.
     54 
     55 \keywords{%
     56 wavy surface
     57 \and pressure field
     58 \and pressure force
     59 \and ship
     60 \and wetted surface
     61 \and OpenCL
     62 \and GPGPU.
     63 }
     64 \end{abstract}
     65 
     66 
     67 \section{Introduction}
     68 
     69 Simulation of ship motion in ocean waves is done in several computer
     70 programmes~\cite{matusiak2013,shin2003nonlinear,ueng2008} that differ in what
     71 physical phenomena they simulate (manoeuvring in waves, compartment flooding,
     72 regular and irregular waves, wind, real-time simulation or batch processing
     73 etc.) and application area (scientific studies, education or entertainment).
     74 These programmes are virtual analogues of ship model basins that are used to
     75 simulate characteristics and behaviour of ship in a particular sea conditions.
     76 The advantage of using virtual ship model basin over a physical one is that
     77 experiments are performed in real scale (with real-sized ships and ocean waves)
     78 and on a computer without the need to access high-technological facility.
     79 
     80 Although, all numerical experiments are performed on a computer, one computer
     81 is not powerful enough to perform them fast. Often, this problem is solved by
     82 using a cluster of computer nodes or a supercomputer; however, a supercomputer
     83 or a cluster is another high-technological facility that a researcher have to
     84 gain access to. In that case virtual ship model basin has little advantage over
     85 a physical one: the research is slowed down by official documents' approvals
     86 and time-sharing of computing resources.
     87 
     88 One way of removing this barrier is to use graphical accelerator to speed up
     89 computations. In that case simulation can be performed on a regular workstation
     90 that has a dedicated graphics card. Most of the researchers use GPU to make
     91 visualisation in real-time, but it is rarely used for speeding up simulation
     92 parts, let alone the whole programme. In~\cite{pita2016sph} the authors use GPU
     93 to speed up computation of free surface motion inside a tank.
     94 In~\cite{varela2011interactive} the authors rewrite their simulation code using
     95 Fast Fourier transforms and propose to use GPU to gain more performance.
     96 In~\cite{keeler2015integral} the authors use GPU to simulate ocean waves.
     97 Nevertheless, the most efficient way of using GPU is to use it for both
     98 computation and visualisation: it allows to minimise data copying between CPU and
     99 GPU memory and use mathematical models, data structures and numerical methods
    100 that are tailored to graphical accelerators.
    101 
    102 The present research proposes a numerical method for computing velocity
    103 potentials and wave pressures on a graphical accelerator, briefly explains
    104 other methods in the programme, and presents benchmarks for asynchronous
    105 visualisation and simulation.
    106 
    107 
    108 \section{Methods}
    109 
    110 Virtual testbed is a computer programme that simulates ocean waves, ship
    111 motions and compartment flooding. One feature that distinguishes it with
    112 respect to existing proposals is the use of graphical accelerators to speed up
    113 computations and real-time visualisation that was made possible by these
    114 accelerators.
    115 
    116 The programme consists of the following modules: \texttt{vessel} reads ship
    117 hull model from an input file, \texttt{gui} draws current state of the virtual
    118 world and \texttt{core} computes each step of the simulation. The \texttt{core}
    119 module consists of components that are linked together in a pipeline, in which
    120 output of one component is the input of another one.  The computation is
    121 carried out in parallel to visualisation, and synchronisation occurs after each
    122 simulation step. It makes graphical user interface responsive even when
    123 workstation is not powerful enough to compute in real-time.
    124 
    125 Inside \texttt{core} module the following components are present: wavy surface
    126 generator, velocity potential solver, pressure force solver.  Each component in
    127 the \texttt{core} module is interchangeable, which means that different wavy
    128 surface generators can be used with the same velocity potential solver.  Once
    129 initialised, these components are executed in a loop in which each iteration
    130 computes the next time step of the simulation.  Although, iterations of the
    131 loop are sequential, each component is internally parallel, i.e.~each component
    132 uses OpenMP or OpenCL to perform computations on each processor or graphical
    133 core.  In other words, Virtual testbed follows BSP model~\cite{valiant1990bsp}
    134 for organising parallel computations, in which a programme consists of
    135 sequential steps each of which is internally parallel
    136 (fig.~\ref{fig:loop}).
    137 
    138 \subsection{Wavy surface generation}
    139 
    140 There are three models that are used for wavy surface generation in Virtual
    141 testbed: autoregressive moving average model (ARMA), Stokes wave, and plane
    142 sine/cosine wave. It is not beneficial in terms of performance to execute ARMA
    143 model on a graphical accelerator~\cite{gankevich2018ocean}: its algorithm does
    144 not use transcendental mathematical functions, has nonlinear memory access
    145 pattern and complex information dependencies. It is much more efficient (even
    146 without serious optimisations) to execute it on a processor. In contrast, the
    147 other two wave models are embarrassingly parallel and easy to rewrite in
    148 OpenCL. 
    149 
    150 Each wave model outputs three-dimensional (one temporal and two spatial
    151 dimensions) field of wavy surface elevation, and ARMA model post-processes this
    152 field using the following algorithm. First, autocovariance function (ACF) is
    153 estimated from the input field using Wiener---Khinchin theorem. Then ACF is
    154 used to build autocovariance matrix and determine autoregressive model
    155 coefficients. Finally, the coefficients are used to generate new wavy surface
    156 elevation field.
    157 
    158 The resulting field is stochastic, but has the same integral characteristics as
    159 the original one. In particular, probability distribution function of wavy
    160 surface elevation, wave height, length and period are preserved.  Using ARMA
    161 model for post-processing has several advantages.
    162 \begin{itemize}
    163 
    164 	\item It makes wavy surface aperiodic (its period equals period of
    165 		pseudo-random number generator, which can be considered infinite for
    166 		all practical applications) which allows to perform statistical studies
    167 		using Virtual testbed.
    168 
    169 	\item It is much faster to generate wavy surface with this model than with
    170 		the original model, because ARMA model involves only multiplications
    171 		and additions rather than transcendental mathematical functions.
    172 
    173 	\item This model allows to use any wavy surface as the input (not only
    174 		plane and Stokes waves). Frequency-directional spectrum of a particular
    175 		ocean region can be used instead.
    176 
    177 \end{itemize}
    178 This paper gives only a short description of the model, please refer
    179 to~\cite{gankevich2018thesis,gankevich2018ocean} for in-depth study.
    180 
    181 To summarise, wavy surface generator produces wavy surface elevation field
    182 using one of the models described above. For ARMA model it is impractical to
    183 generate it using graphical accelerator, and for other models it is to trivial
    184 to discuss. This field is an input for velocity potential solver.
    185 
    186 \subsection{Velocity potential computation}
    187 
    188 Since wavy surface generator produces discretely given elevation field we may
    189 not use formula from linear wave theory to compute velocity potential; instead,
    190 we derived a formula for arbitrary surface for inviscid incompressible fluid:
    191 \begin{equation}
    192     \label{eq:phi}
    193     \phi(x,y,z,t) = \InverseFourier{
    194         \frac{ \Cosh{\smash{2\pi \WaveNumber (z+h)}} }{ 2\pi\WaveNumber }
    195 		\frac{ \Fourier{ f(x,y,t) }{u,v} }
    196         { \Fourier{\mathcal{D}_3\left( x,y,\zeta\left(x,y\right) \right)}{u,v} }
    197     }{x,y},
    198 \end{equation}
    199 where
    200 \begin{align*}
    201 	f(x,y,t)&=\zeta_t(x,y,t) / \left( i f_1(x,y) + i f_2(x,y) - f_3(x,y) \right),\\
    202 	f_1(x,y)&={\zeta_x}/{\SqrtZetaXY}-\zeta_x,
    203 	\qquad
    204 	\Fourier{\mathcal{D}_3\left(x,y,z\right)}{u,v}=\Cosh{\smash{2\pi\WaveNumber{}z}},\\
    205 	f_2(x,y)&={\zeta_y}/{\SqrtZetaXY}-\zeta_y,
    206 	\qquad
    207 	\WaveNumber = \sqrt{u^2+v^2},\\
    208 	f_3(x,y)&=1/\SqrtZetaXY.
    209 \end{align*}
    210 Here \(\WaveVector\) is wave number, \(\zeta\)~--- wavy surface elevation,
    211 \(h\)~--- water depth, \(\mathcal{F}\)~--- Fourier transform, \(\phi\)~---
    212 velocity potential.  The formula is derived as a solution for continuity
    213 equation with kinematic boundary condition
    214 \begin{align}
    215     & \nabla^2\phi = 0,\nonumber\\
    216     & \phi_t+\frac{1}{2} |\vec{\upsilon}|^2 + g\zeta=-\frac{p}{\rho}, & \text{at }z=\zeta(x,y,t),\label{eq:problem}\\
    217     & D\zeta = \nabla \phi \cdot \vec{n}, & \text{at }z=\zeta(x,y,t),\nonumber
    218 \end{align}
    219 without assumptions of linear wave theory (wave length is much larger than wave
    220 height). Hence it can be used for arbitrary-amplitude ocean waves. Here the
    221 first equation is continuity equation, the second is dynamic boundary
    222 condition, and the last one is kinematic boundary condition; \(p\)~---
    223 pressure, \(\rho\)~--- fluid density,
    224 \(\vec{\upsilon}=(\phi_x,\phi_y,\phi_z)\)~--- velocity vector, \(g\)~---
    225 acceleration of gravity, and \(D\)~--- substantial (Lagrange) derivative. Since
    226 we solve for \(\phi\), dynamic boundary condition becomes explicit formula for
    227 pressure and is used to compute pressure force acting on a ship hull
    228 (see~sec.~\ref{sec:pressure-force}).
    229 
    230 Integral in~\eqref{eq:phi} converges when summation goes over a range of wave
    231 numbers that are actually present in discretely given wavy surface. This range
    232 is determined numerically by finding crests and troughs for each spatial
    233 dimension of the wavy surface with polynomial interpolation and using these
    234 values to determine wave length. For small-amplitude waves this approach gives
    235 the same values of velocity potential field as direct application of the
    236 formula from linear wave theory.
    237 
    238 Formula~\eqref{eq:phi} is particularly suitable for computation on a graphical
    239 accelerator: it contains transcendental mathematical functions (complex
    240 exponents) that help offset slow global memory loads and stores, it is explicit
    241 which makes it easy to compute in parallel, and it is written using Fourier
    242 transforms that are efficient to compute on a graphical
    243 accelerator~\cite{volkov2008fft}.
    244 
    245 This paper gives only a short description of the method, please refer
    246 to~\cite{gankevich2018thesis,gankevich2018ocean} for in-depth study.
    247 
    248 \subsection{Pressure force computation}
    249 \label{sec:pressure-force}
    250 
    251 There are three stages of pressure force computation: determining wetted
    252 surface of a ship hull, computing wave pressure field under wavy surface, and
    253 computing pressure force acting on a ship hull.
    254 
    255 In order to determine wetted surface, ship hull is decomposed into triangular
    256 panels (faces) that approximate its geometry. Then for each panel the algorithm
    257 determines its position relative to wavy surface. If it is fully submerged, it
    258 is considered wetted; if it is partially submerged, the algorithm computes
    259 intersection points using bisection method and wavy surface interpolation, and
    260 slices the part of the panel which is above the wavy surface (for simplicity
    261 the slice is assumed to be straight line, as is the case for sufficiently small
    262 panels).  Wave pressure at any point under wavy surface is computed using
    263 dynamic boundary condition from~\eqref{eq:problem} as an explicit formula.
    264 Then the pressure is interpolated in the centre of each panel to compute
    265 pressure force acting on a ship hull.
    266 
    267 It is straightforward to rewrite pressure computation for a graphical
    268 accelerator as its algorithm reduces to looping over a large collection of panels
    269 and performing the same calculations for each of them; however, dynamic
    270 boundary condition contains temporal and spatial derivatives that have to be
    271 computed. Although, computing derivatives on a processor is fast, copying the
    272 results to accelerator's main memory proved to be inefficient as there are four
    273 arrays (one for each dimension) that need to be allocated and transferred.
    274 Simply rewriting code for OpenCL proved to be even more inefficient due to
    275 irregular memory access pattern for different array dimensions. So, we resorted
    276 to implementing the algorithm described in~\cite{micikevicius2009derivative},
    277 that stores intermediate results in the local memory of the accelerator. Using
    278 this algorithm allowed us to store arrays of derivatives entirely in graphical
    279 accelerator's main memory and eliminate data transfer altogether.
    280 
    281 \subsection{Translational and angular ship motion computation}
    282 
    283 In order to compute ship position, translational velocity, angular displacement
    284 and angular velocity for each time step we solve equations of translational and
    285 angular motion (adapted from~\cite{matusiak2013}) using pressure force computed
    286 for each panel:
    287 \begin{equation*}
    288 	\dot{\vec{v}} = \frac{\vec{F}}{m} + g\vec{\tau}
    289 	- \vec{\Omega}\times\vec{v} - \lambda\vec{v},
    290 	\qquad
    291 	\dot{\vec{h}} = \vec{G} - \vec{\Omega}\times\vec{h},
    292 	\qquad
    293 	\vec{h} = \InertiaMatrix \cdot \vec{\Omega}.
    294 \end{equation*}
    295 Here
    296 \(\vec{\tau}=\left(-\sin\theta,\cos\theta\sin\phi,-\cos\theta\cos\phi\right)\)
    297 is a vector that transforms \(g\) into body-fixed coordinate system,
    298 \(\vec{v}\)~--- translational velocity vector, \(g\)~--- gravitational
    299 acceleration, \(\vec{\Omega}\)~--- angular velocity vector, \(\vec{F}\)~---
    300 vector of external forces, \(m\)~--- ship mass, \(\lambda\)~--- damping
    301 coefficient, \(h\)~--- angular momentum, \(\vec{G}\)~--- the moment of
    302 external forces, and \(\InertiaMatrix\)~--- inertia matrix.
    303 
    304 We compute total force \(\vec{F}\) and momentum \(\vec{G}\) acting on a ship
    305 hull by adding forces acting on each panel. Then we solve the system of
    306 equations using numerical Runge---Kutta---Fehlberg method~\cite{mathews2004}.
    307 The vector of values determined by the method consists of all components of the
    308 following vectors: ship position, translational velocity, angular displacement
    309 and angular velocity (twelve variables in total). Since we are not interested
    310 in angular momentum, we use inertia matrix to obtain it from angular velocity
    311 and inverse inertia matrix to convert it back. Twelve vector components are too
    312 few to efficiently execute this method on a graphical accelerator and there is
    313 no other way of making iterative method parallel, so we execute it on a
    314 processor.
    315 
    316 \section{Results}
    317 
    318 \subsection{Test setup}
    319 
    320 Virtual testbed performance was benchmarked in a number of tests. Since we use
    321 both OpenMP and OpenCL technologies for parallel computing, we wanted to know
    322 how performance scales with the number of processor cores and with and without
    323 graphical accelerator.
    324 
    325 Graphical accelerators are divided into two broad categories: for general
    326 purpose computations and for visualisation. Accelerators from the first
    327 category typically have more double precision arithmetic units and accelerators
    328 from the second category are typically optimised for single precision. The
    329 ratio of single to double precision performance can be as high as 32. Virtual
    330 testbed produces correct results for both single and double precision, but
    331 OpenCL version supports only single precision, and graphical accelerators that
    332 we used have higher single precision performance (tab.~\ref{tab:setup}). So we
    333 choose single precision in all benchmarks.
    334 
    335 \begin{table}
    336 	\centering
    337 	\caption{Hardware configurations for benchmarks. For all benchmarks we
    338 	used GCC version 8.1.1 compiler and optimisation flags \texttt{-O3
    339 	-march=native}.\label{tab:setup}}
    340 	\begin{tabular}{lllrr}
    341 		\toprule
    342 		     &     &     & \multicolumn{2}{c}{GPU GFLOPS} \\
    343 		\cmidrule(r){4-5}
    344 		Node     & CPU              & GPU            & Single & Double \\
    345 		\midrule
    346 		Storm    & Intel Q9550      & Radeon R7 360  & 1613   & 101 \\
    347 		GPUlab   & AMD FX-8370      & NVIDIA GTX1060 & 4375   & 137 \\
    348 		Capybara & Intel E5-2630 v4 & NVIDIA P5000   & 8873   & 277 \\
    349 		\bottomrule
    350 	\end{tabular}
    351 \end{table}
    352 
    353 Double precision was used only for computing autoregressive model coefficients,
    354 because round-off and truncation numerical errors make covariance matrices
    355 (from which coefficients are computed) non-positive definite. These matrices
    356 typically have very large condition numbers, and linear system which they
    357 represent cannot be solved by Gaussian elimination or \(LDLT\) Cholesky
    358 decomposition, as these methods are numerically unstable (at least in our
    359 programme).
    360 
    361 Since Virtual testbed does both visualisation and computation in real-time, we
    362 measured performance of each stage of the main loop (fig.~\ref{fig:loop})
    363 synchronously with the parameters that affect it. To assess computational
    364 performance we measured execution time of each stage in microseconds (wall
    365 clock time) together with the number of wetted panels, and wavy surface size.
    366 To assess visualisation performance we measured the execution time of each
    367 visualisation frame (one iteration of the visualisation main loop) and
    368 execution time of computational frame (one iteration of the computational
    369 loop), from which it is easy to compute the usual frames-per-second metric.
    370 The tests were run for one minute and were forcibly stopped after the time ran
    371 out. Wall clock time was measured as a median across all simulation steps (or
    372 visualisation frames).
    373 
    374 \begin{figure}
    375 	\centering
    376 	\begin{tikzpicture}[x=2.2cm,y=-1.4cm]
    377 		\node[Block] (s1) at (0,0) {\strut{}Wavy surface};
    378 		\node[Block] (s2) at (1,0) {\strut{}Autoreg. model};
    379 		\node[Block] (s3) at (2,0) {\strut{}Wave numbers};
    380 		\node[Block] (s4) at (3,0) {\strut{}Velocity potential};
    381 		\node[Block] (s5) at (4,0) {\strut{}Wave pressure};
    382 		\node[Block] (s6) at (4,1) {\strut{}Wetted surface};
    383 		\node[Block] (s7) at (3,1) {\strut{}Elevation deriv.};
    384 		\node[Block] (s8) at (2,1) {\strut{}Pressure force};
    385 		\node[Block] (s9) at (1,1) {\strut{}Ship velocities};
    386 		\path[Arrow] (s1) -- (s2);
    387 		\path[Arrow] (s2) -- (s3);
    388 		\path[Arrow] (s3) -- (s4);
    389 		\path[Arrow] (s4) -- (s5);
    390 		\path[Arrow] (s5.east) -- ++(0.2,0) |- (s6.east);
    391 		\path[Arrow] (s6) -- (s7);
    392 		\path[Arrow] (s7) -- (s8);
    393 		\path[Arrow] (s8) -- (s9);
    394 		\path[Arrow] (s9) -| (s1);
    395 	\end{tikzpicture}
    396 	\caption{Virtual testbed main loop.\label{fig:loop}}
    397 \end{figure}
    398 
    399 We ran all tests on each node for increasing number of processor cores and
    400 with and without graphical accelerator. The code was compiled with maximum
    401 optimisation level including processor-specific optimisations which enabled
    402 auto-vectorisation for further performance improvements.
    403 
    404 We ran all tests for each of the three ship hull models: Aurora cruiser, MICW
    405 (a hull with reduced moments of inertia for the current waterline) and a
    406 sphere.  The first two models represent real-world ships with known
    407 characteristics and we took them from Vessel database~\cite{vessel2015}
    408 registered by our university which is managed by Hull
    409 programme~\cite{hull2010}. Parameters of these ship models are listed in
    410 tab.~\ref{tab:ships}, three-dimensional models are shown in
    411 fig.~\ref{fig:models}. Sphere was used as a geometric shape wetted surface
    412 area of which is close to constant under impact of ocean waves.
    413 
    414 We ran all tests for each workstation from tab.~\ref{tab:setup} to investigate
    415 if there is a difference in performance between ordinary workstation and a
    416 computer for visualisation. Storm is a regular workstation with mediocre
    417 processor and graphical accelerator, GPUlab is a slightly more powerful
    418 workstation, and Capybara has the most powerful processor and professional
    419 graphical accelerator optimised for visualisation. 
    420 
    421 \begin{table}
    422 	\centering
    423 	\caption{Parameters of ship models that were used in the
    424 	benchmarks.\label{tab:ships}}
    425 	\begin{tabular}{lrrr}
    426 		\toprule
    427 					  & Aurora & MICW  & Sphere \\
    428 		\midrule
    429 		Length, m     & 126.5  &   260 &  100   \\
    430 		Beam, m       &  16.8  &    32 &  100   \\
    431 		Depth, m      &  14.5  &    31 &  100   \\
    432 		No. of panels & 29306  & 10912 & 5120   \\
    433 		\bottomrule
    434 	\end{tabular}
    435 \end{table}
    436 
    437 \begin{figure}
    438 	\centering
    439 	\includegraphics[width=0.5\textwidth]{build/aurora.eps}\hfill
    440 	\includegraphics[width=0.5\textwidth]{build/micw.eps}
    441 	\caption{Aurora and MICW three-dimensional ship hull
    442 	models.\label{fig:models}}
    443 \end{figure}
    444 
    445 
    446 \subsection{Benchmark results}
    447 
    448 The main result of the benchmarks is that Virtual testbed is capable of running
    449 on a regular workstation with or without a graphical accelerator in real-time
    450 with high frame rate and small simulation time steps.
    451 
    452 \begin{itemize}
    453 
    454 	\item We achieved more than 60 simulation steps per second (SSPS) on each
    455 		of the workstations. SSPS is the same metric as frames per second in
    456 		visuliastion, but for simulation. For Storm and GPUlab the most
    457 		performant programme version was the one for graphical accelerator and
    458 		for Capybara the most performant version was the one for the processor
    459 		(tab.~\ref{tab:best}).
    460 
    461 	\item The most performant node is GPUlab with 104 simulation steps per
    462 		second.  Performance of Capybara is higher than of Storm, but it uses
    463 		powerful server-grade processor to achieve it.
    464 
    465 	\item Computational speedup for increasing number of parallel OpenMP
    466 		threads is far from linear: we achieved only fourfold speedup for ten
    467 		threads (fig.~\ref{fig:openmp}).
    468 
    469 	\item Although, GPUlab's processor has higher frequency, even one core of
    470 		Capybara's processor achieves slightly higher performance.
    471 
    472 	\item The least powerful workstation (Storm) has the largest positive
    473 		difference between graphical accelerator and processor performance
    474 		(fig.~\ref{fig:histogram}). The most powerful workstation (Capybara)
    475 		has comparable but negative difference. 
    476 
    477 	\item Usage of graphical accelerator increases time needed to synchronise
    478 		simulation step with the visualisation frame (\textit{exchange} stage
    479 		in fig.~\ref{fig:histogram}).
    480 
    481 \end{itemize}
    482 
    483 \begin{table}
    484 	\centering
    485 	\caption{Best median performance for each workstation and each ship hull.
    486 	Here \(t\) is simulation step computation time, \(m\)~--- no. of simulation
    487 	steps per second (SSPS), and \(n\)~--- the number of OpenMP
    488 	threads, CL~--- OpenCL, MP~--- OpenMP.\label{tab:best}}
    489 	\begin{tabular}{lrrrlrrrlrrrl}
    490 		\toprule
    491 				 & \multicolumn{4}{c}{Sphere}
    492 				 & \multicolumn{4}{c}{Aurora}
    493 				 & \multicolumn{4}{c}{MICW} \\
    494 		\cmidrule(r){2-5}
    495 		\cmidrule(r){6-9}
    496 		\cmidrule(r){10-13}
    497 		Node     & \(t\), ms & \(m\) & \(n\) & ver.
    498 				 & \(t\), ms & \(m\) & \(n\) & ver.
    499 				 & \(t\), ms & \(m\) & \(n\) & ver. \\
    500 		\midrule
    501 		Storm    &        16 &               64 &     1 & CL
    502 		         &        14 &               72 &     1 & CL
    503 		         &        29 &               34 &     1 & CL \\
    504 		GPUlab   &        10 &              104 &     1 & CL
    505 		         &         9 &              112 &     1 & CL
    506 		         &        18 &               55 &     1 & CL \\
    507 		Capybara &        12 &               85 &    10 & MP
    508 		         &        15 &               66 &    10 & MP
    509 		         &        19 &               51 &    10 & MP \\
    510 		\bottomrule
    511 	\end{tabular}
    512 \end{table}
    513 
    514 \begin{figure}
    515 	\centering
    516 	\includegraphics{build/openmp.eps}
    517 	\caption{Median simulation step computation time for different number of
    518 	parallel threads (sphere).\label{fig:openmp}}
    519 \end{figure}
    520 
    521 \begin{figure}
    522 	\centering
    523 	\includegraphics{build/histogram.eps}
    524 	\caption{Median computation time for each main loop stage, each node and
    525 	sequential, OpenMP and OpenCL versions (sphere).\label{fig:histogram}}
    526 \end{figure}
    527 
    528 
    529 
    530 \section{Discussion}
    531 
    532 Although, graphical accelerator gives noticeable performance increase only for
    533 the least powerful workstation, we considered only the simplest simulation
    534 scenario (ship motions induced by a plane Stokes wave) in the benchmarks: The
    535 problem that we solve is too small to saturate graphical accelerator cores.  We
    536 tried to eliminate expensive data copying operations between host and graphical
    537 accelerator memory, where possible, but we need to simulate more physical
    538 phenomena and at a larger scale (ships with large number of panels, large
    539 number of compartments, wind simulation etc.) to verify that performance gap
    540 increases for powerful workstations. On the bright side, even if a computer
    541 does not have powerful graphical accelerator (e.g.~a laptop with integrated
    542 graphics), it still can run Virtual testbed with acceptable performance.
    543 
    544 Large SSPS is needed neither for smooth visualisation, nor for accurate
    545 simulation; however, it gives performance reserve for further increase in
    546 detail and scale of simulated physical phenomena. We manually limit simulation
    547 time step to a minimum of \(1/30\) of the second to prevent floating-point
    548 numerical errors due to small time steps. Also, we limit maximum time step to
    549 have wave frequency greater or equal to Nyquist frequency for precise partial
    550 time derivatives computation.
    551 
    552 Real-time simulation is essential not only for educational purposes, but also
    553 for on-board intelligent systems. These systems analyse data coming from a
    554 multitude of sensors the ship equips, calculate probability of occurrence of a
    555 particular dangerous situation (e.g.~large roll angle) and try to prevent it by
    556 notifying ship's crew and an operator on the coast. This is one of the
    557 directions of future work.
    558 
    559 Overall performance depends on the size of the ship rather than the number of
    560 panels. MICW hull has less number of panels than Aurora, but two times larger
    561 size and two times worse performance (tab.~\ref{tab:best}). The size of the
    562 hull affects the size of the grid in each point of which velocity potential and
    563 then pressure is computed.  These routines are much more compute intensive in
    564 comparison to wetted surface determination and pressure force computation,
    565 performance of which depends on the number of panels.
    566 	
    567 Despite the fact that Capybara has the highest floating-point performance
    568 across all workstations in the benchmarks, Virtual testbed runs faster on its
    569 processor, not the graphical accelerator. Routine-by-routine investigation
    570 showed that this graphics card is simply slower at computing even fully
    571 parallel Stokes wave generator OpenCL kernel. This kernel fills
    572 three-dimensional array using explicit formula for the wave profile, it has
    573 linear memory access pattern and no information dependencies between array
    574 elements.  It seems, that P5000 is not optimised for general purpose
    575 computations. We did not conduct visualisation benchmarks, so we do not know if
    576 it is more efficient in that case.
    577 
    578 Although, Capybara's processor has 20 hardware threads (2 threads per core),
    579 OpenMP performance does not scale beyond 10 threads. Parallel threads in our
    580 code do mostly the same operations but with different data, so switching
    581 between different hardware threads running on the same core in the hope that
    582 the second thread performs useful work while the first one stalls on
    583 input/output or load/store operation is not efficient. This problem is usually
    584 solved by creating a pipeline from the main loop in which each stage is
    585 executed in parallel and data constantly flows between subsequent stages. This
    586 approach is easy to implement when computational grid can be divided into
    587 distinct parts, which is not the case for Virtual testbed: there are too many
    588 dependencies between parts and the position and the size of each part can be
    589 different in each stage. Graphical accelerators have more efficient hardware
    590 threads switching which, and pipeline would probably not improve their
    591 performance, so we did not take this approach.
    592 
    593 Our approach for performing computations on a heterogeneous node (a node with
    594 both a processor and a graphical accelerator) is similar to the approach
    595 followed by the authors of Spark distributed data processing
    596 framework~\cite{zaharia2016spark}. In this framework data is first loaded into
    597 the main memory of each cluster node and then processed in a loop. Each
    598 iteration of this loop runs by all nodes in parallel and synchronisation occurs
    599 at the end of each iteration. This is in contrast to MapReduce
    600 framework~\cite{dean2008mapreduce} where after each iteration the data is
    601 written to stable storage and then read back into the main memory to continue
    602 processing. Not interacting with slow stable storage on every iteration allows
    603 Spark to achieve an order of magnitude higher performance than Hadoop
    604 (open-source version of MapReduce) on iterative algorithms.
    605 
    606 For a heterogeneous node an analogue of stable storage, read/writes to which is
    607 much slower than accesses to the main memory, is graphical accelerator memory. To
    608 minimise interaction with this memory, we do not read intermediate results of
    609 our computations from it, but reuse arrays that already reside there. (As a
    610 concrete example, we do not copy pressure field from a graphical accelerator,
    611 only the forces for each panel.) This allows us to eliminate expensive data
    612 transfer between CPU and GPU memory. In early versions of our programme this
    613 copying slowed down simulation significantly.
    614 
    615 Although, heterogeneous node is not a cluster, efficient programme architecture
    616 for such a node is similar to distributed data processing systems: we process
    617 data only on those device main memory of which contains the data and we never
    618 transfer intermediate computation results between devices. To implement this
    619 principle the whole iteration of the programme's main loop have to be executed
    620 either on a processor or a graphical accelerator. Given the time constraints,
    621 future maintenance burden and programme's code size, it was difficult to fully
    622 follow this approach, but we came to a reasonable approximation of it. We still
    623 have functions (\textit{clamp} stage in fig.~\ref{fig:histogram} that reduces
    624 the size of the computational grid to the points nearby the ship) in Virtual
    625 testbed that work with intermediate results on a processor, but the amount of
    626 data that is copied to and from a graphical accelerator is relatively small.
    627 
    628 
    629 \section{Conclusion}
    630 
    631 We showed that ship motion simulation can be performed on a regular workstation
    632 with or without graphical accelerator. Our programme includes only minimal
    633 number of mathematical models that allow ship motions calculation, but has
    634 performance reserve for inclusion of additional models.  We plan to implement
    635 rudder and propeller, compartment flooding and fire, wind and trochoidal waves
    636 simulation.  Apart from that, the main direction of future research is creation
    637 of on-board intelligent system that would include Virtual testbed as an
    638 integral part for simulating and predicting physical phenomena.
    639 
    640 \subsubsection*{Acknowledgements.}
    641 Research work is supported by Saint Petersburg State University (grant
    642 no.~26520170 and~39417213).
    643 
    644 
    645 \bibliographystyle{splncs04}
    646 \bibliography{references}
    647 
    648 \end{document}