main.tex (32406B)
1 \documentclass[runningheads]{llncs} 2 3 \usepackage{amsmath} 4 \usepackage{booktabs} 5 \usepackage{graphicx} 6 \usepackage{url} 7 \usepackage{tikz} 8 \usetikzlibrary{arrows.meta} 9 \setlength{\tabcolsep}{4pt} 10 11 \definecolor{spbuWhite2}{RGB}{230,231,230} 12 \definecolor{spbuDarkGray}{HTML}{404040} 13 \input{preamble} 14 15 \begin{document} 16 17 \title{Virtual testbed: Ship motion simulation for~personal workstations} 18 \author{% 19 Alexander Degtyarev\orcidID{0000-0003-0967-2949} \and\\ 20 Vasily Khramushin\orcidID{0000-0002-3357-169X} \and\\ 21 Ivan Gankevich\textsuperscript{*}\orcidID{0000-0001-7067-6928} \and\\ 22 Ivan Petriakov \and\\ 23 Anton Gavrikov\orcidID{0000-0003-2128-8368} \and\\ 24 Artemii Grigorev 25 } 26 27 \titlerunning{Virtual testbed} 28 \authorrunning{A.\,Degtyarev et al.} 29 30 \institute{Saint Petersburg State University, Russia\\ 31 \email{\{a.degtyarev,v.khramushin,i.gankevich\}@spbu.ru},\\ 32 \email{\{st049350,st047437,st016177\}@student.spbu.ru}\\ 33 \url{https://spbu.ru/}} 34 35 \maketitle 36 37 \begin{abstract} 38 39 Virtual testbed is a computer programme that simulates ocean waves, ship 40 motions and compartment flooding. One feature of this programme is that it 41 visualises physical phenomena frame by frame as the simulation progresses. The 42 aim of the studies reported here was to assess how much performance can be 43 gained using graphical accelerators compared to ordinary processors when 44 repeating the same computations in a loop. We rewrote programme's hot spots in 45 OpenCL to able to execute them on a graphical accelerator and benchmarked their 46 performance with a number of real-world ship models. The analysis of the 47 results showed that data copying in and out of accelerator's main memory has 48 major impact on performance when done in a loop, and the best performance is 49 achieved when copying in and out is done outside the loop (when data copying 50 inside the loop involves accelerator's main memory only). This result comes in 51 line with how distributed computations are performed on a set of cluster nodes, 52 and suggests using similar approaches for single heterogeneous node with a 53 graphical accelerator. 54 55 \keywords{% 56 wavy surface 57 \and pressure field 58 \and pressure force 59 \and ship 60 \and wetted surface 61 \and OpenCL 62 \and GPGPU. 63 } 64 \end{abstract} 65 66 67 \section{Introduction} 68 69 Simulation of ship motion in ocean waves is done in several computer 70 programmes~\cite{matusiak2013,shin2003nonlinear,ueng2008} that differ in what 71 physical phenomena they simulate (manoeuvring in waves, compartment flooding, 72 regular and irregular waves, wind, real-time simulation or batch processing 73 etc.) and application area (scientific studies, education or entertainment). 74 These programmes are virtual analogues of ship model basins that are used to 75 simulate characteristics and behaviour of ship in a particular sea conditions. 76 The advantage of using virtual ship model basin over a physical one is that 77 experiments are performed in real scale (with real-sized ships and ocean waves) 78 and on a computer without the need to access high-technological facility. 79 80 Although, all numerical experiments are performed on a computer, one computer 81 is not powerful enough to perform them fast. Often, this problem is solved by 82 using a cluster of computer nodes or a supercomputer; however, a supercomputer 83 or a cluster is another high-technological facility that a researcher have to 84 gain access to. In that case virtual ship model basin has little advantage over 85 a physical one: the research is slowed down by official documents' approvals 86 and time-sharing of computing resources. 87 88 One way of removing this barrier is to use graphical accelerator to speed up 89 computations. In that case simulation can be performed on a regular workstation 90 that has a dedicated graphics card. Most of the researchers use GPU to make 91 visualisation in real-time, but it is rarely used for speeding up simulation 92 parts, let alone the whole programme. In~\cite{pita2016sph} the authors use GPU 93 to speed up computation of free surface motion inside a tank. 94 In~\cite{varela2011interactive} the authors rewrite their simulation code using 95 Fast Fourier transforms and propose to use GPU to gain more performance. 96 In~\cite{keeler2015integral} the authors use GPU to simulate ocean waves. 97 Nevertheless, the most efficient way of using GPU is to use it for both 98 computation and visualisation: it allows to minimise data copying between CPU and 99 GPU memory and use mathematical models, data structures and numerical methods 100 that are tailored to graphical accelerators. 101 102 The present research proposes a numerical method for computing velocity 103 potentials and wave pressures on a graphical accelerator, briefly explains 104 other methods in the programme, and presents benchmarks for asynchronous 105 visualisation and simulation. 106 107 108 \section{Methods} 109 110 Virtual testbed is a computer programme that simulates ocean waves, ship 111 motions and compartment flooding. One feature that distinguishes it with 112 respect to existing proposals is the use of graphical accelerators to speed up 113 computations and real-time visualisation that was made possible by these 114 accelerators. 115 116 The programme consists of the following modules: \texttt{vessel} reads ship 117 hull model from an input file, \texttt{gui} draws current state of the virtual 118 world and \texttt{core} computes each step of the simulation. The \texttt{core} 119 module consists of components that are linked together in a pipeline, in which 120 output of one component is the input of another one. The computation is 121 carried out in parallel to visualisation, and synchronisation occurs after each 122 simulation step. It makes graphical user interface responsive even when 123 workstation is not powerful enough to compute in real-time. 124 125 Inside \texttt{core} module the following components are present: wavy surface 126 generator, velocity potential solver, pressure force solver. Each component in 127 the \texttt{core} module is interchangeable, which means that different wavy 128 surface generators can be used with the same velocity potential solver. Once 129 initialised, these components are executed in a loop in which each iteration 130 computes the next time step of the simulation. Although, iterations of the 131 loop are sequential, each component is internally parallel, i.e.~each component 132 uses OpenMP or OpenCL to perform computations on each processor or graphical 133 core. In other words, Virtual testbed follows BSP model~\cite{valiant1990bsp} 134 for organising parallel computations, in which a programme consists of 135 sequential steps each of which is internally parallel 136 (fig.~\ref{fig:loop}). 137 138 \subsection{Wavy surface generation} 139 140 There are three models that are used for wavy surface generation in Virtual 141 testbed: autoregressive moving average model (ARMA), Stokes wave, and plane 142 sine/cosine wave. It is not beneficial in terms of performance to execute ARMA 143 model on a graphical accelerator~\cite{gankevich2018ocean}: its algorithm does 144 not use transcendental mathematical functions, has nonlinear memory access 145 pattern and complex information dependencies. It is much more efficient (even 146 without serious optimisations) to execute it on a processor. In contrast, the 147 other two wave models are embarrassingly parallel and easy to rewrite in 148 OpenCL. 149 150 Each wave model outputs three-dimensional (one temporal and two spatial 151 dimensions) field of wavy surface elevation, and ARMA model post-processes this 152 field using the following algorithm. First, autocovariance function (ACF) is 153 estimated from the input field using Wiener---Khinchin theorem. Then ACF is 154 used to build autocovariance matrix and determine autoregressive model 155 coefficients. Finally, the coefficients are used to generate new wavy surface 156 elevation field. 157 158 The resulting field is stochastic, but has the same integral characteristics as 159 the original one. In particular, probability distribution function of wavy 160 surface elevation, wave height, length and period are preserved. Using ARMA 161 model for post-processing has several advantages. 162 \begin{itemize} 163 164 \item It makes wavy surface aperiodic (its period equals period of 165 pseudo-random number generator, which can be considered infinite for 166 all practical applications) which allows to perform statistical studies 167 using Virtual testbed. 168 169 \item It is much faster to generate wavy surface with this model than with 170 the original model, because ARMA model involves only multiplications 171 and additions rather than transcendental mathematical functions. 172 173 \item This model allows to use any wavy surface as the input (not only 174 plane and Stokes waves). Frequency-directional spectrum of a particular 175 ocean region can be used instead. 176 177 \end{itemize} 178 This paper gives only a short description of the model, please refer 179 to~\cite{gankevich2018thesis,gankevich2018ocean} for in-depth study. 180 181 To summarise, wavy surface generator produces wavy surface elevation field 182 using one of the models described above. For ARMA model it is impractical to 183 generate it using graphical accelerator, and for other models it is to trivial 184 to discuss. This field is an input for velocity potential solver. 185 186 \subsection{Velocity potential computation} 187 188 Since wavy surface generator produces discretely given elevation field we may 189 not use formula from linear wave theory to compute velocity potential; instead, 190 we derived a formula for arbitrary surface for inviscid incompressible fluid: 191 \begin{equation} 192 \label{eq:phi} 193 \phi(x,y,z,t) = \InverseFourier{ 194 \frac{ \Cosh{\smash{2\pi \WaveNumber (z+h)}} }{ 2\pi\WaveNumber } 195 \frac{ \Fourier{ f(x,y,t) }{u,v} } 196 { \Fourier{\mathcal{D}_3\left( x,y,\zeta\left(x,y\right) \right)}{u,v} } 197 }{x,y}, 198 \end{equation} 199 where 200 \begin{align*} 201 f(x,y,t)&=\zeta_t(x,y,t) / \left( i f_1(x,y) + i f_2(x,y) - f_3(x,y) \right),\\ 202 f_1(x,y)&={\zeta_x}/{\SqrtZetaXY}-\zeta_x, 203 \qquad 204 \Fourier{\mathcal{D}_3\left(x,y,z\right)}{u,v}=\Cosh{\smash{2\pi\WaveNumber{}z}},\\ 205 f_2(x,y)&={\zeta_y}/{\SqrtZetaXY}-\zeta_y, 206 \qquad 207 \WaveNumber = \sqrt{u^2+v^2},\\ 208 f_3(x,y)&=1/\SqrtZetaXY. 209 \end{align*} 210 Here \(\WaveVector\) is wave number, \(\zeta\)~--- wavy surface elevation, 211 \(h\)~--- water depth, \(\mathcal{F}\)~--- Fourier transform, \(\phi\)~--- 212 velocity potential. The formula is derived as a solution for continuity 213 equation with kinematic boundary condition 214 \begin{align} 215 & \nabla^2\phi = 0,\nonumber\\ 216 & \phi_t+\frac{1}{2} |\vec{\upsilon}|^2 + g\zeta=-\frac{p}{\rho}, & \text{at }z=\zeta(x,y,t),\label{eq:problem}\\ 217 & D\zeta = \nabla \phi \cdot \vec{n}, & \text{at }z=\zeta(x,y,t),\nonumber 218 \end{align} 219 without assumptions of linear wave theory (wave length is much larger than wave 220 height). Hence it can be used for arbitrary-amplitude ocean waves. Here the 221 first equation is continuity equation, the second is dynamic boundary 222 condition, and the last one is kinematic boundary condition; \(p\)~--- 223 pressure, \(\rho\)~--- fluid density, 224 \(\vec{\upsilon}=(\phi_x,\phi_y,\phi_z)\)~--- velocity vector, \(g\)~--- 225 acceleration of gravity, and \(D\)~--- substantial (Lagrange) derivative. Since 226 we solve for \(\phi\), dynamic boundary condition becomes explicit formula for 227 pressure and is used to compute pressure force acting on a ship hull 228 (see~sec.~\ref{sec:pressure-force}). 229 230 Integral in~\eqref{eq:phi} converges when summation goes over a range of wave 231 numbers that are actually present in discretely given wavy surface. This range 232 is determined numerically by finding crests and troughs for each spatial 233 dimension of the wavy surface with polynomial interpolation and using these 234 values to determine wave length. For small-amplitude waves this approach gives 235 the same values of velocity potential field as direct application of the 236 formula from linear wave theory. 237 238 Formula~\eqref{eq:phi} is particularly suitable for computation on a graphical 239 accelerator: it contains transcendental mathematical functions (complex 240 exponents) that help offset slow global memory loads and stores, it is explicit 241 which makes it easy to compute in parallel, and it is written using Fourier 242 transforms that are efficient to compute on a graphical 243 accelerator~\cite{volkov2008fft}. 244 245 This paper gives only a short description of the method, please refer 246 to~\cite{gankevich2018thesis,gankevich2018ocean} for in-depth study. 247 248 \subsection{Pressure force computation} 249 \label{sec:pressure-force} 250 251 There are three stages of pressure force computation: determining wetted 252 surface of a ship hull, computing wave pressure field under wavy surface, and 253 computing pressure force acting on a ship hull. 254 255 In order to determine wetted surface, ship hull is decomposed into triangular 256 panels (faces) that approximate its geometry. Then for each panel the algorithm 257 determines its position relative to wavy surface. If it is fully submerged, it 258 is considered wetted; if it is partially submerged, the algorithm computes 259 intersection points using bisection method and wavy surface interpolation, and 260 slices the part of the panel which is above the wavy surface (for simplicity 261 the slice is assumed to be straight line, as is the case for sufficiently small 262 panels). Wave pressure at any point under wavy surface is computed using 263 dynamic boundary condition from~\eqref{eq:problem} as an explicit formula. 264 Then the pressure is interpolated in the centre of each panel to compute 265 pressure force acting on a ship hull. 266 267 It is straightforward to rewrite pressure computation for a graphical 268 accelerator as its algorithm reduces to looping over a large collection of panels 269 and performing the same calculations for each of them; however, dynamic 270 boundary condition contains temporal and spatial derivatives that have to be 271 computed. Although, computing derivatives on a processor is fast, copying the 272 results to accelerator's main memory proved to be inefficient as there are four 273 arrays (one for each dimension) that need to be allocated and transferred. 274 Simply rewriting code for OpenCL proved to be even more inefficient due to 275 irregular memory access pattern for different array dimensions. So, we resorted 276 to implementing the algorithm described in~\cite{micikevicius2009derivative}, 277 that stores intermediate results in the local memory of the accelerator. Using 278 this algorithm allowed us to store arrays of derivatives entirely in graphical 279 accelerator's main memory and eliminate data transfer altogether. 280 281 \subsection{Translational and angular ship motion computation} 282 283 In order to compute ship position, translational velocity, angular displacement 284 and angular velocity for each time step we solve equations of translational and 285 angular motion (adapted from~\cite{matusiak2013}) using pressure force computed 286 for each panel: 287 \begin{equation*} 288 \dot{\vec{v}} = \frac{\vec{F}}{m} + g\vec{\tau} 289 - \vec{\Omega}\times\vec{v} - \lambda\vec{v}, 290 \qquad 291 \dot{\vec{h}} = \vec{G} - \vec{\Omega}\times\vec{h}, 292 \qquad 293 \vec{h} = \InertiaMatrix \cdot \vec{\Omega}. 294 \end{equation*} 295 Here 296 \(\vec{\tau}=\left(-\sin\theta,\cos\theta\sin\phi,-\cos\theta\cos\phi\right)\) 297 is a vector that transforms \(g\) into body-fixed coordinate system, 298 \(\vec{v}\)~--- translational velocity vector, \(g\)~--- gravitational 299 acceleration, \(\vec{\Omega}\)~--- angular velocity vector, \(\vec{F}\)~--- 300 vector of external forces, \(m\)~--- ship mass, \(\lambda\)~--- damping 301 coefficient, \(h\)~--- angular momentum, \(\vec{G}\)~--- the moment of 302 external forces, and \(\InertiaMatrix\)~--- inertia matrix. 303 304 We compute total force \(\vec{F}\) and momentum \(\vec{G}\) acting on a ship 305 hull by adding forces acting on each panel. Then we solve the system of 306 equations using numerical Runge---Kutta---Fehlberg method~\cite{mathews2004}. 307 The vector of values determined by the method consists of all components of the 308 following vectors: ship position, translational velocity, angular displacement 309 and angular velocity (twelve variables in total). Since we are not interested 310 in angular momentum, we use inertia matrix to obtain it from angular velocity 311 and inverse inertia matrix to convert it back. Twelve vector components are too 312 few to efficiently execute this method on a graphical accelerator and there is 313 no other way of making iterative method parallel, so we execute it on a 314 processor. 315 316 \section{Results} 317 318 \subsection{Test setup} 319 320 Virtual testbed performance was benchmarked in a number of tests. Since we use 321 both OpenMP and OpenCL technologies for parallel computing, we wanted to know 322 how performance scales with the number of processor cores and with and without 323 graphical accelerator. 324 325 Graphical accelerators are divided into two broad categories: for general 326 purpose computations and for visualisation. Accelerators from the first 327 category typically have more double precision arithmetic units and accelerators 328 from the second category are typically optimised for single precision. The 329 ratio of single to double precision performance can be as high as 32. Virtual 330 testbed produces correct results for both single and double precision, but 331 OpenCL version supports only single precision, and graphical accelerators that 332 we used have higher single precision performance (tab.~\ref{tab:setup}). So we 333 choose single precision in all benchmarks. 334 335 \begin{table} 336 \centering 337 \caption{Hardware configurations for benchmarks. For all benchmarks we 338 used GCC version 8.1.1 compiler and optimisation flags \texttt{-O3 339 -march=native}.\label{tab:setup}} 340 \begin{tabular}{lllrr} 341 \toprule 342 & & & \multicolumn{2}{c}{GPU GFLOPS} \\ 343 \cmidrule(r){4-5} 344 Node & CPU & GPU & Single & Double \\ 345 \midrule 346 Storm & Intel Q9550 & Radeon R7 360 & 1613 & 101 \\ 347 GPUlab & AMD FX-8370 & NVIDIA GTX1060 & 4375 & 137 \\ 348 Capybara & Intel E5-2630 v4 & NVIDIA P5000 & 8873 & 277 \\ 349 \bottomrule 350 \end{tabular} 351 \end{table} 352 353 Double precision was used only for computing autoregressive model coefficients, 354 because round-off and truncation numerical errors make covariance matrices 355 (from which coefficients are computed) non-positive definite. These matrices 356 typically have very large condition numbers, and linear system which they 357 represent cannot be solved by Gaussian elimination or \(LDLT\) Cholesky 358 decomposition, as these methods are numerically unstable (at least in our 359 programme). 360 361 Since Virtual testbed does both visualisation and computation in real-time, we 362 measured performance of each stage of the main loop (fig.~\ref{fig:loop}) 363 synchronously with the parameters that affect it. To assess computational 364 performance we measured execution time of each stage in microseconds (wall 365 clock time) together with the number of wetted panels, and wavy surface size. 366 To assess visualisation performance we measured the execution time of each 367 visualisation frame (one iteration of the visualisation main loop) and 368 execution time of computational frame (one iteration of the computational 369 loop), from which it is easy to compute the usual frames-per-second metric. 370 The tests were run for one minute and were forcibly stopped after the time ran 371 out. Wall clock time was measured as a median across all simulation steps (or 372 visualisation frames). 373 374 \begin{figure} 375 \centering 376 \begin{tikzpicture}[x=2.2cm,y=-1.4cm] 377 \node[Block] (s1) at (0,0) {\strut{}Wavy surface}; 378 \node[Block] (s2) at (1,0) {\strut{}Autoreg. model}; 379 \node[Block] (s3) at (2,0) {\strut{}Wave numbers}; 380 \node[Block] (s4) at (3,0) {\strut{}Velocity potential}; 381 \node[Block] (s5) at (4,0) {\strut{}Wave pressure}; 382 \node[Block] (s6) at (4,1) {\strut{}Wetted surface}; 383 \node[Block] (s7) at (3,1) {\strut{}Elevation deriv.}; 384 \node[Block] (s8) at (2,1) {\strut{}Pressure force}; 385 \node[Block] (s9) at (1,1) {\strut{}Ship velocities}; 386 \path[Arrow] (s1) -- (s2); 387 \path[Arrow] (s2) -- (s3); 388 \path[Arrow] (s3) -- (s4); 389 \path[Arrow] (s4) -- (s5); 390 \path[Arrow] (s5.east) -- ++(0.2,0) |- (s6.east); 391 \path[Arrow] (s6) -- (s7); 392 \path[Arrow] (s7) -- (s8); 393 \path[Arrow] (s8) -- (s9); 394 \path[Arrow] (s9) -| (s1); 395 \end{tikzpicture} 396 \caption{Virtual testbed main loop.\label{fig:loop}} 397 \end{figure} 398 399 We ran all tests on each node for increasing number of processor cores and 400 with and without graphical accelerator. The code was compiled with maximum 401 optimisation level including processor-specific optimisations which enabled 402 auto-vectorisation for further performance improvements. 403 404 We ran all tests for each of the three ship hull models: Aurora cruiser, MICW 405 (a hull with reduced moments of inertia for the current waterline) and a 406 sphere. The first two models represent real-world ships with known 407 characteristics and we took them from Vessel database~\cite{vessel2015} 408 registered by our university which is managed by Hull 409 programme~\cite{hull2010}. Parameters of these ship models are listed in 410 tab.~\ref{tab:ships}, three-dimensional models are shown in 411 fig.~\ref{fig:models}. Sphere was used as a geometric shape wetted surface 412 area of which is close to constant under impact of ocean waves. 413 414 We ran all tests for each workstation from tab.~\ref{tab:setup} to investigate 415 if there is a difference in performance between ordinary workstation and a 416 computer for visualisation. Storm is a regular workstation with mediocre 417 processor and graphical accelerator, GPUlab is a slightly more powerful 418 workstation, and Capybara has the most powerful processor and professional 419 graphical accelerator optimised for visualisation. 420 421 \begin{table} 422 \centering 423 \caption{Parameters of ship models that were used in the 424 benchmarks.\label{tab:ships}} 425 \begin{tabular}{lrrr} 426 \toprule 427 & Aurora & MICW & Sphere \\ 428 \midrule 429 Length, m & 126.5 & 260 & 100 \\ 430 Beam, m & 16.8 & 32 & 100 \\ 431 Depth, m & 14.5 & 31 & 100 \\ 432 No. of panels & 29306 & 10912 & 5120 \\ 433 \bottomrule 434 \end{tabular} 435 \end{table} 436 437 \begin{figure} 438 \centering 439 \includegraphics[width=0.5\textwidth]{build/aurora.eps}\hfill 440 \includegraphics[width=0.5\textwidth]{build/micw.eps} 441 \caption{Aurora and MICW three-dimensional ship hull 442 models.\label{fig:models}} 443 \end{figure} 444 445 446 \subsection{Benchmark results} 447 448 The main result of the benchmarks is that Virtual testbed is capable of running 449 on a regular workstation with or without a graphical accelerator in real-time 450 with high frame rate and small simulation time steps. 451 452 \begin{itemize} 453 454 \item We achieved more than 60 simulation steps per second (SSPS) on each 455 of the workstations. SSPS is the same metric as frames per second in 456 visuliastion, but for simulation. For Storm and GPUlab the most 457 performant programme version was the one for graphical accelerator and 458 for Capybara the most performant version was the one for the processor 459 (tab.~\ref{tab:best}). 460 461 \item The most performant node is GPUlab with 104 simulation steps per 462 second. Performance of Capybara is higher than of Storm, but it uses 463 powerful server-grade processor to achieve it. 464 465 \item Computational speedup for increasing number of parallel OpenMP 466 threads is far from linear: we achieved only fourfold speedup for ten 467 threads (fig.~\ref{fig:openmp}). 468 469 \item Although, GPUlab's processor has higher frequency, even one core of 470 Capybara's processor achieves slightly higher performance. 471 472 \item The least powerful workstation (Storm) has the largest positive 473 difference between graphical accelerator and processor performance 474 (fig.~\ref{fig:histogram}). The most powerful workstation (Capybara) 475 has comparable but negative difference. 476 477 \item Usage of graphical accelerator increases time needed to synchronise 478 simulation step with the visualisation frame (\textit{exchange} stage 479 in fig.~\ref{fig:histogram}). 480 481 \end{itemize} 482 483 \begin{table} 484 \centering 485 \caption{Best median performance for each workstation and each ship hull. 486 Here \(t\) is simulation step computation time, \(m\)~--- no. of simulation 487 steps per second (SSPS), and \(n\)~--- the number of OpenMP 488 threads, CL~--- OpenCL, MP~--- OpenMP.\label{tab:best}} 489 \begin{tabular}{lrrrlrrrlrrrl} 490 \toprule 491 & \multicolumn{4}{c}{Sphere} 492 & \multicolumn{4}{c}{Aurora} 493 & \multicolumn{4}{c}{MICW} \\ 494 \cmidrule(r){2-5} 495 \cmidrule(r){6-9} 496 \cmidrule(r){10-13} 497 Node & \(t\), ms & \(m\) & \(n\) & ver. 498 & \(t\), ms & \(m\) & \(n\) & ver. 499 & \(t\), ms & \(m\) & \(n\) & ver. \\ 500 \midrule 501 Storm & 16 & 64 & 1 & CL 502 & 14 & 72 & 1 & CL 503 & 29 & 34 & 1 & CL \\ 504 GPUlab & 10 & 104 & 1 & CL 505 & 9 & 112 & 1 & CL 506 & 18 & 55 & 1 & CL \\ 507 Capybara & 12 & 85 & 10 & MP 508 & 15 & 66 & 10 & MP 509 & 19 & 51 & 10 & MP \\ 510 \bottomrule 511 \end{tabular} 512 \end{table} 513 514 \begin{figure} 515 \centering 516 \includegraphics{build/openmp.eps} 517 \caption{Median simulation step computation time for different number of 518 parallel threads (sphere).\label{fig:openmp}} 519 \end{figure} 520 521 \begin{figure} 522 \centering 523 \includegraphics{build/histogram.eps} 524 \caption{Median computation time for each main loop stage, each node and 525 sequential, OpenMP and OpenCL versions (sphere).\label{fig:histogram}} 526 \end{figure} 527 528 529 530 \section{Discussion} 531 532 Although, graphical accelerator gives noticeable performance increase only for 533 the least powerful workstation, we considered only the simplest simulation 534 scenario (ship motions induced by a plane Stokes wave) in the benchmarks: The 535 problem that we solve is too small to saturate graphical accelerator cores. We 536 tried to eliminate expensive data copying operations between host and graphical 537 accelerator memory, where possible, but we need to simulate more physical 538 phenomena and at a larger scale (ships with large number of panels, large 539 number of compartments, wind simulation etc.) to verify that performance gap 540 increases for powerful workstations. On the bright side, even if a computer 541 does not have powerful graphical accelerator (e.g.~a laptop with integrated 542 graphics), it still can run Virtual testbed with acceptable performance. 543 544 Large SSPS is needed neither for smooth visualisation, nor for accurate 545 simulation; however, it gives performance reserve for further increase in 546 detail and scale of simulated physical phenomena. We manually limit simulation 547 time step to a minimum of \(1/30\) of the second to prevent floating-point 548 numerical errors due to small time steps. Also, we limit maximum time step to 549 have wave frequency greater or equal to Nyquist frequency for precise partial 550 time derivatives computation. 551 552 Real-time simulation is essential not only for educational purposes, but also 553 for on-board intelligent systems. These systems analyse data coming from a 554 multitude of sensors the ship equips, calculate probability of occurrence of a 555 particular dangerous situation (e.g.~large roll angle) and try to prevent it by 556 notifying ship's crew and an operator on the coast. This is one of the 557 directions of future work. 558 559 Overall performance depends on the size of the ship rather than the number of 560 panels. MICW hull has less number of panels than Aurora, but two times larger 561 size and two times worse performance (tab.~\ref{tab:best}). The size of the 562 hull affects the size of the grid in each point of which velocity potential and 563 then pressure is computed. These routines are much more compute intensive in 564 comparison to wetted surface determination and pressure force computation, 565 performance of which depends on the number of panels. 566 567 Despite the fact that Capybara has the highest floating-point performance 568 across all workstations in the benchmarks, Virtual testbed runs faster on its 569 processor, not the graphical accelerator. Routine-by-routine investigation 570 showed that this graphics card is simply slower at computing even fully 571 parallel Stokes wave generator OpenCL kernel. This kernel fills 572 three-dimensional array using explicit formula for the wave profile, it has 573 linear memory access pattern and no information dependencies between array 574 elements. It seems, that P5000 is not optimised for general purpose 575 computations. We did not conduct visualisation benchmarks, so we do not know if 576 it is more efficient in that case. 577 578 Although, Capybara's processor has 20 hardware threads (2 threads per core), 579 OpenMP performance does not scale beyond 10 threads. Parallel threads in our 580 code do mostly the same operations but with different data, so switching 581 between different hardware threads running on the same core in the hope that 582 the second thread performs useful work while the first one stalls on 583 input/output or load/store operation is not efficient. This problem is usually 584 solved by creating a pipeline from the main loop in which each stage is 585 executed in parallel and data constantly flows between subsequent stages. This 586 approach is easy to implement when computational grid can be divided into 587 distinct parts, which is not the case for Virtual testbed: there are too many 588 dependencies between parts and the position and the size of each part can be 589 different in each stage. Graphical accelerators have more efficient hardware 590 threads switching which, and pipeline would probably not improve their 591 performance, so we did not take this approach. 592 593 Our approach for performing computations on a heterogeneous node (a node with 594 both a processor and a graphical accelerator) is similar to the approach 595 followed by the authors of Spark distributed data processing 596 framework~\cite{zaharia2016spark}. In this framework data is first loaded into 597 the main memory of each cluster node and then processed in a loop. Each 598 iteration of this loop runs by all nodes in parallel and synchronisation occurs 599 at the end of each iteration. This is in contrast to MapReduce 600 framework~\cite{dean2008mapreduce} where after each iteration the data is 601 written to stable storage and then read back into the main memory to continue 602 processing. Not interacting with slow stable storage on every iteration allows 603 Spark to achieve an order of magnitude higher performance than Hadoop 604 (open-source version of MapReduce) on iterative algorithms. 605 606 For a heterogeneous node an analogue of stable storage, read/writes to which is 607 much slower than accesses to the main memory, is graphical accelerator memory. To 608 minimise interaction with this memory, we do not read intermediate results of 609 our computations from it, but reuse arrays that already reside there. (As a 610 concrete example, we do not copy pressure field from a graphical accelerator, 611 only the forces for each panel.) This allows us to eliminate expensive data 612 transfer between CPU and GPU memory. In early versions of our programme this 613 copying slowed down simulation significantly. 614 615 Although, heterogeneous node is not a cluster, efficient programme architecture 616 for such a node is similar to distributed data processing systems: we process 617 data only on those device main memory of which contains the data and we never 618 transfer intermediate computation results between devices. To implement this 619 principle the whole iteration of the programme's main loop have to be executed 620 either on a processor or a graphical accelerator. Given the time constraints, 621 future maintenance burden and programme's code size, it was difficult to fully 622 follow this approach, but we came to a reasonable approximation of it. We still 623 have functions (\textit{clamp} stage in fig.~\ref{fig:histogram} that reduces 624 the size of the computational grid to the points nearby the ship) in Virtual 625 testbed that work with intermediate results on a processor, but the amount of 626 data that is copied to and from a graphical accelerator is relatively small. 627 628 629 \section{Conclusion} 630 631 We showed that ship motion simulation can be performed on a regular workstation 632 with or without graphical accelerator. Our programme includes only minimal 633 number of mathematical models that allow ship motions calculation, but has 634 performance reserve for inclusion of additional models. We plan to implement 635 rudder and propeller, compartment flooding and fire, wind and trochoidal waves 636 simulation. Apart from that, the main direction of future research is creation 637 of on-board intelligent system that would include Virtual testbed as an 638 integral part for simulating and predicting physical phenomena. 639 640 \subsubsection*{Acknowledgements.} 641 Research work is supported by Saint Petersburg State University (grant 642 no.~26520170 and~39417213). 643 644 645 \bibliographystyle{splncs04} 646 \bibliography{references} 647 648 \end{document}