iccsa-20-waves

git clone https://git.igankevich.com/iccsa-20-waves.git
Log | Files | Refs

commit a3c96a9c87c9aa8c92cbd2006814af1529f12dae
parent 7c184cec46392f521e67086de1c2597691ae6a08
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Mon, 16 Mar 2020 14:17:44 +0300

OpenCL.

Diffstat:
main.tex | 148++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
1 file changed, 90 insertions(+), 58 deletions(-)

diff --git a/main.tex b/main.tex @@ -39,7 +39,7 @@ \email{i.gankevich@spbu.ru},\\ \email{st047437@student.spbu.ru},\\ \email{st016177@student.spbu.ru},\\ - \email{v.khramusin@spbu.ru}\\ + \email{v.khramushin@spbu.ru}\\ \url{https://spbu.ru/}} \maketitle @@ -359,8 +359,7 @@ We plug this expression into the boundary condition and get \exp\left( -i\vec{d}_s\cdot\vec\zeta_0 \right) = 0. \end{equation*} Here we substitute \(\vec{d}_r\cdot\vec{n}\) with \(-\vec{d}_i\cdot\vec{n}\) -which is derived from the formula for \(\vec{d}_r\) -(see~sec.~\ref{sec-formulae}). +which is derived from the formula for \(\vec{d}_r\). Hence, the boundary condition reduces to \begin{equation*} C_1 - C_2 \exp\left( -i\vec{d}_s\cdot\vec\zeta_0 \right) = 0. @@ -373,64 +372,77 @@ This solution reduces to the solution for the wall when \(\vec{n}=(0,0,1)\). %\input{stationary-surface.tex} %\input{progressively-moving-surface.tex} +\section{Results} \subsection{OpenCL implementation} -Virtual testbed is a program for personal computers. -Its main feature is to perform all calculations in real time, - paying attention to the high accuracy of calculations. -This is achieved by using graphical accelerator. -Generating Gerstner waves isn't an exception. -We implement algorithm for GPU, using OpenCL framework, - and regular CPU, with the ability to parallelization, using OpenMP framework. - -This algorithm consists of several parts. -First of all, we calculate wavy surface, according to our approach. -Then, we compute wetted panels, which are located under the calculated surface. -Finally, we find the buoyancy force, acting on a ship. -These steps are repeated in infinity loop, and this is how we get things worked. - -Let's consider process of computing wavy surface in more details. -Since we have an irregular structure of surface - (it means, that we store set of points, describing surface), - we just need to perform same formulas for each point of surface. -It is easy to do with C++ for CPU computation, but it takes some effort - to effectively run this algorithm with GPU acceleration. -Our first implementations was quiet slow, when we had about five iterations of global loop, - but now it is much more. - -Storage order is very important for GPU architecture. -Those algorithms are efficient, which are with sequential memory access. -In this way, we store our set of points in sequential order: one by one. -It is very obviosly statement, but we need it to keep in mind. -The next feature, that we use to increase performance, was built-in vector functions. -So, we don't need to implement custom vector functions to work with our large set of vectors, - and it leads to decreasing size of code and possible mistakes. -Besides, these functions are very fast, and that is how we get there acceleration. -The third feature, is cache managment. -Unlike CPU, GPU allows programmers to control it's own kind of L3 cache - (more precicely -- part of L3 cache), that is called "shared memory". -Moreover, in most cases, among of any algorithms, we have to manage shared memory to accelerate them. -A distinctive point of this kind of memory is that this memory has the smallest latency, - at the same time sharing data between some others computing unit, -As far as, memory bandwith remains a bottleneck, this kind of optimization would fit any situations. -In our case, summation occurs over the surface of the ship, - so we copy small pieces of it to shared memory. -By this action we reduce number of access to global memory, which has a much bigger latency. -Following these simple rules, we can easily implement efficient algorithm. -All we have to do is: - check storage order; - include vector operations, as much, as possible; - and finally, manage shared memory. - - - - - - - -\section{Results} - +Solution for fluid velocity field was implemented in velocity potential solver +in the framework of Virtual testbed. Virtual testbed is a programme for +workstations that simulates ship motions in extreme conditions and physical +phenomena that causes them: ocean waves, wind, compartment flooding etc. The +main feature of this programme is to perform all calculations nearly in real +time, paying attention to the high accuracy of calculations, which is partially +achieved using graphical accelerators. + +Virtual testbed uses several solvers to simulate ship motions. +The algorithm for velocity potential solver is the following. +\begin{itemize} + \item First of all, we generate wavy surface, according to our solution and using + wetted ship panels from the previous time step (if any). + \item Second, we compute wetted panels for the current time step, which are + located under the surface calculated on the previos step. + \item Finally, we calculate Froude---Krylov forces, acting on a ship hull. +\end{itemize} +These steps are repeated in infinite loop. Consequently, wavy surface is +always one time step behind the wetted panels. This inconsistency is a result +of the decision not to solve ship motions and fluid motions in one system of +equations, which would be too difficult to do. + +We implemented velocity potential solver using OpenMP for parallel computations +on a processor and OpenCL for graphical accelerator. The solver uses single +precision floating point numbers. Benchmark results are presented in +tab.~\ref{tab-benchmark}. + +Let us consider process of computing wavy surface in more detail. Since wavy +surface grid is irregular (i.e.~we store a matrix of fluid particle positions +that describe the surface), we compute the same formula for each point of the +surface. It is easy to do with C++ for CPU computation, but it takes some +effort to efficiently run this algorithm with GPU acceleration. Our first +naive implementation was ineffcient, but the second implementation that used +local memory to optimise memory loads and stores works proved to be much more +performant. + +First, we optimised storage order of points making it fully sequential. +Sequential storage order leads to sequential loads and stores from the global +memory and greatly improves performance of the graphical accelerator. Second, +we use as many built-in vector functions as we can in our computations, since +they are much more efficient than manually written ones and compiler knows how +to optimise them. This also descreases code size and prevents possible mistakes +in the manual implementation. Finally, we optimised how ship hull panels are +read from the global memory. One way to think about panels is that they are +coefficients in our model, as array of coefficients is typically read-only and +constant. This type of array is best placed in the constant memory of the +graphical accelerator that provides L2 cache for faster loads by parallel +threads. However, our panel array is too large to fit in constant memory, so we +simulated constant memory using local memory: we copied a small block of the +array into local memory of the multiprocessor, computed sum using this block +and then proceeded to the next block. This approach allowed to achieve almost +200-fold speedup over CPU version of the solver. + +A distinctive feature of the local memory is that it has the smallest latency, +at the same time sharing its contents between all computing units of the +multiprocessor. Using local memory we reduce number of access to global +memory, which has a much bigger latency. As far as global memory bandwith +remains a bottleneck, this kind of optimisation would improve performance. +To summarise, our approach to write code for graphical accelerators is the +following: +\begin{itemize} + \item make storage order linear, + \item use as many built-in vector operations as is possible, + \item use local memory of the multiprocessor to optimise global memory + load and stores. +\end{itemize} +Following these simple rules, we can easily implement efficient algorithms. \begin{table} \centering @@ -489,6 +501,26 @@ All we have to do is: \section{Discussion} + +All the solutions obtained for various boundaries satisfy continuity equation +and equation of motion, but they are all written for plain surface boundary +with different orientations. Typical ship hull three-dimensional model is +represented by triangulated surface, and in the centre of each panel fluid +particle velocity vector does not depend on the surface normal of the panel and +not other panels. So, the solution for plain surface boundary is enough +to compute fluid velocity field \emph{on} the surface boundary. + +In order to generalise the solution fluid velocity field \emph{near} the +surface boundary, we need to calculate weighted average of reflection terms of +each underwater panel of the surface. Our preliminary tests showed that simple +average is enough to visualise waves reflecting from the hull, but the approach +that uses signed panel area or signed tetrahedron volume to account for +direction of the surface normal relative to wave direction may give more +accurate results. Nevertheless, only fluid velocity in the centre of each panel +is used to calculate ship motions, and velocity field near the ship hull is +used only for visualisation. + + \section{Conclusion} \subsubsection*{Acknowledgements.}