commit a3c96a9c87c9aa8c92cbd2006814af1529f12dae
parent 7c184cec46392f521e67086de1c2597691ae6a08
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date: Mon, 16 Mar 2020 14:17:44 +0300
OpenCL.
Diffstat:
main.tex | | | 148 | ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------- |
1 file changed, 90 insertions(+), 58 deletions(-)
diff --git a/main.tex b/main.tex
@@ -39,7 +39,7 @@
\email{i.gankevich@spbu.ru},\\
\email{st047437@student.spbu.ru},\\
\email{st016177@student.spbu.ru},\\
- \email{v.khramusin@spbu.ru}\\
+ \email{v.khramushin@spbu.ru}\\
\url{https://spbu.ru/}}
\maketitle
@@ -359,8 +359,7 @@ We plug this expression into the boundary condition and get
\exp\left( -i\vec{d}_s\cdot\vec\zeta_0 \right) = 0.
\end{equation*}
Here we substitute \(\vec{d}_r\cdot\vec{n}\) with \(-\vec{d}_i\cdot\vec{n}\)
-which is derived from the formula for \(\vec{d}_r\)
-(see~sec.~\ref{sec-formulae}).
+which is derived from the formula for \(\vec{d}_r\).
Hence, the boundary condition reduces to
\begin{equation*}
C_1 - C_2 \exp\left( -i\vec{d}_s\cdot\vec\zeta_0 \right) = 0.
@@ -373,64 +372,77 @@ This solution reduces to the solution for the wall when \(\vec{n}=(0,0,1)\).
%\input{stationary-surface.tex}
%\input{progressively-moving-surface.tex}
+\section{Results}
\subsection{OpenCL implementation}
-Virtual testbed is a program for personal computers.
-Its main feature is to perform all calculations in real time,
- paying attention to the high accuracy of calculations.
-This is achieved by using graphical accelerator.
-Generating Gerstner waves isn't an exception.
-We implement algorithm for GPU, using OpenCL framework,
- and regular CPU, with the ability to parallelization, using OpenMP framework.
-
-This algorithm consists of several parts.
-First of all, we calculate wavy surface, according to our approach.
-Then, we compute wetted panels, which are located under the calculated surface.
-Finally, we find the buoyancy force, acting on a ship.
-These steps are repeated in infinity loop, and this is how we get things worked.
-
-Let's consider process of computing wavy surface in more details.
-Since we have an irregular structure of surface
- (it means, that we store set of points, describing surface),
- we just need to perform same formulas for each point of surface.
-It is easy to do with C++ for CPU computation, but it takes some effort
- to effectively run this algorithm with GPU acceleration.
-Our first implementations was quiet slow, when we had about five iterations of global loop,
- but now it is much more.
-
-Storage order is very important for GPU architecture.
-Those algorithms are efficient, which are with sequential memory access.
-In this way, we store our set of points in sequential order: one by one.
-It is very obviosly statement, but we need it to keep in mind.
-The next feature, that we use to increase performance, was built-in vector functions.
-So, we don't need to implement custom vector functions to work with our large set of vectors,
- and it leads to decreasing size of code and possible mistakes.
-Besides, these functions are very fast, and that is how we get there acceleration.
-The third feature, is cache managment.
-Unlike CPU, GPU allows programmers to control it's own kind of L3 cache
- (more precicely -- part of L3 cache), that is called "shared memory".
-Moreover, in most cases, among of any algorithms, we have to manage shared memory to accelerate them.
-A distinctive point of this kind of memory is that this memory has the smallest latency,
- at the same time sharing data between some others computing unit,
-As far as, memory bandwith remains a bottleneck, this kind of optimization would fit any situations.
-In our case, summation occurs over the surface of the ship,
- so we copy small pieces of it to shared memory.
-By this action we reduce number of access to global memory, which has a much bigger latency.
-Following these simple rules, we can easily implement efficient algorithm.
-All we have to do is:
- check storage order;
- include vector operations, as much, as possible;
- and finally, manage shared memory.
-
-
-
-
-
-
-
-\section{Results}
-
+Solution for fluid velocity field was implemented in velocity potential solver
+in the framework of Virtual testbed. Virtual testbed is a programme for
+workstations that simulates ship motions in extreme conditions and physical
+phenomena that causes them: ocean waves, wind, compartment flooding etc. The
+main feature of this programme is to perform all calculations nearly in real
+time, paying attention to the high accuracy of calculations, which is partially
+achieved using graphical accelerators.
+
+Virtual testbed uses several solvers to simulate ship motions.
+The algorithm for velocity potential solver is the following.
+\begin{itemize}
+ \item First of all, we generate wavy surface, according to our solution and using
+ wetted ship panels from the previous time step (if any).
+ \item Second, we compute wetted panels for the current time step, which are
+ located under the surface calculated on the previos step.
+ \item Finally, we calculate Froude---Krylov forces, acting on a ship hull.
+\end{itemize}
+These steps are repeated in infinite loop. Consequently, wavy surface is
+always one time step behind the wetted panels. This inconsistency is a result
+of the decision not to solve ship motions and fluid motions in one system of
+equations, which would be too difficult to do.
+
+We implemented velocity potential solver using OpenMP for parallel computations
+on a processor and OpenCL for graphical accelerator. The solver uses single
+precision floating point numbers. Benchmark results are presented in
+tab.~\ref{tab-benchmark}.
+
+Let us consider process of computing wavy surface in more detail. Since wavy
+surface grid is irregular (i.e.~we store a matrix of fluid particle positions
+that describe the surface), we compute the same formula for each point of the
+surface. It is easy to do with C++ for CPU computation, but it takes some
+effort to efficiently run this algorithm with GPU acceleration. Our first
+naive implementation was ineffcient, but the second implementation that used
+local memory to optimise memory loads and stores works proved to be much more
+performant.
+
+First, we optimised storage order of points making it fully sequential.
+Sequential storage order leads to sequential loads and stores from the global
+memory and greatly improves performance of the graphical accelerator. Second,
+we use as many built-in vector functions as we can in our computations, since
+they are much more efficient than manually written ones and compiler knows how
+to optimise them. This also descreases code size and prevents possible mistakes
+in the manual implementation. Finally, we optimised how ship hull panels are
+read from the global memory. One way to think about panels is that they are
+coefficients in our model, as array of coefficients is typically read-only and
+constant. This type of array is best placed in the constant memory of the
+graphical accelerator that provides L2 cache for faster loads by parallel
+threads. However, our panel array is too large to fit in constant memory, so we
+simulated constant memory using local memory: we copied a small block of the
+array into local memory of the multiprocessor, computed sum using this block
+and then proceeded to the next block. This approach allowed to achieve almost
+200-fold speedup over CPU version of the solver.
+
+A distinctive feature of the local memory is that it has the smallest latency,
+at the same time sharing its contents between all computing units of the
+multiprocessor. Using local memory we reduce number of access to global
+memory, which has a much bigger latency. As far as global memory bandwith
+remains a bottleneck, this kind of optimisation would improve performance.
+To summarise, our approach to write code for graphical accelerators is the
+following:
+\begin{itemize}
+ \item make storage order linear,
+ \item use as many built-in vector operations as is possible,
+ \item use local memory of the multiprocessor to optimise global memory
+ load and stores.
+\end{itemize}
+Following these simple rules, we can easily implement efficient algorithms.
\begin{table}
\centering
@@ -489,6 +501,26 @@ All we have to do is:
\section{Discussion}
+
+All the solutions obtained for various boundaries satisfy continuity equation
+and equation of motion, but they are all written for plain surface boundary
+with different orientations. Typical ship hull three-dimensional model is
+represented by triangulated surface, and in the centre of each panel fluid
+particle velocity vector does not depend on the surface normal of the panel and
+not other panels. So, the solution for plain surface boundary is enough
+to compute fluid velocity field \emph{on} the surface boundary.
+
+In order to generalise the solution fluid velocity field \emph{near} the
+surface boundary, we need to calculate weighted average of reflection terms of
+each underwater panel of the surface. Our preliminary tests showed that simple
+average is enough to visualise waves reflecting from the hull, but the approach
+that uses signed panel area or signed tetrahedron volume to account for
+direction of the surface normal relative to wave direction may give more
+accurate results. Nevertheless, only fluid velocity in the centre of each panel
+is used to calculate ship motions, and velocity field near the ship hull is
+used only for visualisation.
+
+
\section{Conclusion}
\subsubsection*{Acknowledgements.}