OpenCL. - iccsa-20-waves

commit a3c96a9c87c9aa8c92cbd2006814af1529f12dae
parent 7c184cec46392f521e67086de1c2597691ae6a08
Author: Ivan Gankevich <i.gankevich@spbu.ru>
Date:   Mon, 16 Mar 2020 14:17:44 +0300

OpenCL.

Diffstat:
main.tex  | 148 ++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------

1 file changed, 90 insertions(+), 58 deletions(-)
diff --git a/main.tex b/main.tex
@@ -39,7 +39,7 @@
     \email{i.gankevich@spbu.ru},\\
     \email{st047437@student.spbu.ru},\\
     \email{st016177@student.spbu.ru},\\
-    \email{v.khramusin@spbu.ru}\\
+    \email{v.khramushin@spbu.ru}\\
     \url{https://spbu.ru/}}
 
 \maketitle
@@ -359,8 +359,7 @@ We plug this expression into the boundary condition and get
 \exp\left( -i\vec{d}_s\cdot\vec\zeta_0 \right) = 0.
 \end{equation*}
 Here we substitute \(\vec{d}_r\cdot\vec{n}\) with \(-\vec{d}_i\cdot\vec{n}\)
-which is derived from the formula for \(\vec{d}_r\)
-(see~sec.~\ref{sec-formulae}).
+which is derived from the formula for \(\vec{d}_r\).
 Hence, the boundary condition reduces to
 \begin{equation*}
 C_1 - C_2 \exp\left( -i\vec{d}_s\cdot\vec\zeta_0 \right) = 0.
@@ -373,64 +372,77 @@ This solution reduces to the solution for the wall when \(\vec{n}=(0,0,1)\).
 %\input{stationary-surface.tex}
 %\input{progressively-moving-surface.tex}
 
+\section{Results}
 
 \subsection{OpenCL implementation}
  
-Virtual testbed is a program for personal computers. 
-Its main feature is to perform all calculations in real time, 
- paying attention to the high accuracy of calculations. 
-This is achieved by using graphical accelerator.
-Generating Gerstner waves isn't an exception. 
-We implement algorithm for GPU, using OpenCL framework, 
- and regular CPU, with the ability to parallelization, using OpenMP framework. 
-
-This algorithm consists of several parts.
-First of all, we calculate wavy surface, according to our approach.
-Then, we compute wetted panels, which are located under the calculated surface.
-Finally, we find the buoyancy force, acting on a ship.
-These steps are repeated in infinity loop, and this is how we get things worked.
-
-Let's consider process of computing wavy surface in more details.
-Since we have an irregular structure of surface 
- (it means, that we store set of points, describing surface), 
- we just need to perform same formulas for each point of surface.
-It is easy to do with C++ for CPU computation, but it takes some effort 
- to effectively run this algorithm with GPU acceleration. 
-Our first implementations was quiet slow, when we had about five iterations of global loop,
- but now it is much more.
-
-Storage order is very important for GPU architecture. 
-Those algorithms are efficient, which are with sequential memory access. 
-In this way, we store our set of points in sequential order: one by one.
-It is very obviosly statement, but we need it to keep in mind.
-The next feature, that we use to increase performance, was built-in vector functions.
-So, we don't need to implement custom vector functions to work with our large set of vectors,
- and it leads to decreasing size of code and possible mistakes.
-Besides, these functions are very fast, and that is how we get there acceleration.
-The third feature, is cache managment. 
-Unlike CPU, GPU allows programmers to control it's own kind of L3 cache 
- (more precicely -- part of L3 cache), that is called "shared memory". 
-Moreover, in most cases, among of any algorithms, we have to manage shared memory to accelerate them.
-A distinctive point of this kind of memory is that this memory has the smallest latency, 
- at the same time sharing data between some others computing unit,
-As far as, memory bandwith remains a bottleneck, this kind of optimization would fit any situations.
-In our case, summation occurs over the surface of the ship,
- so we copy small pieces of it to shared memory. 
-By this action we reduce number of access to global memory, which has a much bigger latency.
-Following these simple rules, we can easily implement efficient algorithm. 
-All we have to do is: 
- check storage order; 
- include vector operations, as much, as possible; 
- and finally, manage shared memory.
-
-
-
-
-
-
-
-\section{Results}
-
+Solution for fluid velocity field was implemented in velocity potential solver
+in the framework of Virtual testbed.  Virtual testbed is a programme for
+workstations that simulates ship motions in extreme conditions and physical
+phenomena that causes them: ocean waves, wind, compartment flooding etc.  The
+main feature of this programme is to perform all calculations nearly in real
+time, paying attention to the high accuracy of calculations, which is partially
+achieved using graphical accelerators.
+
+Virtual testbed uses several solvers to simulate ship motions.
+The algorithm for velocity potential solver is the following.
+\begin{itemize}
+    \item First of all, we generate wavy surface, according to our solution and using
+        wetted ship panels from the previous time step (if any).
+    \item Second, we compute wetted panels for the current time step, which are
+        located under the surface calculated on the previos step.
+    \item Finally, we calculate Froude---Krylov forces, acting on a ship hull.
+\end{itemize}
+These steps are repeated in infinite loop. Consequently, wavy surface is 
+always one time step behind the wetted panels. This inconsistency is a result
+of the decision not to solve ship motions and fluid motions in one system of
+equations, which would be too difficult to do.
+
+We implemented velocity potential solver using OpenMP for parallel computations
+on a processor and OpenCL for graphical accelerator.  The solver uses single
+precision floating point numbers. Benchmark results are presented in
+tab.~\ref{tab-benchmark}.
+
+Let us consider process of computing wavy surface in more detail.  Since wavy
+surface grid is irregular (i.e.~we store a matrix of fluid particle positions
+that describe the surface), we compute the same formula for each point of the
+surface.  It is easy to do with C++ for CPU computation, but it takes some
+effort to efficiently run this algorithm with GPU acceleration.  Our first
+naive implementation was ineffcient, but the second implementation that used
+local memory to optimise memory loads and stores works proved to be much more
+performant.
+
+First, we optimised storage order of points making it fully sequential.
+Sequential storage order leads to sequential loads and stores from the global
+memory and greatly improves performance of the graphical accelerator.  Second,
+we use as many built-in vector functions as we can in our computations, since
+they are much more efficient than manually written ones and compiler knows how
+to optimise them. This also descreases code size and prevents possible mistakes
+in the manual implementation. Finally, we optimised how ship hull panels are
+read from the global memory. One way to think about panels is that they are
+coefficients in our model, as array of coefficients is typically read-only and
+constant. This type of array is best placed in the constant memory of the
+graphical accelerator that provides L2 cache for faster loads by parallel
+threads. However, our panel array is too large to fit in constant memory, so we
+simulated constant memory using local memory: we copied a small block of the
+array into local memory of the multiprocessor, computed sum using this block
+and then proceeded to the next block. This approach allowed to achieve almost
+200-fold speedup over CPU version of the solver.
+
+A distinctive feature of the local memory is that it has the smallest latency,
+at the same time sharing its contents between all computing units of the
+multiprocessor.  Using local memory we reduce number of access to global
+memory, which has a much bigger latency.  As far as global memory bandwith
+remains a bottleneck, this kind of optimisation would improve performance.
+To summarise, our approach to write code for graphical accelerators is the
+following:
+\begin{itemize}
+    \item make storage order linear,
+    \item use as many built-in vector operations as is possible,
+    \item use local memory of the multiprocessor to optimise global memory
+        load and stores.
+\end{itemize}
+Following these simple rules, we can easily implement efficient algorithms.
 
 \begin{table}
     \centering
@@ -489,6 +501,26 @@ All we have to do is:
 
 
 \section{Discussion}
+
+All the solutions obtained for various boundaries satisfy continuity equation
+and equation of motion, but they are all written for plain surface boundary
+with different orientations. Typical ship hull three-dimensional model is
+represented by triangulated surface, and in the centre of each panel fluid
+particle velocity vector does not depend on the surface normal of the panel and
+not other panels. So, the solution for plain surface boundary is enough
+to compute fluid velocity field \emph{on} the surface boundary.
+
+In order to generalise the solution fluid velocity field \emph{near} the
+surface boundary, we need to calculate weighted average of reflection terms of
+each underwater panel of the surface. Our preliminary tests showed that simple
+average is enough to visualise waves reflecting from the hull, but the approach
+that uses signed panel area or signed tetrahedron volume to account for
+direction of the surface normal relative to wave direction may give more
+accurate results. Nevertheless, only fluid velocity in the centre of each panel
+is used to calculate ship motions, and velocity field near the ship hull is
+used only for visualisation.
+
+
 \section{Conclusion}
 
 \subsubsection*{Acknowledgements.}

	iccsa-20-waves
	git clone https://git.igankevich.com/iccsa-20-waves.git
	Log \| Files \| Refs