commit 8bc3d35ab6c8259ef49d46bfe7141755a51b8c7c
parent b0f1c56ccea0eaebc1532fdda925c6dd7a75c8cf
Author: Ivan Gankevich <igankevich@ya.ru>
Date: Thu, 18 Apr 2019 18:47:58 +0300
edit and proof-read
Diffstat:
main.tex | | | 101 | ++++++++++++++++++++++++++++++++++++++++--------------------------------------- |
1 file changed, 51 insertions(+), 50 deletions(-)
diff --git a/main.tex b/main.tex
@@ -118,10 +118,10 @@ to speed up computation of free surface motion inside a tank.
In~\cite{varela2011interactive} the authors rewrite their simulation code using
Fast Fourier transforms and propose to use GPU to gain more performance.
In~\cite{keeler2015integral} the authors use GPU to simulate ocean waves.
-Nevertheless, the most efficient way of using GPU is to use it for the whole
-programme: it allows to minimise data copying between CPU and GPU memory and
-use mathematical models, data structures and numerical methods that are
-tailored to graphical accelerators.
+Nevertheless, the most efficient way of using GPU is to use it for both
+computation and visualisation: it allows to minimise data copying between CPU and
+GPU memory and use mathematical models, data structures and numerical methods
+that are tailored to graphical accelerators.
The present research proposes a numerical method for computing velocity
potentials and wave pressures on a graphical accelerator, briefly explains
@@ -181,7 +181,7 @@ elevation field.
The resulting field is stochastic, but has the same integral characteristics as
the original one. In particular, probability distribution function of wavy
-surface elevation, wave height, length and period is preserved. Using ARMA
+surface elevation, wave height, length and period are preserved. Using ARMA
model for post-processing has several advantages.
\begin{itemize}
@@ -251,7 +251,7 @@ we solve for \(\phi\), dynamic boundary condition becomes explicit formula for
pressure and is used to compute pressure force acting on a ship hull
(see~sec.~\ref{sec:pressure-force}).
-Formula~\eqref{eq:phi} converges when summation goes over a range of wave
+Integral in~\eqref{eq:phi} converges when summation goes over a range of wave
numbers that are actually present in discretely given wavy surface. This range
is determined numerically by finding crests and troughs for each spatial
dimension of the wavy surface with polynomial interpolation and using these
@@ -289,7 +289,7 @@ Then the pressure is interpolated in the centre of each panel to compute
pressure force acting on a ship hull.
It is straightforward to rewrite pressure computation for a graphical
-accelerator as its algorithm reduces to looping over large collection of panels
+accelerator as its algorithm reduces to looping over a large collection of panels
and performing the same calculations for each of them; however, dynamic
boundary condition contains temporal and spatial derivatives that have to be
computed. Although, computing derivatives on a processor is fast, copying the
@@ -420,19 +420,20 @@ visualisation frames).
\caption{Virtual testbed main loop.\label{fig:loop}}
\end{figure}
-We ran all tests on the same node for increasing number of processor cores and
+We ran all tests on each node for increasing number of processor cores and
with and without graphical accelerator. The code was compiled with maximum
optimisation level including processor-specific optimisations which enabled
auto-vectorisation for further performance improvements.
-We ran all tests for each of the three ship models: Aurora cruiser, MICW (a
-hull with reduced moments of inertia for the current waterline) and a sphere.
-The first two models represent real-world ships with known characteristics and
-we took them from Vessel database~\cite{vessel2015} registered by our
-university which is managed by Hull programme~\cite{hull2010}. Parameters of
-these ship models are listed in tab.~\ref{tab:ships}, three-dimensional models
-are shown in fig.~\ref{fig:models}. Sphere was used as a geometrical shape
-wetted surface area of which is close to constant under impact of ocean waves.
+We ran all tests for each of the three ship hull models: Aurora cruiser, MICW
+(a hull with reduced moments of inertia for the current waterline) and a
+sphere. The first two models represent real-world ships with known
+characteristics and we took them from Vessel database~\cite{vessel2015}
+registered by our university which is managed by Hull
+programme~\cite{hull2010}. Parameters of these ship models are listed in
+tab.~\ref{tab:ships}, three-dimensional models are shown in
+fig.~\ref{fig:models}. Sphere was used as a geometric shape wetted surface
+area of which is close to constant under impact of ocean waves.
We ran all tests for each workstation from tab.~\ref{tab:setup} to investigate
if there is a difference in performance between ordinary workstation and a
@@ -482,7 +483,7 @@ with high frame rate and small simulation time steps.
(tab.~\ref{tab:best}).
\item The most performant node is GPUlab with 104 simulation steps per
- second. Performance of Capybara is higher than Storm, but it uses
+ second. Performance of Capybara is higher than of Storm, but it uses
powerful server-grade processor to achieve it.
\item Computational speedup for increasing number of parallel OpenMP
@@ -569,34 +570,34 @@ simulation; however, it gives performance reserve for further increase in
detail and scale of simulated physical phenomena. We manually limit simulation
time step to a minimum of \(1/30\) of the second to prevent floating-point
numerical errors due to small time steps. Also, we limit maximum time step to
-have frequency greater or equal to Nyquist frequency for precise partial time
-derivatives computation.
+have wave frequency greater or equal to Nyquist frequency for precise partial
+time derivatives computation.
Real-time simulation is essential not only for educational purposes, but also
for on-board intelligent systems. These systems analyse data coming from a
multitude of sensors the ship equips, calculate probability of occurrence of a
particular dangerous situation (e.g.~large roll angle) and try to prevent it by
-notifying ship's crew and an operator on the coast. This is one of the topics
-of future work.
+notifying ship's crew and an operator on the coast. This is one of the
+directions of future work.
Overall performance depends on the size of the ship rather than the number of
-panels. MICW hull has less number of panels than Aurora, but larger size and
-exactly two times worse performance (tab.~\ref{tab:best}). The size of the hull
-affects the size of the grid in each point of which velocity potential and then
-pressure is computed. These routines are much more compute intensive in
+panels. MICW hull has less number of panels than Aurora, but two times larger
+size and two times worse performance (tab.~\ref{tab:best}). The size of the
+hull affects the size of the grid in each point of which velocity potential and
+then pressure is computed. These routines are much more compute intensive in
comparison to wetted surface determination and pressure force computation,
performance of which depends on the number of panels.
Despite the fact that Capybara has the highest floating-point performance
-across all workstations in the benchmarks, Virtual testbed runs faster on the
+across all workstations in the benchmarks, Virtual testbed runs faster on its
processor, not the graphical accelerator. Routine-by-routine investigation
showed that this graphics card is simply slower at computing even fully
-parallel Stokes wave generator kernel. This kernel fills three-dimensional
-array using explicit formula for the wave profile, it has linear memory access
-pattern and no information dependencies between array elements. It seems, that
-P5000 is not optimised for general purpose computations. We did not conduct
-visualisation benchmarks, so we do not know if it is more efficient in that
-case.
+parallel Stokes wave generator OpenCL kernel. This kernel fills
+three-dimensional array using explicit formula for the wave profile, it has
+linear memory access pattern and no information dependencies between array
+elements. It seems, that P5000 is not optimised for general purpose
+computations. We did not conduct visualisation benchmarks, so we do not know if
+it is more efficient in that case.
Although, Capybara's processor has 20 hardware threads (2 threads per core),
OpenMP performance does not scale beyond 10 threads. Parallel threads in our
@@ -608,14 +609,14 @@ solved by creating a pipeline from the main loop in which each stage is
executed in parallel and data constantly flows between subsequent stages. This
approach is easy to implement when computational grid can be divided into
distinct parts, which is not the case for Virtual testbed: there are too many
-dependencies between parts and in each stage position and size of each part can
-be different. OpenCL does not have these limitations, and pipeline would
-probably not improve graphical accelerator performance, so we did not take this
-approach.
+dependencies between parts and the position and the size of each part can be
+different in each stage. Graphical accelerators have more efficient hardware
+threads switching which, and pipeline would probably not improve their
+performance, so we did not take this approach.
Our approach for performing computations on a heterogeneous node (a node with
-both a processor and a graphical accelerator) is similar to approach followed by
-the authors of Spark distributed data processing
+both a processor and a graphical accelerator) is similar to the approach
+followed by the authors of Spark distributed data processing
framework~\cite{zaharia2016spark}. In this framework data is first loaded into
the main memory of each cluster node and then processed in a loop. Each
iteration of this loop runs by all nodes in parallel and synchronisation occurs
@@ -635,18 +636,18 @@ only the forces for each panel.) This allows us to eliminate expensive data
transfer between CPU and GPU memory. In early versions of our programme this
copying slowed down simulation significantly.
-Although, heterogeneous node is not a cluster, the approach to programming it is
-similar to distributed data processing systems: we process data only on those
-device main memory of which contains this data and we never transfer
-intermediate computation results between devices. To employ this approach the
-whole iteration of the programme's main loop have to be executed either on a
-processor or a graphical accelerator. Given the time constraints, future
-maintenance burden and programme's code size, it was difficult to fully follow
-this approach, but we came to a reasonable approximation of it. We still have
-functions (\textit{clamp} stage in fig.~\ref{fig:histogram} that reduces the
-size of the computational grid to the points nearby the ship) in Virtual testbed
-that work with intermediate results on a processor, but the amount of data that
-is copied to and from a graphical accelerator is relatively small.
+Although, heterogeneous node is not a cluster, efficient programme architecture
+for such a node is similar to distributed data processing systems: we process
+data only on those device main memory of which contains the data and we never
+transfer intermediate computation results between devices. To implement this
+principle the whole iteration of the programme's main loop have to be executed
+either on a processor or a graphical accelerator. Given the time constraints,
+future maintenance burden and programme's code size, it was difficult to fully
+follow this approach, but we came to a reasonable approximation of it. We still
+have functions (\textit{clamp} stage in fig.~\ref{fig:histogram} that reduces
+the size of the computational grid to the points nearby the ship) in Virtual
+testbed that work with intermediate results on a processor, but the amount of
+data that is copied to and from a graphical accelerator is relatively small.
\section{Conclusion}