Re-read. - iccsa-19-vtestbed

commit fe7e7b70da6d4678d8177800fe8965a0168fd985
parent ff752f8426d71553046f654bf2cc9693c6f30952
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Sun, 31 Mar 2019 20:20:07 +0300

Re-read.

Diffstat:
main.tex  | 199 ++++++++++++++++++++++++++++++++++++++++---------------------------------------

1 file changed, 100 insertions(+), 99 deletions(-)
diff --git a/main.tex b/main.tex
@@ -41,10 +41,10 @@
 \title{Virtual testbed: Ship motion simulation for~personal workstations}
 \author{%
 Alexander Degtyarev\orcidID{0000-0003-0967-2949} \and\\
-Vasily Khramushin\orcidID{0000-0002-3357-169X} \and
-Ivan Gankevich\textsuperscript{*}\orcidID{0000-0001-7067-6928} \and
-Ivan Petriakov \and
-Anton Gavrikov\orcidID{0000-0003-2128-8368} \and
+Vasily Khramushin\orcidID{0000-0002-3357-169X} \and\\
+Ivan Gankevich\textsuperscript{*}\orcidID{0000-0001-7067-6928} \and\\
+Ivan Petriakov \and\\
+Anton Gavrikov\orcidID{0000-0003-2128-8368} \and\\
 Artemii Grigorev
 }
 
@@ -60,20 +60,21 @@ Artemii Grigorev
 
 \begin{abstract}
 
-Virtual testbed is a computer programme that simulates ocean waves, ship motion
-and compartment flooding. One feature of this programme is that it visualises
-physical phenomena frame by frame as the simulation progresses. The aim of the
-studies reported here was to assess how much performance can be gained using
-graphical accelerators compared to ordinary processors when repeating the same
-computations in a loop.  We rewrote programme's hot spots in OpenCL to execute
-them on a graphical accelerator and benchmarked their performance with a number
-of real-world ship models. The analysis of the results showed that data copying
-in and out of accelerator's main memory has major impact on performance when
-done in a loop, and the best performance is achieved when copying in and out is
-done outside the loop (when data copying inside the loop involves accelerator's
-main memory only). This result comes in line with how distributed computations
-are performed on a set of cluster nodes, and suggests using similar
-approaches for single heterogeneous node with a graphical accelerator.
+Virtual testbed is a computer programme that simulates ocean waves, ship
+motions and compartment flooding. One feature of this programme is that it
+visualises physical phenomena frame by frame as the simulation progresses. The
+aim of the studies reported here was to assess how much performance can be
+gained using graphical accelerators compared to ordinary processors when
+repeating the same computations in a loop.  We rewrote programme's hot spots in
+OpenCL to able to execute them on a graphical accelerator and benchmarked their
+performance with a number of real-world ship models. The analysis of the
+results showed that data copying in and out of accelerator's main memory has
+major impact on performance when done in a loop, and the best performance is
+achieved when copying in and out is done outside the loop (when data copying
+inside the loop involves accelerator's main memory only). This result comes in
+line with how distributed computations are performed on a set of cluster nodes,
+and suggests using similar approaches for single heterogeneous node with a
+graphical accelerator.
 
 \keywords{%
 wavy surface
@@ -110,7 +111,7 @@ and time-sharing of computing resources.
 
 One way of removing this barrier is to use graphical accelerator to speed up
 computations. In that case simulation can be performed on a regular workstation
-that has a discrete graphics card. Most of the researchers use GPU to make
+that has a dedicated graphics card. Most of the researchers use GPU to make
 visualisation in real-time, but it is rarely used for speeding up simulation
 parts, let alone the whole programme. In~\cite{pita2016sph} the authors use GPU
 to speed up computation of free surface motion inside a tank.
@@ -130,21 +131,20 @@ visualisation and simulation.
 
 \section{Methods}
 
-Virtual testbed is a computer programme that simulates ocean waves, ship motion
-and compartment flooding. One feature that distinguishes it with respect to
-existing proposals is the use of graphical accelerators to speed up
+Virtual testbed is a computer programme that simulates ocean waves, ship
+motions and compartment flooding. One feature that distinguishes it with
+respect to existing proposals is the use of graphical accelerators to speed up
 computations and real-time visualisation that was made possible by these
 accelerators.
 
-The programme consists of the following modules: \texttt{vessel} reads
-threedimensional ship hull model from an input file, \texttt{gui} draws current
-state of the virtual world and \texttt{core} computes each step of the
-simulation. The \texttt{core} module consists of components are linked together
-in a pipeline, in which output of one component is the input of another one.
-The computation is carried out in parallel to visualisation, and
-synchronisation occurs after each simulation step. It makes graphical user
-interface responsive even when workstation is not powerful enough to compute in
-real-time.
+The programme consists of the following modules: \texttt{vessel} reads ship
+hull model from an input file, \texttt{gui} draws current state of the virtual
+world and \texttt{core} computes each step of the simulation. The \texttt{core}
+module consists of components that are linked together in a pipeline, in which
+output of one component is the input of another one.  The computation is
+carried out in parallel to visualisation, and synchronisation occurs after each
+simulation step. It makes graphical user interface responsive even when
+workstation is not powerful enough to compute in real-time.
 
 Inside \texttt{core} module the following components are present: wavy surface
 generator, velocity potential solver, pressure force solver.  Each component in
@@ -168,9 +168,10 @@ model on a graphical accelerator~\cite{gankevich2018ocean}: its algorithm does
 not use transcendental mathematical functions, has nonlinear memory access
 pattern and complex information dependencies. It is much more efficient (even
 without serious optimisations) to execute it on a processor. In contrast, the
-other two waves are embarrassingly parallel and easy to rewrite in OpenCL. 
+other two wave models are embarrassingly parallel and easy to rewrite in
+OpenCL. 
 
-Each wave model outputs threedimensional (one temporal and two spatial
+Each wave model outputs three-dimensional (one temporal and two spatial
 dimensions) field of wavy surface elevation, and ARMA model post-processes this
 field using the following algorithm. First, autocovariance function (ACF) is
 estimated from the input field using Wiener---Khinchin theorem. Then ACF is
@@ -181,7 +182,7 @@ elevation field.
 The resulting field is stochastic, but has the same integral characteristics as
 the original one. In particular, probability distribution function of wavy
 surface elevation, wave height, length and period is preserved.  Using ARMA
-model for post-processing has the following advantages.
+model for post-processing has several advantages.
 \begin{itemize}
 
 	\item It makes wavy surface aperiodic (its period equals period of
@@ -261,7 +262,7 @@ formula from linear wave theory.
 Formula~\eqref{eq:phi} is particularly suitable for computation on a graphical
 accelerator: it contains transcendental mathematical functions (complex
 exponents) that help offset slow global memory loads and stores, it is explicit
-which makes it easy to compute in parallel and it is written using Fourier
+which makes it easy to compute in parallel, and it is written using Fourier
 transforms that are efficient to compute on a graphical
 accelerator~\cite{volkov2008fft}.
 
@@ -282,49 +283,46 @@ is considered wetted; if it is partially submerged, the algorithm computes
 intersection points using bisection method and wavy surface interpolation, and
 slices the part of the panel which is above the wavy surface (for simplicity
 the slice is assumed to be straight line, as is the case for sufficiently small
-panels).
-
-Wave pressure at any point under wavy surface is computed using dynamic
-boundary condition from~\eqref{eq:problem} as an explicit formula.  Then the
-pressure is interpolated in the centre of each panel out of which the ship hull
-is composed to compute pressure force acting on a ship hull.
+panels).  Wave pressure at any point under wavy surface is computed using
+dynamic boundary condition from~\eqref{eq:problem} as an explicit formula.
+Then the pressure is interpolated in the centre of each panel to compute
+pressure force acting on a ship hull.
 
 It is straightforward to rewrite pressure computation for a graphical
 accelerator as its algorithm reduces to looping over large collection of panels
 and performing the same calculations for each of them; however, dynamic
-boundary condition contains temporal and spatial derivatives that have to
-computed on a graphical accelerator. Although, computing derivatives on a
-processor is fast, copying the results to accelerator's main memory proved to
-be inefficient as there are four arrays (one for each dimension) that need to
-be allocated and transferred. Simply rewriting code for OpenCL proved to be
-even more inefficient due to irregular memory access pattern for different
-array dimensions. So, we resorted to implementing the algorithm described
-in~\cite{micikevicius2009derivative}, that stores intermediate results in the
-local memory of the accelerator. Using this algorithm allowed us to store
-arrays of derivatives entirely in graphical accelerator's main memory and
-eliminate data transfer altogether.
+boundary condition contains temporal and spatial derivatives that have to be
+computed. Although, computing derivatives on a processor is fast, copying the
+results to accelerator's main memory proved to be inefficient as there are four
+arrays (one for each dimension) that need to be allocated and transferred.
+Simply rewriting code for OpenCL proved to be even more inefficient due to
+irregular memory access pattern for different array dimensions. So, we resorted
+to implementing the algorithm described in~\cite{micikevicius2009derivative},
+that stores intermediate results in the local memory of the accelerator. Using
+this algorithm allowed us to store arrays of derivatives entirely in graphical
+accelerator's main memory and eliminate data transfer altogether.
 
 \subsection{Translational and angular ship motion computation}
 
 In order to compute ship position, translational velocity, angular displacement
-and angular velocity each time step we solve equations motion (adapted
-from~\cite{matusiak2013}) using pressure force computed for each panel. The
-equation of translational motion
+and angular velocity for each time step we solve equations of translational and
+angular motion (adapted from~\cite{matusiak2013}) using pressure force computed
+for each panel:
 \begin{equation*}
 	\dot{\vec{v}} = \frac{\vec{F}}{m} + g\vec{\tau}
 	- \vec{\Omega}\times\vec{v} - \lambda\vec{v},
 	\qquad
 	\dot{\vec{h}} = \vec{G} - \vec{\Omega}\times\vec{h},
 	\qquad
-	\vec{h} = \InertiaMatrix \cdot \vec{\Omega},
+	\vec{h} = \InertiaMatrix \cdot \vec{\Omega}.
 \end{equation*}
-where
+Here
 \(\vec{\tau}=\left(-\sin\theta,\cos\theta\sin\phi,-\cos\theta\cos\phi\right)\)
 is a vector that transforms \(g\) into body-fixed coordinate system,
 \(\vec{v}\)~--- translational velocity vector, \(g\)~--- gravitational
 acceleration, \(\vec{\Omega}\)~--- angular velocity vector, \(\vec{F}\)~---
 vector of external forces, \(m\)~--- ship mass, \(\lambda\)~--- damping
-coefficient, \(h\)~--- angular momentum and \(\vec{G}\)~--- the moment of
+coefficient, \(h\)~--- angular momentum, \(\vec{G}\)~--- the moment of
 external forces, and \(\InertiaMatrix\)~--- inertia matrix.
 
 We compute total force \(\vec{F}\) and momentum \(\vec{G}\) acting on a ship
@@ -354,7 +352,7 @@ category typically have more double precision arithmetic units and accelerators
 from the second category are typically optimised for single precision. The
 ratio of single to double precision performance can be as high as 32. Virtual
 testbed produces correct results for both single and double precision, but
-OpenCL version supports only single precision and graphical accelerators that
+OpenCL version supports only single precision, and graphical accelerators that
 we used have higher single precision performance (tab.~\ref{tab:setup}). So we
 choose single precision in all benchmarks.
 
@@ -377,15 +375,16 @@ choose single precision in all benchmarks.
 \end{table}
 
 Double precision was used only for computing autoregressive model coefficients,
-because round-off and truncation numerical errors make covariance matrices (from
-which coefficients are computed) non-positive definite. These matrices
+because round-off and truncation numerical errors make covariance matrices
+(from which coefficients are computed) non-positive definite. These matrices
 typically have very large condition numbers, and linear system which they
 represent cannot be solved by Gaussian elimination or \(LDLT\) Cholesky
-decomposition, as these methods are numerically unstable.
+decomposition, as these methods are numerically unstable (at least in our
+programme).
 
 Since Virtual testbed does both visualisation and computation in real-time, we
 measured performance of each stage of the main loop (fig.~\ref{fig:loop})
-synchronously with parameters that affect it. To assess computational
+synchronously with the parameters that affect it. To assess computational
 performance we measured execution time of each stage in microseconds (wall
 clock time) together with the number of wetted panels, and wavy surface size.
 To assess visualisation performance we measured the execution time of each
@@ -426,13 +425,13 @@ with and without graphical accelerator. The code was compiled with maximum
 optimisation level including processor-specific optimisations which enabled
 auto-vectorisation for further performance improvements.
 
-We ran all tests for each of the three ship models: Aurora, MICW (a hull with
-reduced moments of inertia for the current waterline) and sphere. The first two
-models represent real-world ships with known characteristics and were taken from
-Vessel database~\cite{vessel2015} registered by our university which was created
-for Hull programme~\cite{hull2010}. Parameters of these ship models are listed
-in tab.~\ref{tab:ships}, threedimensional models are shown in
-fig.~\ref{fig:models}. Spherical ship hull was used as a geometrical shape
+We ran all tests for each of the three ship models: Aurora cruiser, MICW (a
+hull with reduced moments of inertia for the current waterline) and a sphere.
+The first two models represent real-world ships with known characteristics and
+we took them from Vessel database~\cite{vessel2015} registered by our
+university which is managed by Hull programme~\cite{hull2010}. Parameters of
+these ship models are listed in tab.~\ref{tab:ships}, three-dimensional models
+are shown in fig.~\ref{fig:models}. Sphere was used as a geometrical shape
 wetted surface area of which is close to constant under impact of ocean waves.
 
 We ran all tests for each workstation from tab.~\ref{tab:setup} to investigate
@@ -440,7 +439,7 @@ if there is a difference in performance between ordinary workstation and a
 computer for visualisation. Storm is a regular workstation with mediocre
 processor and graphical accelerator, GPUlab is a slightly more powerful
 workstation, and Capybara has the most powerful processor and professional
-graphical accelerator for visualisation. 
+graphical accelerator optimised for visualisation. 
 
 \begin{table}
 	\centering
@@ -462,7 +461,7 @@ graphical accelerator for visualisation.
 	\centering
 	\includegraphics[width=0.5\textwidth]{build/aurora.eps}\hfill
 	\includegraphics[width=0.5\textwidth]{build/micw.eps}
-	\caption{Aurora and MICW threedimensional ship hull
+	\caption{Aurora and MICW three-dimensional ship hull
 	models.\label{fig:models}}
 \end{figure}
 
@@ -476,9 +475,11 @@ with high frame rate and small simulation time steps.
 \begin{itemize}
 
 	\item We achieved more than 60 simulation steps per second (SSPS) on each
-		of the workstations. For Storm and GPUlab the most performant programme
-		version was the one for graphical accelerator and for Capybara the most
-		performant version was the one for the processor (tab.~\ref{tab:best}).
+		of the workstations. SSPS is the same metric as frames per second in
+		visuliastion, but for simulation. For Storm and GPUlab the most
+		performant programme version was the one for graphical accelerator and
+		for Capybara the most performant version was the one for the processor
+		(tab.~\ref{tab:best}).
 
 	\item The most performant node is GPUlab with 104 simulation steps per
 		second.  Performance of Capybara is higher than Storm, but it uses
@@ -558,11 +559,10 @@ problem that we solve is too small to saturate graphical accelerator cores.  We
 tried to eliminate expensive data copying operations between host and graphical
 accelerator memory, where possible, but we need to simulate more physical
 phenomena and at a larger scale (ships with large number of panels, large
-number of compartments with large number of faces, wind simulation etc.) to
-verify that performance gap increases for powerful workstations. On the bright
-side, even if a computer does not have powerful graphical accelerator (e.g.~a
-laptop with integrated graphics), it still can run Virtual testbed with
-acceptable performance.
+number of compartments, wind simulation etc.) to verify that performance gap
+increases for powerful workstations. On the bright side, even if a computer
+does not have powerful graphical accelerator (e.g.~a laptop with integrated
+graphics), it still can run Virtual testbed with acceptable performance.
 
 Large SSPS is needed neither for smooth visualisation, nor for accurate
 simulation; however, it gives performance reserve for further increase in
@@ -584,20 +584,21 @@ panels. MICW hull has less number of panels than Aurora, but larger size and
 exactly two times worse performance (tab.~\ref{tab:best}). The size of the hull
 affects the size of the grid in each point of which velocity potential and then
 pressure is computed.  These routines are much more compute intensive in
-comparison to wetted surface determination and pressure force computation
-(performance of which depends on the number of panels).
+comparison to wetted surface determination and pressure force computation,
+performance of which depends on the number of panels.
 	
 Despite the fact that Capybara has the highest floating-point performance
-across all workstations in the benchmarks, Virtual testbed runs faster on its
+across all workstations in the benchmarks, Virtual testbed runs faster on the
 processor, not the graphical accelerator. Routine-by-routine investigation
-showed that it simply slower at computing even fully parallel Stokes wave
-generator kernel. This kernel fills threedimensional array elements using
-explicit formula for the wave profile, it has linear memory access pattern and
-no information dependencies between array elements. It seems, that P5000 is not
-optimised for general purpose computations. We did not conduct visualisation
-benchmarks, so we do not know if it is more efficient in that case.
-
-Although, Capybara's processor hash 20 hardware threads (2 threads per core),
+showed that this graphics card is simply slower at computing even fully
+parallel Stokes wave generator kernel. This kernel fills three-dimensional
+array using explicit formula for the wave profile, it has linear memory access
+pattern and no information dependencies between array elements.  It seems, that
+P5000 is not optimised for general purpose computations. We did not conduct
+visualisation benchmarks, so we do not know if it is more efficient in that
+case.
+
+Although, Capybara's processor has 20 hardware threads (2 threads per core),
 OpenMP performance does not scale beyond 10 threads. Parallel threads in our
 code do mostly the same operations but with different data, so switching
 between different hardware threads running on the same core in the hope that
@@ -625,14 +626,14 @@ processing. Not interacting with slow stable storage on every iteration allows
 Spark to achieve an order of magnitude higher performance than Hadoop
 (open-source version of MapReduce) on iterative algorithms.
 
-On a heterogeneous node an analogue of stable storage, read/writes to which is
+For a heterogeneous node an analogue of stable storage, read/writes to which is
 much slower than accesses to the main memory, is graphical accelerator memory. To
 minimise interaction with this memory, we do not read intermediate results of
 our computations from it, but reuse arrays that already reside there. (As a
 concrete example, we do not copy pressure field from a graphical accelerator,
 only the forces for each panel.) This allows us to eliminate expensive data
 transfer between CPU and GPU memory. In early versions of our programme this
-copying significantly slowed down simulation.
+copying slowed down simulation significantly.
 
 Although, heterogeneous node is not a cluster, the approach to programming it is
 similar to distributed data processing systems: we process data only on those
@@ -652,12 +653,12 @@ is copied to and from a graphical accelerator is relatively small.
 
 We showed that ship motion simulation can be performed on a regular workstation
 with or without graphical accelerator. Our programme includes only minimal
-number of mathematical models that allow motion calculation, but has performance
-reserve for inclusion of additional models.  We plan to implement wind, rudder
-and propeller, compartment flooding and fire, and trochoidal waves simulation.
-Apart from that, the main direction of future research is creation of on-board
-intelligent system that would include Virtual testbed as an integral part for
-simulating and predicting physical phenomena.
+number of mathematical models that allow ship motions calculation, but has
+performance reserve for inclusion of additional models.  We plan to implement
+rudder and propeller, compartment flooding and fire, wind and trochoidal waves
+simulation.  Apart from that, the main direction of future research is creation
+of on-board intelligent system that would include Virtual testbed as an
+integral part for simulating and predicting physical phenomena.
 
 \subsubsection*{Acknowledgements.}
 Research work is supported by Saint Petersburg State University (grant

	iccsa-19-vtestbed
	git clone https://git.igankevich.com/iccsa-19-vtestbed.git
	Log \| Files \| Refs