commit 99f0c70205c00ec15c28e43c8a793e76f2069cff
parent ba1023e37e7aad060ff3507ddb0efa6bea9624d0
Author: Ivan Gankevich <igankevich@ya.ru>
Date: Sat, 23 Mar 2019 17:23:16 +0300
Results.
Diffstat:
main.tex | | | 53 | +++++++++++++++++++++++++++++++++++++++++++++++++++-- |
1 file changed, 51 insertions(+), 2 deletions(-)
diff --git a/main.tex b/main.tex
@@ -1,6 +1,7 @@
\documentclass[runningheads]{llncs}
\usepackage{amsmath}
+\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{tikz}
\usetikzlibrary{arrows.meta}
@@ -300,7 +301,7 @@ local memory of the accelerator. Using this algorithm allowed us to store
arrays of derivatives entirely in graphical accelerator's main memory and
eliminate data transfer altogether.
-\subsection{Translational and angular motion computation}
+\subsection{Translational and angular ship motion computation}
In order to compute ship position, translational velocity, angular displacement
and angular velocity each time step we solve equations motion (adapted
@@ -337,9 +338,53 @@ processor.
\section{Results}
+Virtual testbed performance was benchmarked in a number of tests. Since we use
+both OpenMP and OpenCL technologies for parallel computing, we wanted to know
+how performance scales with the number of processor cores and with and without
+graphical accelerator.
+
+Graphical accelerators are divided into two broad categories: for general
+purpose computations and for visualisation. Accelerators from the first
+category typically have more double precision arithmetic units and accelerators
+from the second category are typically optimised for single precision. The
+ratio of single to double precision performance can be as high as 32. We ran
+all tests on a node with Quadro P5000 (tab.~\ref{tab:setup}) which falls into
+the second category, so we choose single precision in all benchmarks.
+
+\begin{table}
+ \centering
+ \caption{Hardware configuration and compiler options for
+ benchmarks.\label{tab:setup}}
+ \begin{tabular}{ll}
+ \toprule
+ Graphical accelerator & NVIDIA Quadro P5000 \\
+ Processor & Intel Xeon CPU E5-2630 v4 \\
+ Compiler & GCC 8.1.1 \\
+ Compiler options & \texttt{-O3 -march=native} \\
+ \bottomrule
+ \end{tabular}
+\end{table}
+
+Double precision was used only for computing autoregressive model coefficients,
+because roundoff and truncation numerical errors make covariance matrices (from
+which coefficients are computed) non-positive definite. These matrices
+typically have very large condition numbers, and linear system which they
+represent cannot be solved by Gaussian elimination or \(LDLT\) Cholesky
+decomposition, as these methods are numerically unstable.
+
+Since Virtual testbed does both visualisation and computation in real-time, we
+measured performance of each stage of the main loop (fig.~\ref{fig:loop})
+synchronously with parameters that affect it. To assess computational
+performance we measured execution time of each stage in microseconds (wall
+clock time) together with the number of wetted panels, and wavy surface size.
+To assess visualisation performance we measured the execution time of each
+visualisation frame (one iteration of the visualisation main loop) and
+execution time of computational frame (one iteration of the computational
+loop), from which it is easy to compute the usual frames-per-second metric.
+
\begin{figure}
\centering
- \begin{tikzpicture}[x=2.2cm,y=-1.5cm]
+ \begin{tikzpicture}[x=2.2cm,y=-1.4cm]
\node[Block] (s1) at (0,0) {\strut{}Wavy surface};
\node[Block] (s2) at (1,0) {\strut{}Autoreg. model};
\node[Block] (s3) at (2,0) {\strut{}Wave numbers};
@@ -362,6 +407,10 @@ processor.
\caption{Virtual testbed main loop.\label{fig:loop}}
\end{figure}
+We ran all tests on the same node for increasing number of processor cores and
+with and without graphical accelerator. The code was compiled with maximum
+optimisation level including processor-specific optimisations which enabled
+auto-vectorisation for further performance improvement.
\section{Discussion}