iccsa-17-gpulab

Acceleration of Computing and Visualization Processes with OpenCL for Standing Sea Wave Simulation Model
git clone https://git.igankevich.com/iccsa-17-gpulab.git
Log | Files | Refs

body.tex (22881B)


      1 \section{Compute and graphics contexts interoperability}
      2 
      3 OpenGL itself does not contain any mechanisms that could help to organise the interaction between OpenGL and OpenCL. However, despite the fact that the specified functionality is not available, OpenGL supports the data and message exchange between its own contexts, the fundamental principles of which are laid in CL/GL interoperability \cite{opengl:orm}. Thus, it would be expediently to consider on this question as it represents the basics we need to understand.
      4 
      5 First, the graphics card, which is intended to be used for computation, should be checked for OpenCL shared context mode support. To find this out, the \texttt{clinfo} command line utility should be run on the target machine. If the required functionality is supported the \texttt{cl\_khr\_gl\_sharing} option will be specified in the ``Device Extensions'' section \cite{objects2013opencl}. This extension is provided by Khronos group and it is responsible for the interaction between APIs. 
      6 
      7 The following extension contains all necessary functions for OpenCL, which are defined in \texttt{cl\_gl.h} header file. In general, they could be divided into three main groups:
      8 
      9 \begin{itemize}
     10     \item Memory broker functions --- acquire and release allocated memory areas represented with OpenGL objects;
     11     \item Object transformation functions --- create an OpenCL representation for OpenGL object;
     12     \item Info functions --- provide various information about OpenGL context, like associated devices, object description, etc.
     13 \end{itemize}
     14 
     15 The next thing that should be discussed is data types, which could be driven by the interoperability API. This issue has been partially reviewed in~\cite{bogdanov:2016:CDR} at the discussion about graphics API parallelization. Basically, all OpenGL object are represented with two groups. The first one is a Container Object group the specificity of which lies in the fact that they can not be shared between contexts since they contain references to other objects, and GL standard disallow transfer for objects of that type, i.e.~they are not a point of our interest. Another group contains regular Objects, which could be shared between context without any limitations. Since we need to share only the data for now, we should take a look for the Buffer Object, which could actually store an array of unformatted memory allocated by the OpenGL context. Among all Buffer Objects, for the our particular case the Vertex Buffer Object should be used at first to transfer the surface representation data.
     16 
     17 \begin{figure}
     18 	\centering
     19 	\includegraphics[width=.99\linewidth]{gl-cl-interop}
     20 	%\includegraphics{gl-cl-interop}
     21 	\caption{OpenGL/OpenCL interoperability pipeline.\label{fig:gl-cl-interop}}
     22 \end{figure}
     23 
     24 Due to the fact that graphics unit could proceed with the single context at once, we will need to manage them manually by alternating them upon request. Following this way we should be able to build a pipeline as it presented in fig.~\ref{fig:gl-cl-interop}. Here, both contexts controlled by the main process are basically performing eight main steps to achieve the goal:
     25 
     26 \begin{enumerate}
     27     \item First, check whether the \texttt{cl\_khr\_gl\_sharing} is supported by the target GPGPU.
     28     \item If the previous step succeeds, initialise both compute and graphics contexts.
     29     \item Next, the shared object is created by OpenGL. It will be used later on to pass the data back to graphics context.
     30     \item Then, the control passes to the OpenCL context and the object is registered here for the sharing.
     31     \item Apply a lock for the memory area acquired by the object.
     32     \item Perform all required computations and write the results to the shared memory.
     33     \item Release the lock and pass control to the graphics API.
     34     \item Finally, process and draw geometry.
     35 \end{enumerate}
     36 
     37 \section{Standing sea waves simulation model}
     38 
     39 %\subsection{Introduction}
     40 
     41 Our approach to sea waves simulation is based on the autoregressive model---moving average (ARMA) model of sea waves~\cite{degtyarev2011modelling,degtyarev2013synoptic}. This model was developed as an superior alternative to existing linear Longuet---Higgins model. The new model simulates sea waves without assumptions of linear and small amplitude wave theories, i.e.
     42 \begin{itemize}
     43     \item the model generates waves of arbitrary amplitudes,
     44     \item period of wavy surface realisation equals the period of pseudo-random number generator (PRNG) and
     45     \item it requires less number of coefficients to converge compared to Longuet---Higgins model.
     46 \end{itemize}
     47 This model allows to generate both propagating and standing sea waves via moving average and autoregressive process respectively, but for the purpose of this paper we narrow the discussion to standing waves and autoregressive (AR) process only.
     48 
     49 One implication of turning down the assumptions of linear wave theory is that it is not possible to use linear velocity potential field computation formulae for the new wavy surface, as they were derived under the same assumptions. As a result, the new analytic formula was derived, that determines velocity potential field under arbitrary wavy sea surface. This formula is particularly suitable for computation on GPUs:
     50 \begin{itemize}
     51     \item it contains transcendental mathematical functions (hyperbolic cosines and complex exponents);
     52     \item it is computed over large four-dimensional ($t$, $x$, $y$, $z$) region;
     53     \item it is analytic with no information dependencies between individual data points in $t$ and $z$ dimensions.
     54 \end{itemize}
     55 Moreover, for the purpose of the verification of the resulting wavy surface, it is imperative to visualise the surface and velocity potential velocity field in real-time as the computation progresses. Performing two simulations at a time with different velocity potential field formulae allows to spot the difference in computed fields, and to visually compare the size and the shape of regions where the most wave energy is concentrated.
     56 
     57 Within the framework of autoregressive model for standing waves we investigate how GPGPU computations can be used to speed-up velocity potential field computation and make real-time visualisation of the surface as computation proceeds.
     58 
     59 \subsection{Governing equations for 3-dimensional AR process}
     60 
     61 Three-dimensional autoregressive process is defined by
     62 \begin{equation*}
     63 	\zeta_{i,j,k} =
     64 	\sum\limits_{l=0}^{p_1}
     65 	\sum\limits_{m=0}^{p_2}
     66 	\sum\limits_{n=0}^{p_3}
     67 	\Phi_{l,m,n} \zeta_{i-l,j-m,k-n}
     68 	\epsilon_{i,j,k}
     69 	,
     70 	\label{eq:arma-process}
     71 \end{equation*}
     72 where $\zeta$ --- wave elevation, $\Phi$ --- AR coefficients, $\epsilon$ --- white noise with Gaussian distribution,
     73 $(p_1,p_2,p_3)$ --- AR process order, and
     74 $\Phi_{0,0,0} \equiv 0$. The input parameters are
     75 AR process coefficients and order.
     76 
     77 The coefficients $\Phi$ are calculated from ACF via
     78 three-dimensional Yule---Walker equations:
     79 \begin{equation*}
     80     \Gamma
     81     \left[
     82         \begin{array}{l}
     83             \Phi_{0,0,0}\\
     84             \Phi_{0,0,1}\\
     85             \vdotswithin{\Phi_{0,0,0}}\\
     86             \Phi_{p_1,p_2,p_3}
     87         \end{array}
     88     \right]
     89     = 
     90     \left[
     91         \begin{array}{l}
     92             K_{0,0,0}-\Var{\epsilon}\\
     93             K_{0,0,1}\\
     94             \vdotswithin{K_{0,0,0}}\\
     95             K_{p_1,p_2,p_3}
     96         \end{array}
     97     \right],
     98     \qquad
     99     \Gamma=
    100     \left[
    101         \begin{array}{llll}
    102             \Gamma_0 & \Gamma_1 & \cdots & \Gamma_{p_1} \\
    103             \Gamma_1 & \Gamma_0 & \ddots & \vdotswithin{\Gamma_0} \\
    104             \vdotswithin{\Gamma_0} & \ddots & \ddots & \Gamma_1 \\
    105             \Gamma_{p_1} & \cdots & \Gamma_1 & \Gamma_0
    106         \end{array}
    107     \right],
    108 \end{equation*}
    109 where $\vec N = \left( p_1, p_2, p_3 \right)$, $\Var{\epsilon}$ --- white noise
    110 variance, and
    111 \begin{equation*}
    112     \Gamma_i = 
    113     \left[
    114     \begin{array}{llll}
    115         \Gamma^0_i & \Gamma^1_i & \cdots & \Gamma^{p_2}_i \\
    116         \Gamma^1_i & \Gamma^0_i & \ddots & \vdotswithin{\Gamma^0_i} \\
    117         \vdotswithin{\Gamma^0_i} & \ddots & \ddots & \Gamma^1_i \\
    118         \Gamma^{p_2}_i & \cdots & \Gamma^1_i & \Gamma^0_i
    119     \end{array}
    120     \right]
    121     \qquad
    122     \Gamma_i^j= 
    123     \left[
    124     \begin{array}{llll}
    125         K_{i,j,0} & K_{i,j,1} & \cdots & K_{i,j,p_3} \\
    126         K_{i,j,1} & K_{i,j,0} & \ddots &x \vdotswithin{K_{i,j,0}} \\
    127         \vdotswithin{K_{i,j,0}} & \ddots & \ddots & K_{i,j,1} \\
    128         K_{i,j,p_3} & \cdots & K_{i,j,1} & K_{i,j,0}
    129     \end{array}
    130     \right],
    131 \end{equation*}
    132 Since $\Phi_{0,0,0}\equiv0$, the first row and column of $\Gamma$ can be
    133 eliminated. Matrix $\Gamma$ is block-toeplitz, positive definite and symmetric,
    134 hence the system is solved by Cholesky decomposition. White noise variance is
    135 estimated by
    136 \begin{equation*}
    137     \Var{\epsilon} = 
    138 	K_{0,0,0}
    139     - 
    140 	\sum\limits_{i=0}^{p_1}
    141 	\sum\limits_{i=0}^{p_2}
    142 	\sum\limits_{k=0}^{p_3}
    143     \Phi_{i,j,k} K_{i,j,k}.
    144 \end{equation*}
    145 
    146 \subsection{Three-dimensional velocity potential field}
    147 
    148 The problem of finding pressure field under wavy sea surface represents inverse
    149 problem of hydrodynamics for incompressible inviscid fluid. System of equations
    150 for it in general case is written as~\cite{kochin1966theoretical}
    151 \begin{align}
    152     & \nabla^2\phi = 0,\nonumber\\
    153     & \phi_t+\frac{1}{2} |\vec{\upsilon}|^2 + g\zeta=-\frac{p}{\rho}, & \text{на }z=\zeta(x,y,t),\label{eq-problem}\\
    154     & D\zeta = \nabla \phi \cdot \vec{n}, & \text{на }z=\zeta(x,y,t),\nonumber
    155 \end{align}
    156 where \(\phi\)~--- velocity potential, \(\zeta\)~--- elevation (\(z\) coordinate)
    157 of wavy surface, \(p\)~--- wave pressure, \(\rho\)~--- fluid density,
    158 \(\vec{\upsilon}=(\phi_x,\phi_y,\phi_z)\)~--- velocity vector, \(g\)~---
    159 acceleration of gravity, and \(D\)~--- substantial (Lagrange) derivative. The
    160 first equation is called continuity (Laplace) equation, the second one is the
    161 conservation of momentum law (the so called dynamic boundary condition); the
    162 third one is kinematic boundary condition for free wavy surface, which states
    163 that rate of change of wavy surface elevation (\(D\zeta\)) equals to the change of
    164 velocity potential derivative along the wavy surface normal
    165 (\(\nabla\phi\cdot\vec{n}\)).
    166 
    167 Inverse problem of hydrodynamics consists in solving this system of equations
    168 for \(\phi\). In this formulation dynamic boundary condition becomes explicit
    169 formula to determine pressure field using velocity potential derivatives
    170 obtained from the remaining equations. So, from mathematical point of view
    171 inverse problem of hydrodynamics reduces to Laplace equation with mixed boundary
    172 condition~--- Robin problem.
    173 
    174 Three-dimensional version of~\eqref{eq-problem} is written as
    175 \begin{align*}
    176     \label{eq-problem-3d}
    177     & \phi_{xx} + \phi_{yy} + \phi_{zz} = 0,\\
    178     & \zeta_t + \zeta_x\phi_x + \zeta_y\phi_y
    179     =
    180     \frac{\zeta_x}{\SqrtZeta{1 + \zeta_x^2 + \zeta_y^2}} \phi_x
    181     +\frac{\zeta_y}{\SqrtZeta{1 + \zeta_x^2 + \zeta_y^2}} \phi_y
    182     - \phi_z, & \text{на }z=\zeta(x,y,t).\nonumber
    183 \end{align*}
    184 Using Fourier method with some assumptions the equation is solved yielding formula for
    185 \(\phi\):
    186 \begin{equation}
    187     \label{eq:phi-high-amp}
    188     \phi(x,y,z,t) = \InverseFourierY{
    189         \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} }
    190         { 2\pi\Kveclen \Sinh{\smash{2\pi \Kveclen h}} }
    191         \FourierY{ \frac{ \zeta_t }{ \left( i f_1(x,y) + i f_2(x,y) - 1 \right) } }{u,v}
    192     }{x,y},
    193 \end{equation}
    194 where \(f_1(x,y)={\zeta_x}/{\SqrtZeta{1+\zeta_x^2+\zeta_y^2}}-\zeta_x\) and
    195 \(f_2(x,y)={\zeta_y}/{\SqrtZeta{1+\zeta_x^2+\zeta_y^2}}-\zeta_y\).
    196 
    197 
    198 \subsection{Architecture}
    199 
    200 Incorporation of OpenCL version of velocity potential solver into the existing source code reduced to an addition of two subclasses (fig.~\ref{fig:classes}):
    201 \begin{itemize}[leftmargin=4cm]
    202     \item[\texttt{Realtime\_solver}] a subclass of abstract solver that implements computation of velocity potential field on GPU, and 
    203     \item[\texttt{ARMA\_realtime\_driver}] a subclass of control flow object~--- an object that defines the sequence of calls to subroutines~--- that implements real-time visualisation and stores OpenGL buffer objects that are shared with the solver. These objects are shared with the solver only if the solver is real-time.
    204 \end{itemize}
    205 The algorithm for computation and visualisation pipeline is presented in alg.\ref{alg:pipeline}.
    206 
    207 \begin{algorithm}
    208     Initialise shared OpenCL/OpenGL context.\\
    209     Generate the first wavy surface realisation time slice.\\
    210     Compute corresponding velocity potential field.\\
    211     \While{not exited}{
    212         Visualise the current slice of wavy surface and velocity field.\\
    213         Asynchronously compute next wavy surface time slice.\\
    214         Compute its velocity potential field.
    215     }
    216     \caption{Main loop of computation and visualisation pipeline.\label{alg:pipeline}}
    217 \end{algorithm}
    218 
    219 
    220 \begin{figure}
    221 	\centering
    222 	\includegraphics[width=9cm]{classes}
    223 	\caption{Classes which implement OpenCL/OpenGL interoperability in the simulation code.\label{fig:classes}}
    224 \end{figure}
    225 
    226 \section{Evaluation}
    227 
    228 For the purpose of evaluation we use simplified version of \eqref{eq:phi-high-amp}:
    229 \begin{align}
    230     \label{eq:phi-linear}
    231     \phi(x,y,z,t) &= \InverseFourierY{
    232         \frac{ \Sinh{\smash{2\pi \Kveclen (z+h)}} }
    233         { 2\pi\Kveclen \Sinh{\smash{2\pi \Kveclen h}} }
    234         \FourierY{ \zeta_t }{u,v}
    235     }{x,y}\nonumber \\
    236     &= \InverseFourierY{ g_1(u,v) \FourierY{ g_2(x,y) }{u,v} }{x,y}.
    237 \end{align}
    238 Since standing sea wave generator does not allow efficient GPU implementation due to autoregressive dependencies between wavy surface points, only velocity potential solver was rewritten in OpenCL and its performance was compared to existing OpenMP implementation.
    239 
    240 For each implementation the overall performance of the solver for a particular time instant was measured. Velocity field was computed for one $t$ point, for 128 $z$ points below wavy surface and for each $x$ and $y$ point of four-dimensional $(t,x,y,z)$ grid. The only parameter that was varied between subsequent programme runs is the size of the grid along $x$ dimension. A total of 10 runs were performed and average time of each stage was computed.
    241 
    242 A different FFT library was used for each version of the solver. For OpenMP version FFT routines from GNU Scientific Library (GSL)~\cite{galassi2015gnu} were used, and for OpenCL version clFFT library~\cite{clfft} was used instead. There are two major differences in the routines from these libraries.
    243 \begin{enumerate}
    244 
    245     \item The order of frequencies in Fourier transforms is different and clFFT library requires reordering the result of~\eqref{eq:phi-linear} whereas GSL does not.
    246     
    247     \item Discontinuity at $(x,y) = (0,0)$ of velocity potential field grid is handled automatically by clFFT library, whereas GSL library produce skewed values at this point.
    248     
    249 \end{enumerate}
    250 For GSL library an additional interpolation from neighbouring points was used to smooth velocity potential field at these points. We have not spotted other differences in FFT implementations that have impact on the overall performance. 
    251 
    252 In the course of the numerical experiments we have measured how much time each solver's implementation spends in each computation stage to explain find out how efficient data copying between host and device is in OpenCL implementation, and how one implementation corresponds to the other in terms of performance.
    253  
    254 \section{Results}
    255 
    256 The experiments showed that GPU implementation outperforms CPU implementation by a factor of 10--15 (fig.~\ref{fig:bench-cpu-gpu}), however, distribution of time between computation stages is different for each implementation (fig.~\ref{fig:breakdown-cpu-gpu}). The major time consumer in CPU implementation is computation of $g_1$, whereas in GPU implementation its running time is comparable to computation of $g_2$. GPU computes $g_1$ much faster than CPU due to a large amount of modules for transcendental mathematical function computation. In both implementations $g_2$ is computed on CPU, but for GPU implementation the result is duplicated for each $z$ grid point in order to perform multiplication of all $XYZ$ planes along $z$ dimension in single OpenCL kernel, and, subsequently copied to GPU memory which severely hinders overall stage performance. Copying the resulting velocity potential field between CPU and GPU consumes $\approx{}20\%$ of velocity potential solver execution time.
    257 
    258 \begin{figure}
    259 	\centering
    260 	\includegraphics{bench-cpu-gpu}
    261 	\caption{Performance comparison of CPU (OpenMP) and GPU (OpenCL) versions of velocity potential solver.\label{fig:bench-cpu-gpu}}
    262 \end{figure}
    263 
    264 
    265 \begin{figure}
    266 	\centering
    267 	\includegraphics{breakdown-cpu-gpu}
    268 	\caption{Performance breakdown for GPU (OpenCL) and CPU (OpenMP) versions of velocity potential solver.\label{fig:breakdown-cpu-gpu}}
    269 \end{figure}
    270 
    271 \section{Discussion and future work}
    272 
    273 Simplified velocity potential formula~\eqref{eq:phi-linear} lacks $f_{1,2}$ functions which contain spatial derivatives of wavy surface $\zeta$ which are inefficient to compute on GPU. The remaining derivative $\zeta_t$ is also computed on CPU for the sake of efficiency. The future work is to find high-performance algorithm for multidimensional derivative computation on GPUs.
    274 
    275 OpenGL is still a single-thread library and in most cases of graphics applications only one thread manages the access to the GL context. This could lead to the performance drops, thread interoperability issues and complicated application architecture. Some workarounds could be found yet, like launching multiple processes and switching contexts between them, but it is not solving the problems mentioned about really. Thus, there are some premises exist, which are making sense to investigate on the similar result achievement with the newer APIs, such as a Vulkan and DirectX 12.
    276 
    277 Relatively the same statement could be made about the OpenCL toolkit. Despite the fact that general GPGPU computing interface allowed us to achieve some improvements through the waveform computation acceleration and and made a cross-platform execution possible, it still has some disadvantages. E.g., each core should be manually cached after the compilation, which is exactly proceeded over execution time. Moreover, it could happen that other compute APIs could show even better performance rates.
    278 
    279 We can also consider experimenting with dual chip GPU boards. The main idea here is to provide for both computing and rendering contexts an exclusive right to use their own dedicated GPU core in every moment of time. This improvement can help for the cases where intensive rendering routines are expected to be applied for the data obtained during the computations. In that way we will not lose the performance and, probably, will achieve even better results.
    280 
    281 And the most interesting question here is what should be done when the computing capabilities of the single node will be reached, or simply how do we scale. The part of application connected with direct computations could be handled in a common way by the one of MPI implementations. This method is compatible for both CUDA and CPU driven processes. Some concerns could appear at the stage of visualisation scaling.
    282 
    283 The first and simplest way to achieve the desired result is to accumulate data from the nodes after each computation round on the master node to join results and visualise them on it. But if we will make it like this we will lose the benefits of using a shared buffer for OpenCL and OpenGL. In addition, as the number of nodes will increase the bandwidth requirements of the channel will grow too, and in the end we will rest against its limitations.
    284 
    285 The second possible method of achieving the result is similar to the first one and has the same problems, but involves the usage of GPUDirect or DirectGMA technology for graphics cards depends on their vendor. The only difference here is that the exchange of data between nodes will be based on peer-to-peer protocol under CUDA context. It means that GPU will receive messages directly.
    286 
    287 The last possible way to solve this problem is to use distributed rendering techniques. It is based basically on combination of load distribution and objects or images composition algorithms, where three main ones can be distinguished: sort-first, sort-middle and sort-last~\cite{molnar:sorting}.
    288 
    289 \begin{figure}
    290 	\centering
    291 	\includegraphics[width=.9\linewidth]{sort-first}
    292 	\caption{Sort-first image compositing.\label{fig:sort-first}}
    293 \end{figure}
    294 
    295 Sort-first (fig.~\ref{fig:sort-first}) assumes that the scene should be divided into a number of zones, each processed on its own hardware. In fact, several cameras/viewports are created in the OpenGL context on different nodes, and then resulting frame buffers are simply merged into a single final frame. Two main problems that could be met here, but can be solved in some way, are artefacts at the joints of frame parts when using lossy compression and desynchronization of frame pieces when there is some motion on the scene.
    296 
    297 \begin{figure}
    298 	\centering
    299 	\includegraphics[width=.9\linewidth]{sort-middle}
    300 	\caption{Sort-middle image compositing.\label{fig:sort-middle}}
    301 \end{figure}
    302 
    303 Sort-middle (fig.~\ref{fig:sort-middle})  is not widely used in real time applications, due to the fact that it takes an extra time to produce and brings network overheads, but it is the most effective one. First, each node calculates geometry, and then workload is getting redistributed equally among all computing devices. 
    304 
    305 Both methods described above do not really suit us well, since each of the nodes will process either a single frame, or a part of the total surface, and we will not involve a viewport in a such way, at least for now.
    306 
    307 \begin{figure}
    308 	\centering
    309 	\includegraphics[width=.9\linewidth]{sort-last}
    310 	\caption{Sort-last image compositing.\label{fig:sort-last}}
    311 \end{figure}
    312 
    313 Actually, we are interested in the sort-last method (fig.~\ref{fig:sort-last}) most of all. Unlike the previous two, the load distribution here is performed based on the 3D objects grouping and segmentation. Later on, alpha blending of rendered objects is performed to retain the resulting frame. This way may not be optimal in some cases, since nodes will have to render parts of objects that will be overlapped at the compositing stage, but still it allows to achieve the desired result.
    314 
    315 We can even make some optimisation steps here. For example, to reduce the load on the network channel, we can use an $N$-ary compositing. In other words, instead of joining all objects on the only node, we can exchange objects between N nodes first and connect them. For example, it can be done on the principle of a binary tree. We can also use various compression algorithms, which show the greatest performance for colour and numeric data.
    316 
    317