hpcs-16-mic-v2

Speedup of deep neural network learning on the MIC-architecture
git clone https://git.igankevich.com/hpcs-16-mic-v2.git
Log | Files | Refs

methods.tex (6740B)


      1 \section{METHODS}
      2 
      3 \subsection{Parallel architecture code optimization}
      4 
      5 Each Intel Xeon processor core and Intel Xeon Phi coprocessor core contains a vector processing unit. It is possible to process 16 32-bit integers or 8 64-bit integers in a single processor cycle. The code vectorization during array processing yields a significant potential for program acceleration when launching on parallel architectures. Vectorization was carried out by the technology of Array Notation extension Intel Cilk Plus. Intel Cilk Plus is a C/C++ extension for parallel support, implemented in the Intel compiler.
      6 
      7 For working with the array the following construction is used instead of cycle \textit{for} in Array Notation: \verb=array[start_index : length]=. For example, the following code adds i\textsuperscript{th} element of $W$ array to each i\textsuperscript{th} element of $W_{\text{delta}}$ array
      8 \begin{verbatim}
      9 W[0:count] += Wdelta[0:count];
     10 \end{verbatim}
     11 With Array Notation it is possible to vectorize an execution of more complicated operations. The search of maximum element in array is performed using the expression \verb=__sec_reduce_max=
     12 \begin{verbatim}
     13 const float max = 
     14 __sec_reduce_max(in_vec[base:ncols]);
     15 \end{verbatim}
     16 Summing the elements of the array performed by expression \verb=__sec_reduce_add=
     17 \begin{verbatim}
     18 const float sumexp =
     19 __sec_reduce_add(in_vec[base:ncols]);
     20 \end{verbatim}
     21 After vectorization, the code was launched on the processor Intel Xeon on 12 cores (24 threads). Performance increased by 14.5 times, compared to launching the non-vectorized code on one core.
     22 
     23 \subsection{Porting the code on MIC architecture}
     24 
     25 An offload-model of data transfer was used for working with Intel Xeon Phi. In offload mode the code block highlighted by the directive \verb=#pragma offload target (mic)= is executed on the coprocessor, the rest of the code is executed on the main processor. The size of the coprocessor memory for each variable must be specified. Offload mode supports 2 data transfer models: explicit and implicit.
     26 
     27 \subsubsection{Explicit data transfer model}
     28 
     29 By using the explicit model, a programmer specifies which variables should be copied onto coprocessor. The copy destination is also specified. The advantage of this model is the possibility of successful code compilation by any compiler besides Intel Compiler. Unknown directives will be simply ignored, generating no errors, the code will be compiled and ready for work on x86 architecture only.
     30 
     31 The functions of neural network learning are called within two nested loops. Inner loop was marked for execution on the coprocessor. 
     32 \begin{verbatim}
     33 while (FetchOneChunk(cpuArg, chunk)) {
     34     ...
     35     #pragma offload target (mic:0)
     36     while (FetchOneBunch(chunk, bunch)) {
     37         dnnForward (bunch);
     38         dnnBackward(bunch);
     39         dnnUpdate  (bunch);
     40     }
     41 }
     42  \end{verbatim}
     43  
     44 One problem that we faced during optimization was that there is no simple way to transfer two-dimensional arrays to MIC and back. All in all, we managed to do this with help of preprocessor macros and careful calculation of array sizes from source code analysis. Unfortunately, an explicit data transfer model contains a drawback: It supports a bitwise data copy only, and a structure containing field-pointers cannot be copied. In this program all characteristics of neural network are contained within the \texttt{bunch} structure. It is specified as an argument for functions being sent to the coprocessor for execution. This structure contains field-pointers. In order to copy the \texttt{bunch} structure properly to the coprocessor, its each field must be copied separately and then the whole structure assembled again on the coprocessor.
     45 \begin{verbatim}
     46 #define COPY_FLOAT_ARRAY_IN(arr) \
     47     float* arr ## 0 = bunch.arr[0]; \
     48     float* arr ## 1 = bunch.arr[1]; \
     49     float* arr ## 2 = bunch.arr[2]; \
     50     float* arr ## 3 = bunch.arr[3]; \
     51     float* arr ## 4 = bunch.arr[4]; \
     52     float* arr ## 5 = bunch.arr[5]; \
     53     float* arr ## 6 = bunch.arr[6]
     54 
     55 #define COPY_FLOAT_ARRAY_OUT(arr) \
     56     bunch.arr[0] = arr ## 0; \
     57     bunch.arr[1] = arr ## 1; \
     58     bunch.arr[2] = arr ## 2; \
     59     bunch.arr[3] = arr ## 3; \
     60     bunch.arr[4] = arr ## 4; \
     61     bunch.arr[5] = arr ## 5; \
     62     bunch.arr[6] = arr ## 6
     63 
     64 ...
     65 COPY_FLOAT_ARRAY_IN(d_B);
     66 COPY_FLOAT_ARRAY_IN(d_Wdelta);
     67 COPY_FLOAT_ARRAY_IN(d_Bdelta);
     68 COPY_FLOAT_ARRAY_IN(d_Y);
     69 COPY_FLOAT_ARRAY_IN(d_E);
     70 
     71 #pragma offload target(mic:0) \
     72 mandatory \
     73 inout(d_W0: length(
     74 bunch.dnnLayerArr[0] * 
     75 bunch.dnnLayerArr[1])) \
     76 
     77 inout(d_W1: length(
     78 bunch.dnnLayerArr[1] * 
     79 bunch.dnnLayerArr[2])) \
     80 
     81 inout(d_W2: length(
     82 bunch.dnnLayerArr[2] * 
     83 bunch.dnnLayerArr[3])) \
     84 
     85 inout(d_W3: length(
     86 bunch.dnnLayerArr[3] * 
     87 bunch.dnnLayerArr[4])) \
     88 
     89 inout(d_W4: length(
     90 bunch.dnnLayerArr[4] * 
     91 bunch.dnnLayerArr[5])) \
     92 
     93 inout(d_W5: length(
     94 bunch.dnnLayerArr[5] * 
     95 bunch.dnnLayerArr[6])) \
     96 
     97 inout(d_W6: length(
     98 bunch.dnnLayerArr[6] * 
     99 bunch.dnnLayerArr[7]))
    100 
    101 //similarly for d_B, d_Wdelta, 
    102 d_Bdelta, d_Y, d_E
    103 
    104 {
    105 
    106 // loop of training
    107 
    108 COPY_FLOAT_ARRAY_OUT(d_W);
    109 
    110 //similarly for d_B, d_Wdelta, 
    111 d_Bdelta, d_Y, d_E
    112 
    113 }
    114 \end{verbatim}
    115 Experiments on test dataset demonstrated that this data transfer model is not adequate for this task. The program runs slightly faster, than on one core processor and 12 times slower than on all of the cores (coprocessor~\ref{tab:workload-2}). Therefore it has been decided to use implicit data transfer model on coprocessor.
    116 
    117 \subsubsection{Implicit data transfer model}
    118 
    119 The basic principle for the implicit model is the usage of memory shared between CPU and MIC in the virtual address space. This method allows transferring of complex data types, thus ridding of the limitation of bitwise copying occurring on explicit model. Program conversion implemented as follows:
    120 \begin{enumerate}
    121 \item Data was marked by the \verb=_Cilk_shared= keyword that allows allocating it in the shared memory.
    122 \begin{verbatim}
    123 bunch.d_B[i-1]=(_Cilk_shared
    124 float*)_Offload_shared_malloc(size);
    125 \end{verbatim}
    126 \item Functions used inside the learning cycle were marked as shared:
    127 \begin{verbatim}
    128 #pragma offload_attribute(push,\
    129     _Cilk_shared)
    130 ...
    131 #pragma offload_attribute(pop)
    132 \end{verbatim}
    133 \item A separate function was created for the neuron network learning loop for using it in shared memory:
    134 \begin{verbatim}
    135 _Cilk_shared void
    136 dnn(Bunch& bunch, Chunk& chunk)
    137 {
    138     while(FetchOneBunch(chunk, bunch))
    139     {
    140         dnnForward (bunch);
    141         dnnBackward(bunch);
    142         dnnUpdate (bunch);
    143     }
    144 }
    145 \end{verbatim}
    146 \item A function sent for execution to the coprocessor was marked by the command \verb=_Cilk_offload=:
    147 \begin{verbatim}
    148 _Cilk_offload dnn(bunch, chunk);
    149 \end{verbatim}
    150 \end{enumerate}