Edit pp. - arma-thesis

commit 1f281aad868d1ca88fecb334ae489118e9db8937
parent ba6298ef857f9c7189cf79b7a141316c6aacd8be
Author: Ivan Gankevich <igankevich@ya.ru>
Date:   Tue, 31 Oct 2017 19:17:09 +0300

Edit pp.

Diffstat:
arma-thesis-ru.org  | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------
arma-thesis.org  | 54 ++++++++++++++++++++++++++++--------------------------

2 files changed, 108 insertions(+), 67 deletions(-)
diff --git a/arma-thesis-ru.org b/arma-thesis-ru.org
@@ -1942,6 +1942,86 @@ MPP части, от которых зависит данная, должны б
 | GL, GLUT\nbsp{}cite:kilgard1996opengl                        | трехмерная визуализация          |
 | CGAL\nbsp{}cite:fabri2009cgal                                | триангуляция волновых чисел      |
 
+Модель АР показывает наибольшую производительность в реализации на OpenMP и
+наименьшую в реализации на OpenCL, что также является наибольшей и наименьшей
+производительностью среди всех комбинация моделей и технологий. В самой
+оптимальной комбинации производительность АР в 4,5 раз выше, чем
+производительность СС, и в 20 раз выше, чем производительность ЛХ; в самой
+неоптимальной конфигурация\nbsp{}--- в 137 раз медленней, чем СС и в два раза
+медленней, чем ЛХ. Отношение между наибольшей (OpenMP) и наименьшей (OpenCL)
+производительностью модели АР составляет несколько сотен. Это объясняется тем,
+что формула модели\nbsp{}eqref:eq-ar-process эффективно отображается на
+архитектуру центрального процессора, который отличается от видеокарты наличием
+нескольких кэшей, памятью с низкой пропускной способностью и небольшим
+количеством модулей для операций с плавающей точкой.
+- Эта формула не содержит трансцендентных функций (синусов, косинусов и
+  экспонент),
+- все операции умножения и сложения в формуле реализуются посредством FMA
+  инструкций процессора, и
+- эффективное использование (локальность) кэша достигается путем использования
+  библиотеки Blitz, которая реализует оптимизированный обход элементов
+  многомерного массива, основанный на заполняющей пространство кривой Гильберта.
+В отличие от центрального процессора, видеокарта имеет меньшее количество кэшей,
+память с высокой пропусскной способностью и большое количество модулей для
+операций с плавающей точкой, что является наименее благоприятным сценарием для
+модели АР.
+- Формула модели АР не содержит транцендентных функций, которые могли бы
+  компенсировать высокие задержки памяти,
+- в видеокарте присутствуют инструкции FMA, но они не увеличивают
+  производительность из-за высоких задержек памяти, и
+- оптимальный обход многомерного массива не использовался ввиду отсутствия
+  библиотек, реализующих его для видеокарты.
+Наконец, архитектура видеокарты не содержит примитивы синхронизации, позволяющих
+эффективно реализовать авторегресиионные зависимости между отдельными частями
+взволнованной поверхности; вместо этого отдельная подпрограмма OpenCL
+запускается для каждой части, а управление зависимостями между ними
+осуществляется на стороне центрального процессора. Таким образом, в случае
+модели АР архитектура центрального процессора превосходит архитектуру
+видеокарты, поскольку более эффективно обрабатывает сложные информационные
+зависимости, простые вычисления (сложения и умножения) и сложные шаблоны доступа
+к памяти.
+
+#+name: tab-arma-performance
+#+begin_src R :results output org :exports results
+source(file.path("R", "benchmarks.R"))
+model_names <- list(
+	ar.x="АР",
+	ma.x="СС",
+	lh.x="ЛХ",
+	ar.y="АР",
+	ma.y="СС",
+	lh.y="ЛХ",
+  Row.names="\\orgcmidrule{2-4}{5-6}Подпрограмма"
+)
+row_names <- list(
+  determine_coefficients="Определение коэффициентов",
+  validate="Проверка модели",
+  generate_surface="Генерация поверхности",
+  nit="НБП",
+  write_all="Запись вывода в файл",
+  copy_to_host="Копирование данных с GPU",
+  velocity="Выч. потенциалов скорости"
+)
+arma.print_openmp_vs_opencl(model_names, row_names)
+#+end_src
+
+#+name: tab-arma-performance
+#+caption: Время работы (с.) реализации OpenMP и OpenCL для моделей АР, СС и ЛХ.
+#+attr_latex: :booktabs t
+#+RESULTS: tab-arma-performance
+#+BEGIN_SRC org
+|                                    |      |      | OpenMP |        | OpenCL |
+| \orgcmidrule{2-4}{5-6}Подпрограмма |   АР |   СС |     ЛХ |     АР |     ЛХ |
+|------------------------------------+------+------+--------+--------+--------|
+| Определение коэффициентов          | 0.02 | 0.01 |   0.19 |   0.01 |   1.19 |
+| Проверка модели                    | 0.08 | 0.10 |        |   0.08 |        |
+| Генерация поверхности              | 1.26 | 5.57 | 350.98 | 769.38 |   0.02 |
+| НБП                                | 7.11 | 7.43 |        |   0.02 |        |
+| Копирование данных с GPU           |      |      |        |   5.22 |  25.06 |
+| Выч. потенциалов скорости          | 0.05 | 0.05 |   0.06 |   0.03 |   0.03 |
+| Запись вывода в файл               | 0.27 | 0.27 |   0.27 |   0.28 |   0.27 |
+#+END_SRC
+
 **** Производительность ввода-вывода.
 **** Параллельное вычисление поля потенциала скорости.
 **** Производительность OpenCL-решателя, вычисляющего поле потенциала скорости.
@@ -2055,47 +2135,6 @@ Mathematica\nbsp{}cite:mathematica10, а на втором этапе логик
 | 760000 |   1.56 |  76.86 | 61.41 |   3.47 |  0.156 | 0.155 |
 | 800000 |   1.64 |  81.03 | 66.42 |   3.25 |  0.166 | 0.174 |
 
-#+name: tab-arma-performance
-#+begin_src R :results output org :exports results
-source(file.path("R", "benchmarks.R"))
-model_names <- list(
-	ar.x="АР",
-	ma.x="СС",
-	lh.x="ЛХ",
-	ar.y="АР",
-	ma.y="СС",
-	lh.y="ЛХ",
-  Row.names="\\orgcmidrule{2-4}{5-6}Подпрограмма"
-)
-row_names <- list(
-  determine_coefficients="Определение коэффициентов",
-  validate="Проверка модели",
-  generate_surface="Генерация поверхности",
-  nit="НБП",
-  write_all="Запись вывода в файл",
-  copy_to_host="Копирование данных с GPU",
-  velocity="Выч. потенциалов скорости"
-)
-arma.print_openmp_vs_opencl(model_names, row_names)
-#+end_src
-
-#+name: tab-arma-performance
-#+caption: Время работы (с.) реализации OpenMP и OpenCL для моделей АР, СС и ЛХ.
-#+attr_latex: :booktabs t
-#+RESULTS: tab-arma-performance
-#+BEGIN_SRC org
-|                                    |      |      | OpenMP |        | OpenCL |
-| \orgcmidrule{2-4}{5-6}Подпрограмма |   АР |   СС |     ЛХ |     АР |     ЛХ |
-|------------------------------------+------+------+--------+--------+--------|
-| Определение коэффициентов          | 0.02 | 0.01 |   0.19 |   0.01 |   1.19 |
-| Проверка модели                    | 0.08 | 0.10 |        |   0.08 |        |
-| Генерация поверхности              | 1.26 | 5.57 | 350.98 | 769.38 |   0.02 |
-| НБП                                | 7.11 | 7.43 |        |   0.02 |        |
-| Копирование данных с GPU           |      |      |        |   5.22 |  25.06 |
-| Выч. потенциалов скорости          | 0.05 | 0.05 |   0.06 |   0.03 |   0.03 |
-| Запись вывода в файл               | 0.27 | 0.27 |   0.27 |   0.28 |   0.27 |
-#+END_SRC
-
 Кроме выбора стандарта параллельных вычислений на время работы программы влияет
 выбор библиотек типовых вычислительных методов, и эффективность этих библиотек
 была показана тестированием их разработчиками. В качестве библиотеки для
diff --git a/arma-thesis.org b/arma-thesis.org
@@ -1911,38 +1911,40 @@ only a small fraction of it.
 | GL, GLUT\nbsp{}cite:kilgard1996opengl                        | three-dimensional visualisation |
 | CGAL\nbsp{}cite:fabri2009cgal                                | wave numbers interpolation      |
 
-AR model exhibits the best performance in OpenMP and the worst performance in
-OpenCL implementations, which is also the best and the worst performance across
-all model and framework combinations. In the best model and framework
-combination AR performance is 4.5 times higher than MA performance, and 20 times
-higher than LH performance; in the worst combination\nbsp{}--- 137 times slower
-than MA and 2 times slower than LH. The ratio between the best (OpenMP) and the
-worst (OpenCL) AR model performance is several hundreds. This is explained by
-the fact that the model formula\nbsp{}eqref:eq-ar-process is efficiently mapped
-on the CPU architecture with caches, low-bandwidth memory and small number of
-floating point units:
-- it contains no transcendental mathematical functions (sines, cosines and
-  exponents),
+AR model exhibits the highest performance in OpenMP and the lowest performance
+in OpenCL implementations, which is also the best and the worst performance
+across all model and framework combinations. In the most optimal model and
+framework combination AR performance is 4.5 times higher than MA performance,
+and 20 times higher than LH performance; in the most suboptimal
+combination\nbsp{}--- 137 times slower than MA and two times slower than LH. The
+ratio between the best (OpenMP) and the worst (OpenCL) AR model performance is
+several hundreds. This is explained by the fact that the model
+formula\nbsp{}eqref:eq-ar-process is efficiently mapped on the CPU architecture,
+which is distinguished from GPU architecture by having multiple caches,
+low-bandwidth memory and small number of floating point units compared to GPU.
+- This formula does not contain transcendental mathematical functions (sines,
+  cosines and exponents),
 - all of the multiplications and additions in the formula can be implemented
-  using FMA processor instructions,
-- and cache locality is achieved by using Blitz library which implements
-  optimised traversals for multidimensional arrays based on Hilbert
+  using FMA processor instructions, and
+- efficient use (locality) of cache is achieved by using Blitz library which
+  implements optimised traversals for multidimensional arrays based on Hilbert
   space-filling curve.
 In contrast to CPU, GPU has less number of caches, high-bandwidth memory and
 large number of floating point units, which is the worst case scenario for AR
 model:
-- there are no transcendental functions which compensate high memory latency,
-- there are FMA instructions but they do not improve performance due to high
+- there are no transcendental functions which could compensate high memory
   latency,
-- and optimal traversal was not used due to a lack of libraries implementing it
-  for a GPU.
-Finally, GPU does not have synchronisation primitives that allow to implement
-autoregressive dependencies between distinct wavy surface parts to compute them
-in parallel, and instead of this processor launches a separate OpenCL kernel for
-each part, controlling all the dependencies between them using CPU. So, for AR
-model CPU architecture is superior compared to GPU due to better handling of
-complex information dependencies, non-intensive calculations (multiplications
-and additions) and complex memory access patterns.
+- there are FMA instructions in GPU but they do not improve performance due to
+  high latency, and
+- optimal traversal of multidimensional arrays was not used due to a lack of
+  libraries implementing it for a GPU.
+Finally, GPU architecture does not contain synchronisation primitives that allow
+to implement autoregressive dependencies between distinct wavy surface parts;
+instead of this a separate OpenCL kernel is launched for each part, and
+dependency management between them is done on CPU side. So, in AR model case CPU
+architecture is superior compared to GPU due to better handling of complex
+information dependencies, simple calculations (multiplications and additions)
+and complex memory access patterns.
 
 #+header: :results output raw :exports results
 #+name: tab-arma-performance

	arma-thesis
	git clone https://git.igankevich.com/arma-thesis.git
	Log \| Files \| Refs \| LICENSE

arma-thesis-ru.org	\|	121	++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------
arma-thesis.org	\|	54	++++++++++++++++++++++++++++--------------------------