csit-15-vsc-exp

Experience in building virtual private supercomputer
git clone https://git.igankevich.com/csit-15-vsc-exp.git
Log | Files | Refs

containers.tex (12742B)


      1 \section{Experience with application containers}
      2 \label{sec:cont}
      3 
      4 Only light-weight virtualization technologies can be used to build efficient virtual clusters for large-scale problems. This stems from the fact that on large scale no service overhead is acceptable if it scales with the number of nodes. In case of virtual clusters, scalable overhead comes from processor virtualization which means that no para- and fully-virtualized machines are suitable for large virtual clusters. This leaves only application container technologies for investigation. The other challenge is to make dynamic creation and deletion of virtual clusters take constant time.
      5 
      6 \subsection{System configuration}
      7 
      8 Test system comprises many standard components which are common in high performance computing: distributed parallel file system which stores home directories with experiment's input and output data; cluster resource scheduler which allocates resources for jobs and client programs to pre- and post-process data; non-standard component is network-attached storage exporting container's root files systems as directories. Linux Container technology (LXC) is used to provide containerization, \mbox{GlusterFS} is used to provide parallel file system and TORQUE to provide task scheduling. The most recent CentOS Linux 7 is chosen to provide stable version of LXC (greater than 1.0) and version of kernel which supports all container features. Due to limited number of nodes each of them is chosen to be both compute and storage node and every file in parallel file system is stored on exactly two nodes. Detailed hardware characteristics and software version numbers are listed in Table~\ref{fig:system-configuration}.
      9 
     10 %Creating virtual cluster in such environment requires the following steps. First, a client submits a task requesting particular number of cores. Then according to distribution of these cores among compute nodes a container is started on each node from the list with SSH daemon as the only program running inside it. Here there are two options: either start containers with network virtualization (using \textit{macvlan} or \textit{veth} LXC network type) and generate sufficient number of IP addresses for the cluster or use host network name space (\textit{none} LXC network type) and generate only the port number to run ssh daemon on. The next step is to copy (possibly amended) node file from host into the first container and launch submitted script inside it. When the script finishes its work SSH daemon in every container is killed and all containers are destroyed.
     11 
     12 \begin{table}[h]
     13 
     14 \caption{Hardware and software components of the system.}
     15 \label{fig:system-configuration}
     16 
     17 \begin{tabular}{ll}
     18     \toprule
     19     Component & Details \\
     20     \midrule
     21     CPU model & Intel Xeon E5440 \\
     22     CPU clock rate (GHz) & 2.83 \\
     23     No. of cores per CPU & 4 \\
     24     No. of CPUs per node & 2 \\
     25     RAM size (GB) & 4 \\
     26     Disk model & ST3250310NS \\
     27     Disk speed (rpm) & 7200 \\
     28     No. of nodes & 12 \\
     29     Interconnect speed (Gbps) & 1 \\
     30     \addlinespace
     31     Operating system & CentOS 7 \\
     32     Kernel version & 3.10 \\
     33     LXC version & 1.0.5 \\
     34     \mbox{GlusterFS} version & 3.5.1 \\
     35     TORQUE version & 5.0.0 \\
     36     \mbox{OpenMPI} version & 1.6.4 \\
     37     IMB version & 4.0 \\
     38     \mbox{OpenFOAM} version & 2.3.0 \\
     39     \bottomrule
     40 \end{tabular}
     41 \vspace{10pt}
     42 
     43 \end{table}
     44 
     45 %For this algorithm to work as intended client's home directory should be bind-mounted inside the container before launching the script. Additionally since some MPI programs require \textit{scratch} directories on each node to work properly, container's root file system should be mounted in copy-on-write mode, so that all changes in files and all the new files are written to host's temporary directory and all unchanged data is read from read-only network-mounted file system; this can be accomplished via UnionFS or similar file system and that way application containers are left untouched by tasks running on the cluster.
     46 
     47 To summarize, only standard Linux tools are used to build the system: there are no opaque virtual machines images, no sophisticated full virtualization appliances and no heavy-weight cloud computing stacks in this configuration.
     48 
     49 \subsection{Evaluation}
     50 
     51 To test the resulting configuration \mbox{OpenMPI} and Intel MPI Benchmarks (IMB) were used to measure network throughput and \mbox{OpenFOAM} was used to measure overall performance on a real-world application.
     52 
     53 The first experiment was to create virtual cluster, launch an empty (with \textit{/bin/true} as an executable file) MPI program and compare execution time to ordinary physical cluster. To set this experiment up in the container the same operating system and version of \mbox{OpenMPI} as in the host machine was installed. No network virtualization was used, each run was repeated several times and the average was displayed on the graph (Figure~\ref{fig:ping-1}). The results show that a constant overhead of 1.5 second is added to every LXC run after the 8\textsuperscript{th} core: one second is attributed to the absence of cache inside container with SSH configuration files, key files and libraries in it and other half of the second is attributed to the creation of containers as shown in Figure~\ref{fig:ping-2}. The jump after the 8\textsuperscript{th} core marks bounds of a single machine which means using network for communication rather than shared memory. The creation of containers is fully parallel task and takes approximately the same time to complete for different number of nodes. Overhead of destroying containers was found to be negligible and was combined with \textit{mpirun} time. So, usage of Linux containers adds some constant overhead to the launching of parallel task depending on system's configuration which is split between creation of containers and filling the file cache.
     54 
     55 \begin{figure}
     56 \includegraphics{ping-1}
     57 \caption{Comparison of LXC and physical cluster performance running empty MPI program.}
     58 \vspace{20pt}
     59 \label{fig:ping-1}
     60 \end{figure}
     61 
     62 \begin{figure}
     63 \includegraphics{ping-2}
     64 \caption{Breakdown of LXC empty MPI program run.}
     65 \vspace{50pt}
     66 \label{fig:ping-2}
     67 \end{figure}
     68 
     69 %The second experiment was to measure performance of different LXC network types using IMB suite and it was found that the choice of network virtualization greatly affects performance. As in the previous test container was set up with the same operating system and the same IMB executables as the host machine. Network throughput was measure with \textit{exchange} benchmark and displayed on the graph (Figure~\ref{fig:imb-1}). From the graph it is evident that until 214 bytes message size the performance is approximately the same for all network types, however, after this mark there is a dip in performance of virtual ethernet. It is difficult to judge where this overhead comes from: some studies report that under high load performance of bridged networking (\textit{veth} is always connected to the bridge) is decreased~\cite{james2004performance}, but IMB does not have high load on the system. Additionally, the experiment showed that as expected throughput decreases with the number of cores due to synchronization overheads (Figure~\ref{fig:imb-2}).
     70 
     71 Another experiment dealt with real-world application performance and for this role the \mbox{OpenFOAM} was chosen as the complex parallel task involving large amount of network communication, disk I/O and high CPU load. The dam break RAS case was run with different number of cores (total number of cores is the square of number of cores per node) and different LXC network types and the average of multiple runs was displayed on the graph (Figure~\ref{fig:openfoam-1}). Measurements for 4 and 9 cores were discarded because there is a considerable variation of execution time for these numbers on physical machines. From the graph it can be seen that low performance of virtual ethernet decreased final performance of \mbox{OpenFOAM} by approximately 5-10\% whereas \textit{macvlan} and \textit{none} performance is close to the performance of physical cluster (Figure~\ref{fig:openfoam-2}). Thus, the choice of network type is the main factor affecting performance of parallel applications running on virtual clusters and its overhead can be eliminated by using \textit{macvlan} network type or by not using network virtualization at all. More experimental results are presented in \cite{gankevich-ondemand2015}.
     72 
     73 \begin{figure}
     74 \includegraphics{openfoam-1}
     75 \caption{Average performance of \mbox{OpenFOAM} with different LXC network types.}
     76 \vspace{20pt}
     77 \label{fig:openfoam-1}
     78 \end{figure}
     79 
     80 \begin{figure}
     81 \includegraphics{openfoam-2}
     82 \caption{Difference of \mbox{OpenFOAM} performance on physical and virtual clusters. Negative numbers show slowdown of virtual cluster.}
     83 \vspace{50pt}
     84 \label{fig:openfoam-2}
     85 \end{figure}
     86 
     87 To summarize, there are two main types of overheads when using virtual cluster: creation overhead which is constant and small compared to average time of a typical parallel job and network overhead which can be eliminated by not using network virtualization at all.
     88 
     89 \subsection{Application containers with Docker}
     90 The next step in using containers for building virtual cluster is applying various automation and management tools that help to ease deployment and handling of virtual clusters. We investigated capabilities provided by several modern tools (Docker, Mesos, Mininet) to model and build virtualized computational infrastructure, investigated configuration management in the integrated environment and evaluated performance of the infrastructure tuned to a particular test application. Docker -- a lightweight and powerful open source container virtualization technology which we use to manage containers -- has a resource management system available so it is possible to test different configurations: from "slow network and slow CPUs" to "fast network and fast CPUs".
     91 
     92 Even though container-based virtualization is easy to run and use, it's not often easy and user-friendly to scale configuration or to limit resources. This is where Apache Mesos~\cite{mesos} and Mesosphere Marathon~\cite{marathon} were used. Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. At a high level Mesos is a cluster management platform that combines servers into a shared pool from which applications or frameworks like Hadoop, Jenkins, Cassandra, ElasticSearch, and others can draw. Marathon is a Mesos framework for long-running services such as web applications, long computations and so on.
     93 
     94 Figure~\ref{fig:npb-tests} shows the experimental results for execution of NAS Parallel Benchmarks (NPB) suite on different configurations of virtual testbed (Class W: workstation size; Classes A, B: standard test problems, ~4X size increase going from one class to the next). With NPB results are also very different, everything depends on benchmark type. For example, for SP test smaller size system of nonlinear PDEs had better Mop/s than for bigger size. However, for lower matrices sizes in LU test results are worse than for bigger matrices.
     95 \begin{figure}
     96 \centering
     97 \includegraphics[width=8cm]{fig-tests1.pdf}
     98 \caption{Performance of different tests from NAS Parallel Benchmarks suite on different configurations}
     99 \vspace{50pt}
    100 \label{fig:npb-tests}
    101 \end{figure}
    102 
    103 
    104 %\begin{figure*}
    105 
    106 %\begin{minipage}[t]{0.49\textwidth}
    107 %\vspace{0pt}
    108 %\includegraphics{ping-1}
    109 %\caption{Comparison of LXC and physical cluster performance running empty MPI program.}
    110 %\label{fig:ping-1}
    111 %\end{minipage}
    112 %\hfill
    113 %\begin{minipage}[t]{0.49\textwidth}
    114 %\vspace{0pt}
    115 %\includegraphics{ping-2}
    116 %\caption{Breakdown of LXC empty MPI program run.}
    117 %\label{fig:ping-2}
    118 %\end{minipage}
    119 
    120 %\begin{minipage}[t]{0.49\textwidth}
    121 %\vspace{0pt}
    122 %\includegraphics{imb-1}
    123 %\caption{Average throughput of \textit{exchange} MPI %benchmark.}
    124 %\label{fig:imb-1}
    125 %\end{minipage}
    126 %\hfill
    127 %\begin{minipage}[t]{0.49\textwidth}
    128 %\vspace{0pt}
    129 %\includegraphics{imb-2}
    130 %\caption{Throughput for 16Kb messages.}
    131 %\label{fig:imb-2}
    132 %\end{minipage}
    133 
    134 %\begin{minipage}[t]{0.49\textwidth}
    135 %\vspace{0pt}
    136 %\includegraphics{openfoam-1}
    137 %\caption{Average performance of \mbox{OpenFOAM} with different LXC network types.}
    138 %\label{fig:openfoam-1}
    139 %\end{minipage}
    140 %\hfill
    141 %\begin{minipage}[t]{0.49\textwidth}
    142 %\vspace{0pt}
    143 %\includegraphics{openfoam-2}
    144 %\caption{Difference of \mbox{OpenFOAM} performance on physical and virtual clusters. Negative numbers show slowdown of virtual cluster.}
    145 %\label{fig:openfoam-2}
    146 %\end{minipage}
    147 
    148 %\end{figure*}