hpcs-17-collector

Using Virtualisation for Reproducible Research and Code Portability
git clone https://git.igankevich.com/hpcs-17-collector.git
Log | Files | Refs

main_text.tex (17798B)


      1 \section{Introduction}
      2 
      3 %Верификация исследований один из актуальных вопросов в современной науке. В естественно-научных областях повторное воспроизведение исследования чатсо требует длительной  подготовки, закупки дорогостоящих аппаратов (если работает другая группа), расходных материалов и прочее. В computer science дело обстоит проще. Если эксперимент не требует специфического оборудования (дорогие серевра, специфические архитектуры), а требует только грамотного создания софтверного окружения, необходимое окружение можно создать один раз, после чего сохранить и разворачивать при проведении нового эксперимента. 
      4 
      5 %возможность легко и просто повторить эксперимент также поднимает популярность статьи и уровень доверия к автору.
      6 %способы наглядно показать свою разработку(алгоритм): псевдокод, диаграммы, UML-схемы, ссылка на репозиторий с открытым кодом, 
      7 %как это обычно делают (репозиторий со статьей + картинки. как они получены????? org-mode, script for reproduce picture (example on R)/makefile)
      8 %как тетсировать на данных? надо воспроизвести окружение => образ ОС + платформа + необходимое ПО и его конфиги
      9 
     10 Research reproducibility is an emerging topic in computer science~\cite{leveque2006wave,davison2014sumatra}; although, repeating a research work in computer science is often easier than in other sciences~--- one needs only a decent computer and the source code to reproduce the research~--- it may take a considerable amount of time to fully configure the platform: setup virtual or physical cluster, install compatible versions of operating system, software libraries and tools and compile and run the source code of the research work. Not only the source code, accompanying the paper, is published rarely, but it requires certain platform configuration to compile and run.
     11 
     12 There are several stages on each of which (ideally) there should be a tool that automates reproducibility:
     13 \begin{itemize}
     14     \item hardware stage (finding the required hardware),
     15     \item operating system stage (installing compatible operating system),
     16     \item software stage (compiling and executing the programme),
     17     \item graphical stage (gathering statistics from programme runs and plotting graphs),
     18     \item publication stage (writing and publishing the paper with all the data, graphs and the source code included).
     19 \end{itemize}
     20 
     21 In this proposal we deal with operating system and software stages~--- automate creation of environment to compile and run the programme in. For this purpose we use lightweight virtualisation technologies (Linux namespaces) on the example of distributed batch processing programme that runs on a cluster of nodes and processes the data in parallel. Our tool, called \emph{Collector}, creates root file system with the specified version of Linux distribution, the compiler and all the dependent packages. Then it compiles and runs the source code inside this virtual environment. The resulting root file system is portable across any platform with the same processor architecture and compatible kernel version.
     22 
     23 The advantages of using raw file system over opaque operating system images are clear:
     24 \begin{itemize}
     25     \item It is portable: can be stored as is or in the archive, and converted to/from any OS image format.
     26     \item It can be mounted over cluster network to any number of cluster nodes, and used concurrently by several parallel processes via Union/Overlay file system.
     27     \item It can be directly patched/upgraded by changing the current root directory to the path of the raw file system.
     28 \end{itemize}
     29 
     30 The objective of the study reported here is to develop a tool that automates
     31 creation of such portable environments and makes building particular source
     32 code inside it repeatable regardless of underlying operating system. This is
     33 the first publication on this tool.
     34 
     35 %One of the problems in science it is verifying of researches. In many fields reproducibility of research requires time, resources, some materials and many other. Computer science has more advantages for that. We do not need chemical reactive or expensive equipment. There are can define next parts: hardware configuration, software environment and useful data. Hardware structure can very complex and exclusive, but for more easy cases you can use virtualization for simulate need platform. Possible of reproducibility increase interest in you research and also level of confidence. 
     36 %Computer science is most easy to reproducibility. You can use virtualisation for simulate need configuration and source code from repository.
     37 
     38 %There are traditional ways to clearly view its idea/development. Pseudocode, block-scheme is good for demonstrate some basic algorithms, but should be enough short and also can not show details that depends on features some programming language. Diagram, UML-scheme is a good idea for show relations between parts of program/hardware and etc. Yet another idea it s links on source code. But you also will need configure and run it. In process you can take problems with configures of operating system and software. Graphs, plots and pictures make presentation of any research is better. There are many ways to generate it. It may be various outside software example gnuplot. Do you have trust this pictures? More correctly use one of script languages for generate plots, example R or Python. Using them together with org-mode allow reproduce all plots because they can generate just from text. It one features org-mode. 
     39 
     40 %Using org-mode suitable for reproduce plots from scripts but not allow make tests that take special software and many data. We are present solution for this case. Our tool save data and environment (software and his dependencies) in virtual space. On any same operating system you can run program, which create virtual namespace and in his make all tests. Without root privileges and without leave trace. Main environment will undamaged.
     41 
     42 \section{Related work}
     43 
     44 %org-mode
     45 The topic of research reproducibility is active not only in computer science, but also in statistics. For example, in~\cite{schulte2012multi} the authors propose to use Org-mode~--- a plain text markup language~--- to insert source code of research immediately in the text of the paper and execute the code on every document export to produce tables and graphs. Although, the system is capable of running arbitrary scripts, it is impractical to include any real C/C++/Fortran source code, as it is generally large compared to the code that produces graphs and requires certain libraries/compilers to build and run. So, Org-mode support for reproducing graphs and tables is limited to relatively small programmes written in high-level languages (R/python/graphviz), that takes input data, produced elsewhere, and generates a graph or a table.
     46 
     47 %reproducible research and parallel computing
     48 In~\cite{hunold2013state} the authors discuss the importance of research reproducibility in parallel computing to improve trustworthiness of the experiments. The issues that prevent wide spread of reproducible research practice include
     49 \begin{itemize}
     50     \item the impossibility to reproduce research if someone uses unique hardware,
     51     \item publishing rules and agreements,
     52     \item the impossibility to obtain the required version of software.
     53 \end{itemize}
     54 So, the external rules and regulations may prevent publishing the whole paper together with the source code, but may not prevent publishing gathered data and the source code that produces the numbers.
     55 
     56 %common store of source code and his presentation in HDF5 format
     57 Another idea is that its is not the source code that should be included in scientific paper, but that data, programme code and presentation of research may be stored together in a single file. This approach was explored in~\cite{hinsen2011data} where the authors suggest using Java Virtual Machine (JVM) to execute bytecode and HDF5 file format to store all the experimental data, source code and scripts for generating tables, plots and figures. Potential problems consist of using another programming languages, that are not supported by JVM (C/C++/Fortran), and storing large datasets in HDF5 file.
     58 
     59 %wave propagation software
     60 
     61 \section{Collector tool}
     62 
     63 %берем только ОС!!! (ну и еще платформу). основываемся на репозитории определенной ОС
     64 %платформа --- linux namespaces (create environment), cgroups (configure environment + virtual network)
     65 
     66 All computer science research works can be divided into two broad categories. On one side there are experiments with software or algorithms. In this research the most valuable part is the source code and configuration of execution environment, which usually consists of some operating system, processor architecture and software packages. On another side there are experiments with configuration of compute nodes and cluster network. To store and later reproduce operating system and execution environment we propose to use Collector~--- a programme that builds C/C++/Fortran source code by downloading and installing system packages in a separate root file system directory without super user privileges.
     67 
     68 The task is accomplished via instantiating new mount and user Linux namespaces in which the original user is mapped to the super user. After that a new process is launched having all super user privileges inside these namespaces, and installs packages specified in the configuration file into the specified root file system directory. Finally, the current file system root is moved to this directory, a directory with the source code is mapped from the original root file system to the new one, and the code is compiled and run inside it.
     69 
     70 %\footnote{\texttt{CLONE\_NEWNS} and \texttt{CLONE\_NEWUSER} flags of \texttt{clone(2)} system call.}
     71 
     72 %We suggest Collector for fast launching application on any Linux environment. If you haven't root privileges it is not a problem. Application creates virtual namespace, load need sources and dependencies (RPM packages) and run task. How it works? We are using \texttt{clone(2)} Linux command for creating new virtual environment. You handle function what will execute in new namespace. Namespaces it is mechanism of Linux kernel that support apart process isolation. It means impossibility to maintains several independence tree of processes. System call \texttt{clone(2)} with \texttt{CLONE\_NEWPID} flag is used for creating new namespace. You should use \texttt{CLONE\_NEWUSER} flag if you want have a root privileges in a new namespace. Own filesystem may be create using \texttt{chroot(2)} command. \texttt{chroot(2)} change root directory to your defined folder.
     73 
     74 Root file system that was created during the first run is saved, and subsequent runs of the application and code compilation do not cause installation of system packages (unless specified in the configuration file). On the first run Collector downloads and installs system packages specified in the configuration file from OS repository via package manager. It is a simple prototype that may be improved in future. For example, using network Linux namespace it is straightforward to run the application over virtual network with specified number of nodes and IP address range. Another improvement is to use Linux control groups to limit resource usage of each parallel process to make performance of virtual network more predictable. In our example we use CentOS operating system with RPM package manager, but the procedure can be adapted for other platforms.
     75 
     76 %\section{Evaluation}
     77 
     78 %при запуске создается виртуальное пространство имен, в котором происходит вся работа. когда при первом запуске происходит настройка окружения и установка программ вам не нужны права администратора, т.к. в виртуальном пространстве вы сами себе root. используется механизм cgroups для создания виртуального окружения. минус - для каждой платформы нужно создавать свой контейнер. рассмотреть как создается виртуальное пространство имен. тонкости change_root
     79 
     80 In the experiment we compile and run the test programme~\cite{spec-factory} two
     81 times. During the first run Collector downloads and installs all the
     82 dependencies before compiling and running, and during the second run it only
     83 checks that dependencies are satisfied. After that it compiles the programme
     84 and runs tests. The experiment showed that initialising a separate root file
     85 system takes considerable amount of time compared to the execution time of
     86 tests, whereas subsequent runs are faster as they use already initialised
     87 environment (Table~\ref{tab:actions}). Performance-wise it would be more
     88 efficient to store read-only base image of the operating system in cache
     89 directory and use Union/Overlay file system to mount it under writable
     90 directory to reduce initialisation time.
     91 
     92 %do view of my experiment
     93 %Containerization it is good idea, but it also need good implementation. One of those tools -- Docker -- has using in many commercial projects. Docker allow to put any software tool in container that may be running on your system. You just download it and start works! 
     94 
     95 %TODO: link on sources. Spec-factory it is scheduler for clusters. It contains test on 3 local nodes. Test example compute wave spectres on hadoop.
     96 
     97 
     98 
     99 \begin{table}
    100     \centering
    101     \caption{Performance of root file system initialisation.\label{tab:actions}}
    102     \begin{tabular}{lrr}
    103         \toprule
    104         \multirow{2}[2]{*}{Action}  & \multicolumn{2}{c}{Time, s} \\
    105         \cmidrule(lr){2-3}          & Exp.~I & Exp.~II \\
    106         \midrule
    107         Download and install dependencies &  548 &   9 \\
    108         Execute example                   &  723 & 723 \\
    109         \addlinespace
    110         All time                          & 1271 & 732 \\
    111         \bottomrule
    112     \end{tabular}
    113 \end{table}
    114 
    115 
    116 %\section{Discussion}
    117 
    118 There are a number of potential problems that are related to lightweight virtualisation technologies. First, it requires recent version of Linux kernel (at least 3.10 fully supports all namespaces) to use unprivileged user namespaces, and this version is newer than the one that is widely used in HPC clusters. For example, Scientific Linux distribution, which is popular in GRID computing, uses kernel version 2.6.32, which is not capable of creating user namespaces. Second, lightweight virtualisation is available on Linux only, there is no compatible version of the technology neither for UNIX nor for POSIX-compliant operating systems.
    119 
    120 %There we have two potential problem. One of them related to our technology (virtualization/containers), other is more widely and related to whole theme of reproducible research.
    121 
    122 %Virtualization have some disadvantages. Firstly it limited by operating system. If you use Linux it is good but Windows and MacOS users can't use container technology. Next you should check kernel version your system. Cgroups mechanism was appeared in 2.4.19 version and example flag \texttt{CLONE\_NEWUSER} was appeared in 3.8 version. It is good if you use popular/native distributive as Ubuntu, CentOS, RedHat and etc. Often science companies/institutions use custom distributive that based on more old kernel version (example Scientific Linux based on RedHat on cluster in JINR has 2.6.32 kernel version). Now it problem reducing, but was very actual several days ago.
    123 
    124 %Main idea of reproducible research is sharing of solutions. It healthing science community and high quality of researches. On one hand, results of any research it is intellectual property, that have some value and not any people will want squander it. One other hand, you have a free choice: make your research a commercial or share it with community.
    125 %core version
    126 %other OS
    127 %вопрос об интеллектуальной собственности
    128 
    129 \section{Conclusion}
    130 
    131 One of the problems in research reproducibility is the absence of tools to reproduce specified operating system with specific version of the software installed. Lightweight virtualisation technologies is a solution to this problem, that uses unprivileged Linux namespaces to create such execution environment in a separate root file system directory and package it together with the source code of the programme and its binary form. The solution does not pollute host operating system with programme dependencies and does not require super user privileges to create the environment. The future work is to investigate how network Linux namespace and control groups can improve application execution inside the environment.
    132 
    133 %\section{References}