tcs-16-balance

Balancing Load on a Multiprocessor System with Event-Driven Approach
git clone https://git.igankevich.com/tcs-16-balance.git
Log | Files | Refs

tcs-16-balance.tex (48234B)


      1 \documentclass{llncs}
      2 
      3 \usepackage{graphicx}
      4 \usepackage{booktabs}
      5 \usepackage{cite}
      6 \usepackage{subcaption}
      7 \captionsetup{compatibility=false}
      8 
      9 \begin{document}
     10 \title{Balancing load on a multiprocessor system with event-driven approach}
     11 %
     12 \titlerunning{Balancing load on a multiprocessor system with event-driven approach}  
     13 \author{Alexander Degtyarev \and Ivan Gankevich}
     14 %
     15 \authorrunning{Alexander Degtyarev \and Ivan Gankevich} % abbreviated author list (for running head)
     16 %
     17 \institute{Saint Petersburg State University, Saint Petersburg 199034, Russia\\
     18 \email{deg@csa.ru, i.gankevich@spbu.ru}}
     19 
     20 \maketitle
     21 
     22 \begin{abstract}
     23 There are many causes of imbalanced load on a multiprocessor system such as heterogeneity of processors, parallel execution of tasks of varying complexity and also difficulties in estimating complexity of a particular task, however, if one can treat computer as an event-driven processing system and treat tasks as events running through this system the problem of load balance can be reduced to a well-posed mathematical problem which further simplifies to solving a single equation. The load balancer measures both complexity of the task being solved and performance of a computer running this particular task so that a load distribution can be adjusted accordingly. Such load balancer is implemented as a computer program and is known to be able to balance the load on heterogeneous processors in a number of scenarios.
     24 \keywords{load balance, event-driven architecture, heterogeneous system, multiprocessor computer}
     25 \end{abstract}
     26 
     27 \section*{Introduction}
     28 Load balance is maintained by adjusting distribution of computational tasks among available processors with respect to their performance, and inability to distribute them evenly stems not only from technical reasons but also from peculiarity of a problem being solved. On one hand, load imbalance can be caused by heterogeneity of the tasks and inability to estimate how much time it takes to execute one particular task compared to some other task. Such difficulties arise in fluid mechanics applications involving solution of a problem with boundary conditions when the formula used to calculate boundary layer differs from the formula used to calculate inner points and takes longer time to calculate; the same problem arises in concurrent algorithms of intelligent systems that have different asymptotic complexities but solve the same problem concurrently hoping to obtain result by the fastest algorithm. On the other hand, load imbalance can be caused by heterogeneity of the processors and their different performance when solving the same problem and it is relevant when tasks are executed on multiple computers in a network or on a single computer equipped with an accelerator. Therefore, load imbalance can be caused by heterogeneity of tasks and heterogeneity of processors and these peculiarities should be both taken into account to maintain load balance of a computer system.
     29 
     30 From mathematical point of view, load balance condition means equality of distribution function $F$ of some task metric (e.g.~execution time) to distribution function $G$ of some processor metric (e.g.~performance) and the problem of balancing the load is reduced to solving equation
     31 \begin{equation} \label{eq:balance}
     32 F(x)=G(n),
     33 \end{equation}
     34 where $x$ is task metric (or time taken to execute the task) and $n$ is processor metric (or relative performance of a processor needed to execute this task). Since in general case it is impossible to know in advance neither the time needed to execute the task on a particular processor, nor the performance of a processor executing a particular task, stochastic approach should be employed to estimate those values. Empirical distribution functions can be obtained from execution time samples recorded for each task: task metric is obtained dividing execution time by a number of tasks and processor metric is obtained dividing a number of tasks by their execution time. Also, any other suitable metrics can be used instead of the proposed ones, e.g.~the size of data to be processed can be used as a task metric and processor metric can be represented by some fixed number.
     35 
     36 It is easy to measure execution time of each task when the whole system acts as an event-driven system and an event is a single task consisting of program code to be executed and data to be processed. In this interpretation, load balancer component is connected to a processor recursively via profiler forming feedback control system. Profiler collects execution time samples and load balancer estimates empirical distribution functions and distributes new tasks among processors solving equation~\ref{eq:balance}.
     37 
     38 Static load balancing is also possible in this event-driven system and for that purpose a set of different load balancers can be composed into a hierarchy. In such hierarchy, distribution function is estimated incrementally from bottom to top and hierarchy is used to maintain static load balance. Physical processors are composed or decomposed into virtual ones grouping a set of processors and assigning them to a single load balancer or assigning one physical processor to more than one load balancer at once. So static load balancing is orthogonal to dynamic load balancing and they can be used in conjunction.
     39 
     40 To summarize, recursive load balancing approach targets problems exhibiting not only dynamic but also static imbalance and the balance can be achieved solving a single equation.
     41 
     42 \section*{Related work}
     43 
     44 The main drawback of existing parallel programming technologies is their inability to perform load balancing across different computing devices. Each device is associated with a different type of a workload, e.g.~disk is associated with I/O and processor with pure computations. Although, almost any program involves computations and reading/writing data to disk, today's standards for multi-core programming (like OpenMP~\cite{dagum1998openmp}) do not allow to do it in parallel --- there are no pipelines neither in OpenMP standard nor in any emerging technology known to date. Moreover, there are many other computing devices that can benefit from pipelines --- mainly network interface cards and GPUs --- to perform computations and data transfer simultaneously. So, modern parallel programming technologies do not allow to co-exist different type of workloads in a single program, but many programs may benefit from it, exploiting additional degrees of parallelism.
     45 
     46 In contrast to parallel programming technologies, event-driven approach allows to use every device in the computer in a unified way, and easily form a pipeline between different devices. Event-driven architecture have been used extensively to create desktop applications with graphical user interface since MVC paradigm~\cite{mvc} was developed and nowadays it is also used to compose enterprise application components into a unified system with message queues~\cite{amqp,jms}, however, it is rarely implemented in scientific applications. One example of such usage is GotoBLAS2 library~\cite{gotoblas1,gotoblas2}. Although, it is not clear from the referenced papers, analysis of its source code\footnote{Source code is available in \url{https://www.tacc.utexas.edu/tacc-projects/gotoblas2/}.} shows that this library uses specialized server object to schedule matrix and vector operations' kernels and to compute them in parallel. The total number of CPUs is defined at compile time and they are assumed to be homogeneous. There is a notion of a queue implemented as a linked list of objects where each object specifies a routine to be executed and data to be processed and also a number of CPUs to execute it on. Server processes these objects in parallel and each kernel can be executed in synchronous (blocking) and asynchronous (non-blocking) mode. So, compared to event-driven system GotoBLAS2 server uses static task scheduling, its tasks are not differentiated into production and reduction tasks, both the tasks and the underlying system are assumed to be homogeneous. GotoBLAS2 library exhibits competitive performance compared to other BLAS implementations~\cite{gotoblas1,gotoblas2} and it is a good example of viability of event-driven approach in scientific applications. Considering this, event-driven system can be seen as a generalization of this approach to a broader set of applications.
     47 
     48 There are a number of research works in which the authors develop systems borrowing some features from event-driven approach. For example, in~\cite{kale1993charm,zheng2010hierarchical,pilla2010charm} a concurrent object-oriented system similar to our system is described. In contrast to our system, it uses messages rather than events to transfer data, and it does not allow load-balancing across different computing devices. Another example is~\cite{armstrong2003making}, in which the author describes a system which uses \emph{supervisor trees} to organize concurrent processes and manage resources. These structures are similar to hierarchies, but we have different hierarchies for servers and tasks, so that system resources and the flow of computations can be managed separately. So, there are some works which borrow different parts of a typical event-driven system, however, they either do not put emphasis on load-balancing, or do not describe their system as an event-driven, thus not exploiting full benefits of it.
     49 
     50 Load-balancing is one of the main tasks of an operating system, and in our view in high-performance computing it should be delegated to some intermediate software layer which lies between operating system and application. So, in contrast to described approaches it would be useful to implement full event-driven system and hide message passing and synchronization logic in it.
     51 
     52 \section{Implementation of event-driven system}
     53 
     54 The whole system was implemented as a collection of C++ classes, and problem-solving classes were separated from utility classes with an event-driven approach. In this approach, problem solution is represented by a set of executable objects or ``employees'', each implementing a solution of one particular part of a problem. Each executable object can implement two methods.  With the first method employee either solves part of the problem or produces child executable objects (or ``hires'' additional employees) to delegate problem solution to them. Since upon completion of this method no object is destroyed, it is called ``production'' task or ``upstream'' task as it often delegates problem solution to a hierarchy layer located farther from the root than the current layer is. The second method collects execution results from subordinate executable objects and takes such object as an argument. Upon completion of this method the child object is destroyed or ``fired'' so that the total number of executable objects is reduced. Hence this task is called ``reduction'' task or ``downstream'' task since the results are sent to a hierarchy layer located closer to root. Executable objects can send results not only to their parents but also to any number of other executable objects, however, when communication with a parent occurs the child object is destroyed and when the root object tries to send results to its nonexistent parent, the program ends. Execution of a particular object is performed via submitting it to a queue corresponding to a particular processor. Child and parent objects are determined implicitly during submission so that no manual specification is needed. Finally, these objects are never copied and are accessed only via their addresses. In other words, the only thing that is required when constructing an executable object is to implement a specific method to solve a task and object's life time is implicitly controlled by the system and a programmer does not have to manage it manually.
     55 
     56 Execution of objects is carried out concurrently and construction of an executable object is separated from its execution with a thread-safe queue. Every message in a queue is an executable object and carries the data and the code needed to process it and since executable objects are completely independent of each other they can be executed in any order. There are real server objects corresponding to each queue in a system which continuously retrieve objects from a queue and execute their production or reduction tasks in a thread associated with the server object. Production tasks can be submitted to any queue, but a queue into which reduction tasks can be submitted is determined by a corresponding parent object so that no race condition can occur. Since each processor works with its own queue only and in its own thread, processing of queues is carried out concurrently. Also, each queue in a system represents a pipeline through which the data flows, however, execution order is completely determined by the objects themselves. So, executable objects and their methods model control flow while queues model data flow and the flows are separated from each other.
     57 
     58 Heterogeneity of executable objects can cause load imbalance among different queues and this problem can be solved introducing imaginary (i.e.~proxy) servers and profilers to aid in distribution of executable objects. Imaginary server is a server tied to a set of other servers and its only purpose is to choose the right child server to execute an object at.
     59 
     60 In the simplest case, a proper distribution can be achieved with round-robin algorithm, i.e.~when each arriving object is executed on the next server, however, in general case, some additional information about completed runs is needed to choose the right server and this information can be collected with pluggable profiler objects. When a new object arrives to an imaginary server, actual profiling information is collected from child servers and specified distribution strategy is used to delegate execution of an object to an appropriate server, and some static distribution strategy is also possible. So, imaginary servers together with distribution strategies and profilers can be used to distribute executable objects among real servers taking into account some profiling information of completed object executions.
     61 
     62 The class diagram of the whole event-driven system is depicted in the Figure~\ref{fig:classes} and the system works as follows.
     63 
     64 \begin{enumerate}
     65 	\item When a program execution starts, the hierarchy of imaginary and real servers is composed. All real servers are launched in a separate threads and processing of executable objects starts.
     66 	\item The first object is created and submitted to the imaginary server at the top of the hierarchy. The server employs specified distribution strategy to choose an appropriate server from the next layer of the hierarchy to send the object to. The profiler gathers measurements of completed runs from subordinate servers and decides where to send an object.
     67 	\item The previous step repeats until the bottom level of the hierarchy is reached and real server which was found with the distribution strategy starts execution of an object.
     68 	\item Object is executed and measurements are made by a profiler. If during execution more executable objects are created and submitted to the top imaginary server, the whole algorithm is repeated for each new object; if the root object submits reduction task then all servers in the hierarchy are shut down, and program execution ends.
     69 \end{enumerate}
     70 
     71 \begin{figure}
     72 	\centering
     73 	\includegraphics{classes.eps}
     74 	\caption{Class diagram for an event-driven system. ``Iserver'' denotes imaginary server and ``Rserver'' denotes real server.}
     75 	\label{fig:classes}
     76 \end{figure}
     77 
     78 To sum up, the whole system is composed of the two hierarchies: one hierarchy represents tasks and data and their dependencies employing executable objects, the other hierarchy represents processing system employing imaginary and real server objects. Mapping of the first hierarchy to the second is implicit and is implemented using message queues. Such composition allows easy configuration of dynamic and static load distribution strategies and allows programming with simple executable objects.
     79 
     80 \section{Implementation of distribution strategy}
     81 
     82 Recursive load balancing was implemented as a load distribution strategy, however, equation~\ref{eq:balance} was not solved directly. The first problem occurring when solving this equation directly was that task metric $x$ cannot be computed before actually running the task so it was estimated to be an average metric of a number of previous runs. The second problem was that when the task metric is known, the result of direct solution of equation~\ref{eq:balance} is not an identifier of a processor to execute the task on but it is number $n$ -- relative performance of a processor needed to execute the task and the number $n$ is not particularly useful when determining where to execute the task. Therefore, equation~\ref{eq:balance} was not solved directly but its main idea was realized in an algorithm similar to round-robin.
     83 
     84 The resulting algorithm works as follows.
     85 \begin{enumerate}
     86 \item First, algorithm collects samples recorded by profilers of child servers as well as estimates task metric and processor metric using values from previous runs. At this stage, not only averaging but also any other suitable predicting technique can be used.
     87 \item Then, probability of having a task with metric equal to computed task metric is determined by counting samples equal to computed task metric and dividing it by the total number of samples.
     88 \item The cursor pointing at the processor to execute the next task on is incremented by a step equal to a product of computed probability and computed processor metric.
     89 \item Then, by recursively subtracting metric of each processor from the cursor, the needed processor is found and the task is executed on it.
     90 \end{enumerate}
     91 
     92 The resulting mathematical formula for each step can be written simply as
     93 $$ cursor = cursor + F(\bar x) \bar n, $$
     94 where $\bar x$ is a task metric and $\bar n$ is a processor metric. In case of fully homogeneous system and all tasks having equal metric this algorithm is equivalent to round-robin: all processors have metric equal to $1$ and probability is always $1$ so that the cursor is always incremented by $1$.
     95 
     96 Although, the algorithm is simple, in practice it requires certain modifications and a robust profiler to work properly. Since algorithm balances reciprocal values of task metric $t$ (execution time) and processor metric $1 / t$ (processor throughput), even a slight oscillation of a task metric can affect the resulting distributions greatly. The solution to this problem is to smooth samples with a logarithm function and it can be done in a straightforward way, because the algorithm does not make assumptions about metrics' dimensions and treat them as numbers. The second problem is that the algorithm should be implemented with integer arithmetic only to minimize overhead of load balancing. This problem can be solved by omitting mantissa after logarithm is applied to a sample and in that case processor metric is equal to task metric but has an opposite sign. The last major problem is that the distribution of task metric may change abruptly during program execution, which renders samples collected by a profiler for previous runs useless. This problem is solved by detecting a sharp change in task execution time (more than three standard deviations) and when outliers are detected the profiler is reset to its initial state in which distribution is assumed to be uniform. As a result of applying logarithm to each sample the algorithm becomes unsuitable for relatively small tasks and for tasks taking too much time, and although such tasks are executed, the samples are not collected for them as they often represent just control flow tasks. To summarize, the modified algorithm is implemented using integer arithmetic only, is suitable for relatively complex tasks and adapts to a rapid change of a task metric distribution.
     97 
     98 One problem of the algorithm that stands aside is that it becomes inefficient in the event of high number of tasks with high metric values. It happens because when task is assigned to a particular processor it is not executed directly but rather gets placed in a queue. If this queue is not empty the task can reside in it for such a long time that its assignment to a particular processor will not match actual distribution function. The solution is simple: these stale tasks can be easily detected by recording their arrival time and comparing it with the current time and when such tasks are encountered by a queue processor, they can be redistributed to match the current distribution function. However, an existence of stale tasks is also an evidence that the computer is not capable of solving a problem fast enough to cope with continuously generated tasks and it is an opportunity to communicate with some other computers to solve the problem together. From a technical point of view, delegation of tasks to other computers is possible because tasks are independent of each other and read/write (serialization) methods are easily implemented for each of them, however, the problem was not addressed herein, and only load redistribution within a single computer was implemented.
     99 
    100 Described algorithm is suitable for distributing production tasks, but a different algorithm is needed to distribute reduction tasks. Indeed, when executable objects come in pairs consisting of the child and its parent, all children of the parent must be executed on the same server so that no race condition takes place, so it is not possible to distribute the task on an arbitrary server but a particular server must be chosen for all of the child tasks. One possible way of choosing a server is by applying a simple hashing function to parent's memory address. Some sophistication of this algorithm is possible, e.g.~predicting memory allocation and deallocation pattern to distribute reduction tasks uniformly among servers, however, considering that most of the reduction tasks in tested program were simple (the reason for this is discussed in Section~\ref{evaluation1}) the approach seemed to be non viable and was not implemented. So, simple hashing algorithm was used to distribute reduction tasks among servers.
    101 
    102 To summarize, recursive load distribution algorithm by default works as round-robin algorithm and when a reasonable change of task execution time is detected it automatically distributes the load in accordance with task metric distribution. Also, if there is a change in processor performance it is taken into account by relating its performance to other processors of computing system. Finally, if a task stays too much time in a queue it is distributed once again to match current distribution function.
    103 
    104 \section{Evaluation}
    105 
    106 Event-driven approach was tested on the example of hydrodynamics simulation program which solves real-world problem~\cite{csit,autoreg1,autoreg2,stab}. The problem consists of generating real ocean wavy surface and computing pressure under this surface to measure impact of the external excitations on marine object. The program is well-balanced in terms of processor load and for the purpose of evaluation it was implemented with introduced event-driven approach and the resulting implementation was compared to existing non event-driven approach in terms of performance and programming effort.
    107 
    108 Event-driven architecture makes it easy to write logs which in turn can be used to make visualization of control flow in a program. Each server maintains its own log file and when some event occurs it is logged in this file accompanied by a time stamp and a server identifier. Having such files available, it is straight-forward to reconstruct a sequence of events occurring during program execution and to establish connections between these events (to dynamically draw graph of tasks as they are executed). Many such graphs are used in this section to demonstrate results of experiments.
    109 
    110 Generation of a wavy surface is implemented as a transformation of white noise, autoregressive model is used to generate ocean waves and pressures are computed using analytical formula. The program consists of preprocessing phase, main computer-intensive phase and post-processing phase. The program begins with solving Yule-Walker equations to determine autoregressive coefficients and variance of white noise. Then white noise is generated and is transformed to a wavy surface. Finally, the surface is trimmed and written to output stream. Generation of a wavy surface is the most computer-intensive phase and consumes over 80\% of program execution time (Figure~\ref{fig:cases}) for moderate wavy surface sizes and this time does not scale with a surface size. So, the program spends most the time in the main phase generating wavy surface (this phase is marked with $[G_0,G_1]$ interval in the graphs). The hardware used in the experiments is listed in Table~\ref{tab:setup}. The program was tested in a number of experiments and finally compared to other parallel programming techniques.
    111 
    112 \subsection{Evaluation of event-driven system}
    113 \label{evaluation1}
    114 
    115 The first experiment consisted of measuring stale cycles and discovering causes of their occurrence. Program source code was instrumented with profiling directives and every occurrence of stale cycles was written to the log file. Also the total stale time was measured. Obtained results showed that stale cycles prevail in preprocessing and at the end of main phase but are not present in other parts of the program (Figure~\ref{fig:stale}). The reason for this deals with insufficient amount of tasks available to solve during these phases which in turn is caused by global synchronizations occurring multiple times in preprocessing phase and naturally at the end of a program. Stale cycles in the main phase are caused by computation performing faster than writing results to disk: in the program only one thread writes data and no parallel file system is used. Further experiments showed that stale cycles consume at most 20\% of the total execution time for 4 core system (Table~\ref{tab:overhead}) and although during this time threads are waiting on a mutex so that this time can be consumed by other operating system processes, there is also an opportunity to speed up the program. Considering file output performance stale cycles can only be reduced with faster storage devices combined with slower processors or with parallel file systems combined with fast network devices and interconnects. In contrast, the main cause of stale cycles in preprocessing phase deals with global synchronization and to minimize its effect it should be replaced by incremental synchronization if possible.
    116 
    117 \begin{figure}
    118 	\centering
    119 	\includegraphics{stale.eps}
    120 	\caption{Occurrences of stale cycles in preprocessing and at the end of the main computational phase of a program. Range $[G_0, G_1]$ denotes computationally intensive phase.}
    121 	\label{fig:stale}
    122 \end{figure}
    123 
    124 The next experiment consisted of measuring different types of overheads including profiling, load balancing, queuing and other overheads so that real performance of event-driven system can be estimated. In this experiment, the same technique was used to obtain measurements: every function causing overhead was instrumented and also the total time spent executing tasks and total program execution time was measured. As a result, the total overhead was estimated to be less than 0.1\% for different number of cores (Table~\ref{tab:overhead}). Also the results showed that reduction time is smaller than the total time spent solving production tasks in all cases (Table~\ref{tab:overhead}). It is typical of generator programs to spend more time solving data generating production tasks than solving data processing reduction tasks; in a data-centric program specializing in data processing this relation can be different. Finally, it is evident from the results that the more cores are present in the system the more stale time is introduced into the program. This behavior was explained in the previous experiment and is caused by imbalance between processor performance and performance of a storage device for this particular computational problem. To summarize, the experiment showed that event-driven system and recursive load distribution strategy do not incur much overhead even on systems with large number of cores and the program is rather code-centric than data-centric spending most of its execution time solving production tasks.
    125 
    126 \begin{table}
    127 	\centering
    128 	\begin{tabular}{p{0.22\textwidth} p{0.33\textwidth} l l l}
    129 	\toprule
    130 	Classifier       & Time consumer              & \multicolumn{3}{l}{Time spent, \%} \\
    131 	\cmidrule(r){3-5}
    132 	                 &                            & 4 cores            & 24 cores           & 48 cores \\
    133 	\midrule
    134 	Problem solution & Production tasks           & 71                 & 33                 & 19 \\
    135 	                 & Reduction tasks            & 13                 & \hphantom{0}4      & \hphantom{0}2 \\
    136 	\addlinespace
    137 	Stale time       & Stale cycles               & 16                 & 63                 & 79 \\
    138 	\addlinespace
    139 	Overhead         & Load distribution overhead & \hphantom{0}0.01   & \hphantom{0}0.0014 & \hphantom{0}0.0017 \\
    140 	                 & Queuing overhead           & \hphantom{0}0.002  & \hphantom{0}0.0007 & \hphantom{0}0.0005 \\
    141 	                 & Profiling overhead         & \hphantom{0}0.0004 & \hphantom{0}0.0004 & \hphantom{0}0.0003 \\
    142 	                 & Other overheads            & \hphantom{0}0.06   & \hphantom{0}0.03   & \hphantom{0}0.02   \\
    143 	\bottomrule \\
    144 	\end{tabular}
    145 	\caption{Distribution of wall clock time and its main consumers in event-driven system. Time is shown as a percentage of the total program execution time. Experiments for 4 cores were conducted on the system I and experiments for 24 and 48 cores were conducted on the system II from Table~\ref{tab:setup}.}
    146 	\label{tab:overhead}
    147 \end{table}
    148 
    149 In the third experiment, the total number of production tasks solved by the system was measured along with the total number of task resubmissions and it was found that there is high percentage of resubmissions. Each resubmission was recorded as a separate event and then a number of resubmissions for each task was calculated. The experiment showed that on average a total of 35\% of tasks are resubmitted and analysis of an event log suggested that resubmissions occur mostly during the main computational phase (Figure~\ref{fig:resubmit}). In other words 35\% of production tasks stayed in a queue for too long time (more than an average time needed to solve a task) so underlying computer was not capable of solving tasks as fast as they are generated by the program. This result leads to a conclusion that if more than one computer is available to solve a problem, then there is a natural way to determine what part of this problem requires multiple computers to be solved. So, high percentage of resubmissions shows that machine solves production tasks slower than they are generated by the program so multiple machines can be used to speed up problem solution.
    150 
    151 \begin{figure}
    152 	\centering
    153 	\includegraphics{resubmit.eps}
    154 	\caption{Event plot of resubmission of production tasks staying in a queue for too long time. Range $[G_0, G_1]$ denotes computationally intensive phase.}
    155 	\label{fig:resubmit}
    156 \end{figure}
    157 
    158 In the final experiment overall performance of event-driven approach was tested and it was found to be superior when solving problems producing large volumes of data. In the previous research it was found that OpenMP is the best performing technology for the wavy ocean surface generation~\cite{csit}, so the experiment consisted of comparing its performance to the performance of event-driven approach on a set of input data. A range of sizes of a wavy surface was the only parameter that was varied among subsequent program runs. As a result of the experiment, event-driven approach was found to have higher performance than OpenMP technology and the more the size of the problem is the bigger performance gap becomes (Figure~\ref{fig:performance}). Also event plot in Figure~\ref{fig:overlap} of the run with the largest problem size shows that high performance is achieved with overlapping of parallel computation of a wavy surface (interval $[G_0,G_1]$) and output of resulting wavy surface parts to the storage device (interval $[W_0,W_1]$). It can be seen that there is no such overlap in OpenMP implementation and output begins at point $W_0$ right after the generation of wavy surface ends at point $G_1$. In contrast, there is a significant overlap in event-driven implementation and in that case wavy surface generation and data output end almost simultaneously at points $G_1$ and $W_1$ respectively. So, approach with pipelined execution of parallelized computational steps achieves better performance than sequential execution of the same steps.
    159 
    160 \begin{figure}
    161 	\centering
    162 	\includegraphics{performance.eps}
    163 	\caption{Performance comparison of OpenMP and event-driven implementations.}
    164 	\label{fig:performance}
    165 \end{figure}
    166 
    167 Although OpenMP technology allows constructing pipelines, it is not easy to combine a pipeline with parallel execution of tasks. In fact such combination is possible if a thread-safe queue is implemented to communicate threads generating ocean surface to a thread writing data to disk. Then using \textit{omp section} work of each thread can be implemented. However, implementation of parallel execution within \textit{omp section} requires support for nesting \textit{omp parallel} directives. So, combining pipeline with parallel execution is complicated in OpenMP implementation requiring the use a thread-safe queue which is not present in OpenMP standard.
    168 
    169 \begin{figure}
    170 	\centering
    171 	\includegraphics{overlap.eps}
    172 	\caption{Event plot showing overlap of parallel computation $[G_0,G_1]$ and data output $[W_0,W_1]$ in event-driven implementation. There is no overlap in OpenMP implementation.}
    173 	\label{fig:overlap}
    174 \end{figure}
    175 
    176 To summarize, event-driven programming approach was applied to a real-world high-performance application and it was shown that it incurs low overhead, but results in appearance of stale periods when no problem solving is performed by some threads. The duration of these periods in the main phase can be reduced with faster storage equipment and the duration of stale periods in preprocessing phase can be reduced employing incremental synchronization techniques. Also, event-driven approach offers a natural way of determining whether program execution should scale to multiple machines or not, however, viability of such mode of execution was not tested in the present research. Finally, it was shown that event-driven approach is more efficient than standard OpenMP technology especially for large problem sizes and it was also shown that a pipeline combined with parallel execution works faster than sequential execution of parallelized steps.
    177 
    178 \subsection{Evaluation of load distribution strategy}
    179 
    180 Performance of recursive load distribution algorithm was compared to performance of round-robin algorithm and was tested in a number of scenarios with combinations of homogeneous and heterogeneous tasks and homogeneous and heterogeneous processors. In each experiment the total execution time and distributions of task metric and processor metric were measured and compared to uniform distribution case. All tests were performed on the same system (Table~\ref{tab:setup}) and each scenario was run multiple times to ensure accurate results. Also, preliminary validation tests were performed to make sure that the algorithm works as intended. So, the purpose of evaluation was to demonstrate how algorithm works in practice and to measure its efficiency on a real problem.
    181 
    182 \begin{table}
    183 	\centering
    184 	\begin{tabular}{p{0.34\textwidth} p{0.31\textwidth} p{0.31\textwidth}}
    185 		\toprule
    186 		Component                  & System \\
    187 		\midrule
    188 		Programming language       & C++11 \\
    189 		Threading library          & C++11 STL threads \\
    190 		Atomics library            & C++11 STL atomic \\
    191 		Time measurement routines  & \verb=clock_gettime(CLOCK_MONOTONIC, ...)= \\
    192 		                           & \verb=/usr/bin/time -f %e= \\
    193 		\addlinespace
    194 		Compiler                   & GCC 4.8.2 \\
    195 		Compiler flags             & \verb:-std=c++11 -O2 -march=native: \\
    196 		\cmidrule(r){2-3}
    197 		                           & I                       & II \\
    198 		\cmidrule(r){2-3}
    199 		Operating system           & Debian 3.2.51-1 x86\_64 & CentOS 6.5 x86\_64 \\
    200 		File system                & ext4                    & ext4 \\
    201 		\addlinespace
    202 		Processor                  & Intel Core 2 Quad Q9650 & $2\times$Intel Xeon E5-2695 v2 \\
    203 		Cores frequency (GHz)      & 3.00                    & 2.40 \\
    204 		Number of cores            & 4                       & 24 (48 virtual cores)\\
    205 		RAM capacity (GB)          & 8                       & 256 \\
    206 		RAID device                &                         & Dell PERC H710 Mini \\
    207 		RAID configuration         &                         & RAID10 \\
    208 		Storage device             & Seagate ST3250318AS     & $4\times$Seagate ST300MM0006 \\
    209 		Storage device speed (rpm) & 7200                    & $2\times$10000 \\
    210 	\bottomrule \\
    211 	\end{tabular}
    212 	\caption{Testbed setup.}
    213 	\label{tab:setup}
    214 \end{table}
    215 
    216 It has already been shown that the algorithm consumes only a small fraction of total execution time of a program (Table~\ref{tab:overhead}), so the purpose of the validation test was to show algorithm's ability to switch between different task metric distributions. The switching is performed when a significant change (more than three standard deviations) of a task execution time occurs. The test have shown that the switching events are present in preprocessing phase and do not occur in the main phase (Figure~\ref{fig:cases}). The cause of the switching is a highly variable task execution time inherent to preprocessing phase. So, profilers' resets occur only when a change of task execution time distribution is encountered and no switching is present when this distribution does not change.
    217 
    218 %\begin{figure}
    219 %	\centering
    220 %	\includegraphics{resets.eps}
    221 %	\caption{Event plot of profilers' resets during execution. Point $G_0$ indicates start of ocean surface generation phase.}
    222 %	\label{fig:resets}
    223 %\end{figure}
    224 
    225 The purpose of the first experiment was to show that the algorithm is capable of balancing homogeneous tasks on homogeneous computer and in that case it works like well-known round-robin algorithm. During the experiment, events of task submissions were recorded as well as additional profiler data and an event plot was created. In Figure~\ref{fig:case_0} relative performance of each processor core is plotted and all the samples lie on a single line in the computational phase. Since this phase consists of executing tasks of equal metric, the straight line represents the uniform distribution of tasks among processor cores constituting round-robin algorithm. So, in the simplest case of homogeneous tasks and processors recursive load balancing algorithm works as round-robin algorithm.
    226 
    227 The purpose of the second experiment was to show that recursive load-balancing algorithm is capable of balancing homogeneous tasks on heterogeneous processors and in that case it can distribute the load taking into account performance of a particular processor. Although natural application of such load balancing is hybrid computer systems equipped with graphical or other accelerators, the experiment was conducted by emulating such systems with a hierarchy of servers. It was found that load balancing algorithm can recognize performance of different components and adapt distribution of tasks accordingly (Figure~\ref{fig:case_1}): $I_1$'s first and second child servers have relative performance equal to $0.75$ and $0.25$ respectively whereas all children of $I_2$ server have relative performance equal to $1 \over 3$. Also, this system setup shows performance similar to performance of the homogeneous computer configuration (Figure~\ref{fig:performance2}). So, recursive load balancing algorithm works on heterogeneous computer configurations and the performance is similar to homogeneous system case.
    228 
    229 \begin{figure}
    230 	\centering
    231 	\includegraphics{performance2.eps}
    232 	\caption{Performance comparison of different server configurations. Configurations are listed in Figure~\ref{fig:cases}.}
    233 	\label{fig:performance2}
    234 \end{figure}
    235 
    236 The purpose of the third experiment was to show that the algorithm is capable of balancing heterogeneous tasks on a homogeneous system and the experiment showed that performance gain is small. For the experiment the source code generating a wavy surface was modified so that parts of two different sizes are generated simultaneously. In order to balance such workload on a homogeneous system the step should be equal to ${1 \over {2i}} n, i=1,2,...$, where $n$ is the processor metric (instead of being equal to $1$ when parts have the same size) so that each processor takes two respective parts of the surface. In the Figure~\ref{fig:case_3} showing results of the experiment the step reaches its optimal value of ${1 \over 2}n$ ($0.125$ mark), however, it takes almost 8 seconds (or 40\% of the total time) to reach this value. The first two cases do not exhibit such behavior and the step does not change during execution. Also, in the course of the experiment it was found that the step oscillates and to fix this it was smoothed with five point median filter and the number of samples was doubled. Finally, in subsequent experiments it was found that the more unique parts sizes are present in the main phase, the more samples should be collected to preserve the accuracy of the step evaluation, however, the increase in the number of samples led to slow convergence of the step to its optimal value. In other words, the more heterogeneous the tasks are, the more time is needed to find the optimal step value for them.
    237 
    238 The purpose of the fourth and final experiment was to show that the algorithm is capable of balancing heterogeneous tasks on heterogeneous system and results were similar to the previous experiment. System configuration was the same as in the second experiment. Although, in the Figure~\ref{fig:case_4} showing the results metrics and steps of both servers reach nearly optimal values, there are more disturbances in these processes. So, the algorithm works with heterogeneous tasks and system but heterogeneity of a system increases variability of the step. In other words, heterogeneity of a system also increases time needed to find the optimal step value.
    239 
    240 To summarize, from the experiments one can conclude, that the algorithm works on any system configuration and with any task combination, but requires tuning for a particular problem. However, experience obtained in the course of the experiments suggests that not only heterogeneity of tasks and computers increases the number of samples and convergence time but also there are certain task size distributions that cannot be handled efficiently by this algorithm and can extend this time indefinitely. One example of such distribution is linearly increasing task size. In this case step is always equal to $1/m$, where $m$ is the number of samples, and there is no way to tune the algorithm to balance such workload. So, the downside of recursive load balancing algorithm is that it is suitable for closed metric distributions with low variability of the metric and more general and simple algorithms can be developed. Also, it is evident from the experiments from the Section~\ref{evaluation1} that in the tested program the dominating performance factor is balance between the speed of wavy surface generation and the speed of writing it to storage device. In that case, load balancing algorithm plays only a second role and any combination of computer and task heterogeneity demonstrates comparable performance as was depicted in Figure~\ref{fig:performance2}.
    241 
    242 \begin{figure}
    243 	\centering
    244 	\begin{subfigure}{\textwidth}
    245 		\begin{tabular}{p{0.77\textwidth}p{0.30\textwidth}}
    246 			\vspace{0pt} \includegraphics{case_0.eps} & \vspace{1.33em} \includegraphics{scheme_0.eps} \\
    247 		\end{tabular}
    248 		\caption{Homogeneous tasks and homogeneous computer case.}
    249 		\label{fig:case_0}
    250 	\end{subfigure}
    251 	\begin{subfigure}{\textwidth}
    252 		\begin{tabular}{p{0.77\textwidth}p{0.30\textwidth}}
    253 			\vspace{0pt} \includegraphics{case_1.eps} & \vspace{1.33em} \includegraphics{scheme_1.eps} \\
    254 		\end{tabular}
    255 		\caption{Homogeneous tasks and heterogeneous computer case.}
    256 		\label{fig:case_1}
    257 	\end{subfigure}
    258 	\begin{subfigure}{\textwidth}
    259 		\begin{tabular}{p{0.77\textwidth}p{0.30\textwidth}}
    260 			\vspace{0pt} \includegraphics{case_3.eps} & \vspace{1.33em} \includegraphics{scheme_0.eps} \\
    261 		\end{tabular}
    262 		\caption{Heterogeneous tasks and homogeneous computer case.}
    263 		\label{fig:case_3}
    264 	\end{subfigure}
    265 	\begin{subfigure}{\textwidth}
    266 		\begin{tabular}{p{0.77\textwidth}p{0.30\textwidth}}
    267 			\vspace{0pt} \includegraphics{case_4.eps} & \vspace{1.33em} \includegraphics{scheme_1.eps} \\
    268 		\end{tabular}
    269 		\caption{Heterogeneous tasks and heterogeneous computer case.}
    270 		\label{fig:case_4}
    271 	\end{subfigure}
    272 %	\begin{subfigure}{\textwidth}
    273 %		\begin{tabular}{p{0.77\textwidth}p{0.30\textwidth}}
    274 %			\vspace{0pt} \includegraphics{case_2.eps} & \vspace{1.33em} \includegraphics{scheme_2.eps} \\
    275 %		\end{tabular}
    276 %		\caption{Homogeneous tasks and heterogeneous computer case.}
    277 %		\label{fig:case_2}
    278 %	\end{subfigure}
    279 	\caption{Event plot of task submissions and relative performance of child servers recorded at the time of submissions. $I$ denotes ``Iserver'' and $R$ denotes ``Rserver''. Profiled servers are marked with dashed line.}
    280 	\label{fig:cases}
    281 \end{figure}
    282 
    283 \section*{Conclusions}
    284 
    285 The main advantage of event-driven approach is its applicability to both heterogeneous systems and heterogeneous tasks. This allows a programmer to rely on the technology to distribute the load on the processor cores evenly. Experiments showed that this approach works in a wide range of test cases and a real-world application. Moreover, in this application it performs better than popular OpenMP technology.
    286 
    287 Apart from being more efficient than OpenMP the biggest advantage of event-driven approach is the ease of parallel programming. First of all, what is needed from a programmer is to develop a class to describe each independent task, create objects of that class and submit them to a queue. Programming in such a way does not involve thread and lock management and the system is flexible enough to have even the tiniest tasks executed in parallel. Second, relieving programmer from thread management makes it easy to debug this system. Each thread maintains its own log and any of both system and user events can be written to it and the sequence of events can be restored after the execution ends. Finally, with event-driven approach it is easy to write load distribution algorithm for your specific problem (or use an existing one). The only thing which is not done automatically is decomposition and composition of tasks, however, this problem requires higher layer of abstraction to solve.
    288 
    289 The future work is to extend event-driven approach for distributed and hybrid (GPGPU) systems and to see if it is possible to cover those cases. The other possible direction of research is to create declarative language which acts as higher layer of abstraction and performs decomposition into tasks automatically.
    290 
    291 \section*{Acknowledgments}
    292 
    293 Research was carried out using computational resources provided by Resource Center ``Computer Center of SPbU''\footnote{Official web site: \url{http://cc.spbu.ru}.} and supported by Russian Foundation for Basic Research (project N~13-07-00747) and Saint Petersburg State University (projects N~9.38.674.2013, 0.37.155.2014).
    294 
    295 \bibliography{balance}
    296 \bibliographystyle{plain}
    297 
    298 \end{document}