Towards performance control on the Grid

K.R Mayes , M Luján , G.D Riley , J Chin , P.V Coveney , J.R Gurd

Abstract

Advances in computational Grid technologies are enabling the development of simulations of complex biological and physical systems. Such simulations can be assembled from separate components—separately deployable computation units of well-defined functionality. Such an assemblage can represent an application composed of interacting simulations or might comprise multiple instances of a simulation executing together, each running with different simulation parameters. However, such assemblages need the ability to cope with heterogeneous and dynamically changing execution environments, particularly where such changes can affect performance.

This paper describes the design and implementation of a prototype performance control system (PerCo), which is capable of monitoring the progress of simulations and redeploying them so as to optimize performance. The ability to control performance by redeployment is demonstrated using an assemblage of lattice Boltzmann simulations running with and without control policies. The cost of using PerCo is evaluated and it is shown that PerCo is able to reduce overall execution time.

Keywords:

1. Introduction

This paper is ultimately concerned with the performance control of scientific simulations executing on a computational Grid. It is possible to construct complex applications by composing previously independent or pre-existing units together. For example, a detailed molecular dynamics simulation can be coupled to a general computational fluid dynamics simulation, allowing closer examination of areas of particular interest within the overall simulation. The software engineering challenge for this multi-model coupling approach is to couple previously independent models together as components, facilitating this process by minimizing the changes required to the existing code (Ford et al. in press).

In order to achieve optimal performance of such an assemblage of components, the behaviour of each component must be monitored in the context of overall assemblage performance. Rate-limiting components can be given more resources available from the Grid. However, to allow this degree of control, the assembled application should be composed of loosely coupled components which are redeployable and malleable. A redeployable component is one that can be interrupted on one platform and restarted on another from the point of interruption. A malleable component is one whose configuration can be changed (e.g. it can be halted on some number of processors and restarted seamlessly on a different number of processors, or can change an algorithm being used for some task). Loose-coupling between components facilitates reconnection of redeployed components. There must be a redeployment infrastructure that is capable of reconfiguring the components over the available resources underpinning such a loosely coupled assemblage of redeployable and malleable components.

Crucially, a performance control system is required in order to monitor performance and detect opportunities for optimizing deployment. The ability to control performance requires the ability to predict if there is some new deployment of components that will give better performance than the current deployment. The performance control system can then utilise this information, taking into account the costs and benefits of redeployment and an estimate of the reliability of the prediction, and make a redeployment decision. When appropriate, the redeployment and reconfiguration is carried out by the redeployment infrastructure. PerCo (Mayes et al. 2004; Armstrong et al. 2005) is a prototype of such a system.

An application consisting of redeployable and malleable components has not been available during the development of the PerCo prototype. Instead, a multiple-component application was emulated by running several instances of the same monolithic simulation code, LB3D (Nekovee & Coveney 2001; Nekovee et al. 2001). This LB3D code is both redeployable and malleable; PerCo can monitor the performance of, and redeploy, concurrently running instances of LB3D as if the instances were components. Such a use-case is not entirely spurious. Scientific simulations are executable models of natural phenomena, used to explore the effects of varying experimental parameters. Grid resources can thus be used for the concurrent execution of multiple instances of the same monolithic simulation, where each simulation instance executes with different value parameters (generally termed ‘parameter studies’).

This paper describes experiments with an assemblage of LB3D simulation instances in the context of a performance control system and associated redeployment architecture. The ability to control performance is demonstrated.

2. Computational science motivation

A significant number of problems in modern condensed matter physics could be described as ‘mesoscale’ or ‘mesoscopic’ problems, which operate at length and time-scales intermediate between the microscopic level, at which discrete atoms or molecules are considered, and the macroscopic level, at which matter can be regarded as a continuum quantity. ‘Complex fluids’ are a particularly interesting class of mesoscopic condensed matter systems, in which both flow (a continuum effect resulting from the correlated motion of many millions of particles) and microscopic interactions between particles are important. Examples of complex fluids include liquid crystals, colloids, biological fluids, such as blood or milk, and many kinds of polymers and novel materials. Complex fluids are of great importance because of their ubiquity, but they also present significant intellectual challenges because of their intrinsically mesoscopic nature.

A variety of computational techniques for modelling complex fluids has emerged in recent years in order to tackle these challenges. Traditional atomistic methods, such as molecular dynamics, are too computationally expensive for mesoscale problems, and traditional continuum computational fluid dynamics (CFD) techniques do not take a sufficiently detailed account of interparticle interactions; hence, novel techniques such as dissipative particle dynamics (Español & Warren 1995; Jury et al. 1999; Flekkøy et al. 2000), lattice gas cellular automata (Rivet & Boon 2001), Malevanets & Kapral's stochastic rotation dynamics (Malevanets & Kapral 1998; Hashimoto et al. 2000; Sakai et al. 2000) and the lattice Boltzmann equation (Succi 2001; Love et al. 2003) have emerged. The computational science work in this article will focus on work with an implementation of the lattice Boltzmann algorithm, but the techniques described generalize to other mesoscale modelling algorithms and many other computational physics tasks in general.

(a) Lattice Boltzmann models of complex fluids

The lattice Boltzmann technique discretizes a fluid on to a lattice containing points x, each joined to their ith nearest neighbour by a vector ci. The fluid is described by the single-particle distribution function, Embedded Image representing the density of particles of species σ occupying lattice site x and travelling with discrete velocity ci at discrete time-step t. The density of particles of species σ is therefore Embedded Image and the fluid momentum Embedded Image. The fluid is updated through application of the lattice Boltzmann equationEmbedded Image(2.1)where Embedded Image represents the equilibrium particle distribution, which takes account of interparticle interactions through, for example, imposition of a free energy functional (Swift et al. 1995; Lamura et al. 1999) or through the use of phenomenological forcing terms (Shan & Chen 1993). The LB3D programme uses a lattice Boltzmann algorithm which has been extended (Chen et al. 2000) to also cover fluid mixtures which contain surfactant molecules.

LB3D has been used to examine the dynamics of porous media flow (Harting et al. 2005), hydrodynamic spinodal decomposition (González-Segredo et al. 2003) and the spontaneous assembly of lamellar and bicontinuous cubic surfactant mesophases (Nekovee & Coveney 2001; González-Segredo & Coveney 2004a,b).

In addition to being well-suited to complex fluid modelling, the lattice Boltzmann method has several desirable properties from the point of view of performance. In particular, the interaction scheme is entirely local: the updated state of each point on the lattice depends only on its current state and the states of its nearest neighbours. This locality permits fast and very scalable implementation on distributed-memory parallel processing architectures (Kandhai et al. 1998; Love et al. 2003).

LB3D has been deployed as a component across computational Grids, a technique which allowed the world's largest fluid dynamics simulations of their kind to be performed (Pickles et al. 2004). In this scenario, many simultaneous LB3D calculations can be launched, running on arbitrary pre-configured machines on the Grid. These simulations may then be steered and the large quantities of output data streamed to separate rendering tasks on remote machines that produce human-viewable images.

(b) Performance steering

LB3D stands to benefit from performance steering in a number of scenarios. Firstly, many simulations need to run for fairly long periods of time; if sources of faster or cheaper computing power become available due to, for example, other jobs terminating, then the lattice Boltzmann jobs could be migrated to take advantage of this.

Many of the systems which are simulated by LB3D have extremely complicated phase diagrams, whose structure is the subject of considerable debate and investigation (Seddon & Templer 1995). Traditionally, the parameter space of these systems has been explored by firing off a ‘task farm’ of many simulations corresponding to different points on the phase diagram. Techniques such as computational steering (Chin et al. 2003) have, to an extent, reduced the need for task farms; however, in situations where simulations update too slowly to be computationally steered in a useful timeframe, or when performing an initial set of simulations as starting points for computational steering, task farms may still be used.

When the parameter space of a system is understood, one may wish to calculate certain functional relations: for example, how the surface tension at the interface between two fluids varies with a parameter controlling the interactions between fluid particles. Such a relation can only be determined by spawning at least one simulation for each value of the parameter to be examined and then running each simulation until the fluid interface has equilibrated, so that the surface tension can be calculated.

All of these scenarios involve several lattice Boltzmann jobs of similar size running simultaneously: the faster these jobs can run, the quicker the scientific investigation can move on to the analysis stage. The ability to optimize the performance of such jobs by dynamic migration therefore stands to be of substantial benefit.

3. The PerCo design

The PerCo system is being developed in the context of the RealityGrid project. PerCo is designed to be a lightweight, application-specific performance control system. PerCo is concerned with the performance of an individual application on a set of resources that have been allocated by some external resource scheduler.

The basic structure of an assemblage of components with PerCo is shown in figure 1. Each component executes in the context of a PerCo loader. The set of loaders together comprise the redeployment infrastructure. Each loader is responsible for launching and moving the components. Each component may have two interfaces. A component-coupling interface may be present to implement data exchange functions for inter-component communication. A performance control interface is used to communicate with a component performance steerer (CPS). This interface is used to exchange performance information and performance commands. A CPS is responsible for local control of its associated component. The entity which takes overall control of the assemblage is the application performance steerer (APS). The APS receives performance data from the CPSs and contains a performance repository to hold data about performance history. The APS can invoke a performance predictor to determine improved component configurations.

Figure 1

Basic design of the PerCo system. Arrows indicate message exchange.

This system adapts to changes in the performance of the application. No user-interaction is required; redeployment decisions are based not on human judgment or expert knowledge, but on the policy that has been built into the APS. This policy may be based on estimations of a performance predictor, or, in the simple case presented here, rules which are more static.

4. The LB3D application

LB3D is a Fortran message passing interface (MPI) implementation of a ‘bottom up’ three-dimensional lattice Boltzmann method for the non-equilibrium and hydrodynamic simulation of binary immiscible, and binary and ternary amphiphilic, fluids (Nekovee & Coveney 2001; Nekovee et al. 2001). The LB3D application simulates the physical phenomenon by progressing through a series of time-steps. From the viewpoint of PerCo, each LB3D simulation instance is in a consistent state between each time-step and can be said to be at a safepoint. That is, the LB3D instances can be safely redeployed. A call into the PerCo system is inserted at the end of the main loop of the LB3D source code, after each time-step has completed. If required, the PerCo APS can command the CPSs to redeploy their LB3D instances at these points.

LB3D is redeployable. LB3D provides an internal routine for writing the computational data to check-point files. This routine can be invoked when a redeployment is required and when each MPI process will generate its own check-point file. Each LB3D simulation has an associated input-file which contains values used to initialize LB3D variables at start-up. There are two critical values here: the flag indicating a restore from check-point files and the time-step to be commenced. By updating these in the input-file of an LB3D instance, PerCo can ensure that that instance can continue execution when relaunched on a new platform. Check-point files are written in the external data representation (XDR) format, so that they can be used on any architecture.

LB3D is malleable. When LB3D MPI processes write check-point files, extra data is written giving information about the MPI configuration. On restoration, the LB3D executable starts up, reads its input-file and then reads the MPI configuration data from the check-point files. Depending on the change in configuration, the check-point files are read and the data redistributed as necessary.

5. The PerCo prototype

This section describes the basic mechanisms used to implement the design shown in figure 1. It is useful to give a general operational overview of PerCo. The first operation is to launch the PerCo loaders and APS on selected platforms using a bootstrap mechanism. The PerCo loaders start to execute and establish network connections with each other. Next, each PerCo loader must launch a component and associated CPS from a local executable file. When the components start executing, each component calls out to its CPS and all the CPSs connect to and communicate with the APS, informing the APS that the components are starting up. Subsequently the CPSs communicate with the APS at relevant progress points during component execution. At some point determined by APS policy, the APS may instruct the CPS on, say, loader1 to migrate its component to loader2. The CPS causes its component to generate a check-point file. The component process then dies, and loader1 transfers the check-point file, appropriately modified initial data and other information to loader2. This remote loader2 then starts up its local component executable, which reads the transferred state and continues the redeployed execution from the point of interruption. The following paragraphs give a more detailed description of the mechanisms used to implement the prototype in which LB3D instances are treated as ‘components’.

(a) External resource scheduler

The PerCo prototype uses Globus facilities: the external resource scheduler is a Globus resource specification language (RSL) script which is interpreted by the globusrun command.1 This RSL script specifies which executables are to be run on which remote machines. In the case of PerCo, the executables specified are the PerCo component loaders and the APS. Once booted, the PerCo system and the component-based assemblage it is controlling can run independently of Globus facilities.

(b) Component code and component interface

In the experiment described in §6, the Grid ‘application’ is actually an assemblage of components, each of which is a monolithic LB3D instance. Each LB3D instance runs independently of the others, with no direct interaction between them. The LB3D Fortran source code is instrumented with calls to PerCo and is transformed into a PerCo component by wrapping it in interface code. This interface wrapper passes outgoing instrumented calls from the LB3D code to the CPS and interprets incoming commands from the CPS to the LB3D. In the prototype PerCo system, the LB3D, interface and the CPS codes are linked together into a single binary and execute in the context of a single process. Calls between one LB3D instance and its component interface, and between its component interface and its CPS, are conventional procedure calls. When commanded by the CPS to exit for migration purposes, the interface wrapper invokes the LB3D check-pointing routine. Note that the basic design (figure 1) allows components to exchange data between their component coupling interfaces. However, this facility is not required in parameter studies (see §1) and is not used in the experiment.

(c) Component performance steerer

A CPS is associated with a component, in this case with an LB3D instance and its component interface. As already noted, the CPS runs in the same process as the LB3D. Although the design allows the CPS to apply component-specific performance control, in the prototype the CPS acts in two main ways. Firstly, it acts as a conduit for passing information and commands between the component interface and the APS. Secondly, it interacts with the PerCo loader hosting the LB3D instance, taking information from the loader at component startup and passing it to the loader at component termination. Interactions between APS and CPS are expressed as procedure calls which are implemented as message passing stub routines. The actual communications mechanism is hidden at a lower level and can use either globus_io or the socket system calls. In the prototype, message exchanges from CPS to APS are blocking—a choice taken to reduce initial implementation complexity. A non-blocking implementation is to be investigated in the near future. However, as will be described below, the structure of the APS as two processes minimizes the time taken by the APS in replying to the CPS calls. Information must be passed between the CPS and PerCo loader. In the prototype, the mechanism used is a data structure in a memory region shared between the CPS and loader processes. For example, when an LB3D instance eventually exits, the PerCo loader needs to be directed to its next action. This command is supplied by the CPS.

(d) Application performance steerer

In the refinement of the APS design there are three parts: a communicator, a performance history repository and a performance monitor (figure 2). In the prototype, the communicator and monitor are each implemented as a separate process. The implementation of the repository uses the Berkeley DB library, although a repository cache is provided for fast look-up of the most recent records. The APS communicator receives performance information from the CPSs, writes this data to the performance repository and sends commands (normally to continue) to the CPSs. The APS monitor continuously checks for performance imbalance, using information read from the performance repository cache. Figure 2 is explained in the following.

  1. The detector checks for problems in the performance of the assemblage of components.

    Figure 2

    Structure of the APS. The deployment generator can rely on a performance predictor or, as in the present case, some in-built redeployment policy.

  2. When a problem is detected, the deployment generator facility is invoked to generate a better deployment of components.

  3. In the full system this deployment generator will be based on performance prediction and will use performance history data.

  4. This improved deployment is then output and is made available to the APS communicator, which then commands the CPSs to redeploy their components (to migrate).

This redeployment activity uses the PerCo loader infrastructure. Note that there is negligible delay in communications between CPS and APS due to generating deployments: there is either a new deployment available to the APS communicator or there is not.

(e) Component loader

The PerCo loaders are responsible for launching and migrating the components, or, in this case, LB3D instances. At run-time, the loader consists of two core processes. The loader-communicator process is responsible for communicating with other loaders, for example, during component migration. The application launcher process is responsible for launching the LB3D code, awaiting its completion and interpreting the next action to be taken by the loader. Additionally, the need to transfer input data and check-point files between machines requires either gridFTP or PerCo socket file transfer processes. When the loader-communicator has established connections with its peers, the application launcher launches the component executable (this executable contains the LB3D code with the component interface wrapper and the CPS) via mpirun. After launching mpirun, the application launcher blocks until the component processes exit. When the component has exited, the application launcher must determine its next action. It does this by examining the command written by the CPS into the memory region shared between the CPS and the loader. The trigger for the redeployment of a component is a migrate command left in the shared memory by an exiting CPS. The application launcher notes this and instigates two actions: (i) transfer the necessary files to the new host and (ii) send a migrate message to the loader on the new host. This is achieved by the loader-communicator. Incoming messages from remote loaders are processed by the loader-communicator and placed on a queue until the application launcher becomes free.

6. Performance control experiment

Figures 3 and 4 present typical results that demonstrate the ability of PerCo to influence the behaviour of four instances of LB3D executing 501 time-steps on three machines. Throughput was measured periodically as cumulative time-steps per second achieved by each instance. In both experiments, at each time-step the instances made a call to the PerCo system, including the APS. All LB3D instances utilised two MPI processes and used a 32×32×32 data matrix. At the start of the experiments, instances 1 and 4 executed on the same 16-processor (400 MHz) SGI Origin, instance 2 executed on a 4-processor (444 MHz) Sun machine and instance 3 executed on a 2-processor (870 MHz) Linux/P3 machine. Independently of PerCo, all instances wrote to stdout each time-step and periodically did a check-point. The Sun and Linux/P3 machines were on a 100 Mb s−1 LAN with a 1 Gb interconnect to the SGI Origin LAN. The APS ran on the SGI machine. All timings were made by the APS, using a 1 ms resolution timer. All communications between CPSs and APS, between PerCo loaders and transfers of LB3D check-point and input files were over TCP sockets. In other situations, globus_io and gridFTP could be used when authentication is an issue.

Figure 3

The execution rates of four LB3D instances without redeployment by PerCo.

Figure 4

The effect on execution rates of four LB3D instances with redeployments by PerCo.

Figure 3 shows the results of the experiment where the LB3D instances were allowed to run freely with no redeployments (this is the ‘control’ experiment but the term is avoided to prevent ambiguity). In figure 4, the policy implemented by the APS was to redeploy the LB3D instances between machines so that the progress of all instances was similar. In this experiment, the APS policy redeployed the instances between machines on six occasions.

In the redeployment experiment of figure 4, the APS policy did not utilise performance prediction. The APS monitor process detected a ‘problem’ simply by checking the difference between time-steps of the LB3D instances, as read from the performance repository cache. When this difference reached a threshold, a new deployment was generated on the basis of the known relative performance for LB3D on the three platforms. The APS communicator then sent migrate commands to the CPSs. The APS policy did not consider redeployments between the Sun and Linux/P3 machines. As can be seen from the figure, even this simplistic control criterion achieved similar execution rates in the four instances. In the no-redeployment experiment of figure 3, the APS ignored the performance differences and simply sent continue commands.

The overheads of migration were relatively high due to a number of factors, including time for data transfer and reading and writing files (two check-point files of 7 929 856 bytes and an input-file of 1995 bytes), synchronization at source and destination loaders, restart of execution, mpirun finalization and initialization and reconnection to the APS at startup. The extent of these overheads can be indicated by comparisons of raw time-step durations. Time-step durations for each LB3D instance were collected by the APS, as the time measured between time-step n and time-step n+1 messages. After redeployment, this time would include the latency due to the migration and a time-step on the destination machine. In the experiment presented in figure 4, the mean resident time-step durations were 1.34 s (SGI), 2.21 s (Linux/P3) and 2.67 s (Sun). The mean migration time-step durations were 14.38 s (SGI to Sun), 10.26 s (SGI to Linux/P3), 10.76 s (Sun to SGI) and 7.32 s (Linux/P3 to SGI).

However, despite these migration overheads, figures 3 and 4 show that the four LB3D instances with redeployment finish earlier than those without redeployment. The base cost of interaction with the PerCo system in these experiments was negligible compared with time-step duration—the round-trip cost of calling to the CPS and APS from the component interface was less than the 8 ms average in the no-redeployment experiment in figure 3. Obviously, latencies on a WAN Grid would be higher and future work will address the current synchronous implementation of communications between CPS and APS.

7. Current work: performance prediction

The redeployment policy used for this demonstration experiment was effective, yet simple and inflexible. In order to provide sophisticated control in general Grid environments, the APS deployment generator can be based on a performance prediction capability. Current work on the PerCo prototype involves adding performance prediction. Classically, work on performance prediction has concentrated on providing estimates of execution times for well-behaved applications on stable computing resources. For Grid and cluster environments, performance prediction techniques need to address more complex applications on non-dedicated resources and can use history performance data in order to make a prediction. These prediction techniques apply mathematical or statistical techniques in order to generate performance predictions. A variety of time-series-based techniques, available from the literature, generate predictions by applying a mathematical formula to the history data. On the other hand, different data fitting or statistical techniques are applied to history data to obtain a mathematical formula with many variables. Evaluating this formula with actual values (e.g. number of processors, application data size) generates predictions.

The performance prediction to be used in PerCo combines both time-series and data fitting techniques. Time-series are used to predict the performance of the next time-step given the current deployment. The data fitting techniques are used to predict the performance of the next time-step given some new deployment. Available data fitting and time-series techniques are grouped into two families, sorted in increasing order of accuracy (and compute-time). At run-time, two adjacent techniques from either the data fitting or time-series family are used simultaneously to produce two predictions. The more accurate of the two predictions is used directly. However, the difference between the two predictions provides an estimate of the prediction accuracy. This gives the ability to determine the required level of accuracy and, therefore, the cost of prediction. That is, when the difference between the predicted and actual execution time of a time-step is larger than the estimated accuracy, then greater accuracy is required. The two techniques selected for prediction are then replaced by their more accurate neighbours in the same family. On the other hand, when the difference between the least accurate prediction and the actual execution time is smaller than the estimated accuracy, then a less accurate technique can be used. Finally, when the time taken to generate the prediction is greater than the actual execution time of the time-step, less accurate prediction techniques are required.

8. Related work

Process migration can be used to achieve some desired overall system behaviour (such as load balancing) using operating system-level mechanisms (Douglis & Ousterhout 1991) or a check-pointing library (Litzkow et al. 1997). In contrast, many systems which utilise migration emphasize individual application behaviour. PerCo is concerned only with the run-time control of a single component-based application or, in the present work, a single assemblage of coupled LB3D instances. Similarly, many existing systems which support adaptation provide application-specific agents or managers to monitor and control individual applications (Furmento et al. 2001; Kennedy et al. 2002; Berman et al. 2003), though these systems tend to be more heavyweight than PerCo. While Vadhiyar & Dongarra (2003) provided a general check-pointing library for migration and reconfiguration of MPI applications, PerCo, like the system of Huedo et al. (2004), relies on the application to provide a check-pointing routine.

The basis of run-time control of performance is closed loop performance steering (Reed et al. 1996). This involves installing sensors in the application source code and providing actuators to produce the desired adaptations (Ribler et al. 2001). The sensors report performance data to the performance control system, which then enables the actuators to alter the application performance. For example, the actuator may tune the chosen implementation of an application or library (Tapus et al. 2002). Although not yet investigated in the prototype, the major design role of the PerCo CPS is to tune the performance of the local component.

Acknowledgments

This work has been supported by the RealityGrid project (EPSRC GR/R67699) and the Market for Computational Services project (DTI THBB/C/008/00020).

Footnotes

References

View Abstract