## Abstract

Models of cardiac electrophysiology consist of a system of partial differential equations (PDEs) coupled with a system of ordinary differential equations representing cell membrane dynamics. Current software to solve such models does not provide the required computational speed for practical applications. One reason for this is that little use is made of recent developments in adaptive numerical algorithms for solving systems of PDEs. Studies have suggested that a speedup of up to two orders of magnitude is possible by using adaptive methods. The challenge lies in the efficient implementation of adaptive algorithms on massively parallel computers. The finite-element (FE) method is often used in heart simulators as it can encapsulate the complex geometry and small-scale details of the human heart. An alternative is the spectral element (SE) method, a high-order technique that provides the flexibility and accuracy of FE, but with a reduced number of degrees of freedom. The feasibility of implementing a parallel SE algorithm based on fully unstructured all-hexahedra meshes is discussed. A major computational task is solution of the large algebraic system resulting from FE or SE discretization. Choice of linear solver and preconditioner has a substantial effect on efficiency. A fully parallel implementation based on dynamic partitioning that accounts for load balance, communication and data movement costs is required. Each of these methods must be implemented on next-generation supercomputers in order to realize the necessary speedup. The problems that this may cause, and some of the techniques that are beginning to be developed to overcome these issues, are described.

## 1. Introduction

The bidomain and monodomain equations (Plonsey 1988; Sepulveda *et al*. 1989; Keener & Sneyd 1998) have been widely used for many years to model the propagation of electrical waves across cardiac tissue. The numerical solution of the bidomain equations is usually calculated using standard finite-element (FE), finite-volume (FV) or finite-difference (FD) methods. However, as the discretization of a whole heart with an average nodal spacing of 0.2 mm generates a mesh with many millions of nodes, the linear systems resulting from FD or FE methods are very large. Whole-heart simulation using the bidomain model is therefore a demanding scientific computing problem.

Several advanced problem-solving environments for cardiac simulation already exist (Vigmond *et al*. 2003; Watanabe *et al*. 2004; Pitt-Francis *et al*. 2008; http://www.cmiss.org; http://www.continuity.ucsd.edu; http://www1.pacific.edu/jeason), but none of these currently provide the required computational speed to allow researchers in, for example, the pharmaceutical industry to run simulations quickly enough to make them viable as an everyday research tool. For example, using CARP, 6.4 hours of CPU time is required on a 64-processor machine to run a 200 ms simulation of the bidomain model on a rabbit ventricular mesh containing 862 515 grid points using a modified Beeler–Reuter ionic model (Beeler & Reuter 1977; Plank *et al*. 2007). Extending this to the approximately 30 million grid points that would be required to compute an average (human-sized) heart (250 cm^{3}) at 0.2 mm spatial resolution (Hunter & Borg 2003), it would take more than six weeks to run a 1 s simulation on 64 processors (using an over-simplistic cellular model), even assuming, over-optimistically, linear scaling of the computational time with the number of mesh nodes. Furthermore, it may be the case that an even greater spatial resolution may be required at some critical locations (in areas of complex geometry or high heterogeneity). Clearly, this is not practicable for any real application, which would be likely to require simulations over much longer time scales: minutes, not seconds.

There are a number of projects currently underway around the world to develop petascale-class computer systems with tens or hundreds of thousands of CPU cores. At least some of the required performance improvement to make bidomain simulation more tractable could be realized by running heart simulators on such massively parallel systems. However, the required speedup factor is so large that simply porting existing codes to even the most powerful supercomputers would not be sufficient, even assuming linear scaling. (The scaling is, in fact, almost certain to be well below linear.) So, much of the necessary speedup will need to be found from improved numerical algorithms rather than simply implementing existing methods efficiently.

In this paper, we describe the current procedure for numerical modelling of the bidomain equations (§2) and review recent developments in adaptive numerical algorithms, the development of spectral element (SE) methods as a high-performance alternative to FE, and state-of-the-art parallel linear solvers for large-scale algebraic systems (§3). Furthermore, we discuss the problems that are certain to be encountered in designing simulators that exploit these algorithms and are intended to be run on massively parallel computers (§4). Some preliminary ideas for overcoming these issues and progressing towards a petascale heart simulation code are also introduced.

## 2. Numerical modelling of cardiac electrophysiology

The propagation of a cardiac action potential is a multi-scale phenomenon, with the electrical activity through ion channels in individual, but electrically connected, cells driving a wavefront over the whole heart. This wave can propagate between neighbouring cells via electrically active gap junctions. However, the cell membrane is also electrically active as it contains ion channels and pumps that selectively allow ions to cross it. Hence, a potential gradient is also seen in the extracellular space and the cardiac action potential can propagate in this also. In order to simulate this behaviour numerically, it is necessary to derive a system of governing equations, discretize these into an appropriate form and then solve the discretized equations. The process is outlined in this section.

### (a) Governing equations

It is usual to model cardiac tissue as a two-phase ohmic medium (one phase representing the intracellular space and the other the extracellular space) with the two phases linked by a network of resistors (representing the ion channels) and capacitors (representing a capacitative current driven across the membranes by the potential difference). On taking this approach, the bidomain equations can be derived (Keener & Sneyd 1998) as(2.1)(2.2)(2.3)plus appropriate boundary conditions, where *V*_{m}(** x**,

*t*) is the transmembrane potential;

*ϕ*

_{e}(

**,**

*x**t*) is the extracellular potential;

**(**

*u**t*) is a vector of dependent variables;

*I*

_{ion}is the total ionic current across the cell membrane per unit area;

*Χ*is the cell surface to volume ratio;

*C*

_{m}is the membrane capacitance per unit area;

*σ*

_{i}and

*σ*

_{e}are the intracellular and extracellular conductivity tensors, respectively;

*I*

_{e}is an extracellular stimulus current (e.g. to model an electric shock during defibrillation);

**is a vector-valued function;**

*f***is position; and**

*x**t*is time.

The function ** f** encapsulates the cellular level behaviour and can be represented using one of a wide variety of models. Most common are the Hodgkin–Huxley type ordinary differential equation (ODE) models (Hodgkin & Huxley 1952; FitzHugh 1961; Nagumo

*et al*. 1962; DiFrancesco & Noble 1985; Luo & Rudy 1991, 1994

*a*,

*b*; Noble

*et al*. 1998), but other forms, such as Markov models (Clancy & Rudy 1999; Iyer

*et al*. 2004; Flaim

*et al*. 2007), are beginning to be used. The exact form of the cell level model used is, however, not crucial in the context of high-performance distributed computing since the model for any one cell is independent of that for all other cells. Hence, as long as there are more cellular level models to solve than there are computational cores, the solution of these models will be embarrassingly parallel.

If it is assumed that the two domains are equally anisotropic (*σ*_{e}=*kσ*_{i}) then the extracellular potential can be eliminated from the bidomain model, leading to the monodomain equations (Keener & Sneyd 1998)(2.4)(2.5)where *σ*_{m}=*σ*_{i}(*σ*_{i}+*σ*_{e})^{−1}*σ*_{e} is the monodomain conductivity tensor and *I*_{m}=*I*_{e}/(1+*k*) is the monodomain stimulus current. This simplification is motivated by the reduced computational effort required to solve the single partial differential equation (PDE) of the monodomain model compared with the two coupled PDEs of the bidomain model. However, this gain may be at the cost of reduced accuracy (especially close to the propagating wavefront where *V*_{m} is most likely to be non-constant), hence it is important to verify the accuracy of monodomain simulations compared with the equivalent bidomain simulation.

Potse *et al*. (2006) concluded that the monodomain model is generally sufficient for studying propagation problems, but later work (Sebastian *et al*. 2008) indicates that this is not true on coarse meshes. Nielsen *et al*. (2007) have demonstrated a method for optimally approximating the solution of the bidomain equations using a monodomain model, without ever having to solve the bidomain equations themselves to verify the solution. However, the computational benefits of using the monodomain equations are much less significant for the more realistic cell models that are likely to be used in the future than they are for simpler ones. Sundnes *et al*. (2004) showed that, when using order optimal methods, the theoretical reduction in computational requirements of the monodomain equations when compared with the bidomain is only two (compared with approximately ten for simpler models).

One additional problem with using the monodomain equations is that, since the extracellular potential is eliminated in deriving them, it is considerably more difficult to apply an extracellular stimulus (as might be desired in, for example, a defibrillation study) than when using the bidomain equations.

### (b) Spatial and time discretization

There are many different FD, FV and FE methods that can be used for solving the governing equations (Harrild & Henriquez 1997; Keener & Bogar 1998; Vigmond *et al*. 2002, 2003, 2007; Trew *et al*. 2005*a*,*b*; Whiteley 2006). Given the complex geometry of the domain and the potential need to develop accurate error estimators in order to implement adaptivity, the FE method is preferred in the high-performance computing (HPC) strategy described in this work.

On discretizing in space over a mesh containing *N* nodes and introducing the vectors ** V**=(

*V*

_{m,k}),

**=(**

*Φ**ϕ*

_{e,k}),

*I*_{ion}(

**,**

*u***)=(**

*V**I*

_{ion},

*) and*

_{k}

*I*_{e}=(

*I*

_{e,k}),

*k*=1, …,

*N*, where

*V*

_{m,k},

*ϕ*

_{e,k},

*I*

_{ion,k}and

*I*

_{e,k}represent the values of

*V*

_{m},

*ϕ*

_{e},

*I*

_{ion}and

*I*

_{e}, respectively, at node

*k*, the FE form of the bidomain equations (2.1)–(2.3) is(2.6)(2.7)(2.8)where

*M*is the FE mass matrix and

*A*

_{ξ}, ξ=i, e, are the FE stiffness matrices relating to the intracellular and extracellular conductivity tensors, respectively (i.e.

*A*

_{ξ}is the discretized form of the operator ∇.(

*σ*

_{ξ}∇)). All three of these matrices depend on the shape of the elements in the FE mesh and the basis functions chosen. Most existing heart simulators use linear basis functions on either tetrahedral or hexahedral meshes. The equivalent discretization for the monodomain equations is(2.9)(2.10)where

*A*

_{m}is the stiffness matrix relating to the monodomain conductivity tensor.

At this stage, it is often convenient to decouple the system of ODEs from the two PDEs. As has been observed above, the ODE system is embarrassingly parallel, so its solution on a massively parallel computer architecture is trivial. Furthermore, treating the *I*_{ion} term as a known source term in the PDEs removes the only nonlinear term in the system and allows its solution to be calculated much more quickly. Note, however, that while this is the standard method, it is not clear that it is necessarily the best and there are advantages (such as the ability to use larger time steps) that result from using fully nonlinear solution methods (Lines *et al*. 2003). Various different schemes exist to discretize the time derivatives. In general, an implicit (or semi-implicit) solver is preferred as this is able to deal with the stiff nature of many ODE cardiac cell models. The simplest such method is the implicit Euler method, which leads to the following discretization in time and space of the bidomain PDEs:(2.11)(2.12)where is calculated at each time step from the decoupled solution of the ODEs. A similar discretization can be performed on equation (2.9) for the monodomain model.

## 3. High-performance algorithms

As noted in §1, the solution of the bidomain (or monodomain) equations on a realistic geometry is an extremely computationally demanding task, and advances in computer hardware alone will not be sufficient to achieve the desired performance levels. In this section, we propose three areas of numerical technology whose exploitation could increase the performance of cardiac simulators: mesh adaptivity; the use of SEs; and optimal matrix solution methods.

### (a) Adaptive algorithms

Adaptive schemes in both space and time have been successfully used to reduce the computational workload in various application areas. In particular, there is a large body of work on adaptive methods in computational mechanics that has been developed over many years (George *et al*. 2004) and could be modified for use in cardiac simulation. A similar process has been underway recently in the modelling of oceans and climate, which, similar to heart simulation, is a multi-scale problem (in both space and time) with complex geometry (necessitating the use of unstructured meshes), and where local phenomena can have a significant global impact (Pain *et al*. 2001; Piggott *et al*. 2008). The use of adaptive mesh refinement (AMR) techniques has been particularly successful in dealing with the multi-scale nature of climate models. Cardiac simulations tend to lead to narrow propagating wavefronts, away from which the potential is relatively constant, so it should be possible to use AMR techniques to greatly reduce the computational load away from the wavefront without compromising on accuracy.

Indeed, work has already begun on developing adaptive techniques for use in cardiac simulation. One early method for simulating cardiac dynamics was developed by Cherry *et al*. (2000, 2003). This uses an explicit FD scheme based on an AMR method previously used for hyperbolic PDEs (Berger & Oliger 1984). In tests on a three-dimensional slab incorporating rotational anisotropy and heterogeneity using the simple FitzHugh–Nagumo (FHN) cell model (FitzHugh 1961; Nagumo *et al*. 1962), a speedup factor of 50 and a memory reduction of 30 were observed over a similar non-adaptive method. In another application, a speedup factor of 70 has been observed on a 100 core parallel computer.

Colli Franzone *et al*. (2006) implemented a fully space–time adaptive method using the Kardos software package (Erdmann *et al*. 2002). A nested multilevel FE mesh is used, together with a hierarchical error estimator that controls mesh adaptation. On a thin three-dimensional slab (using either the FHN or Luo–Rudy I (LRI; Luo & Rudy 1991) cell models), the adaptive method required 200 times fewer nodes for the same degree of accuracy over using a uniform mesh in bidomain and monodomain simulations. The algorithm was subsequently used to solve the bidomain equations on the Auckland three-dimensional heart geometry (LeGrice *et al*. 2001) by Deuflhard *et al*. (in press). This appears to be the only fully adaptive algorithm in the literature to have been applied to a whole-heart model. However, on a shared memory machine with eight dual-core AMD Opteron processors, six weeks of CPU time (and approx. 30 GB of memory) were required to simulate 800 ms of electrical activity using the simplified Aliev–Panfilov cell model (Aliev & Panfilov 1996).

Whiteley (2007) uses the underlying physiology to drive a space–time adaptive method, based on a previously derived stable, semi-implicit numerical scheme (Whiteley 2006). The method achieved a speedup of between two and three orders of magnitude over a non-adaptive explicit numerical scheme in two-dimensional simulations of the propagation of an action potential, using the cell model of Noble *et al*. (1998). Another physiologically driven scheme is used in Whiteley (2008). The ionic currents in the cell model are separated into fast and slow components. The slow components are solved on a coarse mesh and the fast ones on a fine mesh, using different time steps on the two meshes to reduce the computational load. Simulations were again conducted in two dimensions and a similar speedup of two orders of magnitude was observed over an explicit scheme.

Pennacchio (2001) has applied the mortar FE method to solve the bidomain equations, demonstrating an efficiency of between 70 and 90 per cent (using the FHN model) when compared with a standard FE method. The method was initially applied to just the elliptic portion of the bidomain equations, but has since been extended (with a semi-implicit time-stepping scheme) to the fully coupled model (Pennacchio 2004*a*,*b*). A similar method (Trangenstein & Kim 2004) uses a second-order Strang operator splitting in time to allow separate schemes to be used for the integration of the reaction and diffusion terms (a singly diagonal implicit Runge–Kutta scheme for the reaction term and Crank–Nicholson for the diffusion term). The diffusion terms are solved using the mortar FE method, allowing AMR with a nested hierarchy of rectangular grids. A stiff diagonally implicit Runge–Kutta scheme is used for the reaction terms. In two-dimensional monodomain simulations (using the LRI model), a speedup of up to 4.35 was observed compared with a uniform mesh. Recently, Belhamadia (2008) developed a time-dependent anisotropic remeshing strategy for simulating action potential propagation. The adaptation strategy seeks to minimize the error gradient everywhere, naturally dealing with anisotropy inherent in the solution of cardiac problems. In two-dimensional tests with re-entrant spiral waves (using the FHN model), a 10 per cent reduction in the computational time was observed over an unrefined structured grid.

One issue that must be overcome if adaptive methods are to be efficiently implemented is the trade-off between the computational savings of solving a smaller system at each time step and the extra cost that may be incurred by performing the adaptation that reduces the size of the problem. This cost may grow for large-scale parallel computers, as the cost of distributing the new mesh among the many processors may be large. The mortar FE method offers one potential solution to this trade-off. It splits the mesh into patches that can be distributed across the available processors and refined independently of one another, greatly reducing the overhead of refinement (Pennacchio 2001; Trangenstein & Kim 2004). Further trade-offs that must be addressed in implementing adaptive methods include the frequency with which any remeshing should be carried out, the size of the adapted zone around the travelling wavefront and whether the adaptivity should be isotropic or anisotropic. Each of these remain open research questions and it seems likely that the results will be strongly problem dependent. These issues and how they relate to the current state of the art are discussed in more depth by Alauzet *et al*. (2007).

### (b) Parallel high-order discretization methods

The accuracy of the FE technique described in §2*b* for the spatial discretization of equations (2.1) and (2.2) depends mainly on grid spacing (*h*) and on the degree of the polynomial basis functions (*p*). The methods described in §3*a* increased accuracy by reducing the distance between the mesh nodes where the solution varied rapidly. This is known as *h-adaptivity*. On the other hand, it is possible to improve the quality of numerical simulations by using polynomial basis functions of higher degree in regions where an increased accuracy is required. This approach is referred to as *p-adaptivity*. Furthermore, the advantages of both approaches can be combined in *hp-adaptivity*, where the balance between using *h*- or *p*-adaptivity is locally tuned, taking into account the problem features (geometry complexity, regularity of the solution, etc.). These are capable, to some extent, of exploiting the flexibility of standard FE methods in dealing with complex structures while also being able to use high-degree polynomial functions to model high-frequency waves (in preference to very highly refined meshes). That is, *hp*-methods are capable of changing the accuracy of the numerical solver either by modifying the computational grid or, where this might be inappropriate (e.g. in parts of a realistic heart mesh where complex geometry makes remeshing highly computationally demanding), by fixing the grid and increasing the order of the FE basis functions.

While, in principle, it is possible to design FE algorithms with basis functions of arbitrary degree, practical constraints usually mean that only low-degree basis functions can be implemented (since choice of basis function alters the number of nodes per element required and this must be known at *compile time*). In an *hp*-adaptive SE method, on the other hand, the degree of the basis functions can be set at *run-time*—allowing polynomials of any degree to be used. The SE method (Casadei *et al*. 2002) has been explicitly designed for engineering applications, so it maximizes flexibility, performance, ease of interaction with CAD tools and grid generators. SE-based algorithms have been successfully applied to problems in many application areas, including acoustics (Maggio & Quarteroni 1994), elastodynamics (Faccioli *et al*. 1997), electromagnetics (Maggio *et al*. 2004) and thermoelasticity (Massidda 2003). The main drawback to using SEs is that they require the use of an all-hexahedra mesh. However, as flexible and powerful tools for hexahedral mesh generation (http://cubit.sandia.gov) have now become widely available, this is no longer a serious limitation. Indeed, a number of existing (FE and FD) heart simulators use hexahedral meshes, hence the required meshes may already be available. Since the meshes generated using these tools would be unstructured, it is likely that they would be capable of matching the capability of tetrahedral meshes to model all regions of the heart, including those with the most challenging geometry (such as the apex of the heart) that existing hexahedral meshes have struggled to deal with.

The SE discretization leads to an algebraic problem that is formally identical to equations (2.6) and (2.7), where the FE mass and stiffness matrices are replaced by their SE counterparts. While the SE mass matrix is naturally diagonal (reducing the temptation to use mass-lumping techniques, as is often done in FE methods), the stiffness matrix is denser than that for the FE method, due to the higher number of degrees of freedom possessed by each element. The SE discretization uses fewer mesh nodes than its FE equivalent (for the same level of accuracy). Therefore, the computational complexity is determined by matrix–vector products with matrices that are smaller and slightly denser for the SE method than for the FE one. Numerical experiments in the application fields mentioned above have shown that the SE method shows a speedup factor in the range of 5–10 for three-dimensional problems when compared with an FE alternative. The exact speedup factor depends on the grid and the spectral degree adopted: numerical experiments suggest that the best compromise between accuracy and performance is provided by spectral degrees between 2 and 5.

Optimal implementation of a parallel SE solver is not likely to be a straightforward task. The higher density of the SE stiffness matrix requires more sophisticated criteria for domain decomposition and message-passing organization. One approach that has been successfully applied in other application areas is to make mesh partitioning a fully automated procedure that divides the computational domain and the mesh among the processors. This process has two main targets: balancing the computational load on each CPU, and minimizing data exchange between the processors. It can be managed using, for example, METIS, a family of multilevel partitioning algorithms (Karypis & Kumar 1999). Although it was designed for partitioning FE meshes, METIS essentially works on graphs, and can be adapted to a mesh-based method such as SE. Graphs can be defined using either the nodes or the elements as vertices. Both of these strategies may be adopted in an element-by-element solution scheme, but previous investigation indicates that the element-based strategy fits better within an SE frame. Experiments with medium-scale problems (approx. 5 million nodes) using up to 128 CPUs exhibited an efficiency that was almost independent of the number of CPUs, with an average value of 74.5 per cent (Maggio *et al*. 2007).

SE can also be implemented within an adaptive frame, with no substantial differences from an FE scheme. However, the tools used for the dynamic domain decomposition must meet more critical requirements than those in the non-adaptive case. In particular, they should account for data movement costs as nodes are created or destroyed, run quickly and use little memory, and run side-by-side with the SE solver. Tools that perform all of these tasks already exist (http://www.cs.sandia.gov/Zoltan/Zoltan.html) and could be adapted for use in cardiac simulation.

### (c) Optimal matrix solution methods

The spatial discretization of the governing equations leads to large sparse linear systems of equations that must be solved at each time step, regardless of the type of numerical method employed. The size of the linear system is equal to twice the number of nodes in the tissue plus any nodes in a surrounding conducting bath for the bidomain equations or to the number of nodes in the tissue for the monodomain equations. The spatial resolutions required to simulate accurately a cardiac action potential means that tens of millions of nodes are typically required in order to simulate a realistic geometry in three dimensions (see §1). Over the course of the simulation of even a single heartbeat, tens of thousands of linear systems may need to be solved (Mardal *et al*. 2007; Plank *et al*. 2007). Solving these systems represents most of the total computational load for the simulation, hence it is necessary to identify efficient numerical techniques to do this.

Although sparse direct methods based on variants of Gaussian elimination are robust and numerically accurate, they may be too memory demanding for solving large problems even for modern parallel computers (Weber dos Santos *et al*. 2004; Plank *et al*. 2007; Busch 2008). In experiments by Plank *et al*. (2007), a linear system with 862 515 unknowns (modelling the induction of arrhythmias in rabbit ventricles) could not be factorized even when 8 GB of memory were available. Iterative solvers that are based on matrix–vector multiplications can solve the memory bottleneck and present a viable alternative (provided they are accelerated with some form of preconditioning).

For the bidomain equations, the large size of the linear system has motivated the frequent use of an operator-splitting technique that decouples the coupled system described by equations (2.11) and (2.12) into separate parabolic and elliptic parts (Vigmond *et al*. 2002)(3.1)(3.2)On taking this approach, the parabolic equation (3.1) can be solved using only a few iterations of a block Jacobi-preconditioned conjugate gradient solver, while the solution of the elliptic equation (3.2) requires the use of a much more robust solver and accounts for well over 90 per cent of the computational time (Plank *et al*. 2007). Whether or not this is the optimal approach to solving the bidomain equations is still an open question. The work by Pennacchio & Simoncini (2002) suggests that, despite the reduced size of the decoupled system, it is still more efficient (for the same level of accuracy) to solve the coupled system. The discussion that follows is equally relevant to the fully coupled bidomain equations or to the elliptic part of the decoupled system.

A substantial amount of work has been devoted to designing robust preconditioners for solving the monodomain or bidomain equations on both sequential and parallel computer architectures (Pennacchio & Simoncini 2002; Sundnes *et al*. 2002; Weber dos Santos *et al*. 2004; Potse *et al*. 2006). For an operator-splitting approach, simple diagonal (Jacobi) preconditioning is sufficient for the monodomain equations or for the parabolic part of the bidomain equations, but does not reduce the number of iterations for the elliptic part of the bidomain equations significantly (Potse *et al*. 2006). A better performance is reported with the symmetric successive over-relaxation and incomplete LU (ILU) factorization methods. Both of these classes of preconditioner show similar levels of performance, but the ILU method can be enhanced at the cost of additional storage to show improved results (Weber dos Santos 2002; Plank *et al*. 2007).

Multigrid-based preconditioners solve a sequence of problems on a hierarchy of grids of different sizes. Geometric multigrid (GMG) preconditioners require the hierarchy of meshes to be explicitly defined in order to set up the principal components of the algorithm (Hackbusch 1985). Therefore, their application is restricted to regular grids (Sundnes *et al*. 2002; Weber dos Santos *et al*. 2004). Algebraic multigrid (AMG) preconditioners, on the other hand, only require information from a single mesh. This enables the use of irregular grids with different distributions of nodes in the computational domain of the type typically seen in cardiac applications. In many practical applications, AMG has proved to be nearly as effective as GMG in reducing the number of iterations of Krylov subspace linear solvers. Indeed, AMG methods have proven to be very efficient for the bidomain equations, both as a preconditioner and as a stand-alone solver (Austin *et al*. 2005; Plank *et al*. 2007; Vigmond *et al*. 2007).

Multigrid algorithms are typically parallelized by performing computations in parallel within each grid, but processing the grids one at a time. A high level of efficiency can be obtained, provided the work per processor on the finest mesh is sufficient and the grid transfers and coarse grid computations are not dominated by communication. For unstructured problems, the degree of parallelism of the standard coarsening procedure of AMG and of traditional smoothers such as Gauss–Seidel can degrade considerably and specialized algorithms may be required (Chan & Tuminaro 1987; McBryan *et al*. 1991; Smith *et al*. 1996; Jones & McCormick 1997; Chow *et al*. 2006). Parallel AMG methods have been successfully integrated into the existing cardiac simulators (Pavarino & Colli Franzone 2003; Plank *et al*. 2007; Busch 2008; Pitt-Francis *et al*. 2008) using standard linear algebra packages such as PETSc (Balay *et al*. 2006).

To date, AMG preconditioners have been used in cardiac simulations on only a moderate number of processors. The number of iterations increases slightly for parallel implementations compared with the sequential case. The efficiency may also degrade on two-dimensional problems. In three dimensions, the relative size of the finest grid problem compared with the coarsest one is much larger and a high parallel efficiency up to 80–90 per cent is reported (e.g. Weber dos Santos *et al*. 2004; Plank *et al*. 2007). Computations on coarser grids are less efficient because the amount of arithmetic becomes smaller and the relative communication overhead increases. Parallel performance may also depend on the hardware: on shared memory machines, very good efficiency is expected with a large number of processors (Vigmond *et al*. 2007), while with a distributed memory cluster the efficiency may vary significantly depending on the latency and bandwidth of the interconnection network (Weber dos Santos *et al*. 2004; Plank *et al*. 2007). Further work is required to optimize the performance of AMG preconditioners and to identify the most effective method of implementing them on very large parallel systems. The *O*(*n*) computational work per unknown of AMG methods and a low computation-to-communication ratio mean that a significant amount of work will be necessary to achieve this.

For SE methods, when the spectral degree is low (say *p*≤4), the solvers and preconditioners designed for use with FE methods also work well for SE. When *p* increases, however, SE-specific techniques should be used. In particular, we refer to (i) domain decomposition-based preconditioners (Pavarino & Widlund 2000) and (ii) preconditioners based on FD and FE (Shapira *et al*. 1999). AMG methods can be used effectively in conjunction with SE methods, just as for FE. For example, in climate modelling, the work of Heys *et al*. (2005) shows how AMG methods can be applied both to the iterative solution of linear systems produced from a high-order discretization and for use as a preconditioner in a conjugate gradient iteration.

## 4. Implementation on petascale computers

The discussion of each of the potential areas of algorithmic improvement in §3 concentrated on existing work that has been conducted primarily on either workstations or small- to medium-sized clusters. However, in order to realize the huge improvement in computational speed that has been identified as being necessary in §1, it will be necessary to run the simulation code (including the algorithmic improvements) on the next generation of HPC facilities such as the Tier-0 PRACE facilities within Europe (http://www.prace-project.eu), Japan's Next-Generation Supercomputer Project (http://www.nsc.riken.jp) or the Blue Waters project of the National Center for Supercomputing Applications in the USA (http://www.ncsa.uiuc.edu/BlueWaters).

Substantial alterations will be required in order to ensure that the code designed for sequential machines or clusters with tens of nodes can make full use of the petascale (10^{15} floating point operations per second (flops)) resources that these facilities will make available to users. The most powerful of the last generation of supercomputers, such as IBM's Blue Gene/L, already contains more than a hundred thousand cores (Gara *et al*. 2005), and this number will increase even further in the next generation. Maintaining good load balancing and minimizing communication overhead become increasingly difficult tasks with the increase in the number of nodes. Furthermore, the design of the processors used in next-generation supercomputers is likely to be radically different from that on which cardiac simulations have been run in the past. Most importantly, they will contain multiple cores that may share the same cache and require a hybrid message-passing/thread parallelism strategy (Bischof *et al*. 2008).

The challenges of scaling applications effectively to evolving architectures is one that is shared by many application areas, so there is a sizeable community that has already begun to address some of the issues. In particular, there is a great deal of interest in new petascale FE algorithms. The use of multilevel methods such as hierarchical hybrid grids (Bergen *et al*. 2006) will be essential, since these show close to linear scaling of computational complexity with problem size, and therefore will continue to scale well even for the very large problems that will necessitate the use of petascale computers. A similar development process is underway in the field of SE modelling, where various numerical algorithms that are expected to scale to the next generation of supercomputers have also been proposed (Hamman *et al*. 2007; Taylor *et al*. 2008).

AMR methods are, theoretically, scalable to petascale architectures, in that the work for each solve can be distributed evenly over the available cores and minimal communication is needed during the solve (Wissink *et al*. 2001, 2003). However, the process of regridding (refining and unrefining appropriate regions as the simulation progresses) is not necessarily scalable, and the overhead incurred by the need to repartition the mesh between the cores after regridding can also have a serious impact on scaling. Recent work (Luitjens *et al*. 2007) has demonstrated that space-filling curves that can efficiently partition the mesh in order to balance the load while minimizing communication overhead can be generated quickly and in a scalable manner in parallel. According to Luitjens *et al*. (2008), it is now ‘possible but challenging to get specific [AMR] codes to scale well’, but the ideas that have been used to achieve this in some limited cases have yet to be extended to a general AMR framework.

The petascale implementation of underlying numerical technologies (such as optimal matrix solution methods), applicable across a wide area of science and engineering, is the focus of many research efforts worldwide (such as the US SciDAC programme). Issues related to scaling of numerical solvers to massively parallel systems, including heterogeneous computer architectures, efficient data decomposition, load balancing, scaling of communications and overcoming the memory bandwidth ‘wall’, are being addressed. Cardiac simulation will benefit from these breakthroughs and from the highly efficient numerical libraries that will result from these research efforts.

Finally, as computer architectures become more complex, it will become increasingly desirable to hide this complexity from the user in the form of highly tuned compilers and machine-specific libraries (Freundl *et al*. 2008; Logg *et al*. 2008). This would reduce the optimization for a specific numerical method (e.g. a linear solver, a preconditioner or the assembler for a given FE or SE algorithm) to an operation that would only have to be performed once on each new machine. Furthermore, this one-off optimization step would become the responsibility of the developers of the compiler and libraries—who have the necessary expertise about the new architecture—rather than of the end users—who generally do not. A change to this coding model would allow the development of portable application code that could be moved seamlessly to new computing architectures as they come online. This is especially attractive as the lifespan of application code is generally much longer than that of a specific compute facility.

## 5. Conclusions

In this paper, we have identified the requirement for substantial improvement in the speed of existing cardiac simulation packages (by several orders of magnitude) if they are to produce the real-time results that will probably be necessary if they are to be used in non-academic application areas (e.g. early stage testing of the arrhythmogenic impact of new drugs). Since existing codes have largely been run on workstations or clusters, one way of improving their performance (which is sure to be pursued) is to convert the codes to run on HPC facilities. However, the computational load of the current technology is so large that real-time simulation of a whole heart at the required resolution would be beyond it, even if current performance scaled linearly to the largest available supercomputers.

Hence, new methods will be required to be developed alongside the conversion to parallel computing. These methods will need to compute the solution of the governing equations faster than those currently used (for the same level of accuracy), while also scaling efficiently when the computational load is distributed across many thousands of processors. In §3, we suggested three potential classes of method, the implementation of which should help to satisfy these requirements.

First, an increase in the use of adaptive methods presents an opportunity to reduce the size of the problem that has to be solved. Here, AMR is a very attractive candidate for use in cardiac simulation. Anisotropic AMR methods of a similar type to those that would be required in heart modelling have proven to scale well on HPC facilities in other application areas (Piggott *et al*. 2008), and early experiments with AMR in heart simulation have shown speedup factors of up to 50 on a sequential machine (Cherry *et al*. 2003), and a reduction in the total computational load by a factor of up to 200 (Colli Franzone *et al*. 2006).

Next, using SE methods may also reduce the computational load, since they lead to linear systems with diagonal mass matrices and fewer degrees of freedom than equivalent FE methods. On the other hand, the stiffness matrix of an SE method is generally denser than its FE counterpart and the method necessitates the use of hexahedral elements (which may not capture the surface geometry of the heart as well as tetrahedra). Results in other fields indicate that the use of SE methods can increase the speed of a code by up to an order of magnitude.

Finally, no matter what discretization method is used, the bulk of the computational time will be spent solving linear systems. While a significant amount of work has gone into speeding up the solution of the equations of cardiac electrophysiology, especially when they are decoupled into an elliptic and a parabolic part, there is still much more to be done if our goal of close-to-real-time simulation is to be met. Recent work has shown that switching from an ILU to an AMG preconditioner results in a speedup by a factor of between 6 and 8 (Plank *et al*. 2007). The possibility of further advances of this type requires investigation. Currently, it is not even clear whether operator-splitting techniques represent the way forward or whether a switch to a fully nonlinear solution of the coupled equations would be better. The adoption of different algorithms (e.g. adaptive methods and SE methods) that have proved successful in other fields and the use of more powerful (but also more complex) computing facilities should help in the search for additional speed, but it will also mean that there will be even more factors that need to be considered when attempting to identify the optimal approach in the future.

In §4, we have outlined some of the methodologies that are beginning to be developed to run numerical simulations on petascale computers. It is evident from this discussion that these are still in their infancy and that, while a unified framework for petascale computing is desirable (and may eventually emerge), most of the work to date has been application specific. So, while there may be general lessons to be learned from results in other application areas (such as computational mechanics or ocean modelling), developers of cardiac simulation code will probably need to optimize their own code on each new architecture that it is to be run on. This emphasizes the continuing need for close cooperation between experimental physiologists, mathematicians and computer scientists in the development of a heart simulator. Since advanced techniques in each of these fields will be required, it is very unlikely that any one person will be able to acquire all of the necessary skills. So, we envisage that continued close cooperation between computational biology, numerical analysis and physiology groups (each of which continues to specialize in their own field) is more likely to produce good results than any one of these groups trying to take the whole project ‘in-house’.

We hope that, as the benefits of these developments come on stream, cardiac modelling will no longer be simply an academic exercise. The goal of our work is to develop an extremely fast heart simulation software package. This would have the potential to be used by industry and clinicians to provide breakthroughs in the development of pharmaceuticals and medical devices, and in the understanding of new diseases. Potentially, this could lead to much more effective medical treatment being available (not just for patients with heart problems, as many drugs that could treat other conditions currently fail to be licensed due to side effects in the heart) and, ultimately, save lives and reduce the cost of healthcare, providing a major benefit for society.

## Acknowledgments

The work of B.C., G.F., F.M., R.N. and J.S., part of the preDiCT project, was supported by a grant from the European Commission DG-INFSO—grant number 224381. R.B. is supported through an EPSRC-funded Life Sciences Interface Doctoral Training Centre studentship (grant reference EP/E501605/1).

## Footnotes

One contribution of 15 to a Theme Issue ‘The virtual physiological human: tools and applications I’.

- © 2009 The Royal Society