## Abstract

We describe a software package designed for the investigation of topological fluid dynamics with a novel algorithm for locating and tracking vortex cores. The package is equipped with modules for generating desired vortex knots and links and evolving them according to the Navier–Stokes equations, while tracking and visualizing them. The package is parallelized using a message passing interface for a multiprocessor environment and makes use of a computational steering library for dynamic user intervention.

## 1. Introduction

The curl of the hydrodynamic velocity field * u* is a divergenceless vector field known as the

*vorticity*,

*≡*

**ω***×*

**∇***. The line integral of velocity around a closed loop is called the*

**u***circulation*enclosed by the loop and, by Stokes' theorem, this is equal to the flux of the vorticity through a surface bounded by the loop. While it is known that vorticity tends to concentrate into tube-like structures called

*vortex cores*, these have no mathematically precise and universally accepted definition and their identification in a given turbulent flow field remains an outstanding computational problem (Aref & Zawadzki 1991; Roth & Peikert 1998).

In the important, if somewhat contrived, case where the support of vorticity is restricted to closed and perhaps linking tubular vortex cores, each with fixed circulation, it is known that vorticity may be related to global topological invariants of a fluid. In particular, the volume integral of the scalar product of vorticity and velocity is known as *helicity*. This can be shown to be a quadratic form in the circulations of the component links, with coefficients related to the corresponding linking numbers (Khesin & Arnold 1999).

For an incompressible Euler or inviscid fluid whose time evolution is the unfolding of a volume-preserving diffeomorphism, such vortex cores may not pass through one another and so the helicity is invariant and presents a topological obstacle to the relaxation of the fluid to a steady state. By contrast, vortex cores in a Navier–Stokes or viscous fluid may diffuse through one another and such events are characterized by concomitant large changes in helicity.

The search for vortex cores requires large amounts of computational resources, if only because it is most often done for flow fields on large three-dimensional computational Grids. The most naive methods focus on a scalar quantity that tends to be large in a vortex core, such as the magnitude of the vorticity |* ω*| or the negative of the pressure −

*p*, and defines the collection of vortex cores as the union of all regions where these quantities exceed a certain threshold value. The problem with this approach is that vortex cores may vary dramatically in cross-section. Field lines of vorticity may begin concentrated, fan out and concentrate anew, so that attempts to threshold a scalar magnitude may result in ‘gaps’—abrupt stops and starts that do not conform to our intuitive picture of the behaviour of a divergenceless vector field.

More sophisticated identification strategies try to circumvent this problem by defining vortex cores as the loci of points at which the scalar field reaches a maximum in two principal directions. For example, Kida and co-workers (Kida *et al*. 2002) Taylor expand the scalar quantity about the centre of each computational Grid cell, retaining terms to quadratic order. If the critical point of this quadratic form lies within the same computational Grid cell, then the *Hessian matrix* of second derivatives at this point is calculated and diagonalized to identify principal axes. If two of the associated eigenvalues are negative, indicating a maximum in those principal directions, the third eigenvector is taken as the direction of a vortex core passing through the critical point. Although a very significant improvement over thresholding, this method may still result in ‘gaps’.

It is clear that what is required is an identification strategy that takes into account the vector nature of the vorticity field at the very outset. In recent work (Finn & Boghosian in press), we have introduced a definition of a vortex core as a local maximum of the line integral of vorticity around a closed path *C*, *taken in the space of all such closed paths*. To study the characteristics of this new definition, we have constructed a large modular computer code that is capable of generating vorticity fields with desired characteristics, evolving these fields according to the Navier–Stokes equations and applying the new identification algorithm at each time-step. The code allows for dynamic user interaction and feedback to drive the simulation's search for vortex cores in meaningful directions or initiate parallel multiple searches.

In this paper, we provide a detailed description of five modules of this code in the order they are used. In §2, we describe how we generate vorticity fields with easily identifiable vortex knots and links, which provide base-line tests for our identification algorithm. Section 3 describes how the generated vorticity fields are evolved using a multiple-relaxation-time lattice Boltzmann (MRT LB) algorithm. In §4, we describe the new definition of vortex cores in some detail, including the dynamic algorithm that we use to locate vortex cores in snapshots of the vorticity field taken from the LB evolution. Section 5 describes our use of computational steering to guide the vortex core identification process. Finally, §6 details a module that we have built for dynamic lattice remapping which we employ with a distributed three-dimensional Fourier transform to enable dynamic lattice size refinement in our simulations.

## 2. Vorticity field generation

To judge the effectiveness of a proposed vortex core identification algorithm, we need a way to easily generate vorticity fields with specified vortex knots or links. The first module that we describe generates vorticity fields whose principal vortex cores lie near one or more curves specified by the user in a periodic three-dimensional domain. Given this field, a putative vortex core identification algorithm should be able to recover something close to the original curve(s).

Suppose that it is desired to create a vortex core along a closed curve with parametric representation * r*=

*(*

**R***s*), with characteristic width

*σ*and circulation

*Γ*. We compute the vorticity vector thus generated at an arbitrary point

*by the prescription(2.1)It is straightforward to show that vorticity fields generated in this way are necessarily divergenceless. For vortex links, the contributions to the vorticity field due to each component of the link are simply added together. All of our code modules, including this one, are constructed to work in a three-dimensional periodic domain, so the contributions of the infinite array of image vortex cores are accounted for using an Ewald-like sum.*

**r**In cross-sections transverse to the curve * r*=

*(*

**R***s*), the magnitude of the vorticity is maximized precisely on this curve in the limit

*σ*→0 only. For

*σ*>0, this is not necessarily true. In practice, however, it is unnecessary; the general shape and location of the curve is often sufficient to declare a successful identification. When higher accuracy is required, locations of maximum vorticity magnitude can be found numerically if they are needed.

Finally, the velocity field may be recovered from the generated vorticity field in this way, since ∇^{2}* u*=−

*×*

**∇***. The solution to this Poisson equation is accomplished using Fourier transforms, .*

**ω**## 3. Lattice Boltzmann algorithm

Our vorticity field datasets are generated by a three-dimensional, 15 velocity (D3Q15) MRT LB algorithm (D'Humiéres *et al*. 2002). This algorithm stores a 15 component single-particle distribution function at each site on a three-dimensional lattice. At every time-step, particles propagate to neighbouring Grid sites and collide, relaxing the local distribution towards a carefully designed equilibrium form.

Multiprocessor implementations of the LB algorithm employ spatial domain decomposition. Interprocessor communication is required as particles propagate from one subdomain to another. We use a general block decomposition in order to reduce the ratio of surface area to volume; the smaller this ratio, the less the data that needs to be communicated between processors at each time-step. This is illustrated in figure 1.

In practice, to reduce the effects of communication latency, we communicate all the data in halo region *k*-lattice sites deep from the surface of the blocks once every *k* time-steps. We first describe this halo exchange process in two dimensions where it is easier to visualize, since each processor must communicate with eight logical neighbours. Each processor pads its array on all sides with extra rows and columns of sites. A processor's block always contains valid distribution data; the halo may or may not contain valid data. Of course, it would be inefficient for each processor to send and receive data to eight neighbours and this problem would only become worse in three dimensions where each processor would have 26 logical neighbours. As shown in figure 2, we can reduce communication by sending and receiving to only four neighbours in two dimensions. In this two-stage communication, each processor communicates first with its north and south neighbours and then with its east and west neighbours. This indirectly sends the corner entries to the diagonal neighbours. Likewise, in three dimensions, each processor needs to send only six faces of halo data to six logical neighbours. The edges and corners are thereby indirectly communicated to the diagonal neighbours.

We tested the effect of the halo depth *k* on performance, measured in site updates per second per processor, on a Beowulf cluster of 8 processors in a 2×2×2 block processor layout. Global lattice sizes from 128^{3} to 256^{3} were used and halo thicknesses from *k*=1 site to *k*=10 sites were tested. In all cases, a halo thickness of *k*=1 was found to be the most efficient, indicating that communication latency is not a problem on this cluster. The performance data are summarized in the following table, with entries given in hundreds of thousands of site updates per second per processor.

Using a halo thickness of *k*=1 site, we tested the MRT LB algorithm on a range of lattice sizes as well as a number of different CPUs on the Pittsburgh Supercomputing Center's machine called Lemieux.1 The MRT LB code was found to scale essentially linearly from 64 to 1024 processors with entries given in hundreds of thousands of site updates per second per processor, as shown in the following table.

A direct comparison can be made between the 64-CPU-256^{3} and the 512-CPU-512^{3} test cases since they both have the same num0ber of lattice sites per processor. The same comparison holds for the 64-CPU-512^{3} and the 512-CPU-1024^{3} lattices, and the 128-CPU-512^{3} and 1024-CPU-1024^{3} lattice cases. The performance data shows virtually no penalty for increasing the number of processors, which is perhaps not surprising since the algorithm is ‘embarrassingly parallel’. No global communication is ever used in this code module.

## 4. Vortex knot identification

As the numerical fluid evolves by the LB algorithm, we take periodic snapshots of the hydrodynamic velocity * u*(

*). From these snapshots we compute the vorticity field*

**r***(*

**ω***) and apply our vortex knot identification (VKI) algorithm. As noted in §1, this algorithm defines a vortex core as a closed curve for which the line integral of vorticity, , is a local maximum in the space of all closed curves in the domain. We first restate the problem by defining the negative of this quantity as a ‘free energy’ functional in the space of all closed curves(4.1)and we seek the minimization of this free energy. Thus, a vortex core in a vorticity field is a closed curve that tends to pass through areas of large vorticity magnitude and whose tangent vectors are aligned with the vorticity field. This form for the free energy has the virtue of being independent of the parameterization used; it depends only on the shape of the closed curve. See Finn & Boghosian (in press) for more details.*

**r**The algorithm we use to find local minima of is based on a Ginsburg–Landau (GL) equation that defines a dynamic designed to reduce this free energy by changing the curve in the negative of the direction of the Fréchet derivative (infinite-dimensional gradient in continuum) of . The resulting GL evolution equation for * R* is(4.2)Rather than discretize the continuum equation of motion, equation (4.2), we discretize the free energy and then find the resulting equation of motion. This guarantees a monotonic decrease of , even in the discrete case, which is needed for numerical simulations. A natural discretization is a mid-point evaluation of the derivative(4.3)

Unfortunately, this choice is prone to numerical instability. We can strongly inhibit such instabilities by adding a second term to the free energy, proportional to the variance of the distance between points(4.4)

This has a stabilizing effect on the evolution without changing the location of minima in the space of all curves. This is because any curve can be parameterized so that its nodes are equidistant, causing this variance term to vanish. Therefore, the effect of this term is to change the evolution towards the minima without changing the location of the minima. The resulting free-energy minima are still independent of parameterization and only depend on shape. The GL evolution for * R* is the functional derivative of +

*λ*,(4.5)where

*λ*is a weighting factor chosen because it most effectively inhibits numerical instability. This modified dynamic monotonically decreases the free energy, now defined as +

*λ*.

Finding local minima reduces to solving a coupled system of ordinary differential equations for the positions of the points in a static three-dimensional vector field. We use fourth-order Runge–Kutta to evolve **R**_{i} and periodic cubic interpolation (Press *et al*. 2002) to evaluate the vorticity field at non-lattice sites.

The VKI algorithm has not yet been parallelized to run on multiple processors at once, although it has been run successfully in a task-farming scenario (see §5 on computational steering below). There are two bottlenecks for large-scale runs: the vorticity field size and the number of points that make up * R*. Memory constraints become an issue for large lattices of vorticity data; the spline coefficients required for three-dimensional cubic interpolation increase memory requirements by a factor of eight. Computational resources may also become a serious issue for very large numbers of points. Exploration to find local minima typically requires a large number of time-steps and the calculation of the right-hand side of the GL equation involves a very large number of floating-point operations.

In a parallel implementation, the points representing * R* and the vorticity field could be spread across processors. In the former case, all processors would contain the entire vorticity field and share neighbouring point data at each iteration. In the latter case, processors would simulate only the points which reside on their section of the lattice. This remains an active area of algorithmic research; the speed and efficiency of these methods has not yet been assessed.

Figure 3 shows the results of such simulations for two initially coaxial circles of vorticity in parallel planes undergoing Navier–Stokes evolution at a Reynolds number of 3200. Both the vortex cores obtained from the GL algorithm and vorticity isosurfaces at 20% of the maximum value are rendered. The vortex core obtained for each snapshot was used as the initial guess for the position of the vortex core in the following snapshot.

It is often the case that a vorticity field will have multiple local free-energy minima. In such situations, the different minima are often interesting and may be colour coded by their value of free energy. One can search for these by making many runs with a wide variety of initial condition loops. If only the global minima are desired, a Langevin evolution may be employed, wherein a small amount of random noise is added with slowly decreasing amplitude in order to anneal the simulation to the desired global minimum free energy.

## 5. Computational steering and visualization

Computational steering refers to interaction between the user and a running simulation; it has been essential to the success of our vortex core searches. Without human control, all parameters of a simulation must be known when the programme starts. Even simple things like how often to collect data or when to create checkpoints may not be changed after execution has begun without some form of steering.

Both the LB and VKI codes are steerable via a TCP/IP socket, presenting the user with a command line interface. The user can connect to a simulation from anywhere in the world and send commands; the simulation sends back any results in plain text format or files. This simple text-based interface could be extended to a graphical style, like the RealityGrid steering client (Brooke *et al*. 2003; Chin *et al*. 2003).

Once connected, a user has several commands available. Commands currently include pausing, resuming or halting a simulation. Users can view and set simulation variables and parameters, and set alarms to pause a simulation when values exit a specified range. Exposed functions can be called with arguments passed from the command line, which usually includes manipulating data in certain ways, manually loading or check-pointing datasets or creating visualizations. All of these commands enable a high level of interactivity with a simulation. Commands are written in such a way as to prevent the user from attempting an action that is not supported. For example, although the size of the lattice is a visible parameter, its modification requires a lattice rescaling and it cannot be set directly; however, in the future we plan to make even the size of the lattice steerable, as described in §6 below.

Currently there are two methods for steering simulations. MRT LB codes are designed to run on *N* processors, where each processor works in conjunction with every other. A VKI code may run on *N* processors, but each processor may be self-contained, seeking a local minimum for its own curve * R* in its own vorticity field. In this type of simulation, processors do not have communication with each other but steering is still extremely useful. A user can search parameter space much more quickly by simultaneously steering multiple simulations with batch steering commands than individually.

Vortex core identification happens quickly enough that steering combines well with real-time visualization. A check-point for the VKI code is extremely small (about 15 kb) so hundreds can be moved from a computing cluster to a rendering machine in essentially no time at all. The static vorticity field can be viewed through isosurfaces while an evolving curve is overlaid. This kind of real-time visualization with user feedback allows the observable effects of steerable parameters to be seen. Features often seen are curve exploration, where * R* moves into areas with a large vorticity magnitude (expansion). Deceleration occurs when the curve approaches a local minimum and free energy decreases only very slowly. Even the birth of numerical instabilities are usually visible for brief periods of time before they crash the simulation, as shown in figure 4, and an alert steerer may try to vary parameters, such as the weighting factor

*λ*described above, in order to mitigate the instability.

Future methods of visualization would be greatly aided by immersive techniques (Marbach & Gruchalla 2004). A two-dimensional projection of a three-dimensional domain—especially one with vector fields—is extremely unwieldy for the conveyance of useful information. Moreover, immersive visualizations allow for things like surgery on evolving curves when components cross, a useful means to test the stability of local minima.

## 6. Lattice remapping

We have developed a remapping library that provides three significant functions: redistributing (remapping) the lattice data across the available processors in a multiprocessor environment, performing a fully three-dimensional Fourier transform of the lattice data and dynamically resizing the lattice. The second and third functions depend heavily on the first, whose implementation we describe below.

First, the layout of the lattice data across the processors is defined by what we term the *lattice decomposition*, which is made up of consecutive, non-overlapping *decomposition nodes*. A node is defined as a rectangular parallel-piped portion of the lattice, the data in which resides on, *at most*, one processor in the multiprocessor cluster. With processors numbered consecutively starting at 0, we denote a node belonging to processor *N* as *N*_{i}, where *i* is the index of nodes for each processor. Note that a node may contain data not residing on any processor (*N*<0, see below), and that a single processor may have data in more than one node (*N*_{1}, *N*_{2},…). Both of these points are crucial for the resizing function described below.

Upon performing a remap between an old decomposition *A* and a new decomposition *B*, a processor *N* calculates the overlap between each of its nodes in the old decomposition *N*_{i}∈*A* and each of the nodes in the new decomposition *M*_{j}∈*B*, *M*=0, 1, …, *j*=0, 1, … as the blocks to send . The blocks to be received are calculated similarly: , *M*_{j}∈*A*, *N*_{i}∈*B*, *M*, *i*, *j*=0,1,2,…. Each processor then sends all of its outgoing blocks and receives all of its incoming blocks via the message passing interface and constructs its new portion of the lattice from the received blocks.

Performing a Fourier transform on a three-dimensional lattice across multiple processors is involved, but we make use of the fact that such a transform is equivalent to a one-dimensional transform in each of the three dimensions separately. Since highly optimized routines exist to execute transforms on a single processor, we first remap (if necessary) to a so-called *pencil decomposition* so that some dimension is entirely on-processor (see figure 1). We perform the Fourier transform along this dimension for each coordinate pair in the other two dimensions, then remap to a so-called *slab decomposition* (see figure 1) and perform a two-dimensional on-processor transform in the remaining two dimensions.

In order to resize the lattice, we utilise the Fourier transform and complete the size manipulation in Fourier space. To decrease the lattice site resolution, we remove the highest Fourier modes, thus preserving all conserved quantities (which are global sums and therefore functions only of the zeroth Fourier mode) and retaining the vector space characteristics down to the new scale. Similarly, to increase resolution, we need only add higher modes to the existing data, which we initially set at zero. More sophisticated variants of this will employ extrapolations of the power spectrum with a random phase approximation for the added modes.

To implement the resizing once in Fourier space, we again make use of the remap function. To remove lattice data, we partition the decomposition in such a way that the lattice sites we wish to retain are in separate nodes from those we wish to discard. This may involve splitting existing nodes, resulting in the case where a single processor's data are represented by multiple nodes. Once the decomposition is partitioned, we mark the processor number of the nodes we wish to dispose of with a special negative value and remap to a new decomposition. During the remap, processors ‘know’ that nodes with this particular value are to be ignored and the data are automatically discarded.

Increasing the lattice resolution happens in a similar manner. We partition the decomposition along planes where we will need to add data (many optimized fast fourier transform routines result in the highest Fourier modes being in the middle of the array, so we need to partition the lattice through the centre in each dimension). Without yet adding lattice data, we add nodes with a different special negative value to the decomposition in locations where data are to be added and processors which receive data from one of these nodes fill the incoming block with zeros during the subsequent remap.

All of the above functionality has been implemented in a multiprocessor environment using a message passing interface for interprocessor communication. Grid enablement, so that it may work across geographically distributed computational resources, is work in progress.

## 7. Conclusions

We have described a five-stage process used to test a new algorithm for the identification of vortex core knots and links in Navier–Stokes fluids. These stages have been coded as modules in a larger software package that allows for numerical experiments on vortex core identification and evolution. We are able to generate vorticity fields with easily identifiable vortex cores. Using a three-dimensional Fourier transform and remap, we recover the corresponding velocity field. We evolve the velocity field using a LB algorithm, extracting the vorticity field at regular intervals. For each field, we run the vortex core identification algorithm, clearly identifying the location of the core. Computational steering and visualization throughout the entire process allows the user to interact with the simulation in real time.

## Acknowledgments

We are grateful for helpful conversations with Peter Coveney, Jonathan Chin and Jamie Vicary, and for the hospitality of the Centre for Computational Science at University College London during our visit in the summer of 2003 (B.M.B. and L.I.F.) and 2004 (B.M.B. and C.N.K.). B.M.B. is grateful to the EPSRC for his visiting fellowship at UCL under the RealityGrid programme. B.M.B. also received support during the course of this work from AFOSR grant F49620-01-1-0385. Computer time at the Pittsburgh Supercomputer Center was provided under the NSF PACI and NRAC programmes.

## Footnotes

One contribution of 27 to a Theme ‘Scientific Grid computing’.

- © 2005 The Royal Society