## Abstract

If you seem to be able to do data assimilation with uncertain static parameters then you are probably not working in environmental science. In this field, applications are often characterized by sensitive dependence on initial conditions and attracting sets in the state-space, which, taken together, can be a major challenge to numerical methods, leading to very peaky likelihood functions. Inherently stochastic models and uncertain static parameters increase the challenge.

## 1. Background

Three years ago, I started a collaboration with Michel Crucifix at the Université catholique de Louvain, on glacial cycles. Our intention was to use simple ‘phenomenological’ models [1], which would be calibrated against measurements from ice core and benthic datasets, and used to address questions such as Ruddiman’s Early Anthropogenic hypothesis (reviewed in [2]). I thought we would use an approach suggested by Liu & West [3], which was a sophisticated treatment of data assimilation with uncertain static parameters (defined below), in which the state vector is augmented with the static parameters.

One year later, I was discussing our lack of progress with Prof. Christophe Andrieu, an authority on statistical data assimilation. He was not at all surprised, because in his view the problem was, in general, ‘intractable and unsolved’. Regardless, I simply upped the computational sophistication of our methods, including switching to his own recent development, particle Markov chain Monte Carlo (P-MCMC, [4]) and running on a faster computer. Now, another two years on, we still have no calibration. So, chagrined, I paused to consider why our application was so difficult, and whether the same conditions affect other complex systems, notably many of the systems we study in environmental science. I concluded that our application seems to have all of the features that make data assimilation tricky, and I describe these in this paper.

## 2. Outline of notation and concepts

This is the vanilla version of the problem; some complications will be considered in §5. Consider a dynamical system with state vector , for which we would like to learn for some set of times . If is not countable then in general we do not expect to be able to represent ** X** exactly, but for our purposes it can be approximated by linear interpolation on a fine grid of time points discretizing .

We model ** X** using the stochastic differential equation
2.1a
where

*f*(⋅) will be termed the

*drift model*, and

*Q*

^{T}°d

*W*the

*diffusion*. The drift model is treated as known, but the parameters (

*θ*

_{0},

*Q*) are uncertain. For simplicity, we will focus on

*θ*

_{0}and treat

*Q*as known and time-invariant, although the particular difficulties of an uncertain

*Q*will be mentioned in §4.

Pragmatically, the use of a known functional form and uncertain parameters is a device to constrain functional uncertainty to a well-defined and finite collection of quantities. It is necessary to include both an uncertain *θ*_{0} and the diffusion *Q*^{T}°d*W* to represent uncertainty fully, because *θ* cannot in general be tuned to make the solution based on *f*(⋅) alone an exact match to ** X**. Often this is because equations are missing from the drift model; approaches for replacing truncated equations with stochastic terms are discussed in, for example, Penland [5]. Not knowing

*θ*

_{0}is termed

*parametric uncertainty*, while the diffusion represents

*structural uncertainty*—this is the uncertainty about

**that cannot be tuned away by adjusting the parameters. A third source of uncertainty is**

*X**input uncertainty*, discussed in §5.

In this model for ** X** there will be components of

*X*(

*t*) that are operationally defined, corresponding to the quantities that we wish to learn, but there may also be additional instrumental components whose main purpose is to allow us to express our judgements about the evolution of the operationally defined components more easily. So, in fact it would be more accurate to say, in the opening paragraph, that we wish to learn a subset of

*X*(

*t*) for .

To assist us in learning about ** X** there is a collection of measurements
made at known times , and with known likelihood function (in the vanilla version)
2.1b
which has the standard conditional independence properties in which each measurement is conditionally independent of everything else given the state vector at its measurement time. Figure 1 shows the standard representation of this statistical model in the form of a conditional independence graph.

The model is completed by specifying distributions for *X*_{0}:=*X*(*t*_{0}) and *θ*_{0}, usually in the form of the probability density function
2.1c
as it would be usual for the distribution of *X*_{0} to depend on *θ*_{0}. One standard approach for drift models with attractors (see §3*b*) is to ‘spin-up’ the state vector from a specified or random value *x*(*t*_{0}−*κ*), where *κ* is large, and then the spun-up value at time *t*_{0} is taken to be a realization from *π*(*x*_{0}|*θ*); effectively, one is sampling from the invariant measure of the drift model, which will depend on the value of *θ*. Often a plug-in value, say, will be used in place of the unknown *θ*_{0}, in which case, notionally, , where *δ*(⋅) is the Dirac delta function.

In Bayesian statistical terms, we want to use (2.1) to compute the conditional distribution 2.2 I take it for granted that a Bayesian solution is preferred, because the dominant source of uncertainty in complex systems is epistemic, not aleatory. But I do not for one moment believe that a Bayesian solution can capture all of my uncertainty, and am concerned that even to attempt such a thing gives a false impression [6,7]. I do, however, believe that a Bayesian approach will do better at representing epistemic uncertainty than the rather simpler and tractable non-Bayesian approaches that have been adopted in, for example, palaeoclimate reconstruction (see §5).

Approximating or sampling from (2.2) is the challenge of *data assimilation* (*smoothing*) *with uncertain static parameters.* Integrating over *θ* in (2.2) would give the *smoothing distribution*, *π*(** x**|

*z*

^{obs}), which is used for reconstruction and prediction. Integrating over

**in (2.2) would give the**

*x**calibration distribution*,

*π*(

*θ*|

*z*

^{obs}), which is used for studying the structure of the drift model, and for inference about systems other than

**which are like**

*X***, in that they share common parameters with**

*X***. This type of inference often constrains the choice of drift model, which must be realistic enough to encompass systems that are different from**

*X***in some implementable way.**

*X*For example, Ruddiman’s hypothesis [2] concerns the sensitivity of glacial cycles to the level of atmospheric greenhouse gases in the mid-Holocene. As these are a component of the state vector of a simple dynamical model of glacial cycles (e.g. [8]), we could assimilate (proxy) observations on ice volume and atmospheric *CO*_{2} into such a model, and then consider the marginal distribution *π*(*x*(*t*_{k}),*θ*|*z*^{obs}), where *t*_{k} is a point in the mid-Holocene, such as 8 kyr BCE, somewhere in the middle of the assimilation interval. By simulating forwards in time from this distribution we could quantify the stability of the glacial cycle in the mid-Holocene, to determine whether a small perturbation, such as might have been caused by human activity, could have substantially delayed reglaciation. In this case, the actual trajectory *z*_{k},*z*_{k+1},… would be just one realization of a wide fan of possible outcomes. The width of the fan alone represents the sensitivity of the Earth system at time *k*. Ruddiman’s hypothesis goes further, and asserts that the actual trajectory, which represents human intervention, is an outlier on the fan, with most of the other trajectories heading towards reglaciation.

I also mention, for completeness, the simpler data assimilation approach termed *filtering*, which constructs the sequence of distributions
2.3
This is useful for prediction forward from time *t*_{n}, at which point all of the information in the measurements has been incorporated. However, the probabilistic reconstruction of (*X*(*t*_{i}),*θ*_{0}) at *t*_{i}<*t*_{n} will not include all available relevant information. So, for example, filtering up to time *t*_{k} and then predicting forwards provides a poorer assessment of Ruddiman’s hypothesis than we would derive from using the smoothing distribution, because the joint distribution of (*X*(*t*_{k}),*θ*_{0}) is not conditioned on any of the measurements made after time *t*_{k}. Measurements at the times that immediately follow *t*_{k} are particularly useful for constraining *X*(*t*_{k}), but all measurements are useful for constraining *θ*_{0}.

*Notation.* Here and below, I adopt the standard statistical convention that capital letters indicate uncertain quantities, and small letters indicate particular values. Exceptions are for the parameters, where *θ*_{0} is used in preference to *Θ*, and *Q* which is written as a capital letter because it is a matrix. Generic functions such as the probability density function *π*(⋅) are indexed by their arguments.

## 3. The simplest case is already hard

Where are the difficulties in smoothing and calibration? In this section, suppose that the statistical model (2.1) is appropriate, *θ*_{0} is known, and *Q*=**0**. This is a standard inference, known in meteorology as 4*D-VAR with a strong constraint* (see [9,10]). ‘Strong’ here indicates that the drift model *f*(⋅) is treated as correct. The basic conclusion of this section is that even in this very special case, many environmental applications are already intractable.

Under these restrictions, the only thing to be learnt is *X*_{0}:=*X*(*t*_{0}), as the distribution of *X*_{0} implies a distribution for ** X**. In this special case, write the trajectory function of the drift model as
3.1
and the likelihood function becomes
3.2
Initially, I will just focus on the maximum likelihood (ML) estimator of

*X*

_{0}but the same issues apply in a Bayesian analysis (think of ML as the basis of a Laplace approximation to the posterior distribution, with a locally uniform prior for

*X*

_{0}).

### (a) Sensitivity to initial conditions

As is usual in inference, it is straightforward to estimate *X*_{0} if it is a *location parameter* (or nearly one). What makes location parameters straightforward to estimate is that their Fisher information does not depend on their value; (e.g. Cox [11], p. 98). In our case, this would require that shifts in *X*_{0} move the whole trajectory by the same amount. Technically, this is only completely correct if *x*(*t*_{i}) is the location parameter in the measurement error distribution in *L*_{i}(⋅), but this is the common case (e.g. Gaussian measurements errors), possibly after transformation.

Now *X*_{0} is a location parameter if and only if
3.3
for some vector-valued function *g*(⋅). This model cannot have sensitive dependence on initial conditions. It is always approximately true if *t* is close to *t*_{0} (see §3*c*), and so difficulties only arise when the measurements span a long enough time interval for sensitive dependence to be manifested. This is where the complexity of environmental systems can cause difficulties. Not only do these systems commonly display sensitivity to initial conditions, but the measurements are often noisy, necessitating a long assimilation interval. In palaeoclimate applications, for example, the relationship between the quantity that is measured and the state component that it proxies can be rather indirect, with the intermediate stages each contributing a source of uncertainty (see §5). This difficulty is compounded by the relative crudity of many environmental models, although in this section I am treating the model as known.

So, we now have one explanation for why it is hard to do inference for these systems:

### (b) The presence of an attractor

Another property, though, also contributes: complex systems often have attractors. I am going to disregard rigour in talking of attractors; it suffices to note that attractors tend to be found in dissipative systems, and complex systems are often dissipative. I can present only a heuristic argument for how sensitive dependence on initial conditions plus an attractor makes it doubly hard to estimate *X*_{0}.

If the assimilation interval (*t*_{0},*t*_{n}] is long enough for a trajectory to approach the attractor, then the likelihood function for *X*_{0} will be extremely peaky in the region around *X*_{0}. This is because a trajectory starting in the neighbourhood of *X*_{0} will be drawn towards the attractor just as the true trajectory is drawn towards the attractor, and remain roughly in-phase with it. Therefore, it will ‘hit’ many of the measurements, even though it has the wrong initial value. But sensitive dependence on initial conditions will cause the set of measurements that are ‘hit’ by the perturbed trajectory to vary, even for small perturbations around *X*_{0}. Having more measurements exacerbates the problem. This is my interpretation of the results presented by Berliner [12,13].

An additional complication arises when not all components of the state vector are measured. In this case, the projection of the trajectory into the measured components has higher probability of running into the measurements by chance, because of a somewhat free choice for the values of the unmeasured components. An example of this would be when only the fast components of the system are measured, which can go around their projected attractor many times for one orbit of the whole system.

In general, the presence of an attractor is not a problem for ruling our bad choices of *x*_{0}, which are well away from *X*_{0}. This is because although all trajectories approach the attractor, the ones well away from *X*_{0} will be out-of-phase with the true trajectory, and so ‘miss’ the measurements. There is an exception, however. Quasi-periodic forcing can cause the trajectories to synchronize in oscillating systems, i.e. to move towards each other in the state space, as shown in figure 2. This means that the likelihood function can have pronounced peaks even well away from *X*_{0}. In our glacial cycle application, the orbital forcing, which varies solar insolation in the Northern Hemisphere, has exactly this effect, and hence our likelihood function for *X*_{0} is both extremely peaky, and also not concentrated around *X*_{0}. In fact, the situation can be more complicated still, as there may be several local pullback attractors at any given time.

The presence of many local maxima implies that the basin of attraction of the global maximum of the likelihood function, which will contain *X*_{0} with high probability, is only a small fraction of the volume of . Hence, we are unlikely to find the global maximum of the likelihood using steepest ascent from an arbitrary starting point. If we attempt to address this by using multiple starting points, then we will not succeed unless we can afford sufficient starting points that their nearest-neighbour distances are less than the diameters of the basins of attraction of the maxima; and unless has small dimension this is impractical.

To conclude:

### (c) What about meteorology?

Data assimilation seems to work in meteorology (see [14]), and yet the atmosphere, being a compressible fluid, is modelled by the Navier–Stokes equations, which are sensitive to initial conditions, and dissipative. How to resolve this paradox? The answer is that very short assimilation intervals are used, currently six hours, and also enormous resources going back over many years.

Over a short assimilation interval, we have, expanding around *t*=*t*_{0},
3.4
If *f*_{t0}(*x*;*θ*) is slowly varying in *x* around *x*_{0}, then *X*_{0} is approximately a location parameter. Furthermore, many thousands of previous cycles of data assimilation using very large collections of measurements will give a good starting point for maximizing the likelihood function, putting the first-guess initial value in the basin of attraction of *X*_{0}. The computational expense is greatly reduced by not smoothing, but *filtering*; in other words, no attempt is made to update previous values of the state vector using the latest measurements.

This contrast with meteorology shows that we make difficulties for ourselves when we want to do (i) smoothing over (ii) long assimilation intervals, with (iii) modest resources. Our application to glacial cycles has all three of these difficulties.

Judd & Smith [15] introduce a scheme for maximizing the likelihood function that seems to have the characteristics of many short-interval maximizations. They snip the time-interval up according to the timing of the measurements, and initially treat each of these snips separately, before requiring then to ‘relax’ jointly towards a trajectory in which all of the snips join up. Unfortunately, their method works best (if it works at all) when all components of the state vector are measured at each *t*_{i}, and this is seldom the case for complex systems; definitely not if some of the components of the state vector are instrumental, as described in §2.

### (d) Stochastic models

The diffusion term *Q*^{T}°d*W* in (2.1a) is primarily a representation of the limitations of the drift model as a description of how the state vector propagates from time *t* to time *t*+d*t*. Hence, setting *Q*=**0** is asserting that the drift model has negligible limitations. So relaxing *Q*=**0** is an important step in modelling complex systems, for which we often have a limited understanding of the behaviour of the state vector. In meteorology, *Q*≠**0** would be *4D-VAR with a weak constraint* [9,10]. ‘Weak’ here indicates that the drift model is not treated as correct.

In this case, we have to proceed in an explicitly Bayesian fashion, because (2.1a) is effectively a highly structured prior on ** X**, and the posterior distribution will be a conjunction of the constraints in the likelihood and the prior,
3.5
This is a generalization of the

*Q*=

**0**case in the sense that as

*Q*→

**0**, so

**becomes a deterministic function of**

*x**x*

_{0}, and

*π*(

**|**

*x**x*

_{0}) becomes a Dirac

*δ*-function.

It is easy to see that the problem becomes harder when we allow for structural uncertainty, in situations where there is sensitive dependence on initial conditions and an attractor (now a random attractor). This is implicit in the fact that this case is a generalization of the previous case, but with many more uncertain quantities.

For ease of comparison, continue to focus on learning *X*_{0}, using the integrated likelihood
3.6a
In the simplest treatment, this integral can be approximated with a finite sample of size *m*,
3.6b
For each *x*_{0}, each candidate trajectory *x*^{(j)} will typically approach the attractor, which also contains the measurements. Therefore, each candidate will run into some of the measurements by chance, exactly as before (with the problem being more acute if not all components of the state vector are measured). Because of sensitivity, small fluctuations in the sampled Brownian motion can cause large variation in the *L*(*x*^{(j)}), as different numbers of measurements are ‘hit’ by chance. This implies that the estimator will be inaccurate unless resources allow *m* to be huge. So, an accurate inference about *X*_{0} with structural uncertainty has become *m*≫1 times more expensive than the inference without.

There are of course many treatments of inference with high-dimensional latent processes that are much more sophisticated than Monte Carlo integration (see [4] and references therein), but they make a tractable problem more efficient—they do not make an intractable problem tractable. It is also worth mentioning that a useful computation strategy for the case where *Q*=**0** is to *add* a diffusion to the drift equation, which has the effect of allowing the smoother to downweight distant measurements, and so deconcentrate the likelihood function, perhaps reducing the number of local maxima. But this paper is not about computational strategies, but about the fundamental challenges. On this basis, a stochastic model has to be more challenging than a deterministic model, although, at the same time, some computational approximations will perform less badly with a stochastic model than with a deterministic one.

## 4. Uncertain static parameters

We fully expect *θ*_{0} to be uncertain; worse, it is often ill-defined, except as an instrumental quantity that we introduce in order to construct a prior distribution over ** X**. This makes it hard to specify a prior distribution

*π*(

*θ*), but in fact this is not our main difficulty.

The following approach is sometimes invoked to demonstrate that uncertain static parameters cause no additional difficulties if the data assimilation problem has been solved. (Of course, the point of this paper is that the data assimilation problem is often intractable, but put this to one side.) Simply construct an augmented state vector (*x*(*t*),*θ*(*t*)), where *θ* has the degenerate evolution equation d*θ*/d*t*=0, with *θ*(0)=*θ*_{0}.

(An aside. Gordon *et al.* [16] were the first to propose that in practice a better outcome obtains if *θ* is allowed to follow a random walk in time, of which Liu & West [3] is the most sophisticated implementation, using sequential Monte Carlo methods. A randomly time-varying *θ* implies that the evolving calibration distribution forgets old measurements, and so is unable to rule out values of *θ* that would imply a clear inconsistency between widely separated measurements. Tomassini *et al.* [17] have proposed that letting *θ* vary through time is a way to diagnose structural errors in the drift model, which will show up at times of rapid adjustment in *θ*_{0}.)

This augmentation approach suffices to highlight the difficulties. First, even if *X*_{0} behaves as a location parameter, (*X*_{0},*θ*_{0}) certainly does not. This is a general property, because changes in *θ* will change the shape and location of the attractor of the drift model; this is termed *structural instability* by McWilliams [18], and illustrated in figure 3. But it is most obviously true in the case where *Q* is treated as uncertain: *Q* is definitely not a location parameter. The difficulty is that we have almost no insight about the ‘best’ value of *Q* in a phenomenological stochastic differential equation, and so treating *Q* as uncertain, to be calibrated using measurements, is almost mandatory.

Second, even if all of the components of *X*(*t*) are measured, *θ*_{0} is not measured, and, as already noted, having unmeasured components in the state vector can increase the number of local maxima in the likelihood function.

An additional computational problem occurs when *θ*_{0} is uncertain, which is that the initial state *X*(*t*_{0}) must be spun-up afresh for every different candidate value for *θ*_{0}, when sampling from *π*(*x*(*t*_{0}),*θ*).

Therefore, including uncertain static parameters increases the difficulty of the inference, by introducing or enhancing the complications that have already been identified in §3.

## 5. Not the vanilla version

Here is a brief list of some of the main ways in which many inferences in environmental science do not conform to the vanilla version outlined above: these are all additional complications.

The illustrations are from palaeoclimate reconstruction, for which more details of the measurements and the methods can be found in Jones *et al.* [19]. When these additional complications are added to the intractability of palaeoclimate reconstruction (climate definitely has sensitive dependence on initial conditions and an attractor), that enterprise must be seen as heroic in the extreme, and we must expect the uncertainties to be very large indeed. But, somewhat surprisingly, they are not; e.g. as shown in the celebrated hockey stick, which was used so much in the Third Assessment Report of the IPCC [20]; Montford [21] provides a readable if slightly hair-raising account.

The involvement of statisticians in palaeoclimate reconstructions has increased the smoothing uncertainty, but not as much as I would expect; see Smith [22] and McShane & Wyner [23] and the ensuing discussion. I think the main reason for this is the reversal of the dominant causal direction in the statistical modelling: most reconstructions are found by regressing climate on the proxies, whereas the dominant causal direction is from the climate to the proxies. (The point about the Bayesian approach, for example as outlined in this paper, is that one models in the natural direction *from* the system *to* the measurements, and lets the statistics handle the inversion that is necessary for the reconstruction.) The problems of this ‘inverse’ approach are discussed in Braak [24], and summarized in Rougier [25]. Haslett *et al.* [26] is an example of a Bayesian reconstruction which goes in the right direction, using data assimilation; see also the exploratory study by Li *et al.* [27].

*Measurement times not precisely known*. Many palaeoclimate archives are indexed by depth rather than time, for which the age–depth relationship is uncertain. For example: ice cores, lake and benthic sediments, speleothems, boreholes. This can be handled, to some extent, by including depth as an additional component of the state vector, and allowing depth to evolve stochastically (care must be taken to ensure that it evolves monotonically). Measurements on depth are occasionally available for some archives using carbon-dating of organic material, or recognizable tephra layers that correspond to well-dated volcanic eruptions. Unfortunately, the deposition rate will vary spatially, and so one depth sequence will be required for each set of measurements from a common location. This will involve many additional uncertain quantities.

*Uncertain parameters in the likelihood equations*. Notably when a set of measurements has been made by the same instrument, which may introduce a common but unknown source of bias. A recently identified example is the need to correct sea-surface temperature measurements for a change in the balance between engine temperature readings (often biased warm) and uninsulated bucket readings (often biased cold) at the end of the Second World War [28].

A much more complicated situation is where the effect of the state vector on the measurement is imperfectly known. In palaeoclimate this effect would be captured in a *forward model* from the climate state vector to the representation of the sensor in the archive. For example, the impact of climate on the relative abundance of plant species, the deposition of the pollen of individual members of the various species in lakes (directly or indirectly), the interring of the pollen in lake sediment and bioturbation (mixing of the sediment by plants and animals), and then all the stages that occur in the extraction and processing of the sediment core, up until the point where the palynologist counts pollen fragments on a microscope slide. In this sequence, there will be many uncertainties, and hence many parameters in the likelihood function for pollen assemblages, and their cumulative effect might be considerable.

*Uncertain forcing*. The drift model *f*(⋅) is often a function of external and time-varying forcing, and might better be written in the autonomous (i.e. time-invariant) form
5.1
where *y*(*t*) is the forcing at time *t*. In the simple case, *y*(*t*) is known for all . In this case, it has a degenerate evolution equation, and is in effect simply plugged in. More generally, though, if *Y* (*t*) is unknown then it needs to be added to the state vector, and its measurements added to the likelihood function. For example: in palaeoclimate reconstruction, the effects of varying solar insolation and volcanic activity are only partially known. The forcing measurements are likely to suffer from uncertain measurement times, being extracted from depth-indexed archives, although recent large volcanic eruptions can be accurately dated using other records.

## 6. Final thoughts

First, I should warn against what appears to be a counsel of despair. There are toy examples where data assimilation appears to work well, and where augmenting the state vector with the uncertain static parameters, or doing something more sophisticated such as P-MCMC appears to allow calibration as well as smoothing. Presumably some real-world applications are almost as simple, perhaps in finance, but probably not in environmental science, excepting some special cases.

The challenging applications are those with models having sensitive dependence on initial conditions and an attractor, and a long enough assimilation interval that both of these are a feature. In that case simple data assimilation is already difficult, owing to the many local maxima in the likelihood function. Many of these local maxima will occur in regions of very low posterior probability, but they are a major impediment to numerical methods, which have to fight their way through a thicket of local extrema, with little indication of the best direction for global increase. Unmeasured components in the state vector, quasi-periodic forcing, a stochastic model and uncertain static parameters all make the inference harder.

Unfortunately, our glacial cycles application has every one of the complicating features, plus the difficulties described in §5. This does not mean that we cannot reconstruct glacial cycles using data assimilation, but it does mean that we cannot apply an off-the-shelf method and expect it to work, unless we have access to massive computing resources.

If we are to be successful in our application it will be through applying physical insights to tune the data assimilation method, proceeding sequentially, and focusing initially on those aspects of the system that—we hope—can at least rule out bad candidates for *X*_{0} and *θ*_{0}. This technique has been formalized in the *history matching* approach of Craig *et al.* [29], and has recently been successfully applied in a large astronomical application [30]. A crucial aspect of this approach is the replacement of a formal likelihood function with a set of summary statistics, expertly chosen to be particularly informative about crucial aspects of the system: aspects which are both important in the reconstruction, and also ‘constrainable’ by the measurements. This has been a developing theme in statistics, as represented in the ‘indirect approach’ used in econometrics [31], and also approximate Bayesian computation [32,33].

## Acknowledgements

I would like to thank the following for their generous contributions of time and insight: Michel Crucifix and members of the ITOP project (http://www.uclouvain.be/itop); Christophe Andrieu, Nick Whiteley, and my MCMC-ing colleagues at Bristol; Caitlin Buck, John Haslett, and members of the SUPRANET network; two reviewers, whose insights substantially improved the clarity of my arguments (any obscurity that remains is my responsibility!); and readers of the Bishop Hill blog (http://www.bishop-hill.net/), who corrected some sloppy writing in an earlier draft of the introduction to §5. I would also like to thank the Isaac Newton Institute in Cambridge, and Nordita in Stockholm, for hosting me at various times over 2010–2011, workshop participants at both those institutes, and participants at an IMA conference on the mathematics of the climate system in September 2011, and a SuSTaIn workshop on intractability in Bristol in April 2012. And finally I would like to thank the University of Bristol for a Research Fellowship which made these visits possible.

## Footnotes

One contribution of 13 to a Theme Issue ‘Mathematics applied to the climate system’.

- © 2013 The Author(s) Published by the Royal Society. All rights reserved.