## Abstract

Simple chaotic systems are useful tools for testing methods for use in numerical weather simulations owing to their transparency and computational cheapness. The Lorenz system was used here; the full system was defined as ‘truth’, whereas a truncated version was used as a testbed for parametrization schemes. Several stochastic parametrization schemes were investigated, including additive and multiplicative noise. The forecasts were started from perfect initial conditions, eliminating initial condition uncertainty. The stochastically generated ensembles were compared with perturbed parameter ensembles and deterministic schemes. The stochastic parametrizations showed an improvement in weather and climate forecasting skill over deterministic parametrizations. Including a temporal autocorrelation resulted in a significant improvement over white noise, challenging the standard idea that a parametrization should only represent sub-gridscale variability. The skill of the ensemble at representing model uncertainty was tested; the stochastic ensembles gave better estimates of model uncertainty than the perturbed parameter ensembles. The forecasting skill of the parametrizations was found to be linked to their ability to reproduce the climatology of the full model. This is important in a seamless prediction system, allowing the reliability of short-term forecasts to provide a quantitative constraint on the accuracy of climate predictions from the same system.

## 1. Introduction

The central aim of any atmospheric parametrization scheme must be to improve the forecasting skill of the atmospheric model in which it is embedded and to represent better our beliefs about the future state of the atmosphere, be this the weather in 5 days time or the climate in 50 years time. One aspect of this goal is the accurate representation of uncertainty: a forecast should skilfully indicate the confidence the forecaster can have in their prediction. There are two main sources of error in atmospheric modelling: errors in the initial conditions and errors in the model’s representation of the atmosphere. The ensemble forecast generated should explore these uncertainties, and a probabilistic forecast issued to the user [1]. A probabilistic forecast is of great economic value to the user as it allows reliable assessment of the risks associated with different decisions, which cannot be achieved using a deterministic forecast [2].

Uncertainties in initial conditions, owing to limited spatial and temporal distribution of atmospheric measurements, are represented by perturbing the initial conditions of different ensemble members, for example using singular vectors [3], and tracking their subsequent evolution. Model error arises owing to the computational representation of the equations describing the evolution of the atmosphere. The atmospheric model has a finite resolution, and sub-gridscale processes must be represented through schemes that often grossly simplify the physics involved. For each state of the resolved macroscopic variables, there are many possible states of the unresolved variables, so this parametrization process is a significant source of model uncertainty. The large-scale equations must also be discretized in some way, which is a secondary source of error. If only initial condition uncertainty is represented, the forecast ensemble is underdispersive, i.e. it does not accurately represent the error in the ensemble mean (e.g. Stensrud *et al.* [4]). The verification frequently falls outside of the range of the ensemble; model uncertainty must be included for a skilful forecast.

Accurate representation of model and initial condition uncertainties will ensure that the resultant ensemble forecast is *reliable*. This refers to the consistency, when statistically averaged, of the forecast and measured probabilities of an event [5]. If a forecasting system is reliable then, from the very definition of reliability, it must have the correct climatology [6]. This motivates the use of reliability of shorter term forecasts as a test for climatological prediction skill. Palmer *et al.* [7] use the reliability of seasonal forecasts to calibrate climate change projections, particularly concerning precipitation. They argue that it is a necessary requirement that a climate model gives a reliable weather forecast, though this requirement is not sufficient as numerical weather prediction (NWP) models do not include the longer time-scale physics of the cryosphere and biosphere that are nevertheless important for accurate climate prediction.

Several ways of representing model uncertainty have been proposed. Multi-model ensembles (MMEs), such as the Coupled Model Intercomparison Project, phase 3 [8], allow for a pragmatic representation of model uncertainty. While MMEs have been shown to perform better than the best single model in the ensemble [9], they are not able to represent systemic errors that are potentially common to all deterministically parametrized models, and the superensemble remains underdispersive. Furthermore, the individual models in an MME are not independent, and the effective number of models in the ensemble is far lower than the total number of models [10,11]. This lack of diversity adversely affects how well an MME can represent model uncertainty.

A large source of forecast model error is the assumptions built into the parametrization schemes. The model uncertainty from these assumptions can be explored using several different parametrization schemes (*multi-parametrization*) to generate an ensemble of forecasts [12], though such approaches likely suffer from systemic deficiencies. Alternatively, the perturbed parameter approach takes uncertain parameters in the parametrization schemes and varies their values within their physical range. Perturbing parameters gives a greater control over the ensemble than the multi-parametrization approach [13], though this approach also does not explore structural or systemic errors, as a single base model is used for the experiment.

In this study, stochastic parametrizations are investigated as a way of accurately representing model uncertainty. In their seminal paper, Nastrom & Gage [14] show the presence of a shallow power-law slope in the kinetic energy spectrum in the atmosphere, with no observed scale separation. This lack of scale separation allows errors in the representation of unresolved, poorly constrained small-scale processes to infect the larger scales [15]. Therefore, a one-to-one mapping of the large scale on to the small-scale variables, as is the case in a deterministic parametrization, seems unjustified. A stochastic scheme in which random numbers are included in the computational equations of motion acknowledges this limitation, and an ensemble generated by repeating a stochastic forecast gives an indication of the uncertainty inherent in the parametrization process. A stochastic parametrization must be viewed as providing possible realizations of the sub-gridscale motion, whereas a deterministic parametrization represents the average of all the possible sub-gridscale effects.

There has been significant interest in stochastic parametrization schemes in recent years, and they have been shown to perform well in simple systems such as the Lorenz ’96 system [16–18]. Coarse-graining studies [19] allow for the development of physically motivated schemes such as the stochastically perturbed parametrization tendencies (SPPTs) and stochastic kinetic energy backscatter schemes operationally in use in the European Centre for Medium-Range Weather Forecasts (ECMWF) NWP system [20]. Their inclusion has resulted in a reduction of model biases, and a significant improvement in forecast reliability [20–22]. A further example of a physically motivated scheme is the stochastic multi-cloud model for parametrization of tropical convection proposed by Frenkel *et al.* [23]. The resultant cloud structure and wave features compare favourably with cloud resolving model simulations, especially when compared with the performance of a deterministic parametrization. Stochastic schemes have also been found to be beneficial in climate simulation and analysis [24].

The need for stochastic parametrizations has been motivated by considering the requirement that a forecast ensemble should include an estimate of uncertainty owing to errors in the forecast model. This experiment seeks to test the ability of a stochastic parametrization scheme to represent this model uncertainty skilfully. Therefore, perfect initial conditions are used for all ensemble members, removing initial condition uncertainty. The only source of uncertainty in the forecast is model error from truncation and imperfect parametrization of the unresolved ‘*Y* ’ variables in the Lorenz ’96 system, and from errors in the time-stepping scheme. The spread in the forecast ensemble is generated purely from the stochastic parametrization schemes, so the ability of such a scheme to represent model uncertainty can be rigorously tested. The ability to distinguish between model and initial condition uncertainty can only take place in an idealized ‘toy model’ setting. However, in Wilks [16] and Crommelin & Vanden-Eijnden [17], model uncertainty is not distinguished from initial condition uncertainty in this way, so the ability of stochastic parametrizations to represent model uncertainty was not investigated. In §2, we describe the Lorenz ’96 system used in this experiment, and in §3 we describe the stochastic schemes tested. Sections 4–6 discuss the effects of stochastic schemes on short-term weather prediction skill, reliability and climatological skill, respectively. Section 7 discusses experiments with perturbed parameter ensembles and §8 draws some conclusions.

## 2. The Lorenz ’96 system

There are many benefits of performing proof of concept experiments using simple systems before moving to a general circulation model or NWP model. Simple chaotic systems are transparent and computationally cheap, but are able to mimic certain properties of the atmosphere. Crucially, they also allow for a robust definition of ‘truth’, important for development and testing of parametrizations and verification of forecasts. The Lorenz ’96 system used here was designed by Lorenz [25] to be a ‘toy model’ of the atmosphere, incorporating the interaction of variables of different scales. It is, therefore, particularly suited as a testbed for new parametrization methods that must represent this interaction of scales. The second model proposed in Lorenz [25], henceforth, the L96 system, describes a coupled system of equations for two types of variables arranged around a latitude circle (figure 1; [25,16]),
2.1a
and
2.1b
where the variables have cyclic boundary conditions: *X*_{k+K}=*X*_{k} and *Y* _{j+JK}=*Y* _{j}. The *X*_{k} variables are large-amplitude, low-frequency variables, each of which is coupled to many small-amplitude, high-frequency *Y* _{j} variables. Lorenz suggested that the *Y* _{j} represent convective events, whereas the *X*_{k} could represent, for example, larger scale synoptic events. The interpretation of the other parameters, and the values in this study, are shown in table 1. The scaling of the variables is such that one model time unit (MTU) is approximately equal to five atmospheric days, deduced by comparing the error doubling time of the model to that observed in the atmosphere [25].

## 3. Description of the experiment

A series of experiments was carried out using the L96 system. Each of the *K*=8 low-frequency, large-amplitude *X* variables is coupled to *J*=32 high-frequency, small-amplitude *Y* variables. The *X* variables are considered resolved and the *Y* variables unresolved, so must, therefore, be parametrized in a truncated model. The effects of different stochastic parametrizations were then investigated by comparing the truncated forecast model to the ‘truth’, defined by running the full set of coupled equations.

Two different values of the time-scale ratio, *c*=4 and *c*=10, were used in this experiment. The *c*=10 case was also considered by Wilks [16]. This case has a large time-scale separation so can be considered ‘easy’ to parametrize. However, it has been shown that there is no such time-scale separation in the atmosphere [14], so a second parameter setting of *c*=4 was chosen, where parametrization of the sub-grid is more difficult, but which closer represents the real atmosphere.

### (a) ‘Truth’ model

The full set of equations was run and the resultant time series defined as ‘truth’. The equations were integrated using an adaptive fourth-order Runge–Kutta time-stepping scheme, with a maximum time step of 0.001 MTU. Having removed the transients, the 300 initial conditions on the attractor were selected at intervals of 10 MTU, corresponding to 50 ‘atmospheric days’. This interval was selected to ensure adjacent initial conditions are uncorrelated—-the temporal autocorrelation of the *X* variables is close to zero after 10 MTU. A truth run was carried out from each of these 300 initial conditions.

### (b) Forecast model

A forecast model was constructed by assuming that only the *X* variables are resolved, and parametrizing the effect of the unresolved sub-gridscale *Y* variables in terms of the resolved *X* variables,
3.1
where *X*^{*}_{k}(*t*) is the forecast value of *X*_{k}(*t*) and *U*_{p} is the parametrized sub-grid tendency. Equation (3.1) was integrated using a piecewise deterministic, adaptive second-order Runge–Kutta scheme, in which the stochastic noise term is held constant over the time step. Such a stochastic Runge–Kutta scheme has been shown to converge to the true Stratonovich forward integration scheme, as long as the parameters in such a scheme are used in the same way they are estimated [26,27]. This was verified for the different stochastic parametrizations tested. The parametrizations *U*_{p}(*X*) approximate the true sub-grid tendencies,
3.2
which are estimated from the truth time series as
3.3
The forecast time step was set to *Δt*=0.005. The L96 system exhibits cyclic symmetry, so the same parametrization is used for all *X*_{k}.

The true sub-grid tendency, *U*, was plotted as a function of the large-scale *X* variables for both *c*=4 and *c*=10 (figure 2). This was modelled in terms of a deterministic parametrization, , where
3.4
for
3.5
and the parameter values (*b*_{0},*b*_{1},*b*_{2} and *b*_{3}) were determined by a least-squares fit to the (*X*,*U*) truth data. However, figure 2 shows significant scatter about the deterministic parametrization—the residuals *r*(*t*) are non-zero. This variability can be taken into account by incorporating a stochastic component, *e*(*t*), into the parametrized tendency, *U*_{p}.

A number of different stochastic parametrizations were considered. These use different statistical models to represent the sub-gridscale variability owing to the truncated *Y* variables. The different noise models will be described below.

#### (i) Additive noise

This work builds on Wilks [16], where the effects of white and red additive (A) noise were considered on the skill of the forecast model. The parametrized tendency is modelled as the deterministic tendency and an additive noise term, *e*(*t*),
3.6
The stochastic term, *e*(*t*), is designed to represent the residuals, *r*(*t*). The temporal and spatial autocorrelation functions for the residuals are shown in figure 3. The temporal autocorrelation is significant, but the spatial correlation is small. Therefore, the temporal characteristics of the residuals are included in the parametrization by modelling the *e*(*t*) as a first-order autoregressive (AR(1)) process. The stochastic tendencies for each of the *X* variables are assumed to be mutually independent. It is expected that in more complicated systems, including the effects of both spatial and temporal correlations will be important to accurately characterize the sub-gridscale variability. A second-order autoregressive process was also considered, but fitting the increased number of parameters proved difficult, and the resultant improvement over AR(1) was slight, so it will not be discussed further here.

A zero mean AR(1) process can be written as [5]
3.7
where *ϕ* is the first autoregressive parameter (lag-1 autocorrelation), is the variance of the stochastic tendency and *z*(*t*) is unit variance white noise: . *ϕ* and *σ*_{e} can be fitted from the truth time series.

#### (ii) State-dependent noise

A second type of noise was considered where the standard deviation of additive noise is dependent on the value of the *X* variable. This will be called *state-dependent* (SD) *noise*. It can be motivated in the L96 system by studying figure 2; the degree of scatter about the cubic fit is greater for large-magnitude *X* values,
3.8
where the SD standard deviation of *e*(*t*) is modelled as
3.9
As figure 3 shows a large temporal autocorrelation, it is unlikely that white SD noise will adequately model the residuals. Instead, *e*(*t*) will be modelled as an AR(1) process,
3.10
where the time dependency of the standard deviation and the requirement that *e*(*t*) must be a stationary process have motivated the functional form.

The parameters *σ*_{1} and *σ*_{0} can be estimated by binning the residuals according to the magnitude of *X*, and calculating the standard deviation in each bin. The lag-1 autocorrelation was estimated from the residual time series.

#### (iii) Multiplicative noise

Multiplicative (M) noise has been successfully implemented in the ECMWF NWP model using the SPPT scheme, and has been shown to improve the skill of the forecasting system. Therefore, it is of interest whether a parametrization scheme involving multiplicative noise could give significant improvements over additive stochastic schemes in the L96 system.

The parametrization proposed is
3.11
where *e*(*t*) is modelled as an AR(1) process, given by equation (3.7).

The parameters in this model can be estimated by forming a time series of the truth ‘residual ratio’, *R*_{k}, we wish to represent,
3.12
However, whenever approaches zero, the residual ratio tends to infinity. Therefore, the time series was first filtered such that only sections away from were considered, and the temporal autocorrelation and standard deviation estimated from these sections.

For multiplicative noise, it is assumed that the standard deviation of the true tendency is proportional to the parametrized tendency, such that when the parametrized tendency is zero, the uncertainty in the tendency is zero. Figure 2 shows that multiplicative noise does not appear to be a good model for the uncertainty in the L96 system as the uncertainty in the true tendency is large even when is zero. Nevertheless, it was investigated.

#### (iv) Multiplicative and additive noise

Figure 2 motivates a final stochastic parametrization scheme for testing in the L96 system, which will include both multiplicative and additive (MA) noise terms. This represents the uncertainty in the parametrized tendency even when the deterministic tendency is zero. This type of uncertainty has been observed in coarse-graining studies. For example, Shutts & Palmer [19] observed that the standard deviation of the true heating in a coarse gridbox does not go to zero when , the parametrized heating, is zero. This type of stochastic parametrization can also be motivated by considering errors in the time-stepping scheme, which will contribute to errors in the total tendency even if the sub-gridscale tendency is zero.

When formulating this parametrization, the following points were considered.

In a toy model setting, random number generation is computationally cheap. However, in a weather or climate prediction model, generation of spatially and temporally correlated fields of random numbers is comparatively expensive, and two separate generators must be used if two such fields are required. It is, therefore, desirable to use only one random number per time step so that the parametrization could be further developed for use in an atmospheric model.

The fewer parameters there are to fit, the less complicated the methodology required to fit them, the easier it would be to apply this method to a more complex system such as an atmospheric model. This also avoids overfitting.

We consider the most general form of additive and multiplicative noise,
3.13
This can be written as pure additive noise,
3.14
where
3.15
Following point (1) above, we assume *ϵ*_{m}(*t*) and *ϵ*_{a}(*t*) are the same random number, appropriately scaled,
3.16
In the current form, equation (3.16) is not symmetric about the origin with respect to . The standard deviation of the stochastic tendency is zero when . Therefore, in the above equation will be replaced with ,
3.17
where
3.18

This does not change the nature of the multiplicative noise as *ϵ*(*t*) is zero mean, but it forces the additive part of the noise to act always in the same direction as the multiplicative part. *ϵ*(*t*) is modelled as an AR(1) process of unit variance. The parameters will be fitted from the residual time series.

The different stochastic parametrizations used in this experiment are summarized in table 2, together with the parameters measured from the truth time series.

## 4. Weather forecasting skill

The stochastic parametrization schemes were first tested on their ability to predict the ‘weather’ of the L96 system, and represent the uncertainty in their predictions correctly. An ensemble of 40 members was generated for each of the 300 initial conditions on the attractor. Each ensemble member is initialized from the perfect initial conditions defined by the ‘truth’ time series.

Each stochastic parametrization involves two or more tunable parameters that may be estimated from the ‘truth’ time series. Many parameter settings were considered, and the skill of the parametrization evaluated for each setting using scalar skill scores. The ranked probability score (RPS) evaluates a multi-category forecast. In this experiment, ten categories were defined as the ten deciles of the climatological distribution. The ranked probability skill score (RPSS) was also calculated with respect to the climatology [5]. The ignorance score evaluates a continuous forecast. Roulston & Smith [28] suggest defining *N*+1 categories for an ensemble forecast of *N* members, and approximating the probability density function (PDF) as a uniform distribution between consecutive ensemble members. This approximation is used here. The ignorance skill score (IGNSS) was calculated with respect to climatology. The Brier score (BS) is used when considering a dichotomous event [29,5]. The event was defined here as ‘the *X* variable is in the upper tercile of the climatological distribution’. The BS is mathematically related to the RPS [5], and gave very similar results to the RPS, so is not shown here for brevity. The forecast skill scores are calculated at a lead time of 0.6 units, equivalent to three atmospheric days, for each case.

Figure 4 shows the calculated skill scores for a forecast model with an additive AR(1) stochastic parametrization, for both the *c*=4 and *c*=10 cases. The shape of peak in forecasting skill is qualitatively similar in each case, but is lower for the *c*=4 case. The closer time-scale separation of the *c*=4 case is harder to parametrize, so a lower skill is to be expected.

IGNSS shows a different behaviour to RPSS. Ignorance heavily penalizes an underdispersive ensemble, but does not heavily penalize an overdispersive ensemble. This asymmetry is observed in the contour plot for IGNSS—the peak is shifted upwards compared with the peak for RPSS, and deterministic parametrizations have negative skill (are worse than climatology). The very large magnitude and high autocorrelation noise parametrizations are not penalized, but score highly, despite being overdispersive.

The RPSS may be decomposed explicitly into reliability, resolution and uncertainty components [5]. This decomposition demonstrates that the deterministic and low-amplitude noise parametrizations score highly on their resolution, but poorly for their reliability, and the converse is true for the large-amplitude, highly autocorrelated noise parametrizations. The peak in skill according to the RPSS corresponds to parametrizations that score reasonably well on both accounts.

A number of important parameter settings can be identified on these plots. The first corresponds to the deterministic parametrization, which occurs on the *x*-axis where the standard deviation of the noise is zero. The second corresponds to white noise, which occurs on the *y*-axis where the autocorrelation parameter is set to zero. In particular, (*ϕ*,*σ*/*σ*_{meas})=(0,1) corresponds to additive white noise with a magnitude fitted to the truth time series. The third setting is the measured parameters, marked by a black plus. Comparing the skill of these three cases shows an improvement over the deterministic scheme as first white noise, then red noise is included in the parametrization.

The RPSS calculated for the other stochastic parametrizations for the *c*=4 case is shown in figure 5. The contour plots for IGNSS are comparable, so are not shown for brevity. The forecasts are more skilful for the *c*=10 case, but qualitatively similar. They have been included in the electronic supplementary material, figure S1. For all cases considered, including a stochastic term in the parametrization scheme results in an improvement in the skill of the forecast over the deterministic scheme. This result is robust to error in the measurement of the parameters — a range of parameters in each forecast model gave good skill scores. This is encouraging, as it indicates that stochastic parametrizations could be useful in modelling the real atmosphere, where temporally and spatially limited, noisy data restrict how accurately these parameters may be estimated.

The results are summarized in figure 6. For each parametrization, the value for the measured parameters is shown when both no temporal autocorrelation (‘white’ noise) and the measured temporal autocorrelation characteristics are used. The significance of the difference between pairs of parametrizations was estimated using a Monte-Carlo technique. Please see the electronic supplementary material for more details, but for example, there is no significant difference between the RPS for AR(1) multiplicative and additive noise and for AR(1) state-dependent noise for the *c*=10 case, but AR(1) multiplicative noise gave a significant improvement over both of these.

The stochastic parametrizations are significantly more skilful than the deterministic parametrization in both the *c*=4 and *c*=10 cases. For the *c*=4 case, the more complicated parametrizations show a significant improvement over simple additive, especially the multiplicative noise. For the closer time-scale separation, the more accurate the representation of the sub-gridscale forcing, the higher the forecast skill. For the *c*=10 case, the large time-scale separation allows the deterministic parametrization to have reasonable forecasting skill, and a simple representation of sub-grid variability is sufficient to represent the uncertainty in the forecast model; the more complicated stochastic parametrizations show little improvement over simple additive AR(1) noise.

Traditional deterministic parametrization schemes are a function of the gridscale variables at the current time step only. If a stochastic parametrization also needs only to represent the sub-grid- and time-scale variability, the white noise schemes would be adequate. However, for both time-scale separations, we observe a significant improvement in the skill of stochastic parametrizations that include a temporal autocorrelation over those which use white noise. This challenges the standard idea that a parametrization should only represent sub-gridscale and sub-time-step variability: including temporal autocorrelation accounts for the effects of the sub-gridscale at time scales greater than the model time step. In the L96 system, the spatial correlations are low. However, in an atmospheric situation, it is likely that spatial correlations will be significant, and a stochastic parametrization must account for the effects of the sub-grid at scales larger than the spatial discretization scale.

## 5. Representation of model uncertainty

One of the central aims of this experiment is to evaluate the use of stochastic parametrizations as a way to represent model uncertainty accurately. The ensemble is, therefore, initialized from perfect initial conditions, and is generated purely by drawing different random numbers in the stochastic part of the parametrization. If the parametrization successfully represents model uncertainty, the spread of the ensemble generated must correctly indicate the uncertainty in the forecast: the ensemble must be *statistically consistent*.

The full forecast PDF represents our uncertainty about the future state of the system. Ideally, the ensemble forecast should be a random sample from that PDF. The consistency condition is that the verification also behaves like a sample from that PDF [30].

In order to meet the consistency condition, the ensemble must have the correct second moment. If it is underdispersive, the verification will frequently fall as an outlier. Conversely, if the ensemble is overdispersive, the verification may fall too often towards the centre of the distribution. The reliability of an ensemble forecast can be tested through the spread–error relationship [31,32]. The expected squared error of the ensemble mean can be related to the expected ensemble variance by assuming the ensemble members and the truth are independently identically distributed random variables with variance *σ*^{2}. Assuming the ensemble is unbiased, this gives the following requirement for a statistically consistent ensemble:
5.1
where the variance and mean error have been estimated by averaging over many forecast-verification pairs. For large ensemble size, *M*∼40, we can consider the correction factor to be close to 1.

Evaluating the average ensemble error as a function of forecast ensemble spread is a useful measure of how well the forecast model represents uncertainty, and is tested through diagnostic plots of binned root mean square (r.m.s.) spread against the average r.m.s. error in each bin. Figure 7*a* shows this diagnostic for a selection of the parametrization schemes tested for the *c*=4 case. For a well-calibrated ensemble, the points should lie on the diagonal. A clear improvement over the deterministic scheme is seen as first white, then red additive noise is included in the parametrization scheme.

Visual forecast verification measures are limited when comparing many different models as they do not give an unambiguous ranking of the performance of these models. Therefore, we also consider the reliability component of the BS [33], REL, which is a scalar scoring rule testing how reliable an ensemble forecast is when predicting an event. We define the event to be ‘in the upper tercile of the climatological distribution’. The smaller the REL, the closer the forecast probability is to the average observed frequency, and the more reliable the forecast.

The results are summarized in figure 7*b*. The REL score indicates the different AR(1) noise terms all perform similarly. A significant improvement is observed when temporal autocorrelation is included in the parametrization, particularly for the *c*=4 case.

## 6. Climatological skill and seamless prediction

The climatology of the L96 system is defined to be the PDF of the *X* variables, averaged over a long run (10 000 MTU ∼ 140 ‘atmospheric years’), and the forecast climatology is defined in an analogous way. The skill at predicting the climatology can then be quantified by measuring the difference between these two PDFs, which may be evaluated in several ways. The Kolmogorov–Smirnov (KS) statistic, *D*, has been used in this context in several other studies [16,18], where
6.1
Here, *P* is the forecast cumulative PDF, and *Q* is the verification cumulative PDF. A second measure, the Hellinger distance, *H*, was also calculated for each forecast model,
6.2
where *p*(*x*) is the forecast PDF, and *q*(*x*) is the verification PDF [34]. Similarly, the Kullback–Leibler (KL) divergence, is defined as
6.3
motivated by information theory [35]. For all these measures, the smaller the measure, the better the match between forecast and verification climatologies.

The Hellinger distance was found to be a much smoother measure of climatological skill than the KS statistic as it integrates over the whole PDF. Therefore, the KS statistic has not been considered further here. Figure 8 shows that the Hellinger distance and KL divergence give very similar results, so for these reasons, only the Hellinger distance will be considered. Pollard [34] shows that the two measures are linked, so this similarity is not surprising.

The Hellinger distance is evaluated for the different parametrizations for the two cases, *c*=4 and *c*=10. Figure 9 shows the results for *c*=4 when the tuneable parameters are varied as for the weather skill scores. Qualitatively, the shape of the plots is similar to those for the weather skill scores. If a parametrization performs well in forecasting mode, it performs well at reproducing the climate. The peak is shifted up and to the right compared with the RPSS, but is in a similar position to IGNSS and REL. The climatological skill for the different parametrizations is summarized in figure 10. As for the weather forecasting skill, a significant improvement in the climatological skill is observed when temporal autocorrelation is included in the parametrizations.

The climatological skill, as measured by the Hellinger distance, can be compared with the weather skill scores using scatter diagrams (figure 11). This is of interest, as the seamless prediction paradigm suggests that climate models could be verified by evaluating the model in weather forecasting mode. Figure 11*a*,*d* shows the relationship between RPSS and the Hellinger distance. For the *c*=10 case, there appears to be strong negative correlation between the two. However, the peak in RPSS is offset slightly from the minimum in Hellinger distance giving two branches in the scatter plot. The *c*=4 case can be interpreted as being positioned at the joining point of the two branches, and shows how using the RPSS as a method to verify a model’s climatological skill could be misleading.

Figure 11*b*,*e* compares ignorance with the Hellinger distance. The upper branch in figure 11*c* corresponds to the large-magnitude high-temporal autocorrelation parametrizations for which ignorance scores highly, but which have poor climatological skill. This makes ignorance unsuitable as an evaluation method for use in seamless prediction.

Figure 11*c*,*f* shows the results for the reliability component of the BS. There is a strong correlation between REL and the Hellinger distance, indicating this is a suitable score for use in seamless prediction. It is not surprising that the REL is well suited for this task, as it is particularly sensitive to the reliability of an ensemble, which is the characteristic of a forecast important for climate prediction [7]. The other weather skill scores studied put too much weight on resolution to be used for this purpose. Additionally, figure 11*c* indicates that reliability in weather forecasting mode is a necessary but not sufficient requirement of a good climatological forecast, as was suggested by Palmer *et al.* [7].

## 7. Perturbed parameter ensembles in the Lorenz ’96 system

It is of interest to determine whether a perturbed parameter ensemble can also provide a reliable measure of model uncertainty in the Lorenz ’96 system. The four measured parameters defining the cubic polynomial will be perturbed to generate a 40 member ensemble. The skill of this representation of model uncertainty will be evaluated as for the stochastic parametrizations.

Following Stainforth *et al*. [36], each of the four parameters will be set to one of three values: low (*L*), medium (*M*) or high (*H*). The degree to which the parameters should be varied was estimated from the truth time series. The measured *U*(*X*) was split into sections 3 MTU long, and a cubic polynomial fitted to each section. The measured variability in each of the parameters was then calculated as the standard deviation of the parameters fitted to each section . The measured standard deviations are shown in table 3.

The low, medium and high values of the parameters are given by
7.1
where the scale factor, *S*, can be varied to test the sensitivity of the scheme. There are 3^{4}=81 possible permutations of the parameter settings, from which a subset of 40 permutations was selected to sample the uncertainty. This allows for a fair comparison to be made with the stochastic parametrizations, which also use a 40 member ensemble.

The same ‘truth’ model will be used as for the stochastic parametrizations, and the forecast model will be constructed in an analogous way: only the *X* variables are assumed resolved, and the effects of the unresolved sub-gridscale *Y* variables are represented by an ensemble of deterministic parametrizations,
7.2
where the values of the perturbed parameters, *b*^{p}, vary between ensemble members. The scale factor, *S*, in equation (7.1) will be varied to investigate the effect on the skill of the forecast. The ensemble of deterministic parametrizations is shown in figure 12 where the degree of parameter perturbation has been measured from the truth time series (i.e. *S*=1). The truncated model will be integrated using an adaptive second-order Runge–Kutta scheme.

### (a) Weather prediction skill

The skill of the ensemble forecast is evaluated using the RPSS and IGNSS at a lead time of 0.6 MTU for both the *c*=4 and *c*=10 cases. The results are shown in figure 13. The measured parameter perturbations are indicated with a black plus in each case.

Both RPSS and IGNSS indicate the measured perturbed parameter ensemble is significantly less skilful than the stochastic ensemble for both the *c*=4 and *c*=10 case. Figure 13*c*,*f* shows the reliability component of the BS calculated for the perturbed parameter ensembles. Comparing these figures with figure 7*b*, REL for the perturbed parameter schemes is greater, indicating that the perturbed parameter ensemble forecasts are less reliable than the stochastic parametrization forecasts, so the ensemble is a poorer representation of model uncertainty. The significance of the difference between skill scores for the the measured stochastic parametrization schemes and the perturbed parameter schemes with the measured perturbations (*S*=1) are shown in the significance tables in the electronic supplementary material.

The reliability of the forecast ensemble is also considered using the r.m.s. spread–error diagnostic as a function of the scale factor in figure 14. For small-scale factors, the ensemble is systematically underdispersive for both the *c*=4 and *c*=10 cases. For larger scale factors for the *c*=10 case, the ensemble is systematically overdispersive for large errors, and underdispersive for small errors. Comparing figure 14 with figure 7*a* shows that none of the perturbed parameter ensembles are as reliable as the AR(1) additive stochastic parametrization, reflected in the poorer REL score for the perturbed parameter case.

### (b) Climatological skill

The climatology of the perturbed parameter ensemble must include contributions from each of the 40 ensemble members. Therefore, the climatology is defined as the PDF of the *X* variables, averaged over the same total number of MTU as the stochastic parametrizations (10 000); each of the 40 ensemble members is integrated for 250 MTU. The Hellinger distance between the truth and forecast climatologies can then be calculated as a function of the scale factor (figure 15). The climatology of the measured perturbed parameter ensemble is worse than the measured parameter stochastic parametrizations. This is as predicted by the ‘seamless prediction’ paradigm; the perturbed parameter ensembles are less reliable than the stochastic parametrizations, and so predict a less accurate climatology.

## 8. Conclusions

Several different stochastic parametrization schemes were investigated using the Lorenz ’96 system. All showed an improvement in weather and climate forecasting skill over deterministic parametrizations. This result is robust to error in measurement of the parameters—scanning over parameter space indicated a wide range of parameter settings gave good skill scores.

Stochastic parametrizations have been shown to represent the uncertainty in a forecast owing to model deficiencies accurately. This increase in forecast reliability is reflected by the increase in the weather prediction skill and improved climatology of the forecast model.

A significant improvement in the skill of the forecast models was observed when the stochastic parametrizations included temporal autocorrelation in the noise term. This challenges the notion that a parametrization scheme should only represent sub-gridscale (both temporal and spatial) variability. The coupling of scales in a complex system means a successful parametrization must represent the effects of the sub-grid acting on spatial and time scales greater than the truncation level.

The correlation between the performance of the forecast model in weather prediction mode, and its ability to reproduce the climatology of the Lorenz ’96 system provides support for the ‘seamless prediction’ paradigm. This provides a method of verifying climate predictions: the climate model can be run in weather forecasting mode, and its short-term predictive skill analysed.

Stochastic representations of model uncertainty are shown to outperform perturbed parameter ensembles in the L96 system. They have higher short-term forecasting skill and are more reliable than perturbed parameter ensembles. They also predict the climatology of the L96 system better.

The Lorenz ’96 system is an excellent tool for testing developments in stochastic parametrizations. These ideas can now be applied to NWP models and tested on the atmosphere.

## Acknowledgements

We thank the following authors for interesting and useful discussions and suggestions: Jochen Bröcker, Andrew Dawson, Chris Ferro, Frank Kwasniok, Martin Leutbecher, Cecile Penland, Dan Rowlands and Paul Williams. This work was supported by a NERC studentship.

## Footnotes

One contribution of 13 to a Theme Issue ‘Mathematics applied to the climate system’.

- © 2013 The Author(s) Published by the Royal Society. All rights reserved.