## Abstract

This study explores the sensitivity of probabilistic predictions of the twenty-first century surface air temperature (SAT) changes to different multi-model averaging methods using available simulations from the Intergovernmental Panel on Climate Change fourth assessment report. A way of observationally constrained prediction is provided by training multi-model simulations for the second half of the twentieth century with respect to long-term components. The Bayesian model averaging (BMA) produces weighted probability density functions (PDFs) and we compare two methods of estimating weighting factors: Bayes factor and expectation–maximization algorithm. It is shown that Bayesian-weighted PDFs for the global mean SAT changes are characterized by multi-modal structures from the middle of the twenty-first century onward, which are not clearly seen in arithmetic ensemble mean (AEM). This occurs because BMA tends to select a few high-skilled models and down-weight the others. Additionally, Bayesian results exhibit larger means and broader PDFs in the global mean predictions than the unweighted AEM. Multi-modality is more pronounced in the continental analysis using 30-year mean (2070–2099) SATs while there is only a little effect of Bayesian weighting on the 5–95% range. These results indicate that this approach to observationally constrained probabilistic predictions can be highly sensitive to the method of training, particularly for the later half of the twenty-first century, and that a more comprehensive approach combining different regions and/or variables is required.

## 1. Introduction

There have been increasing studies on regional-scale climate change detection and attribution using surface air temperatures (SATs; Karoly *et al*. 2003; Stott 2003; Zwiers & Zhang 2003; Zhang *et al*. 2006; Min & Hense 2007). They have found significant anthropogenic influence (greenhouse gases and sulphate aerosols) on the observed SAT changes over continental and even smaller spatial scales. Based on these regional assessment results, Stott *et al*. (2006) suggested producing probabilistic climate predictions weighted with some measure of the model skills evaluated by observations (Allen *et al*. 2000; Stott & Kettleborough 2002). Here, the main assumption is that past observed changes attributable to anthropogenic forcing can be used as a constraint to future warming. This seems to be reasonable considering that future scenarios such as the well-known special report on emissions scenarios (SRES, Nakicenovic & Swart 2000) are based only on anthropogenic forcing factors.

Since climate predictions are inherently uncertain, the information on the uncertainty is essential to decision-makers. There have been recent efforts to develop methods for a probabilistic treatment of uncertainty in the global warming predictions. Murphy *et al*. (2004), Piani *et al*. (2005) and Stainforth *et al*. (2005) used ‘perturbed physics ensembles’ in which model parameters are changed within expert-defined ranges. Using distributed computing resources through the climate-prediction.net project (Allen 1999), they obtained multi-thousand simulations of an atmospheric general circulation model coupled to mixed layer ocean and estimated climate uncertainty from that ensemble. Stott *et al*. (2006) showed continental-scale temperature predictions in which weighting factors are obtained from detection/attribution results using an ensemble of atmosphere–ocean coupled climate models (AOGCMs) while the necessary uncertainty is estimated from the control run with the same model.

Multi-AOGCM analyses have been carried out as well for climate change (e.g. Cubasch *et al*. 2001; Giorgi & Bi 2005). More recently, in order to consider model uncertainty systematically, Bayesian approaches have been suggested and applied (Tebaldi *et al*. 2005; Greene *et al*. 2006; Min & Hense 2006*a*). Although these approaches take Bayesian statistics as their basis, they are different in dealing with variables and methods to obtain weighting factors. Tebaldi *et al*. (2005) evaluated models with respect to the present-day climatology and the inter-model consistency in predictions. Greene *et al*. (2006) fitted a linear model to observation and model simulation, and the method of Min & Hense (2006*a*) is based on measuring a generalized distance between observation and simulation. On the other hand, Raftery *et al*. (2005) suggested the Bayesian model averaging (BMA) as a method of calibrating multi-model weather forecast ensembles. BMA produces a weighted probability density function (PDF) using the posterior probability of each participating model as a weighting factor. Raftery *et al*. (2005) showed the superiority of BMA in the probabilistic forecast as well as deterministic one in the case of mesoscale weather forecasts.

The objective of this study is to examine the effect of BMA on probabilistic predictions by comparing weighted and unweighted PDFs given a rather small number of a multi-model ensemble of opportunity very similar in method to the case of Raftery *et al*. (2005). The questions are: can we apply the BMA method in general directly to climate projections, can it be modified and do the results lead to interpretable results? To answer these questions, we will apply the BMA method to climate change predictions using a multi-AOGCM dataset from the Intergovernmental Panel on Climate Change (IPCC) fourth assessment report (AR4). As a first step, we focus on SAT changes in the twenty-first century under the SRES A1B scenario. The method is applied to global mean SATs and then extended into continental regions. To test modifications, we use two methods of estimating the weighting factors with both based on an analysis of the likelihood.

## 2. Data

Observations of monthly SAT anomalies over land are taken from the Climate Research Unit dataset (CRUTEM2v) for the period 1950–1999 (Jones & Moberg 2003). Area-averaged SATs are calculated over global and continental regions using a temporally varying observational mask for the analysis period. Six continental regions are defined as North America (NAM), Asia (ASI), South America (SAM), Africa (AFR), Australia (AUS) and Europe (EUR) following Stott (2003) and Min & Hense (2007). Figure 1 shows the distribution of observational grids where monthly mean data exist at least once. The same time-varying spatial coverage is applied to the model simulations for the training period (1950–1999) while the constant pattern shown in figure 1 is used for the future simulations (2001–2099).

As model data, we take 21 AOGCMs of IPCC AR4 which provide simulations under the SRES A1B scenario (model description and data are available from the Coupled Model Intercomparison Project phase 3, CMIP3 archive). According to the implemented external forcing, the models are divided into two groups: MME_ALL (natural plus anthropogenic forcing, 10 models) and MME_ANTH (anthropogenic-only forcing, 11 models). Here, we treat ALL and ANTHRO members as a common ensemble, considering that, for the second half of the twentieth century, both ALL and ANTHRO signals are detectable in the observed SAT changes with similar amplitudes (Min & Hense 2007). This approach would not be possible if the complete twentieth century would be taken for the model-data assessment. For the list of analysed models, we refer the reader to figure 2. The ensemble mean of each model is used as an input variable for skill comparisons among models. For models with a single member, we regard the single realization as ensemble mean. To estimate the internal variability of area-averaged SATs, pre-industrial control runs of the 21 models (MME_PI) are used as in the previous studies (Min & Hense 2006*b*, 2007). There are 80 independent samples of 100-year long SATs available. Therefore, this ensemble of opportunity addresses mainly the modelling (epistemic) uncertainty and its effect upon the projections.

This study is based on the assumption that models which simulate better consistency with the observed change will provide more reliable predictions of future climate changes. Dealing with large-scale SAT changes, we also assume that some of the models are plausible representations of the real world and that one can ignore the interactions with and the behaviours of other variables. Additionally, we consider only long-term components of model responses to given external forcing (mainly the anthropogenic ones for 1950–1999). To accomplish the latter, projections on Legendre polynomials in time are used as a low-pass filtering method following Min & Hense (2006*b*). Legendre coefficients from the first to fourth degree (LP1–LP4) are computed for the 50-year SAT changes of observations and model simulations. Zero degree coefficients (LP0) which represent time averages are omitted here to avoid any effect from selecting different reference periods and to reduce the influence of the difference between the ALL and ANTHRO forcing runs which are basically visible in LP0 (Min & Hense 2007). This corresponds to using SAT anomalies relative to 1950–1999 for both observations and model simulations. Model data are interpolated to the observational grid of 5°×5° before analysis.

## 3. Bayesian model averaging

In ensemble forecasting, it is customary to take the arithmetic ensemble mean (AEM) as a prediction quantity and in most cases AEM already provides a better skill than any of the ensemble members alone. However, this approach gives no information about any kind of uncertainty contained in the predictions. BMA can be a powerful tool because it produces a complete PDF as a forecast and provides a quantification of the uncertainties. If one has a high enough number of ensemble members forming a sample of the climate model population, BMA will give a realistic estimate of the modelling uncertainty of the climate system. However, one has to admit that this assumption probably does not hold due to an undersampling of the ‘climate model space’: even available models cannot be regarded as independent because they share components or are from the same institution (Allen & Stainforth 2002). Accordingly we focus on examining the sensitivity of multi-model probabilistic predictions to different weighting methods. In addition, BMA delivers a way of model selection by weighting each ensemble member according to a measure of models' performance in the training period. Weighted probabilistic predictions of the twenty-first century climate change are obtained on the basis of the model evaluation results for the twentieth century.

The theory of BMA is comprehensively described in Hoeting *et al*. (1999). Given forecast from *K* models *f*_{k}, *k*=1, …, *K*, and the training data *y*^{T}, the weighted forecast PDF for predictand *y* is obtained by(3.1)where *w*_{k} is weight for each model and *g*_{k} is a normal PDF with mean *f*_{k} and variance *σ*^{2} which is statistically denoted by *y*|*f*_{k}∼*N*(*f*_{k}, *σ*^{2}). The weights *w*_{k} are estimated from evaluating the models in view of *y*^{T} for which we use two different methods based on Bayesian statistics. Then BMA predictive mean is just the conditional expectation which is defined as weighted multi-model averaging:(3.2)If the weighting factors are all equal, the BMA mean becomes identical to AEM which is simply(3.3)

As climate variables underlie long-term variations, they are correlated not only in space but also in time. Therefore, a time-series of SAT anomalies has to be treated as the realization of a multivariate normally distributed random variable. Dealing with 50-year annual time-series, a dimension reduction is required. As explained above, we apply Legendre expansions in time restricting from LP1 to LP4 by which the linear trend and the long-term variations are only considered. Now observation and model dataset are analysed on a reduced temporal space (dimension *q*=4) which are hereafter denoted by ** d** and

*μ*_{k}, respectively.

The variance *σ*^{2} in equation (3.1) is estimated from MME_PI simulations. From 80 independent 100-year long samples, we calculate variances of global and continental-scale averaged SATs. Table 1 shows the standard deviations. The estimated standard deviation of the global mean annual SAT is 0.17 K. The corresponding continental values are larger, ranging from 0.20 to 0.44 K. NAM and EUR have relatively stronger variabilities than other regions which might be related to the North Atlantic Oscillation. Variances of 30-year mean SATs are smaller by about half of the annual values.

### (a) Bayes factor

The first approach to calculate the model weights *w*_{k} is to use normalized Bayes factors (BFs) as described in Min & Hense (2006*a*). Given observational data ** d**, the BF

*B*

_{kr}of the model

*M*

_{k}with respect to the reference model

*M*

_{r}is defined as the ratio of posterior odds to prior odds:(3.4)Selecting a different reference model has no effect on estimating

*w*

_{k}due to the normalization of the BFs. When two models are single distributions with no free parameters, the BF becomes identical to the likelihood ratio (Kass & Raftery 1995).

Assuming multivariate normal distribution for the observation and simulations, the likelihood can be expressed as(3.5)where *Σ*_{o} and *Σ*_{k} are the covariance matrix of the observation and model simulation, respectively, , and which represents a generalized Mahalanobis distance between observation and model simulation (for more details refer to Min *et al*. 2004). We assume that *Σ*_{o} and *Σ*_{k} are identical to the covariance matrix *Σ*_{ctl} estimated from MME_PI.

### (b) Expectation–maximization algorithm

Another convenient way taken from Raftery *et al*. (2005) is to maximize the log-likelihood function for the training dataset(3.6)by the expectation–maximization (EM) algorithm (Dempster *et al*. 1977). This algorithm is adapted for a problem that can be formulated with unobserved quantities *z*_{ki}. Here, we define *z*_{ki}=1 if model *k* is the best in realization *i* (see below) and *z*_{ki}=0 otherwise.

The EM algorithm is iterative and consists of two steps. In the first E (expectation) step, the current *z*_{ki} is estimated. The E step for iteration *j* is given by(3.7)In the M (maximization) step, weights and covariance matrix are estimated as follows:where *n* is the number of realizations of each model. This method requires the same large number of realizations for every model. However, some models have only a single realization available. Thus, we need a way to expand the sample size. An adequate way can be to generate additional realizations by parametric resampling.

Ideally different realizations of models should represent the whole range of internal variability. Assuming that internal variabilities in control and forced (ALL and ANTH) runs are identical, we can apply a parametric bootstrap technique as described in Efron & Tibshirani (1993). The idea is to assume that the data follow a parametric model which in our case is a multivariate normal distribution *N*(** μ**,

*Σ*). Then the parametric model has two parameters

**(mean) and**

*μ**Σ*(covariance). Here, we take

*μ*_{k}as mean and estimate the covariance

*Σ*

_{ctl}from the MME_PI control runs. Now we randomly draw a sample of size

*n*from the multivariate normal distributionWith these resampled

*μ*_{ki}, the log-likelihood function equation (3.6) is calculated and maximized as described above.

It has turned out that a relatively high number of realizations are necessary to produce stable results with respect to the magnitude of the weights. We choose *n*=20 000 being a compromise between calculation time and stability of the results. The variation of the weights is about two orders of magnitude smaller than the weights themselves.

## 4. Results

### (a) Weighting factors

Figure 2 shows distributions of weighting factors for the global and six continental area-averaged SATs for the 21 participating AOGCMs. The weighting factors are obtained based on BF and EM methods as described above. Results are also displayed for MME_ALL and MME_ANTH separately to see any significant difference between the two groups. BF and EM results show both similarities and differences. They share high-skilled models although there are exceptions for some regions and models. A major difference between the BF and EM results is that the weighting factors from BF are more evenly distributed over a larger number of models, while EM selects only a few best models and damps out the others. Higher weights to fewer models in EM reflects the fact that the model predictions are highly correlated and thereby models that contribute little additional information tend to have very low weights (Raftery *et al*. 2005). In contrast, BF weights are based on a generalized distance measure and so distributed across many models. One needs to note these different characteristics when applying BF or EM methods.

Comparing weights in figure 2, one can find only a few models which have consistently higher or lower weighting factors across the continental regions, for instance, GFDL-CM2.0, MRI-CGCM2.3.2, GISS-AOM, INM-CM3.0, CGCM3.1(t47), CGCM3.1(t63) and FGOALS-g1.0. The other models show very different combinations of weighting factors at different regions. This implies that there is no single best/worst model in simulating global and regional SAT changes, in accord with the concept of multi-model approach.

### (b) Global mean temperature predictions

Using the standard deviations in table 1 and the weighting factors in figure 2, we obtain the weighted PDFs (BF and EM) of model-simulated future SATs and compare them with unweighted one (AEM). Figure 3 shows the results for the global mean annual SATs for the twenty-first century (2001–2099). From the middle of the twenty-first century onward, the PDFs exhibit a multi-modal structure which is strongest in the EM result. Three maximum densities in EM are delivered by the three models of largest weights: MIROC3.2(hi), MRI-CGCM2.3.2 and UKMO-HadGEM1 (figure 2). BF shows a broader response due to more distributed weighting factors. There are no physical reasons for the global mean SATs such as multiple equilibria to expect a multi-modal PDF response. The obvious explanation for the behaviour shown in figure 2 is undersampling especially with respect to the long-term response. The ensemble of opportunity yet seems to have not enough information to represent faithfully the future variability as long as the BMA-based calibrations are considered. At least this implies that skill-weighted predictions are highly sensitive to the method of evaluating ensemble members.

Figure 4 displays the annual time-series of multi-model-weighted ensemble means and their 5–95% percentiles of probabilistic predictions of global mean SATs. It can be seen that the mean values of BF and EM are very similar to each other and larger than AEM where the difference increases in time with a maximum about 0.3 K by the end of the twenty-first century. The 5% percentiles of the three predictions are very close to each other while the 95% percentiles of BF and EM are larger than that of AEM, indicating the broadened PDFs in the upper tail due to the Bayesian weighting. These characteristics are also found in figure 3 showing enhanced densities in the upper branch of PDF time-series of BF and EM.

### (c) Continental-scale temperature predictions

We apply the same technique to continental regions using area-averaged 30-year mean (2070–2099) SATs. Standard deviations of 30-year mean SATs in table 1 and the same weighting factors as in figure 2 are used for this prediction. We take the long-term means to avoid the noisy patterns which arise on interannual time-scales and to simplify comparisons across the regions and weighting methods. Figure 5 shows the PDFs of SAT predictions over six continental regions using BF, EM and AEM methods. The 5% percentile, weighted multi-model mean, and 95% percentile are depicted on top of each PDF plot. Weighted PDF patterns in 2070–2099 are not Gaussian in all three methods. Comparisons to the 2010–2039 predictions (dashed lines) which are more similar to a Gaussian distribution reveal that 21 multi-models are insufficient to sample reasonably the large range of inter-model uncertainties for the late twenty-first century. Hence, their PDFs produce fine structures which can vary highly according to the composition of the sample. Although there is the possibility of multiple flow regimes on a regional scale, the multi-modality illustrated in figure 5 cannot be regarded as a realistic modelling of a probabilistic climate change. Multi-modality appears more clearly in BF and EM; for instance, EM has two maximum peaks near 2.2 and 4.0 K over ASI and marked three peaks in AFR. Nevertheless, there is little change over EUR and AUS. Across the regions, BF and EM patterns are very close, which is already reflected by the similar weighting factors in figure 2.

In terms of multi-model averages, the differences between BF and EM are very small (less than 0.2 K). The effect of Bayesian weighting is not found clearly even in 5–95% percentiles in these 30-year mean SATs. SAM is the region where largest changes appear (0.5 K increase in the 95% percentile). Annual SAT prediction for the continental regions supports this result (not shown). In short, unlike the global mean, we get only a little effect of Bayesian weighting for continental SATs. Besides the problems arising from undersampling as discussed in the global scale, the regional results also point to problems in estimating the weights. Here, we treated each of six regions independently from each other. A more promising approach would be to combine the SATs of each region and their temporal development into a common space–time vector and evaluate the weights under the spatio-temporal correlation implied by the observations and simulation.

## 5. Concluding remarks

The BMA technique is applied to the twenty-first century SAT changes simulated by the multi-model AOGCM ensembles of IPCC AR4 to produce probabilistic predictions of global and regional SATs. This approach provides a way of observationally constrained prediction of PDFs by using weighting factors which are obtained through evaluating models for the last 50 years of the twentieth century. This training is based on long-term temporal components (Legendre degrees from LP1 to LP4) to eliminate the noise on shorter time-scales.

In order to consider the influence of inter-model and internal variability systematically when estimating the weighting factors, we apply two estimation methods for the weights: BF and EM algorithm. The BMA based upon the BF approach takes the likelihood ratio which is an exponential function of a generalized Mahalanobis distance between observation and model simulation. Hence, it filters out low-skilled models more effectively than a mean-square error-based approach (Min & Hense 2006*a*). The BMA based upon the EM algorithm, which is the version suggested by Raftery *et al*. (2005) for mesoscale weather forecasting, tends to select only a few high-skilled models and leave out the other models more strongly than the BF-based method.

When applied to global and regional SAT anomalies, both BMAs exhibit comparable results in mean and higher level quantiles. Especially, the occurrence of multi-modal PDF for the global mean SATs suggests a severe undersampling of the inter-model variability at least for the long-term projections for the second half of the twenty-first century. Another possible or additional explanation could be that the weighting factors which are obtained from simulations and observations for the second half of the twentieth century could not be used beyond a horizon of similar time length. Indications for this conjecture are the prediction of the quasi-unimodal albeit non-Gaussian PDFs for the first half of the twenty-first century.

The results presented here demonstrate that observationally constrained probabilistic climate change predictions using BMA are feasible and can provide more information than the raw ensemble. However, a straightforward application of the BMA relevant for ensemble weather forecasting is not possible and might be highly dependent on the method of measuring weighting factors in the training period even if we have a large enough multi-model ensemble to construct the probabilistic predictions. Comprehensive measure of model skills based either on space–time vectors of SAT or on multiple variables (e.g. temperature and sea level pressure) might be useful to produce more robust weighting factors and hence more reliable probabilistic predictions of global and regional climate changes.

## Acknowledgments

This study was supported by the German Research Foundation (DFG) with grant He1916/8. We are grateful to two anonymous reviewers for their clarifying comments. We also acknowledge the modelling groups for making their model output available as part of the World Climate Research Program's (WCRP's) CMIP3 multi-model dataset, the Program for Climate Model Diagnosis and Intercomparison (PCMDI) for collecting and archiving this data and the WCRP's Working Group on Coupled Modelling (WGCM) for organizing the model data analysis activity. The WCRP CMIP3 multi-model dataset is supported by the Office of Science, US Department of Energy.

## Footnotes

One contribution of 13 to a Theme Issue ‘Ensembles and probabilities: a new era in the prediction of climate change’.

- © 2007 The Royal Society