## Abstract

Stochastic subgrid models have been proposed to capture the missing variability and correct systematic medium-term errors in general circulation models. In particular, the poor representation of subgrid-scale deep convection is a persistent problem that stochastic parametrizations are attempting to correct. In this paper, we construct such a subgrid model using data derived from large-eddy simulations (LESs) of deep convection. We use a data-driven stochastic parametrization methodology to construct a stochastic model describing a finite number of cloud states. Our model emulates, in a computationally inexpensive manner, the deep convection-resolving LES. Transitions between the cloud states are modelled with Markov chains. By conditioning the Markov chains on large-scale variables, we obtain a conditional Markov chain, which reproduces the time evolution of the cloud fractions. Furthermore, we show that the variability and spatial distribution of cloud types produced by the Markov chains become more faithful to the LES data when local spatial coupling is introduced in the subgrid Markov chains. Such spatially coupled Markov chains are equivalent to stochastic cellular automata.

## 1. Introduction

General circulation models (GCMs) are unable to capture the medium-term variability in the tropical atmosphere. Lin *et al.* [1] made a comprehensive study of the tropical wave spectra determined from the Intergovernmental Panel on Climate Change (IPCC) GCMs and showed that none were able to reproduce the observed power spectrum [2] of convectively coupled Kelvin waves, two-day waves, westward inertio-gravity waves and, least of all, the Madden–Julian oscillation [3]. These are the waves that modulate weather on intra-seasonal time scales in the tropics and are increasingly seen to affect two-week weather forecasts in the middle latitudes [3].

One bias that Lin *et al*. [1] identify in these GCMs is ‘the persistence of equatorial precipitation’, which occurs at the subgrid scales. In the parlance of dynamical systems, the subgrid dynamical models quickly attain their equilibrium values and remain there too long. Palmer [4] used simple arguments from dynamical systems to show how the reduction of a chaotic dynamical system to a smaller number of degrees of freedom can suppress the chaos. While this has the obvious effect of suppressing the variability, he argued that it can have the, even more insidious, effect of driving systematic errors in the mean state. A stochastic parametrization of the unresolved convection introduces variability in the GCM description of these processes, and these parametrizations are increasingly being seen as the next generation of subgrid models [4–10].

Khouider *et al.* [7] created a stochastic multi-cloud model based on the deterministic multi-cloud model of Khouider & Majda [11]. The deterministic multi-cloud model was derived to correspond to the observed behaviour of tropical waves [12], where a focus on three cloud types is needed to capture the observed structure of convectively coupled waves. Furthermore, the deterministic model was calibrated so that the dynamics of the waves matched those of the tropical wave spectrum [2]. When implemented in a GCM, it has been shown to capture much of the convectively coupled equatorial wave [13] activity.

In the stochastic model [7], convection is modelled on a two-dimensional microlattice by letting the local convective state at each lattice site switch randomly between four possible states (three cloud types, and clear sky) with a given probability. At the macroscopic level, the area fractions of these four states evolve randomly over time. The fractions effectively determine the feedback from the micro-scale to the macro-scale. Even in the setting of a single column [7], it was shown that the stochastic multi-cloud model has a large degree of variability. When coupled to a one-dimensional dynamical core [8], it produces a large degree of gravity wave variability.

Crommelin & Vanden-Eijnden [14] proposed a data-driven stochastic parametrization methodology, where the stochastic processes driving the parametrization are systematically inferred from data (e.g. from high-resolution models). This method was used by Dorrestijn *et al*. [15] on data from a large-eddy simulation (LES) of shallow convection. This approach leads to a model with random jumps between a finite number of possible subgrid-scale states, where both the discrete states as well as the switching probabilities are estimated from data. Furthermore, the switching probabilities are dependent (conditional) on the macroscopic, resolved-scale state of the atmosphere.

For the shallow convection parametrization in Dorrestijn *et al*. [15], vertical turbulent fluxes of heat and moisture were collected from the LES data and discretized using a clustering method. By contrast, the discrete states used in Khouider *et al*. [7] are cloud types (congestus clouds, deep convective clouds, stratiform clouds and clear sky) rather than flux states. The states and switching probabilities used in Khouider *et al*. [7] are based on physical intuition and observations; they are not inferred from data.

The objective of the current study is to determine a stochastic multi-cloud parametrization approach as in Khouider *et al*. [7] using a data-driven approach from [14,15]. Much as in Khouider *et al*. [7], we use pre-specified cloud types as a basis for discretizing the subgrid-scale states, and study their (time-evolving) fractions on macroscopic domains. The precise discretization, as well as the switching probabilities and the conditioning on the resolved-scale state, are all inferred from LES data, as in Crommelin *et al*. [14] and Dorrestijn *et al*. [15].

Specifically, we use 8 h of simulation of the development of tropical convection based on an idealization of observed conditions in northwest Brazil [16]. Simulated cloud top and rain water path are stored to classify states on the LES (horizontal) grid nodes. We use five states: (i) *clear sky* and the four cloud types (ii) *shallow cumulus*, (iii) *congestus*, (iv) *deep* and (v) *stratiform*. Strictly speaking, clear sky is not a cloud type, but from now on we will refer to five cloud types. At the beginning of the simulation, only clear sky is present. Gradually, shallow cumulus develops, followed by (raining) congestus clouds. After about 5 h, deep convective towers with heavy precipitation develop. The deep convective towers turn into passive stratiform decks that spread and dissolve.

The paper is organized as follows. In §2, we discuss how we model transitions between cloud types with Markov chains (MCs), and how these MCs can be made conditional on the environment, or on the cloud types at neighbouring lattice sites. We describe the LES data and specify the cloud classification in §3. The stochastic multi-cloud model is described in §4. In §§5–7, we infer the transition probabilities of the MCs and assess their ability to reproduce (emulate) the cloud filling fractions from the LES data. In §5, we use a MC without conditioning, in §6 a MC conditioned on the environment, and in §7 a MC conditioned on cloud types at neighbouring lattice sites. Then, we discuss implementation of the multi-cloud model into a simple single-column model (SCM) (§8), again calculating cloud filling fractions. Finally, conclusions about our multi-cloud model, how stochastics can change dynamics and its implications for climate models are given in §9.

## 2. Modelling cloud type transitions with Markov chains

A central element in the stochastic parametrization approach used here and in recent studies [7,14,15] is discretization of the subgrid-scale (e.g. convective) states. Here, each grid point at the microscopic level can be in only one of five possible states. Let us denote by *Y* _{i}(*t*)∈{1,2,3,4,5} the state at time *t* at grid point *i*. The time evolution of *Y* _{i}(*t*) is modelled as a MC, so *Y* _{i}(*t*) changes randomly in accordance with a set of transition probabilities. In the most basic form, these probabilities are simply
2.1
However, in this basic formulation, the probability of, for example, a congestus state at grid point *i* turning into a deep convective state is independent of the environment (macroscopic state) for *i*. To include such dependence, in recent studies [7,14,15], the transition probabilities have been conditioned on the macroscopic state. If we denote by *X*_{i}(*t*) a variable that is representative of the environment of *i* (e.g. convective available potential energy (CAPE), convective inhibition (CIN) or mid-troposphere relative humidity (RH)), the transition probabilities of such a conditional MC (CMC) are
2.2
As can be seen, the transition probabilities in (2.1) and (2.2) are not explicitly dependent on the convective states of neighbouring grid points. If *i* and *j* are neighbouring grid points, *Y* _{i} and *Y* _{j} are completely uncoupled in case (2.1). They are coupled indirectly via *X*_{i} and *X*_{j} in case (2.2), because *X*_{i} and *X*_{j} are coupled at the macroscopic level. Because *i* and *j* are neighbouring grid points, *X*_{i} and *X*_{j} will be strongly correlated. In this paper, we also explore explicit conditioning on the neighbourhood, as this is likely to improve the spatial correlation of the parametrized convection patterns. We do this by considering the conditional transition probabilities
2.3
and
2.4
where {*i*} denotes the neighbourhood of *i* (e.g. the eight direct neighbours on the lattice). We note that by conditioning the MC on neighbouring states, as in (2.3), the MC effectively becomes a stochastic cellular automaton (SCA). A schematic overview of the generalizations of the MCs is shown in figure 1.

Each grid point on the microlattice has a state that evolves randomly according to the same set of transition probabilities, e.g. (2.2). At the macroscopic level, square blocks of microlattice sites are grouped together, and we study the filling fractions (or area fractions) of the various convective states. For each block, we have
2.5
where *n* is the number of microlattice sites in the macroscopic block, and **1**(⋅) is the indicator function. The filling fractions are time-dependent and random, and must sum up to one for each macroscopic block: for all *t*. By matching the size of the macroscopic blocks to the (horizontal) size of GCM model grid boxes, the filling fractions can be used as input for parametrizing vertical transport due to convection.

## 3. Large-eddy simulation

We use the Dutch Atmospheric LES (DALES) model to produce high-resolution data. DALES is a non-hydrostatic model that resolves atmospheric convection explicitly by solving the spatially filtered Navier–Stokes equations under the anelastic approximation. The model has an ice microphysics scheme, but does not account for latent heat release due to freezing. For further details about DALES we refer to Heus *et al*. [17]. The simulation is based on an idealization of observed conditions [16] during the tropical convection experiment TRMM-LBA carried out in northwest Brazil in 1998/1999. There is no horizontal shear, and surface heat and moisture fluxes are held constant throughout the simulation. At the start of the 8 h simulation, the entire LES domain consists of clear sky. Convection develops gradually, first shallow convection, eventually (after about 5 h) also deep convection. We emphasize that it is a non-stationary case of the development of deep convection. The simulation and the resulting data are described in more detail by Böing *et al*. [18].

The horizontal size of the LES domain is 57.6×57.6 km^{2} and the vertical extent is 25 km. The horizontal grid spacing is 150 m and the vertical spacing increases exponentially from 40 m near the surface to 200 m at the upper levels. For every column, we store the simulated cloud top height, rain water path (the vertically integrated rain water content), CAPE and CIN. We also store liquid water potential temperature *θ*_{l} and total water specific humidity *q*_{t} at two model levels, one in the boundary (subcloud) layer at 413 m, the other in the lower free troposphere at 2345 m. These variables are defined by
3.1

with *θ* the potential temperature, *L* the latent heat of vaporization, *c*_{p} the specific heat of dry air at constant pressure, *q*_{l} the non-raining liquid water content and *q*_{v} the water vapour specific humidity. Furthermore, *π* is the Exner function, the ratio of absolute and potential temperature. In the absence of precipitation, *θ*_{l} and *q*_{t} are conserved for moist adiabatic processes. We store the data at time intervals of 1 min during 8 h, resulting in 480 time slices of the variables mentioned above in each of the 384×384 LES model columns. Below, we discuss how these variables are used for classification of each model column state into five cloud types.

### (a) Classification of cloud types

In the vein of Mapes *et al*. [19] and Khouider *et al*. [7], we consider five cloud types: clear sky, shallow cumulus, congestus, deep convection and stratiform. Figure 2*a*,*c*,*e* shows histograms of the cloud top height. At *t*=480, we see three categories (clear sky, low clouds and high clouds), which can be well distinguished, with thresholds at 200 and 5000 m. Furthermore, to distinguish the heavily raining deep convective towers from their passive, modestly raining stratiform remnants, we use the rain water path divided by the cloud top height. We call this the column rain fraction,
3.2

By dividing by the cloud top height we obtain a measure of the rain intensity, from which the vertical extent of raining cloud has been factored out. The CRF makes it easier to identify stratiform clouds, which have high cloud top and low, but not always negligible, rain water path. Furthermore, we can use the same threshold of the CRF, 10^{−5}, to distinguish deep from stratiform, as well as non-raining shallow cumulus from raining congestus clouds. In figure 3, we plot the CRF against the cloud top height and indicate four cloud types with different symbols. The clear sky group is not shown because its CRF is not well defined. In table 1 we summarize the cloud classification.

We can now assign the state of each LES column, at every time step, to one of the five cloud types. Figure 2*b*,*d*,*f* shows snapshots of the LES domain with all columns assigned to one of the cloud types. At *t*=100, we see clear sky sites combined with shallow cumulus clouds and some congestus clouds. At *t*=300, the development of deep towers start. At *t*=480, we see larger deep towers and dissolving stratiform decks.

## 4. The stochastic multi-cloud model

With the LES data discretized according to table 1, we can choose the size of the macroscopic blocks and calculate the filling fractions *σ*_{α}(*t*) on each of these blocks using (2.5). In what follows, the LES blocks always consist of 32×32 microscopic lattice sites (so that *n*=32^{2}), unless explicitly stated otherwise. The corresponding physical size of these blocks is 4.8 km×4.8 km. The total LES domain is covered by 12^{2} of such (non-overlapping) blocks. In figure 4*a*, we show the time evolution of the means and standard deviations of the filling fractions, taken over the 12^{2} different blocks. We emphasize that these are the filling fractions as computed directly from the LES data. With the stochastic multi-cloud model, we aim to emulate the time evolution of the LES filling fractions. This is done by evolving the state (cloud type) of each microlattice site as a MC. The states on the microlattice sites can be grouped again in macroscopic blocks (of any desired size), leading to emulated filling fractions. As already mentioned, the number of MCs grouped together in the multi-cloud model in one macroscopic block will be 1024, except for the creation of plots in figures 7*b*, 8*b* and 11*b* where we use blocks of 64 MCs.

The transition probabilities that characterize the MC are of the form (2.1), (2.2), (2.3) or (2.4). Their numerical values are estimated from the LES data. We use a time step *Δt* of 1 min, matching the saving time step of the LES data. We assess the performance of the various forms (2.1)–(2.4) in the following sections. The choice of the macroscopic environment variable *X*_{i}(*t*), used in (2.2) and (2.4), is discussed there as well.

Eventually, the multi-cloud model has to provide not just filling fractions, but vertical profiles for heating and moistening that can be used for parametrization purposes in a GCM. In §8*b*, we explain how we deal with heating and moistening in an SCM experiment.

## 5. Markov chains

We start by using the simplest form (2.1), i.e. the form where the MC is not conditioned on macroscopic environment variables or on neighbour states. The transition probabilities determine a single 5×5 stochastic matrix in which the entry at the *k*th row and *l*th column is the probability that a site that is in state *k* will switch to state *l* in the next minute. We count transitions in the LES data to estimate the transition probability matrix, resulting in
We use all data of the entire simulation to estimate transition probabilities. In this case we do not take into account the strong dependence of the transitions on time. The reader is reminded that the case we consider is a non-stationary case of the development of deep convection. Next, we will test the skills of this MC.

### (a) Filling fractions of the Markov chain

Figure 4 shows cloud filling fractions observed in the LES data and reproduced by the MC. The MC filling fractions converge quickly to the filling fractions that correspond to the invariant distribution of the transition matrix. These fractions are, therefore, accurate in the sense that they are in agreement with the time averages of the fractions observed in the LES data. However, the standard deviations are too small and the overall time evolution of the LES cloud fractions is not captured at all.

From the results in figure 4, we can conclude that a MC governed by (2.1) is not capable of emulating the LES cloud fractions satisfactorily. A longer time step (20 min) for the MC did not improve any of these deficiencies (results not shown). Rather, the shortcomings are due to the insensitivity of the MC to both the macroscopic environment and the neighbour states. A natural way to improve on this is to include dependence on environment or neighbours. Thus, in the next sections we generalize the MC (2.1) by

— conditioning on the macroscopic state (environment), leading to the CMC form (2.2), or

— coupling to neighbouring cells, leading to the SCA form (2.3).

In the most general form (2.4), both environment and neighbouring states are included. A schematic overview of these generalizations was shown in figure 1.

## 6. Conditional Markov chains

In this section, we explore conditioning of the MCs on a function of some large-scale variables that could be resolved in a GCM. Large-scale variables such as CAPE, CIN, middle troposphere RH or (moist) convergence are considered to be potential indicators of convective behaviour. In §6*a*, we discuss how mutual information can be used as an objective measure to quantify how good these indicators are.

For now, to explain our method we choose to condition on the CAPE and the CIN. These functions of large-scale variables have been used before in Khouider *et al*. [7,20]. A reversibly lifted adiabatic parcel, using the mean thermodynamic properties at the 200–400 m level, is used to calculate the CAPE and the CIN in every LES model column. In the present context, CAPE and CIN mostly indicate the evolution of the surface properties, rather than the state of the free troposphere. CAPE and CIN are affected both by the gradual moistening and heating by surface fluxes and by the presence of cold pools [21]. The values depend on the choice of variable used to construct the adiabats, in our case *θ*_{l}. Although the CAPE values reported here, maximum values of around 4500 J kg^{−1}, are higher than what we had expected, seasonally averaged values as high as 7000 J kg^{−1} have been reported over tropical land masses by Riemann-Campe *et al*. [22].

As before, we divide the whole LES domain into 12^{2} macroscopic blocks (subdomains) and calculate spatial averages of CAPE and CIN on these subdomains. We thereby obtain 12^{2} paths in the CAPE–CIN space, each 480 min long. An even larger part of the CAPE–CIN space could be sampled by combining data from several LES runs with different initial profiles for temperature and humidity; we will not explore this here.

After obtaining the paths in the CAPE–CIN space, we cluster the CAPE–CIN data points in *K* clusters using the K-means++ algorithm [23,24,25]. While clustering the CAPE–CIN space, we use the Euclidean distance with different rescaling factors for CAPE and CIN. The rescaling factors are such that the mean contributions to the distance to the centroids are equal for CAPE and CIN. The clustering algorithm also works for all other (combinations of) large-scale variables, with other scaling factors. The number of clusters *K* has to be chosen beforehand. It should be as small as possible, because for every cluster a 5×5 transition matrix has to be estimated. We refer to Dorrestijn *et al*. [15] and Kwasniok [26] where clustering has been used to construct CMCs.

In figure 5, we show the result of the clustering using *K*=20. For *K*=20, we will show that the CMCs are able to reproduce the correct filling fractions (see §6*b*). All 12^{2} paths start at CIN ≈ 80 J kg^{−1} and CAPE ≈ 2400 J kg^{−1}. Then, CAPE increases and CIN decreases almost uniformly in the domain. When deep convection sets in, the domain starts to become very inhomogeneous, resulting in CAPE and CIN values that differ substantially over the subdomains. After the CAPE–CIN space is divided into *K* regions, the paths in the CAPE–CIN space can be mapped to paths in the space of cluster centroids. To sum up: first, we calculate the (time-evolving) subdomain averages of CAPE and CIN from the LES data, then we cluster these CAPE-CIN averages. To determine the environment state *X*_{i}(*t*) for microlattice site *i*, we use the discretized (clustered) CAPE–CIN state of the subdomain to which site *i* belongs. Thus, *X*_{i}(*t*) effectively takes values in the set of cluster indices: *X*_{i}(*t*)∈{1,2,…,*K*}. Using this *X*_{i}(*t*) in the manner of (2.2) to condition the transition probabilities implies that we have a transition probability matrix associated with each CAPE–CIN cluster.

These transition probability matrices are estimated by counting transitions in the LES data (see also [14]). To estimate the probability *p*_{γ}(*α*,*β*) defined in (2.2) we use the estimator
6.1
where *T*_{γ}(*α*,*β*) is the number of cloud type transitions *α*→*β* observed in the LES data with *X*_{i}(*t*)=*γ*. Thus,
6.2

### (a) Mutual information between environment and cloud type

Large-scale variables such as CAPE, CIN or middle troposphere RH are considered to be potential indicators of convective behaviour. Below we discuss how mutual information can be used as an objective measure to quantify how good these indicators are.

Suppose we have two discrete random variables with a joint probability mass functions *p*^{J}(*x*,*y*) and marginal probability mass functions *p*(*x*) and *p*(*y*). Then, the mutual information is the relative entropy or Kullback–Leibler distance between the joint distribution *p*^{J} and the product distribution *p*^{P}(*x*,*y*)=*p*(*x*)*p*(*y*). It is given by
where the sum is over all values of *x* and *y*. Here *I*(*p*^{J},*p*^{P}) quantifies how much additional information *p*^{J} contains relative to *p*^{P}. For more details about mutual information and other information-theoretic concepts, we refer to Cover & Thomas [27].

In our case, *x* and *y* are the environment state *X*_{i}(*t*) and the cloud type *Y* _{i}(*t*) at the same location, respectively. The mutual information between their joint distribution and the product of their marginal distributions quantifies how good an indicator *X*_{i}(*t*) is for *Y* _{i}(*t*), and thus how useful it is to condition the MC for *Y* _{i} on *X*_{i}. In Meyer *et al*. [28], similar use is made of mutual information to select useful indicators for stochastic cellular automata. We note that, in our case, the joint and marginal distributions are non-stationary; therefore, we calculate the mutual information separately for every time *t* of the LES dataset.

In figure 6, we show three time series for mutual information between the large-scale variables and the cloud type. In the beginning of the simulation, the mutual information is zero. The reason is that clouds have not evolved yet, and therefore the large-scale variables do not give information about the presence of a cloud. The mutual information is first calculated for every time instance and then the average is calculated over the last 2 h (the phase in which deep convection is developed) to obtain a single value for the mutual information such that we can compare different choices of the large-scale variables. In table 2, we list the time-averaged mutual information using various (clustered) quantities for *X*_{i}. To give an interpretation to the value of (mutual) information in nats, we mention that the mutual information between the cloud type and the cloud type itself is 1.1486 (this would be the best possible score).

The result in table 2 shows that the combination of CAPE and CIN gives significantly more information about cloud type than either of them alone. We see that both the vertical wind velocity field (*w*) and the CAPE/CIN fields contain information on the state of convection. Both of them may be used to reproduce some of the time-dependent behaviour of convective organization in low wind shear (e.g. cold pools). Here, we choose for CAPE and CIN to obtain the best filling fractions. A more detailed study of the physical mechanisms behind the organization of deep convection in the present case is given in [18].

As a final remark, we have included the mutual information of the zonal wind (*u*) at 15 843 m in table 2 as a consistency check: *u* at 15 843 m is mainly determined by upward propagating gravity waves that can have a remote origin, and we do not expect it to be a good indicator of the state of convection and cloud type. The low value of the mutual information confirms this intuition.

### (b) Filling fractions of the conditional Markov chain

Figure 7 shows filling fractions produced by CMCs that are conditioned on CAPE and CIN with *K*=20 clusters. Figure 7*a* shows the means and standard deviations of the fractions over 144 macroscopic blocks using 1024 CMCs per block. The time evolution of the means is in good agreement with the LES results, as can be seen by comparing with figure 4*a*. With a smaller number of clusters (*K*=10), the agreement was unsatisfactory (results not shown). Further, the standard deviations are too small compared with the LES results. They can be increased by using a smaller number of CMCs (because fractions determined by a smaller number of MCs are more likely to deviate from the expected values). In figure 7*b,* we show the means and standard deviations using only 64 CMCs per macroscopic block. As expected, using only 64 instead of 1024 CMCs, the standard deviations are larger and, therefore, in better agreement with the LES fractions. In figure 8, we show cloud filling fractions on a single macroscopic block: in figure 8*a*, the fractions of the LES data on a block of size *n*=1024; in figure 8*b*, the fractions as produced by the multi-cloud model using 64 CMCs (conditioned on CAPE–CIN). We see that by using CAPE and CIN to condition the CMCs, the time evolution of the filling fractions is captured. This is not solely because CAPE and CIN are indicators of convection: in the first part of the simulation, CAPE increases (and CIN decreases) steadily with time, so that conditioning on CAPE and CIN is similar to conditioning on time. However, this only holds true for the first part of the 8 h of simulation. In the last hours, CAPE no longer increases in all LES subdomains. Instead, we observe a decrease of CAPE in part of the subdomains.

## 7. Stochastic cellular automaton

In §6, it was shown that conditioning of the MC on the macroscopic environment strongly improves the behaviour of the filling fraction means, cf. figures 4 and 7. However, the variances of the CMC filling fractions are too small, and can only be brought into better agreement with the variances of the LES filling fractions by reducing the number of CMCs per macroscopic block. In this section, we investigate whether coupling to neighbouring sites on the microlattice can improve the emulated variances, without reducing the number of MCs. Thus, we study use of the forms (2.3) and (2.4) for the MC. We expect that, by coupling to neighbouring sites, the spatial correlations of the cloud type patterns will be better captured, thereby increasing the variance.

As mentioned earlier, by conditioning the MC for lattice site *i* on the state of the neighbouring sites, as in (2.3), the MC becomes an SCA. Cellular automata (CA) have been used for parametrization purposes [5,6,29]. In these studies, the CA have deterministic rules, not stochastic ones, and they are chosen by intuition rather than inferred from data. Also, in those recent studies [5,6,29], the cells of the CA can take on two states, not five as is the case here.

First, we estimate the SCA transition probabilities (2.3) from the LES data. As before, *Y* _{i}(*t*) is the cloud type at site *i* at time *t*, *Y* _{i}(*t*)∈{1,2,3,4,5}. Use of (2.3) implies that in principle, for every state *δ* of the combined neighbouring sites *Y* _{{i}}, there is a different transition probability matrix. This reflects, for example, that the probability of a clear sky site turning into a shallow cumulus site may increase as the number of neighbouring shallow cumulus sites increases.

For the neighbourhood of site *i*, denoted {*i*}, we choose the eight sites directly surrounding site *i* in the microlattice (see figure 1). As each site can take on five different values, there are 5^{8} different configurations, i.e. 5^{8} possible values of *δ*. This is too large to be practicable; therefore, we reduce the number of possibilities by conditioning not on *Y* _{{i}}(*t*) directly, but on a simple reduction function *f* that depends on *Y* _{{i}}(*t*). Thus, we use
7.1
rather than (2.3) itself.

Let us denote by |CL|_{i} the number of clear sky sites directly surrounding *i*, and similarly by |SH|_{i}, |CO|_{i}, |DE|_{i} and |ST|_{i} the number of surrounding shallow, congestus, deep and stratiform sites. These numbers are time-dependent. Clearly, |CL|_{i}+|SH|_{i}+|CO|_{i}+|DE|_{i}+|ST|_{i}=8 for all *i* and at all times. As the function *f* we now choose
7.2
The reason for choosing this particular reduction function is that it is a measure of the degree to which the direct environment is convectively active: the more neighbouring sites there are in a state of convection, the larger the value of *f*. Furthermore, a neighbouring site with cloud type congestus increases *f* more than a neighbouring site with cloud type shallow. The function increases even more if there is a neighbouring deep site. The choice of the factor 4 for stratiform is somewhat debatable, but the coefficient has to be larger than 3 to indicate the presence of stratiform instead of some other cloud type. Further, the value has to be as small as possible to reduce the number of states (and therefore matrices) as much as possible. One can use information theory to perform a systematic search for functions that give the most information about the transitions (see [28] for some ideas on this); however, we will not pursue this here. Estimating the probabilities (7.1) is straightforward, using an estimator analogous to (6.1)–(6.2).

We obtain 33 different transition matrices of size 5×5, because 0≤*f*≤32. For each site, the state of the neighbourhood is determined by counting the numbers of different cloud types surrounding it, and computing the corresponding value of *f*_{i}(*t*) as in (7.2). This value determines which transition matrix is used at lattice site *i* at time *t*.

We initialize the SCA multi-cloud model using 384×384 cells all in a clear sky state, corresponding to the initial condition observed in the LES data. As time evolves, some cells switch to shallow cumulus and clusters of shallow cumulus cells appear. Later on, the SCA correctly produces congestus sites in the shallow cumulus clusters. At about 250 min after initialization, similar to LES, deep convective sites appear. These turn into stratiform decks. Eventually, the patterns of the SCA are clear sky areas with some shallow cumulus, and areas of a mixture of congestus, deep and stratiform. This mixture is not observed in the LES data, but the fractions turn out to be correct. First, we show the patterns produced by the SCA in figure 9*a*.

Figure 10*a* shows filling fractions (mean and standard deviation) for the SCA, using (7.1)–(7.2). The standard deviation is taken over macroscopic blocks of size *n*=1024. Both the time evolution and the magnitude of the standard deviations are in much better agreement with the LES data (figure 4*a*) than those produced by the CMC (figure 7). The time evolutions of the means are reasonable, but not as good as those of the CMC. Therefore, as a final step of refinement, we combine CMC and SCA by conditioning the MC both on the macroscopic state *X*_{i}(*t*) and on the neighbouring states *Y* _{{i}}(*t*). We refer to this combination as conditional SCA (CSCA). To the best of our knowledge, a (stochastic) CA conditioned on an ‘external’, time-evolving field (*X*, in our case) has not been studied before.

The filling fractions of the CSCA are shown in figure 10*b*. As before, we used the function (7.2) rather than *Y* _{{i}}(*t*) to condition the CSCA on the neighbouring sites. Thus, the transition probabilities are as in (2.4), but with *Y* _{{i}}(*t*) replaced by the function (7.2). For conditioning on the macroscopic state *X*_{i}(*t*), we used CAPE, clustered with five centroids. The patterns are similar to the patterns of the SCA; compare the panels of figure 9. The time evolution of the filling fraction means is in better agreement with the LES data than was the case with the SCA. We anticipate that further improvement is possible, e.g. with search techniques as in [28], and with methods to reduce the parameter space as in [26]. We leave this for future study.

## 8. Single-column model

In the tests performed in the previous sections, there was no interaction between the large-scale variables and the CMC or CSCA. Therefore, to take a step forward towards implementation in a GCM, we test the multi-cloud model in an SCM experiment. The SCM can be thought of as representative for the behaviour of a single GCM vertical model column. We use one macroscopic block, containing 1024 CMCs, to represent the GCM model column. These CMCs are conditioned on CAPE and CIN, as in §6. We choose suitable large-scale variables and use LES data to precalculate their tendencies. The tendencies are assumed to depend linearly on the filling fractions determined by the multi-cloud model. Thus, the large-scale variables and the cloud filling fractions are coupled to each other, and both evolve over time. Inspired by Khouider *et al*. [7] we take four prognostic variables: *X*_{1}=*q*^{low}_{t}, *X*_{2}=*q*^{high}_{t}, *X*_{3}=*θ*^{low}_{l} and *X*_{4}=*θ*^{high}_{l}, with *q*_{t} and *θ*_{l} as defined in (3.1). The low level is at 413 m and the high level is at 2345 m in the atmosphere. These are the variables that we are going to resolve in our SCM.

We use the CMCs, conditioned on CAPE and CIN, to calculate the filling fractions of each cloud type. Therefore, we have to express CAPE and CIN in terms of the prognostic variables *X*=(*X*_{1},…,*X*_{4})^{T}.

### (a) CAPE* and CIN*

We assume that CAPE is a linear combination of *X*. We compute the coefficients by doing a least-squares fit with the CAPE values from the LES data and the values of *X*, also from the LES data. We write
8.1
where *λ*=(*λ*_{0},…,*λ*_{4}) are the coefficients and where we add the constant term *X*_{0}=1. We solve
and find that the linear CAPE* is almost completely determined by *q*_{t} and *θ*_{l} at the low atmosphere level. The correlation coefficient of CAPE and CAPE* is 0.97, so we can use CAPE* as a proxy for CAPE. In general, this is not the case, but free tropospheric properties change relatively slowly in the LES data. For CIN, we do a linear fit of the logarithm of CIN. We write
8.2
Here *μ*=(*μ*_{0},…,*μ*_{4}) are the coefficients for CIN*. For CIN and CIN* we find a correlation coefficient of 0.77, so we can use CIN* instead of CIN.

### (b) Large-scale tendencies

In a GCM, a parametrization should deliver entire vertical heating and moistening profiles. In our SCM experiment, we only have four prognostic variables and, therefore, we use LES data to determine the influence of the cloud filling fractions *σ* on these four prognostic variables *X*. Below, we propose a method of using data to calculate the heating and moistening (i.e. the tendencies ); whether this method will work for a large number of variables remains to be explored.

In Dorrestijn *et al*. [15] this was done for shallow cumulus convection by clustering vertical heat and moisture fluxes observed in LES data. Here, we will use a least-squares fitting method that we have already used to calculate the CAPE* and CIN*. Every cloud type has an influence on *θ*_{l} and *q*_{t} at the low and high atmosphere levels. This means that
where is the tendency of *X*_{m} (1≤*m*≤4) and is the influence of cloud type *α* on prognostic variable *X*_{m}. We assume that is a linear combination of the prognostic variables *X*:
We now have
8.3
where *σ* is the 1×5 filling fraction vector, *ν*_{m} is a 5×5 matrix that has to be estimated separately for every prognostic variable *X*_{m}, and *X* is the 5×1 prognostic variables vector. For every prognostic variable *X*_{m}, we estimate *ν*_{m} by least-squares fitting. This is done as follows. Our aim is to calculate, for every 1≤*m*≤4:
8.4

In every subdomain of LES we observe the prognostic variables *X*, tendencies and the LES filling fractions *σ*. This is the case for 479 time instances (at the last time instance *t*=480 the tendencies are not estimated). We can write (8.4) in the form *y*=*Zν*. Then, the least-squares fit gives . This gives the best least-squares estimate of the 25 entries in the 5×5 matrix *ν*_{m}.

### (c) Integration of the single-column model

We integrate equation (8.3), to obtain the evolution of the prognostic variables *X*_{1},…,*X*_{4}. As initial condition, we take *σ*=(1,0,0,0,0). This means that each CMC starts in state 1 (corresponding to clear sky). The initial conditions for *X* are the average initial values observed in the LES data. The CMCs produce the filling fractions *σ* and the *ν* are pre-calculated in §8*b*. We recall that the CMCs are conditioned on CAPE* and CIN*.

### (d) Filling fractions of the single-column model

We test the stochastic multi-cloud model in the SCM. In figure 11, we show filling fractions for the SCM using 1024 CMCs. To increase the standard deviation, we do a second experiment using only 64 CMCs. To calculate the standard deviation in every experiment, we use 12^{2} independent runs of the SCM. In this way we can compare the standard deviation to the standard deviation that we observed in the 12^{2} LES blocks (each consisting of 1024 LES columns). Comparing figures 4*a* with 11, we see that the SCM–CMC is capable of reproducing the time evolution of the filling fractions from the LES data. This is a remarkable result because the SCM is not using any LES data during the integration. Recall that the SCM has been constructed from LES data prior to integration.

Using a smaller number of MCs (64 instead of 1024) increases the variance of the filling fractions in the SCM test, as can be seen in figure 11*b*. We expect that further improvement of the evolution of the standard deviations in the SCM is possible using the SCA or the CSCA instead of CMC, but we did not perform these experiments here.

### (e) A 10-day run of the single-column model

We have seen that the multi-cloud model produces correct filling fractions and that it can be used to enhance variability in the SCM. We integrate the SCM over a longer time period. Although the SCM–CMC has not been trained on a longer period, there are no practical restrictions on performing longer time integrations. As in Khouider *et al*. [7], we integrate the SCM for 10 days. Here, using the SCM, we do not aim to represent a realistic simulation of deep convection (as is the case for LES). Rather, we are interested in the long-term behaviour of the SCM as a dynamical system, seen as coarse extrapolation. We investigate whether or not the multi-cloud model can enhance variability in the SCM. In figure 12, we plot time series for the prognostic variable *X*_{4} in the SCM integrated over 10 days with a time step of 1 min. The graphs for the other *X*_{i} are similar. For both runs, with 2500 CMCs and 64 CMCs, we see a cycle of around 8 h. This cycle is not caused by diurnal variations in the surface fluxes, because the CMCs have been trained on data from an LES run with fixed surface fluxes. We note that the trajectory depends strongly on the number of MCs used. With a large number of MCs, the system behaves very regularly. For smaller *n*, the multi-cloud model is more stochastic, and the SCM–CMC model displays more variability.

## 9. Discussion and conclusion

In this paper, we combined, for the first time, the data-driven approach to stochastic parametrization from Crommelin & Vanden-Eijnden [14] and Dorrestijn *et al*. [15] with the stochastic multi-cloud model approach proposed in Khouider *et al*. [7]. We used data from a convection-resolving LES model to infer a multi-cloud model similar to the one studied in Khouider *et al*. [7]. The aim was to formulate a stochastic model that was able to emulate the coarse-grained convective behaviour of the LES. Data for cloud top height and column rain fraction from the LES were used to determine five cloud types: clear sky, shallow cumulus, congestus, deep and stratiform. The coarse-grained convective behaviour of the LES was represented through the filling fractions, or area fractions, of the five cloud types on (horizontal) macroscopic blocks of 32^{2} LES grid points.

The stochastic model (MC) makes random transitions between cloud types at each grid point, in accordance with transition probabilities that are estimated from the LES data. A straightforward MC was not able to reproduce the correct evolution of the filling fractions corresponding to the five cloud types. Therefore, we explored two ways of improving the skills of the MC: first, by conditioning the MC on large-scale variables, obtaining a CMC; second, by conditioning on the neighbouring cells, obtaining an SCA.

The CMC conditioned on a combination of CAPE and CIN was well capable of reproducing the time evolution of the cloud fractions observed in the LES data. The standard deviations of the filling fractions were not very well reproduced by the CMCs. They were too small and not similar to the standard deviations observed in the LES data. The absence of direct spatial coupling between cloud types in neighbouring cells in the CMC made it difficult to capture the time-varying spatial patterns seen in the LES data. Therefore, the enhanced variability due to these patterns could not be captured by the CMCs.

The average filling fractions of the SCA were not as good as the CMC average filling fractions. Nevertheless, the SCA showed a much better evolution of the standard deviation of the filling fractions. By including spatial coupling, spatial and temporal patterns emerged, resulting in more realistic variability. We showed that further improvement can be achieved by additional conditioning on the large-scale variables; however, this comes at the cost of a more complicated model.

A point of discussion is that the CMCs in the multi-cloud model have been trained on LES data of rather specific idealized (atmospheric) conditions. Clearly, not all possible large-scale states were sampled in this dataset. Dividing the LES domain into subdomains, as was done here (as well as in [15]), enlarges the sample of large-scale states. The large-scale states are defined as subdomain averages, so that the variability between the subdomains helps to increase the sample variance. As already mentioned in §6, one can increase the sample variance even more by using data from multiple LES runs with different initial conditions.

We focused on a setting in which shear in the horizontal plane and spatially varying terrain type have not been considered. In the case of a unidirectional shear with varying strength, the transition probabilities of the SCA may have to depend on the neighbouring cells in an anisotropic way. The question of how strong this sensitivity is, has not been addressed here. With varying terrain, a possible solution is conditioning on several types of terrain.

We showed how the LES data can be used to produce heating and moistening rates. We tested the multi-cloud model in a simple SCM experiment. Using the CMCs, the LES filling fractions were faithfully reproduced by the SCM. The degree to which the multi-cloud model was stochastic had a large influence on the variability of the SCM.

## Acknowledgements

The authors are grateful to Pier Siebesma, Harm Jonker and Christian Jakob for stimulating discussions. This research was supported by the Division for Earth and Life Sciences (ALW) with financial aid from the Netherlands Organization for Scientific Research (NWO). The visit of J.A.B. to CWI was financially supported through an NWO visitor travel grant. In addition, we acknowledge sponsoring by the National Computing Facilities Foundation (NCF) for the use of supercomputer facilities, with financial support of NWO. J.A.B. is supported by a grant from the National Science Foundation, DMS-1009959.

## Footnotes

One contribution of 13 to a Theme Issue ‘Mathematics applied to the climate system’.

- © 2013 The Author(s) Published by the Royal Society. All rights reserved.