## Abstract

We present a hierarchical Bayesian framework for the selection of force fields in molecular dynamics (MD) simulations. The framework associates the variability of the optimal parameters of the MD potentials under different environmental conditions with the corresponding variability in experimental data. The high computational cost associated with the hierarchical Bayesian framework is reduced by orders of magnitude through a parallelized Transitional Markov Chain Monte Carlo method combined with the Laplace Asymptotic Approximation. The suitability of the hierarchical approach is demonstrated by performing MD simulations with prescribed parameters to obtain data for transport coefficients under different conditions, which are then used to infer and evaluate the parameters of the MD model. We demonstrate the selection of MD models based on experimental data and verify that the hierarchical model can accurately quantify the uncertainty across experiments; improve the posterior probability density function estimation of the parameters, thus, improve predictions on future experiments; identify the most plausible force field to describe the underlying structure of a given dataset. The framework and associated software are applicable to a wide range of nanoscale simulations associated with experimental data with a hierarchical structure.

## 1. Introduction

The exploration of nanoscale phenomena requires a synergy of experiments and simulations. The study of water transport in carbon nanotube membranes is a striking example that helps showcase this need. The first experiments [1,2] reported flow rates that were 3–5 orders of magnitude larger than those estimated by continuum hydrodynamics and attracted significant interest from the community. Simulations in short carbon nanotubes (CNTs) [3,4] appear to confirm these fast flow rates, in particular when extrapolated to larger lengths used in the experiments, while conflicting with simulations in periodic domains [5]. More recent simulations [6] in CNT membranes of 2 μm, matching the thickness of the membranes used in the earlier experiments, confirmed the results of the periodic simulations. Recent experimental results [7,8] are also consistent with the results of these simulations. A recent review [9] shows also that earlier experimental results [10] differed with those reported in [2] by orders of magnitude while they were consistent with the results of the periodic simulations. These discrepancies showcase the challenges in performing experiments and simulations for nanoscale fluid flows and the need for synergetic studies. Novel experimental techniques [11,12] show the potential to detect molecular events inside CNTs and as such have the potential to reassess flow transport in CNTs. At the same time a number of combined experimental and simulation studies are beginning to emerge [13].

We argue that the Bayesian framework is a key methodology in bridging the experimental data and the molecular dynamics (MD) simulations [14,15]. The calibration of the force fields in MD simulations is a critical aspect of their predictive capabilities. A number of widely used force fields [16,17] have been developed over the years that have provided us with unique insight in processes ranging from protein folding and crystallization to the development of new materials [18].

The calibration of force fields in MD simulations often relies on experimental data that exhibit a special structure. This structure refers to measurements of the same physical quantity (such as viscosity, thermal conductivity, etc.), using different measurement techniques or under variable environmental conditions. One such example is the measurement of the thermal conductivity in nanofluidics [19] obtained from identical samples by 30 organizations worldwide, reporting a variance of less than 10% from the sample average. For some other quantities, the variance may be larger. Several factors can contribute to such discrepancies, such as the choice of the force field and its calibration, computational errors and experimental uncertainties or physical factors such as the hydrophilicity of the interior of nanochannels [20].

Data with such a hierarchical structure are observed in various fields, including social science studies, biochemical or material testing experiments [21]. Usually model regression techniques lump the data from different sources and perform statistical analysis on the pooled dataset [14,15]. Alternatively, one may group the data in different ways and perform in-group/cross-group statistical analysis [22]. Arguably these approaches do not capture the inherent structure of the data and may not be informative to models due to the variation of the type and level of uncertainties across the experiments. In turn, multilevel regression methods have been developed to address this type of problem [22].

Here, we propose to use a hierarchical Bayesian approach, to handle structured experimental data to calibrate and select force fields in MD simulations. This approach has been shown to provide a rigorous foundation for parameter inference and model selection [23,24] in structural dynamics [25–27]. In recent years, applications of the hierarchical Bayesian models have been expanded to a broad range of problems. Examples include the use of a hierarchical Bayesian model to study human decisions under risk in psychology [28,29], and dynamic causal modelling [30] as it is used in neuroimaging [31].

A major concern in using a hierarchical Bayesian analysis is the computational cost associated with evaluating the evidence that is used for model selection. Computing the evidence requires the evaluation of a number of nested multi-dimensional integrals, with the cost scaling as , where *N* is the presumed levels of hierarchy and *p* the number of parameters. Hence, for models with expensive function evaluations, such as in MD, the cost can be prohibitive. In this study, we obtain an efficient approximation of the evidence by combining the Laplace Asymptotic Approximation (LAA) method [32] and the Transitional Markov Chain Monte Carlo (TMCMC) algorithm [33] implemented within the *Π*4U parallel framework [34]. The proposed scheme reduces the computational cost by orders of magnitude in our MD applications. In this paper, we compare two hierarchical and two single-level Bayesian analyses for MD simulations and analyse their benefits and drawbacks. The analysis is performed using the diffusion property of argon as obtained from MD simulations with varying environmental parameters and experimental data for the viscosity of gaseous krypton.

The remainder of this paper begins with a summary of Bayesian inference and model selection for single-level and hierarchical models (§2). Then, the utility of hierarchical models for inferring MD model parameters from experimental measurements of transport coefficients is explored by using MD simulations under different simulation conditions and with varying model parameters to ‘measure’ transport coefficients, which are then used to infer the model MD model parameters. In this way, the approach can be evaluated with reference to the known parameters. The details of the MD simulations are described in §3, while the results from inferring the parameters are discussed in §4. The application of the same approach to data from physical experiments is discussed in §5, and finally concluding remarks are provided in §6.

## 2. Bayesian inference and model selection for hierarchical models

Let *f*(*x*,** θ**) be a given deterministic function of a physical system, where

*x*is the input variable and

**is the parameters of the function. The probabilistic model based on**

*θ**f*can be expressed as 2.1where the output

*y*is the sum of an uncertain error term

*ϵ*

_{y}and the function

*f*. A typical choice of probability density function (PDF) for

*ϵ*

_{y}is a Gaussian distribution

*N*(

*ϵ*

_{y}|0,

*σ*

_{y}) with zero mean and

*σ*

_{y}standard deviation. Given some observed dataset , one can infer the posterior PDF of

**after specifying a prior PDF of**

*θ***, which represents the knowledge about the model parameters before any data are observed.**

*θ*Bayesian models can employ several levels of probability distributions to express their parameters. For example, in the context of Gaussian distributions, samples can be drawn with a certain mean and variance but in turn the mean and variance can have probability distributions themselves, defined recursively. Here, we furthermore distinguish between *single-level* and *hierarchical models* depending on the assumed structure of the data (figure 1). In single-level models, all the data are pooled together and then a Bayesian analysis is performed to identify the model parameters. In hierarchical models, the data are first grouped in multiple subsets and the Bayesian analysis is performed in an intra- and intergroup fashion.

### (a) Single-level Bayesian model

We denote a single-level model as . The posterior PDF for the model parameters () is obtained using Bayes’ theorem:
2.2where is the likelihood, is the prior PDF and is the evidence that will be used for model selection if multiple models are available [23,24]. Note that the evidence is simply a normalizing constant for the posterior PDF that is calculated by
2.3Then, a robust Bayesian approach predicts *y* given any input *x* based on the posterior PDF:
2.4

In , each observation is assumed to be independent given the values of the model parameters. Hence for , this implies that .

### (b) Hierarchical Bayesian model

In a hierarchical model, the parameters ** θ** are characterized by hyperparameters

**that have their own probability distribution. In the current application, hierarchical models are used to quantify the uncertainty due to the variations across different experiments through the hyperparameters**

*ψ***, and all other uncertainties are accounted for through the error term**

*ψ**ϵ*

_{y}. In this case, the hierarchical model allows the structure of the data to be exploited, which leads to a different procedure for performing Bayesian inference.

We assume that the data are obtained from experiments, i.e. with . Hence, the total number of observations is . The hierarchical model () assumes that there are different values for the parameters *θ*_{i} appropriate for each experiment *D*_{i}, and the variability of the parameters among experiments is described by distributions controlled by the hyperparameters ** ψ** and their respective prior. Figure 1

*b*shows the mapping between the model parameters and the structure of the data as a Bayesian network.

Under this model, the posterior PDF of the parameters ** θ** used to predict

*y*in equation (2.4) is 2.5where is the prior model of

**. The posterior PDF of the hyperparameters**

*θ***() can also be calculated using Bayes’ theorem: 2.6Here, is the prior of**

*ψ***, is the evidence of this model . The likelihood of**

*ψ***() can be calculated from 2.7Here, the first line is deduced based on the assumed hierarchical model that datasets are independent when**

*ψ***is given, and the third line assumes that data points in a dataset**

*ψ**D*

_{i}are independent when

*θ*_{i}is given. Again, the evidence in equation (2.6) is a normalizing constant that can be calculated from 2.8Table 1 summarizes a comparison between the single-level and the hierarchical models.

### (c) Efficient approximate model selection

Predictions from multiple models, calibrated from a given dataset , can be made by

— summing up predictions from each model weighted with its posterior, i.e. hyper-robust prediction [35] or

— using the most probable model.

In either case, for each model () the evaluation of the posterior probability is needed. Using Bayes’ theorem 2.9Equal prior probability is often assigned for all models so that we can write , implying that model selection can be performed using the evidence of each model.

The calculation of the evidence requires integration over all model parameters, implying a high computational cost. This is a major challenge for Bayesian model selection especially when the function *f* is nonlinear with respect to the parameters ** θ**. The computational complexity increases for hierarchical models as they imply a nested integration (equations (2.7) and (2.8)). For complex problems, such as the MD simulations presented herein, there are no analytical expressions available. A Monte Carlo Simulation (MCS) is needed to estimate and for each MCS sample of

**, another MCS is needed to estimate for all . This nested MCS leads to an exponential increase of the number of total function evaluations. The computational cost is intractable even when the function**

*ψ**f*can be evaluated efficiently. For example, assuming

*f*can be evaluated in 1 s and 10

^{4}samples are adequate for a single integral estimate, then a nested integral requires 10

^{8}samples, thus over 3 years, to complete one evidence estimation. Moreover, the hyperparameter space is often in a much higher dimension introducing an additional exponential increase in sample size using MCS. The application of efficient sampling methods, such as MCMCs, can alleviate but not resolve the inherently high computational demand of hierarchical Bayesian methods.

In this paper, we use the LAA method to approximate the integrals for evaluating in equation (2.7) (see §2e for more details). Then, we use the TMCMC algorithm implemented in *Π*4U [34] to draw samples that approximate the posterior PDF of ** ψ** and estimate the evidence in equation (2.8). The combined method gives an efficient approximation to Bayesian model selection involving a hierarchical model. Note that we pick the TMCMC sampling method over other Markov Chain Monte Carlo (MCMC) methods because of three reasons: (i) it is an inherently parallel algorithm; (ii) it is based on the idea of simulated annealing, which makes it robust to different kind of non-unimodal distributions; and (iii) the evidence is obtained as a by-product of the algorithm. Ching & Chen [33] and Hadjidoukas

*et al.*[34] provide detail explanations to these points.

### (d) Approximate robust posterior prediction

In the following, we limit our attention to only hierarchical models and for simplicity of notation, the term is omitted in all the PDFs described below. The robust posterior prediction for a hierarchical model can be expressed as
2.10The PDF *p*(*y*|*x*,** ψ**) is estimated using LAA (denoted as ), and is approximated with TMCMC samples, i.e. , where

*ψ*^{(i)}is the

*i*th sample from TMCMC and

*N*is the total number of samples. The robust prediction is given by 2.11In many cases, one may have particular interest in the posterior mean and variance of the output, which can be then estimated as 2.12where

*E*

_{LAA}[

*y*

^{j}|

*x*,

*ψ*^{(i)}] is the LAA estimation for the

*j*th moment of

*y*given input

*x*and the

*i*th sample

*ψ*^{(i)}. The first and the second moments can be expressed as [36] 2.13

### (e) Laplace asymptotic approximation and robust prediction for Gaussian additive noise

In the case of a Gaussian additive noise *ϵ*_{y} with mean zero and standard deviation *σ*_{y}, we pick the prior model to be a Gaussian distribution with mean *μ*_{θ} and covariance matrix *Σ*_{θ} to obtain analytical expressions for LAA calculations. To estimate in equation (2.7) and *p*(*y*|*x*,** ψ**) in equation (2.10), the LAA assumes that Gaussian distribution is a good approximation to the posterior PDF of

*θ*_{i}given all other parameters (

*p*(

*θ*_{i}|,

**,**

*ψ**σ*

_{y})). Then, the LAA of the corresponding evidence

*p*(

*D*

_{i}|

**,**

*ψ**σ*

_{y}) is 2.14Here, denotes the optimal values of

*θ*_{i}that minimize

*L*(

**).**

*θ*We use the Matlab function *fmincon* with an analytical gradient and Hessian function of *L*
2.15and
2.16where and the gradient ∇*f* and Hessian of the deterministic function *f* can be found using the Matlab symbolic toolbox functions—*gradient* and *hessian*.

For convenience, we augment the hyperparameter space ** ψ** to include also the parameter

*σ*

_{y}and we draw samples in the augmented parameter space using TMCMC. However, only the likelihood function depends on

*σ*

_{y}, but not the prior model . Hence, there is a slight change to the expressions for the robust prediction (cf. equation (2.10)): 2.17Here,

*p*(

*y*|

*x*,

**,**

*ψ**σ*

_{y}) is estimated using LAA for each TMCMC sample (equation (2.14) with (

*y*,

*x*) substituting

*D*

_{i}).

## 3. Application to molecular dynamics models

Over the last 50 years, MD has been instrumental to investigations of atomistic phenomena in areas ranging from materials science to medicine. MD simulations integrate Newton’s equation of motion for interacting atoms using various empirical force fields that are often calibrated using experimental data [37]. The quality and predictive capabilities of MD simulations hinge on the formulation and choice of these force fields [38,39]. In MD simulations, statistical estimates for quantities of interest are obtained either by sampling large systems or by extended simulations of smaller systems with various thermodynamic control algorithms [40]. In addition to the uncertainties introduced by the force fields [15], algorithmic software and hardware uncertainties can give rise to variations in the values of predicted quantities. If one simply performs analysis by grouping all data together in one large dataset irrespective of the code or the simulation set-up used, the resulting ‘cross-experiment’ (cross-simulation) uncertainty will mix with the other uncertainties and affect the quality of the posterior PDF used for prediction or model selection. In the formulation of, e.g. Cailliez & Pernot [14], the cross-laboratory uncertainty is essentially given to be estimated as an additional Gaussian component of the prediction error equation. In this study, we show that hierarchical Bayesian modelling provides a rational treatment of the cross-simulation uncertainty. Also, we compare the results of single-level models and hierarchical models to verify the benefits of using the hierarchical one.

Many different transport phenomena (e.g. diffusion, thermal conductivity, electrical conductivity, etc.) of non-zero mass particles can be described by kinetic theory, in which a phase space distribution function is evolved in time. Calculating the correct evolution hinges on having the right collision operator that models how small-distance and small-time interactions between particles change the distribution. MD simulations provide an ideal sandbox for different classical kinetic theories testing, as all inter-particle interactions are calculated and readily provides access to all necessary collision properties. In this section, we provide an example of assimilating MD data into kinetic theory correlations produced by different hardware platforms, number of atoms and force-fields. All of the simulated systems provide measurements of the diffusion coefficient of argon at eight specific thermodynamic conditions—pressure and temperature (*P*,*T*).

### (a) Molecular dynamics simulation set-up

We perform MD simulations of argon, using the classical Lennard–Jones (LJ) potential to describe the interaction among its atoms
3.1where *r*_{ij}=∥**r**_{ij}∥, **r**_{ij}=**r**_{j}−**r**_{i} and **r**_{i} and **r**_{j} are the positions of particles *i* and *j*, respectively. The parameters *ϵ*_{LJ}, *σ*_{LJ} are key components of a chosen force field. In MD simulations, a cut-off radius (*r*_{c}) is introduced when computing pairwise interactions, to reduce the computational cost. The respective force shifted LJ potential (*ϕ*(*r*_{ij})) can be expressed as
3.2where *ϕ*_{LJ}′(*r*)=d*ϕ*_{LJ}(*r*)/d*r*.

The parameters employed for the MD simulations can be summarized in table 2. We report the self-diffusion coefficient of gaseous argon as measured from these simulations in figure 2.

Each simulation consisted of two phases: the equilibration of the system and the production run used to obtain all the ensemble averages. During the equilibration phase, each simulation was run for 30 ns with a time-step Δ*t*=1 fs using a random velocity rescaling algorithm [41] as a thermostat and a Berendsen barostat [42] to obtain the desired (*P*,*T*) values.

Various thermostats were used during the production phase. The Nose–Hoover thermostat [43] with a coupling constant of 2.0 ps and the Parrinello–Rahman barostat [44] with a coupling constant of 2.0 ps. The velocity rescaling algorithm [41] was used with a constant of 2.0 ps. The Andersen-massive thermostat [45] was used with a constant of 1.5 ps. The MTTK barostat was used with a constant of 2.0 ps [46]. The production runs were performed for 30 ns, using a time-step of Δ*t*_{prod}=1 fs. Normal periodic boundary conditions were used in all simulations and self-diffusion coefficients were calculated via the mean-squared displacement of the trajectories and the Einstein–Stokes equation.

Equation (3.3) is used to obtain the diffusion coefficient *D*_{11} given various temperatures *T* and pressures *P* under different simulation conditions, in order to infer the model parameters *ϵ*_{LJ} and *σ*_{LJ}. In the case of argon, where interatomic interactions are described by simple LJ potentials (equation (3.1)), we can use analytical expressions to calculate the self-diffusion coefficient *D*_{11} [47]. This semi-empirical correlation is based on Chapman–Enskog gas theory and employs expansions of the collision integrals [48]:
3.3where *m* denotes the molar mass of argon, and *k* denotes the Boltzmann factor. The collision integrals *Ω* are specified as follows, given the non-dimensional temperature *T**=*kT*/*ϵ*_{LJ}:
where (*C**_{6}) is (*C**_{6})=*C*_{6}/*ϵ*_{LJ}*σ*^{6}_{LJ} and *C*_{6}=0.6041995.

### (b) Comparison of probabilistic models

In this study, we compare four probabilistic models: two single-level and two hierarchical models. We emphasize that the pre-processing (e.g. grouping) of the data are considered a crucial part of the hierarchical Bayesian modelling. Furthermore, we reiterate that the data use for Bayesian inference in this study are obtained from MD simulations using *a priori* specified parameters. Hence we process 80 data points resulting from 10 MD simulation performed at eight different values of (*T*,*P*). These data are then grouped into different datasets that are used to infer the model parameters of the analytical expressions given above. A particular aspect of using data from these simulations is that we can then compare the inferred parameters with the respective ones that have been used in the simulations. In summary, we consider the following models:

: A single-level model that groups the 80 data points into one dataset without distinguishing between

*T*,*P*and other simulation set-ups.: A single-level model with a reduced dataset. The reduced dataset has eight points, defined by the weighted mean and standard deviation of the values obtained from the 10 different simulations for the same

*T*and*P*.: A hierarchical model that groups data obtained from the same simulation set-ups into individual datasets (a total of 10 datasets, each with eight data points for different

*T*and*P*).: A hierarchical model that groups data with the same

*T*and*P*into individual datasets (a total of eight datasets, each with 10 data points from each simulation set-ups).

represents a typical treatment of a structured dataset where the hierarchical structure is ignored [15] and the uncertainty variable *ϵ*_{y} is expected to capture all uncertainties in the inference process. represents another treatment to a structured dataset where the uncertainty across different datasets is represented by the weighted standard deviation of the data. Hence, the uncertainty variable *ϵ*_{y} accounts for all other uncertainties. represents the appropriate treatment to a structured dataset under the Bayesian approach, where the knowledge about the structure of the datasets is used. Hence, the hyperparameters are expected to account for the uncertainty across different datasets, and *ϵ*_{y} will account for all other uncertainties. represents a similar Bayesian treatment to a structured dataset as in , but the structure of the datasets is incorrectly identified. Hence, we do not expect the same results for *ϵ*_{y} and the hyperparameters as in .

### (c) Algorithmic details

*Hyperparameters*: A bivariate Gaussian prior is chosen for the model parameters *ϵ*_{LJ} and *σ*_{LJ} in and along with five hyperparameters—*μ*_{ϵ}, *σ*_{ϵ}, *μ*_{σ}, *σ*_{σ} and *ρ*, which are the mean and standard deviation of *ϵ*_{LJ} and *σ*_{LJ} and the correlation coefficient between *ϵ*_{LJ} and *σ*_{LJ}, respectively. Hence, *μ*_{θ} and *Σ*_{θ} in §2e are
3.4

*Likelihood*: This study employs an uncertainty variable *ϵ*_{y} with a Gaussian distribution of zero mean and *σ*_{y} standard deviation, with *σ*_{y} one of the parameters to be inferred. The function *f* (see equation (2.1)) is computed from equation (3.3), with inputs ** x**={

*T*,

*P*} (temperature and pressure) and model parameters

**={**

*θ**ϵ*

_{LJ},

*σ*

_{LJ}}. The likelihood function is a Gaussian distribution with mean

*f*and standard deviation

*σ*

_{y}, for all data points (

*y*

_{i},

*x*

_{i})=(

*y*

_{i},{

*T*

_{i},

*P*

_{i}}) in single-level and hierarchical models (

*k*=

*S*or

*H*).

*Prior*: Uniform priors are chosen for the model parameters ** θ** in and , the hyperparameters

**in and and the parameter**

*ψ**σ*

_{y}. Table 3 shows the values of the upper and lower bounds of the uniform priors based on the physical constraints of the chosen MD model.

*Posterior estimate*: For the single-level models and , we infer three parameters (*ϵ*_{LJ}, *σ*_{LJ} and *σ*_{y}). We use MCS to estimate the evidence by using samples of the uniform prior obtained from a uniform grid, which leads to an Importance Sampling approximation (with proposal PDF that equals to the prior PDF) of the posterior PDF . Five hundred grid points are used for *ϵ*_{LJ} and *σ*_{LJ}, and 50 grid points are used for *σ*_{y} for the uniform prior range (table 3), with a total of two million samples for the three-dimensional parameter space. The importance sampling approximation of the posterior with *N* samples from the prior PDF can be written as
3.5For the hierarchical models and , the posterior distributions for all parameters (** θ**,

**and**

*ψ**σ*

_{y}) are computed using the algorithm described in §2c.

*TMCMC parameters*: TMCMC generates posterior PDF samples from the prior PDF based on multiple stages of MCMC sampling that follows an annealing schedule for the likelihood function [33]. In each stage, MCMC is performed based on a Gaussian local random walk in order to propose a new sample. The number of samples per stage, which is also the final number of posterior samples, is 20 000. TMCMC has two parameters to tune: *τ*_{CV} controls the annealing speed and *β*^{2}_{COV} controls size of the Gaussian proposal. Suggested by Ching & Chen [33], we choose *τ*_{CV}=100% to achieve optimal convergence from annealing between the prior and the posterior. We choose *β*^{2}_{COV}=0.09, which is slightly higher than the suggested value by Ching & Chen [33], so that the random walk samples explore a larger area in the hyperparameter space.

## 4. Bayesian inference using data obtained from simulations

We run MD simulations to obtain diffusion coefficients of gaseous argon under different environmental settings and prescribed *ϵ*_{LJ} and *σ*_{LJ} values. We compare the results of Bayesian inference using , , and with our simulation set-ups to verify the benefits of using a hierarchical model over a single-level model.

### (a) Verification of the Laplace asymptotic approximation

We observe that the posterior PDF *p*(*θ*_{i}|*D*_{i},** ψ**,

*σ*

_{y}) for all datasets

*D*

_{i}is close to a Gaussian distribution in the problems studied herein. Figure 3 shows an instance of the likelihood functions for one dataset. Similar shapes are obtained for all other datasets. Figure 4 shows the prior

*p*(

*θ*_{i}|

**) and posterior PDF**

*ψ**p*(

*θ*_{i}|

*D*

_{i},

**,**

*ψ**σ*

_{y}) of three random samples of the hyperparameters

**and**

*ψ**σ*

_{y}based on the likelihood in figure 3. The resulting posterior PDF can be well approximated by a single bivariate Gaussian distribution. To further verify the quality of LAA in our studies, we compare the estimation of the evidence

*p*(

*D*

_{i}|

**,**

*ψ**σ*

_{y}) using LAA with the estimate using MCS. Figure 5 illustrates that the MCS and LAA estimates of the evidence are consistent using two random samples of

**and**

*ψ**σ*

_{y}.

### (b) Verification of Transitional Markov Chain Monte Carlo

TMCMC is a multi-stage sampling scheme where samples from the intermediate stages are used to estimate the evidence, but are not used to approximate the posterior PDF. The scheme achieves a high quality of PDF estimation in the final stage. To verify the performance of TMCMC, the resulting posterior PDF is compared with the posterior estimated by direct MCS with 5 million samples on the six dimension parameter space (five hyperparameters and the likelihood uncertainty parameter)—*μ*_{ϵ}, *σ*_{ϵ}, *μ*_{σ}, *σ*_{σ}, *ρ* and *σ*_{y}. Figures 6 and 7 show the resulting samples in the six-dimensional hyperparameter space using TMCMC and MCS, respectively. We observe a significantly better convergence for the TMCMC. The total number of stages equals 8 with 20 000 samples in each stage; thus, a total of 0.16 million samples are generated using TMCMC. Hence, with an order of magnitude less samples, TMCMC achieves a better convergence than MCS in this six-dimensional parameter space.

### (c) Model comparison

Figures 8 and 9 show the marginalized posterior PDF of the model parameters for *k*=*S*1,*S*2,*H*1,*H*2. One can observe that, as expected, the hierarchical model provides the best posterior that represents the data from the 10 simulations. While the other three models show a strong correlation between *ϵ*_{LJ} and *σ*_{LJ}, gives a posterior PDF distributed as a bivariate Gaussian that accurately covers the parameter values from the 10 datasets (figures 8 and 9). This indicates that a hierarchical model can separate the cross-simulation uncertainty from the other types of uncertainties.

How can we verify whether the hierarchical model can indeed separate the uncertainties correctly? Figure 10 shows the marginalized posterior PDF of *σ*_{y} for all four models. Recall that is designed such that it excludes the cross-simulation uncertainty from *σ*_{y}, while will include all uncertainties to *σ*_{y}. One can see that *σ*_{y} posterior of agrees with , while the *σ*_{y} posterior of agrees with that of . This implies that a hierarchical model can, indeed, correctly separately out the cross-simulation uncertainty. On the other hand, a hierarchical model that does not reflect the structure of the data can be misleading.

Figures 11 and 12 show the marginalized posterior PDF of the five hyperparameters. We observe that both and give a similar optimal value of the mean for the model parameters, but the uncertainty is significantly larger for . However, both *σ*_{ϵ} and *σ*_{σ} are approaching zero in , but their values are higher in . These two parameters are representing the effect of cross-simulation uncertainty on the two model parameters. The correct hierarchical model can identify the cross-simulation uncertainty through *σ*_{ϵ} and *σ*_{σ} directly, while the incorrect hierarchical model cannot do so. Instead, leaves high uncertainty to the mean value of the model parameters. The marginal posterior of *ρ* is more peak around −1 in than in as the model parameters posterior PDF clearly shows a stronger correlation in figure 9.

Table 4 shows the log evidence and the posterior probability of each model. The hierarchical model has a selection probability ∼1 over the other models, as one may expect. and are ranked next. is slightly better (in log-scale) than because the more complex model did not improve the interpretation of the data much as compared with , and a simpler model is always preferred in this case due to Ockham’s Razor [23,24]. has a significantly lower probability even in log-scale because the dataset used is significantly smaller than the other models.

Finally, figure 13 shows the effect on posterior PDF by Bayesian robust prediction using the hierarchical () and single-level models (), respectively. A lower prediction uncertainty is observed from the single-level model because the very narrow and elongated posterior captures the pairs of highly correlated *ϵ*_{LJ} and *σ*_{LJ} values that result in similar predictions. On the other hand, the single-mode Gaussian posterior from the hierarchical model leads to a higher prediction uncertainty that scales with the output values. The significantly different posterior prediction from the two models demonstrate the importance of finding the right probabilistic model, even when using a consistent deterministic model. Our results indicate that using the single-level model will lead to an underestimated prediction uncertainty, although the mean values are consistent.

## 5. Results for experimental data

We now extend our formulation to experimental data. We note that unlike the case of using data obtained from MD simulations, here we cannot *a posteriori* verify the correct distribution of parameters. The same methodology is applied to experimental data of krypton instead of the MD simulation data of argon. Data on viscosity are used instead of diffusion based on Kestin *et al.* [47]. Again, we compare the single-level model (similar to ) with the hierarchical model (similar to ) using the same datasets. As shown in figure 14, there are nine sets of experimental data coming from different sources covering different temperature ranges, and different experimental set-ups. In the case of viscosity *η*, the deterministic function *f* (equation (2.1)) is now expressed as [14]
5.1The parameter *T**=*kT*/*ϵ*_{LJ}, *m* denotes the molar mass of krypton, and *k* denotes the Boltzmann factor. The other coefficients are *A*=1.16145, *B*=0.14874, *C*=0.52487, *D*=0.77320, *E*=2.16178 and *F*=2.43787. The model parameters to be inferred in this case are also *ϵ*_{LJ} and *σ*_{LJ}.

Figure 15 shows the likelihood function *p*(*D*_{i}|** θ**,

*σ*

_{y}) of each experimental dataset

*D*

_{i}, which is proportional to the posterior PDF of the parameters when a uniform prior is assumed. Although the nine datasets have a visually consistent trend, the inferred parameter values in the given model for each dataset are actually very different and highly uncertain. In particular, experiments 1–3 show very inconsistent inferred parameter values when compared to the other six experiments (the plotted ranges of

*ϵ*

_{LJ}and

*σ*

_{LJ}for experiment 3 is different from the one for the other experiments). This is because the viscosity value is very sensitive to the parameters

*ϵ*

_{LJ}and

*σ*

_{LJ}. Hence, we use the hierarchical model to infer the uncertainty of the model parameters due to an inaccurate model. If one performs a Bayesian analysis based on a dataset that groups all the nine experimental datasets together, a high model uncertainty value

*σ*

_{y}is expected. In order to balance the error to each data point, the inferred most probable parameter values are expected to be highly affected by the ‘outliers’ (data from experiments 1 to 3). On the contrary, we expect the hierarchical model to capture the cross-experiment uncertainty in the hyperparameters, thus providing a low model uncertainty value. The uncertainty of

*ϵ*

_{LJ}and

*σ*

_{LJ}is expected to be high because of the large difference in the inferred parameter values from some of the experiments. However, the most probable value shall be less affected by the ‘outliers’.

The single-level model uses the TMCMC method from Angelikopoulos *et al.* [15], and the hierarchical model uses the method introduced in this paper. Figures 16 and 17 show the resulting joint posterior PDF of the model parameters *ϵ*_{LJ} and *σ*_{LJ} for both models. In contrast to the single-level model, the hierarchical model results in that gives a smaller model uncertainty value *σ*_{y}, whereas a higher parameter uncertainty is observed from . We observe that the Bayesian model selection prefers the hierarchical model, as shown in table 5.

## 6. Conclusion

We present an efficient computational framework to perform for the first time a full Bayesian analysis and model selection between different hierarchical and single-level probabilistic models. We investigate the benefits of a hierarchical Bayesian model over a single-level model using MD simulations. To tackle the computational challenge that typically arises from a full Bayesian analysis on a hierarchical model, we propose a novel efficient approximation scheme that combines LAA and parallelized TMCMC to approximate the posterior PDF and model selection. Using our results from both simulation and experimental data, we summarize as the benefit of the hierarchical model in three aspects:

(i) Correct separation of the cross-experiment uncertainty from other uncertainties, which helps analysing where the most contribution of the uncertainties comes from.

(ii) Improvement of the posterior PDF of the parameters, which helps improving the accuracy of predictions of future experiments.

(iii) Identification of the most plausible model for a given data from a set of models that can contain multiple hierarchical and single-level models, which helps to discover the underlying data structure if it is not originally known.

In the MD models, the parameter uncertainty has a larger influence on the prediction uncertainty than the model uncertainty *σ*_{y}. The hierarchical model is able to identify the cross-experiment uncertainty, and, thus, gives a higher parameter uncertainty, but a lower model uncertainty. This results in a higher prediction uncertainty than the one obtained from the single-level model, which captures the data more accurately.

It is important to note that the knowledge of the data structure affects the implementation of a hierarchical model. Different assumptions on the data structure lead to different implementations, thus, different hierarchical models. If none of the hierarchical models capture the actual data structure, the models may result in misleading conclusions. It is also important to check if the pair of likelihood and prior model can be well approximated by LAA given the dataset. If a multi-modal posterior is expected, the variational Bayesian EM algorithm can be a good alternative to the LAA method.

## Data accessibility

Experimental data are obtained from the literature cited in this paper. Readers may contact the authors for the synthetic data from our MD simulations.

## Authors' contributions

S.W. and P.A. designed the methods and analysed the data. C.P., R.M. and P.K. designed the study and participated in the analysis of the data. S.W. and P.K. drafted the manuscript and all authors edited and completed the final manuscript.

## Competing interests

The authors declare that they have no competing interests.

## Funding

P.K. acknowledges support from the European Research Council (Advanced Investigator Award no. 341117).

## Acknowledgements

The authors acknowledge the use of resources at the Swiss National Supercomputing Center CSCS under project no. s448.

## Footnotes

One contribution of 11 to a Theo Murphy meeting issue ‘Nanostructured carbon membranes for breakthrough filtration applications: advancing the science, engineering and design’.

- Accepted September 18, 2015.

- © 2015 The Author(s)