## Abstract

Bayesian analysis for Markov jump processes (MJPs) is a non-trivial and challenging problem. Although exact inference is theoretically possible, it is computationally demanding, thus its applicability is limited to a small class of problems. In this paper, we describe the application of Riemann manifold Markov chain Monte Carlo (MCMC) methods using an approximation to the likelihood of the MJP that is valid when the system modelled is near its thermodynamic limit. The proposed approach is both statistically and computationally efficient whereas the convergence rate and mixing of the chains allow for fast MCMC inference. The methodology is evaluated using numerical simulations on two problems from chemical kinetics and one from systems biology.

## 1. Introduction

Markov jump processes (MJPs) provide us with a formal description of the underlying stochastic behaviour of many physical systems and as such they have a wide applicability in many scientific fields. In chemistry and biology, for example, they are applied for modelling reactions between chemical species [1,2]. In ecology and epidemiology, they are used for modelling the population of interacting species in the environment [3], whereas in telecommunications they describe the population of information packets over a network [4]. In order to introduce some terminology and notation, we will give a more concrete example from chemical kinetics. However, the modelling methodology is similar in other applications although different assumptions are needed, depending on the system being modelled, for calculating reaction rates. Consider a model for the population of molecules of two interacting chemical species, *X*_{A} and *X*_{B}, in a solution of volume *Ω*, where *X*_{A} and *X*_{B} denote the number of molecules of chemicals *A* and *B*, respectively. The interactions between the species are modelled using *reactions* which are specified using the following notation: . On the left-hand side appear the *reactants* and on the right-hand side the *products* of the reaction while over the arrow appears the *rate constant* *c*_{1} which is the probability that a randomly chosen pair of *A* and *B* will react according to *R*_{1}. This reaction, for example, specifies that a pair of molecules *A* and *B* react with probability *c*_{1} to produce a new molecule of *A*. For calculating the probability of a reaction taking place given the current state of the system, i.e. the number of molecules of chemicals *A* and *B*, several system-dependent assumptions must be made. For chemical reactions, it is assumed that in a well-stirred solution the probability of a reaction is proportional to the populations of its products [5]. For *R*_{1}, we can write it as *f*_{1}(*X*_{A},*X*_{B},*c*_{1})=*c*_{1}*Ω*^{−1}*X*_{A}*X*_{B}. Following the same reasoning, additional reactions and species can be added in order to construct large and complex reaction networks. Together, the state of the system *X*_{A} and *X*_{B}, the set of reactions and the reaction rates specify an MJP where the occurrences of reactions are modelled as a Poisson process.

In this particular example, the probability of the reaction has a simple form and is linear with respect to the populations. However, in many real applications, this is often not the case while the *rate constant*, *c*_{1}, is unknown. Given a fully specified MJP, i.e. an MJP with known parameters, rate constants and initial conditions, it is possible to perform exact simulation and obtain samples from the underlying stochastic process using the stochastic simulation algorithm (SSA) described by Gillespie [1]. In many problems, there are system parameters which are not specified or are unknown whereas it is relatively easy to collect partial observations of the physical process at discrete time points. The interest is therefore to obtain statistical estimates of the unknown parameters using the available data.

As a consequence of the Markov property, MJPs satisfy the Chapman–Kolmogorov equation from which we can directly obtain the forward master equation describing the evolution of the system’s state probability over any time interval. However, even for small and simple systems the master equation is intractable and it is not straightforward as to how partially and discretely observed data from the physical process should be incorporated in order to perform inference over unknown system parameters. Recently, Boys *et al*. [6] have shown that it is possible to construct a Markov chain whose stationary probability distribution is the posterior of the unknown parameters without resorting to any approximations of the original MJP. Their method, however, is computationally expensive while the strong correlation between posterior samples means that a large number of Markov chain Monte Carlo (MCMC) iterations are required in order to obtain Monte Carlo estimates with sufficient accuracy.

An alternative is to consider suitable approximations of the likelihood function. The system size expansion mentioned by Van Kampen [7], ch. 10 provides a systematic method for obtaining approximations of a physical process approaching its thermodynamic limit. The most simple approximation yields the macroscopic rate equation (MRE), which describes the thermodynamic limit of the system with a set of ordinary differential equations neglecting any random fluctuations. Although the MRE has extensively been studied in the literature [8,9], it is not applicable for problems where information about the noise and the random fluctuations is necessary or the system is far from its thermodynamic limit. The diffusion approximation [10,11] describes the physical process by a set of nonlinear stochastic differential equations with state-dependent Brownian motion. Similar to the master equation, however, the likelihood is intractable. In the study by Roberts & Stramer [12], a transformation is applied such that the Brownian increments are independent of the system state and thus the system can easily be simulated. However, this limits the applicability of the methodology into systems where such a transformation is possible. A more general methodology is presented by Golightly & Wilkinson [13], who used an approximation of the likelihood instead. Finally, a less studied approach for the purpose of inference is the linear noise approximation (LNA), which conveniently decouples nonlinearity in the diffusion approximation into a nonlinear set of ordinary differential equations in the MRE and a set of linear stochastic differential equations for the random fluctuations around a deterministic state [7, ch. 10, [14]. Recently, Komorowski *et al*. [15] have shown that the simple analytic form of the approximate likelihood obtained by the LNA simplifies MCMC inference and can be applied to problems with a relatively small number of molecules.

A commonly employed algorithm for MCMC inference is the Metropolis–Hastings algorithm [16], which relies on random perturbations around the current state using a local proposal mechanism. It should be noted here that the state of the Markov chain is different from the state of the stochastic process. In the MCMC context, state refers to the current values of the unknown system parameters whereas the state of the system refers to the value of the stochastic process at a given time. We will use the term state interchangeably for the rest of this paper and its meaning will be clear from the context. Owing to the local nature of the proposal mechanism used in the Metropolis–Hastings algorithm, samples from the posterior exhibit strong random walk behaviour and auto-correlation. Tuning the proposal mechanisms to achieve good mixing and fast convergence is far from straightforward even though some theoretical guidance is provided [17]. MCMC methods, such as the Metropolis-adjusted Langevin algorithm (MALA) [18] and the Hamiltonian Monte Carlo (HMC) [19], have also been studied in the literature and have been shown to be more efficient than random walk Metropolis–Hastings in terms of effective sample size (ESS) and convergence rates on several problems. However, the HMC and MALA also require extensive tuning of the proposal mechanisms [20,21]. For MJPs, the problem is compounded further since system parameters, such as probability rate constants of chemical reactions, are often highly correlated and their values may differ by orders of magnitude. The resulting posterior distributions have long narrow ‘valleys’ preventing any local proposal mechanism from proposing large moves about the parameter space.

More recently, Girolami & Calderhead [22] proposed exploitation of the underlying Riemann manifold of probability density functions when defining MCMC methods, thus exploiting the intrinsic geometry of statistical models and thereby providing a principled framework and a systematic approach to the proposal design process. These algorithms rely on the gradient and Fisher information matrix of the likelihood function to automatically tune the proposal mechanism such that large moves on the parameter space are possible and therefore improve convergence and mixing of the chains. In the study of Calderhead & Girolami [9], this approach has successfully been applied for the MRE approximation of chemical reaction networks. For the LNA, the Fisher information and the gradient of the likelihood function can easily be obtained [2]. In this paper, we study the application of the Riemann manifold MCMC methods for the LNA and compare the mixing efficiency and computational cost with the commonly used Metropolis–Hastings algorithm. Moreover, we study how the Markov chains and the resulting Monte Carlo estimates behave for systems which are far from their thermodynamic limit. The aim is to improve the efficiency of MCMC inference for MJPs in order to allow for larger and more complex models frequently encountered in biology and chemistry to be studied in more detail.

In §2, we give a brief overview of MJPs. The diffusion and LNAs are presented in §3. We then discuss MCMC and the Riemann manifold algorithms in §4. Numerical simulations are presented in §6 while §7 concludes the paper.

## 2. Markov jump processes

A *D*-dimensional stochastic process is a family of *D* random variables ** X**(

*t*)=[

*X*

_{1}(

*t*),…,

*X*

_{D}(

*t*)]

^{T}indexed by a continuous time variable

*t*with initial conditions

**(**

*X**t*

_{0})=

*x*_{t0}. An MJP is a stochastic process satisfying the Markov property such that where the dependence on any parameters or other quantities has been suppressed. That is, the conditional probability of the system state at time

*t*

_{i}only depends on the state of the system at the previous time

*t*

_{i−1}. An MJP is characterized by a finite number,

*M*, of state transitions with rates

*f*

_{j}(

**,**

*x***,**

*θ**t*) and state change vectors

*s*_{j}=(

*s*

_{1,j},…,

*s*

_{D,j})

^{T}with

*j*∈[1,…,

*M*].

*f*

_{j}(

**,**

*x***,**

*θ**t*) d

*t*is the probability, given the state of the system at time

*t*,

**(**

*X**t*)=

**, of a jump to a new state**

*x***+**

*x*

*s*_{j}in the infinitesimal time interval [

*t*,

*t*+d

*t*]. For the problems we consider in this paper the transition rates depend not only on the current state and time but also on unknown rate parameters

**. From the Markov property, we can directly obtain the conditional probability of the system being in state**

*θ***at time**

*x**t*given initial conditions, which is characterized by the master equation 2.1

Equation (2.1) in general form is intractable especially when the transition rate functions *f*_{j}(⋅) are nonlinear with respect to the system state. Numerical simulation is also prohibitively expensive as the computational cost grows exponentially with *D* [23].

However, given initial conditions ** X**(

*t*

_{0})=

*x*_{t0}and values for the unknown rate parameters

**we can simulate realizations of the MJP by first noting that the time**

*θ**τ*to the next state transition is exponentially distributed with rate and the new state

**(**

*X**t*

_{0}+

*τ*) will be

*x*_{t0}+

*s*_{j}with probability

*f*

_{j}(

*x*_{t0},

**,**

*θ**t*)/

*λ*. This scheme results in an iterative algorithm from which we can forward simulate a complete trajectory for the stochastic process

**(**

*X**t*), known as the SSA [1] in the chemical kinetics literature.

From the specification of the MJP, we can also write the likelihood function with respect to the parameters ** θ** for a completely observed process

**(**

*X**t*) at the time interval [0,

*T*] as where

*N*is the number of transitions that occurred in the time interval [0,

*T*],

*k*

_{i}∈[1,…

*M*] is the type of the

*i*th transition and

*τ*

_{i}and

*x*_{i}are the time and state at the

*i*th transition, respectively. Note that the likelihood function corresponds to the generative process described by the SSA. By specifying a suitable prior and applying Bayes’ theorem, we can obtain the posterior distribution

*p*(

**|**

*θ***) which we can use for inference over the unknown parameters**

*X***[6].**

*θ*In many problems of interest, however, we cannot observe the times and types of all transitions in a given time interval. Rather, we can only observe the state of the system ** X**(

*t*

_{i})=

*x*_{i}at discrete time points

*t*

_{i}∈[0,

*T*]. The solution proposed by Boys

*et al.*[6] is to treat the trajectories, as well as the number, times and types of transitions, between observed time points to those latent variables. This leads to a data augmentation framework [24] in which a Markov chain is constructed to sample from the joint posterior of the parameters and the latent variables. At each MCMC iteration, the complete trajectory of the MJP has to be simulated conditional on the observed data and the parameters that, for some systems, can be computationally demanding. Furthermore, because of the high-dimensional nature of the simulated trajectory and the strong dependence on the system parameters and observed data the MCMC algorithm has very poor convergence and mixing properties, requiring many samples from the posterior in order to obtain sufficiently accurate Monte Carlo estimates. Finally, a further complication that arises is that the number of transitions between two observed time points is also unknown and has to be sampled using a reversible-jumps-type algorithm [25]. For more details, see Boys

*et al.*[6]. The resulting algorithm therefore is computationally demanding, thus limiting its applicability on small and relatively simple MJPs. A more efficient version of the algorithm is also suggested in Boys

*et al.*[6], where instead of simulating the trajectories between observations using the exact MJP an approximate proposal distribution is employed to sample trajectories which are accepted or rejected using the Metropolis–Hastings ratio.

## 3. Diffusion and linear noise approximations

An alternative to working directly with the master equation and the original MJP is to consider approximations which provide for efficient simulation and possibly an easy to evaluate likelihood function for discretely observed data. Although the resulting posterior will also be approximate in nature, it can be sufficient for inferential purposes given that the system under consideration is near its thermodynamic limit. Here, we describe the diffusion approximation and from that how we can arrive at the LNA. Our presentation is rather informal and follows Gillespie and colleagues [1,14]. For a more formal derivation, the reader should refer to Van Kampen [7] and Gillespie [11]. The requirement for these approximations to be consistent is the existence of a proportionality constant *Ω*, which governs the size of the fluctuations such that for large *Ω* the jumps will relatively be small, and as both *Ω* and ** x** tend to infinity approaching the system’s thermodynamic limit, then
3.1
where

**=**

*z***/**

*x**Ω*and are independent of

*Ω*. For many physical processes where the fluctuations are due to the discrete nature of matter there is a natural

*Ω*parameter with such properties. Examples of such parameters can be the system size in chemical kinetics, the capacity of a condenser in electric circuits or the mass of a particle [7].

### (a) Diffusion approximation

In order to obtain a Langevin equation which closely matches the dynamics of the MJP, it is assumed that there is an infinitesimal time interval d*t* that satisfies the following conditions:
3.2
and
3.3
The first condition constrains d*t* to be small enough such that the transition rate functions remain approximately constant. This implies that the number of transitions of type *j* is distributed as a Poisson random variable with mean *f*_{j}(*x*_{t},** θ**,

*t*) d

*t*and is independent from other transitions of type

*j*′≠

*j*. The second condition constrains d

*t*to be large enough such that the number of transitions for each state is significantly larger than 1, which further implies that the Poisson distribution can accurately be approximated by a Gaussian distribution. It can be shown [26] that we can choose d

*t*and

*Ω*such that both conditions can be satisfied, and this generally occurs when the system approaches its thermodynamic limit.

Given such a time scale, the state of the system at time *t*+d*t* can be computed by
3.4
where denotes a Gaussian random variate with mean *μ* and variance *σ*^{2}. From equation (??eq3.4), we can directly obtain a Langevin equation of the form
3.5
where we use ** S** to denote the matrix whose columns are the state change vectors

*s*_{j},

**(⋅) to denote the vector whose elements are the transition rates**

*f**f*

_{j}(⋅), diag(

**) a function that returns a diagonal matrix with elements taken from the vector**

*v***and d**

*v*

*B*_{t}an

*M*-dimensional Wiener process. Note that the dimension of

*x*_{t}differs from that of d

*B*_{t}.

Owing to the nonlinear state-dependent drift and diffusion coefficients in equation (3.5) the transition density of the stochastic process is also intractable. Therefore, a data augmentation approach similar to the one in Boys *et al.* [6] has to be followed. However, there is no longer the need to sample the number, times and types of state transitions as the MJP is approximated with a continuous process. Moreover, the latent variables corresponding to unobserved states can now be efficiently simulated by a Euler–Maruyama scheme which is computationally more efficient than the SSA. This approach has been followed by Roberts & Stramer [12] and Golightly & Wilkinson [13] for inference over the unknown parameters ** θ**, while in Heron

*et al.*[27] a similar methodology has been applied on real data from an auto-regulatory gene expression network.

### (b) Linear noise approximation

Substituting equation (3.1) in the Langevin equation (3.5) and dividing by *Ω*, we obtain
3.6
from which we can see that the fluctuations are of the order of and in the thermodynamic limit (3.6) reduces to the MRE
To obtain the LNA, we make the assumption that for sufficiently large *Ω* a solution to (3.6) will differ from the MRE by a stochastic term of order . That is,
3.7
where *ϕ*_{t} are deterministic or sure variables satisfying the MRE and *ξ*_{t} are stochastic variables. Rewriting the transition rate functions using (3.7) and Taylor expand around ** ϕ**, we get
3.8
We can now substitute (3.7) and (3.8) back into (3.6) and collect terms of

*O*(1) to get the expression for the differential of

**which is none other than the MRE 3.9 Finally, collecting remaining terms and neglecting terms of and higher we get the differential of**

*ϕ***as 3.10 where we used to denote the Jacobian of the transition rates . Equation (3.10) characterizes the fluctuations around the deterministic state**

*ξ***and its validity depends on the size of**

*ϕ**Ω*. As

*Ω*increases the magnitude of the individual jumps

*s*_{j}becomes negligible relative to the distance in

**over which the nonlinearity of becomes noticeable. A measure of the sufficiency of LNA is the coefficient of variation, i.e. the ratio of the standard deviation to the mean. For a more thorough discussion on the validity of LNA, the reader is referred to Ferm**

*ϕ**et al.*[23] and the supplementary material of Komorowski

*et al.*[15].

### (c) Solution of the linear noise approximation and the approximate likelihood function

LNA provides a convenient expression for the approximate likelihood since the MRE (3.9) can easily be solved numerically and its computational cost is polynomial in *D*. Moreover, equation (3.10) is a system of linear stochastic differential equations which has an explicit solution of the form
3.11
where the integral is in the Itô sense and ** Φ**(

*t*

_{0},

*t*) is the solution of 3.12 Since the Itô integral of a deterministic function is a Gaussian random variable [28], equation (3.11) implies that

*ξ*_{t}has a multi-variate normal distribution. To simplify further, the analysis assumes that the initial condition for

*z*_{t}has a multi-variate normal distribution such that . For the rest of the paper, we will assume that

*ϕ*_{t0}and

*V*_{t0}are known. In cases where the initial conditions are unknown they can be treated as additional parameters. Equations (3.7), (3.9)–(3.11) and the specification of initial conditions further imply that 3.13 where

*ϕ*_{t}are solutions of the MRE and

*V*_{t}are solutions of Finally, multiplying (3.13) by

*Ω*, we get

Assume that we have observations from the stochastic process ** X**(

*t*) at discrete time points

*t*

_{i}∈{

*t*

_{1},…,

*t*

_{N}}. Moreover, assume that each observation

*x*_{t}is obtained by an independent realization of

**(**

*X**t*). For example, to obtain an observation at

*t*

_{1}=10 the SSA is used to simulate a trajectory from

*t*

_{0}to

*t*

_{1}and the state of the system at

*t*

_{1}is kept. For

*t*

_{2}=20, the SSA is again used to simulate a new trajectory from

*t*

_{0}to

*t*

_{2}keeping only the state of the system at

*t*

_{2}, and the process continues until all necessary observations are gathered. These kinds of data are very frequently encountered in biology where in order to obtain a single measurement the sample has to be ‘sacrificed’. This is common in data obtained using polymerase chain reaction reporter assays [29], for example. See also Komorowski

*et al.*[15] for an example of an inference problem with such data. Owing to the independence between different observations and the Markov property the likelihood is simply 3.14

In this paper, we only consider observations of this kind. However, the methodology is readily applicable when observations from a single realization of ** X**(

*t*) are available. In this case, the likelihood also has a simple form where

**=(**

*X*

*x*_{t1},…,

*x*_{tN})

^{T}is an

*ND*vector with all the observations,

**(**

*μ***)=(**

*θ*

*ϕ*_{t1},…,

*ϕ*_{N})

^{T}is also an

*ND*vector with solutions of the MRE and

**(**

*Σ***) is an**

*θ**ND*×

*ND*block matrix

**(**

*Σ***)={**

*θ***(**

*Σ***)**

*θ*^{i,j}:

*i*,

*j*∈[1,…,

*N*]} such that 3.15 This stems from the fact that because of the Markov property and equation (3.13) each

*x*_{ti}can be written as a sum of multi-variate normal random variables and therefore

**is also a multi-variate normal random variable. For more details, refer to the supplementary material of Komorowski**

*X**et al.*[2,15]. The only additional complication which arises for time-series data is that the off-diagonal components of the LNA variance in equation (3.15) need to be estimated by numerically solving the system of ordinary differential equations in equation (3.12). Note that despite the fact that the variance matrix is full we can still exploit the Markov property and write the likelihood as a product of the conditional likelihoods and therefore avoid the cost of inverting the

*ND*×

*ND*variance matrix.

## 4. Markov chain Monte Carlo methods

In this section, we give a brief overview of the MCMC algorithms that we consider in this work. Some familiarity with the concepts of MCMC is required by the reader since an introduction to the subject is outside the scope of this paper.

### (a) Metropolis–Hastings

For a random vector with density *p*(** θ**), the Metropolis–Hastings algorithm employs a proposal mechanism

*q*(

***|**

*θ*

*θ*^{t−1}) and proposed moves are accepted with probability . In the context of Bayesian inference the target density

*p*(

**) corresponds to the posterior distribution of the model parameters. Tuning the Metropolis–Hastings algorithm involves selecting the right proposal mechanism. A common choice is to use a random walk Gaussian proposal of the form , where denotes the multi-variate normal density with mean**

*θ***and covariance matrix**

*μ***.**

*Σ*Selecting the covariance matrix, however, is far from trivial in most cases since knowledge about the target density is required. Therefore, a more simplified proposal mechanism is often considered in which the covariance matrix is replaced with a diagonal matrix such as ** Σ**=

*ϵ*

**where the value of the scale parameter**

*I**ϵ*has to be tuned in order to achieve fast convergence and good mixing. Small values of

*ϵ*imply small transitions and result in high acceptance rates while the mixing of the Markov Chain is poor. Large values, on the other hand, allow for large transitions but they result in most of the samples being rejected.

Tuning the scale parameter becomes even more difficult in problems where the standard deviations of the marginal posteriors differ substantially, since different scales are required for each dimension, and this is exacerbated when correlations between different variables exist. Adaptive schemes for the Metropolis–Hastings algorithm have also been proposed [30] though they should be applied with care [31]. Parameters such as reaction rate constants often differ by orders of magnitude, thus a scaled diagonal covariance matrix will be a bad choice for such problems. In the numerical simulations in the next section, we use a Metropolis within Gibbs scheme where each parameter is updated conditional on all others using a univariate normal density with a parameter-specific scale parameter. This allows us to tune the scale for each proposal independently and achieve better mixing.

### (b) Manifold Metropolis-adjusted Langevin algorithm

Denoting the log of the target density as , the manifold MALA (MMALA) method [22] defines a Langevin diffusion with stationary distribution *p*(** θ**) on the Riemann manifold of density functions with metric tensor

**(**

*G***). By employing a first-order Euler integrator to solve the diffusion a proposal mechanism with density is obtained, where**

*θ**ϵ*is the integration step size, a parameter which needs to be tuned, and the

*d*th component of the mean function

**(**

*μ***,**

*θ**ϵ*)

_{d}is 4.1 where are the Christoffel symbols of the metric in local coordinates [32].

Similarly to MALA [18], owing to the discretization error introduced by the first-order approximation, convergence to the stationary distribution is not guaranteed anymore and thus the Metropolis–Hastings ratio is employed to correct this bias. The MMALA can simply be stated as in algorithm 1 and more details can be found in Girolami & Calderhead [22].

We can interpret the proposal mechanism of MMALA as a local Gaussian approximation to the target density similar to the adaptive Metropolis–Hastings in Haario *et al.* [33]. In contrast to Haario *et al.* [33], the effective covariance matrix in MMALA is the inverse of the metric tensor evaluated at the current position and no samples from the chain are required in order to estimate it, therefore avoiding the difficulties of adaptive MCMC discussed in Andrieu & Thoms [31]. Furthermore, a simplified version of the MMALA (SMMALA) can also be derived by assuming a manifold with constant curvature, thus cancelling the last term in equation (4.1), which depends on the Christoffel symbols. Finally, the MMALA can be seen as a generalization of the original MALA [18] since, if the metric tensor ** G**(

**) is equal to the identity matrix corresponding to a Euclidean manifold, then the original algorithm is recovered.**

*θ*### (c) Manifold Hamiltonian Monte Carlo

The Riemann manifold Hamiltonian Monte Carlo (RMHMC) method defines a Hamiltonian on the Riemann manifold of probability density functions by introducing the auxiliary variables , which are interpreted as the momentum at a particular position ** θ** and by considering the negative log of the target density as a potential function. More formally, the Hamiltonian defined on the Riemann manifold is
4.2
where the terms and are the potential energy and kinetic energy terms, respectively. Simulating the Hamiltonian requires a time-reversible and volume-preserving numerical integrator. For this purpose, the generalized leapfrog algorithm can be employed and provides a deterministic proposal mechanism for simulating from the conditional distribution, i.e.

***|**

*θ***∼**

*p**p*(

***|**

*θ***). More details about the generalized leapfrog integrator can be found in Girolami & Calderhead [22]. To simulate a path across the manifold, the leapfrog integrator is iterated**

*p**L*times, which along with the integration step size

*ϵ*are parameters requiring tuning. Again, owing to the integration errors on simulating the Hamiltonian, in order to ensure convergence to the stationary distribution the Metropolis–Hastings ratio is applied. Moreover, following the suggestion by Radford [20] the number of leapfrog iterations

*L*is randomized in order to improve mixing. The RMHMC algorithm is given in algorithm 2.

Similar to the MMALA, when the metric tensor ** G**(

**) is equal to the identity matrix corresponding to a Euclidean manifold, then the RMHMC algorithm is equivalent to the HMC algorithm of Duane**

*θ**et al.*[19].

## 5. Implementation details

### (a) Gradient and metric tensor for the linear noise approximation

For the manifold MCMC algorithms discussed in this section, we will need the gradient of the log likelihood as well as a metric tensor for the LNA. For density functions the natural metric tensor is the expected Fisher information, ** I**(

**) [34], and for a multi-variate normal with mean**

*θ***(**

*μ***) and covariance matrix**

*θ***(**

*Σ***) its general form is For the likelihood in equation (3.14), the Fisher information is then a sum of**

*θ**N*matrices

**(**

*I***,**

*θ**t*), one evaluated at each time point. Similarly, the general form of the partial derivatives for the log of a multi-variate normal is where

**=**

*c*

*Σ*^{−1}(

**)[**

*θ***−**

*x***(**

*μ***)].**

*θ*Moreover, during the leapfrog integration for the RMHMC and for the mean function of MMALA the partial derivatives of the Fisher information are needed. Their general form is
where *a*_{i}=*Σ*^{−1}∂** μ**(

**)/∂**

*θ**θ*

_{i}and

*A*_{i}=

*Σ*^{−1}∂

**(**

*Σ***)/∂**

*θ**θ*

_{i}.

The above quantities require first- and second-order sensitivities for the ** ϕ** and

**which we obtain by augmenting the ordinary differential equation systems with the additional sensitivity equations. For an ordinary differential equation system of**

*V**n*

_{y}equations with form ,

**(**

*y**t*

_{0})=

*y*_{0}(

**) and**

*θ**n*

_{θ}parameters

**, the first- and second-order forward sensitivity equations are given by (5.1) and (5.2), respectively: 5.1 and 5.2**

*θ*We use *F*_{θ} to denote the *n*_{y}×*n*_{θ} matrix where its *j*th column is the partial derivatives of ** F** with respect to

*θ*

_{j}.

*F*_{θ,y}denotes the derivative of

*F*_{θ}with respect to

**and is an**

*y**n*

_{θ}⋅

*n*

_{y}×

*n*

_{y}matrix where its

*j*th column is the partial derivatives of vec(

*F*^{T}

_{θ}) with respect to

*y*

_{j}.

*I*

_{ny}denotes the

*n*

_{y}×

*n*

_{y}identity matrix, ⊗ the Kronecker product and vec(

**) an operator that creates a column vector by stacking the columns of matrix**

*A***.**

*A*### (b) Re-parametrization

In many problems, the parameters ** θ** can be constrained in certain parts of where

*n*

_{θ}is the number of parameters. In models of chemical kinetics, for example, rate parameters must be positive and can differ by orders of magnitude. For the MCMC algorithms described in the previous section, we will need a re-parametrization in order to allow the algorithms to operate on an unbounded and unconstrained parameter space.

For the numerical simulations in §6, we use a re-parametrization by introducing the variables , *p*∈[1,…,*n*_{θ}]. To ensure that we sample from the correct posterior the joint density is scaled by the determinant of the Jacobian such that , where is an *n*_{θ}×*n*_{θ} diagonal matrix with elements .

The gradient and Fisher information along with its partial derivatives follow from the chain rule as

### (c) Choice of priors

In Bayesian statistics priors provide the means for incorporating existing knowledge for the parameters in question. The choice of a suitable prior distribution can be informed from knowledge about the process being modelled, the experimental design and the empirical observations. For example, we might want to restrict rate parameters in chemical kinetics from becoming very high since we assume from the experimental design that reactions are slow enough to be able to be observed. In some cases, the model itself can also guide the choice of the prior. For example, when a model is only defined for a certain range of values of the parameters, a prior restricting the parameters in that range should be used.

In the numerical simulations of the next section, we use independent normal priors for the parameters . Owing to the re-parametrization introduced earlier, this corresponds to a lognormal prior with base 10 for the parameters ** θ**. This choice allows parameters to differ by several orders of magnitude while it ensures that they are strictly positive. Moreover, as noted by Girolami & Calderhead [22] the negative Hessian of the prior is added to the Fisher information in order to form the metric tensor used during MCMC sampling. This has the added benefit of regularizing the Fisher information when it is near-singular [9], although we have not observed such problems in the simulations presented here.

## 6. Numerical simulations

### (a) Chemical kinetics

In this section, we consider two examples from chemical kinetics [14] and study the effect of the system size parameter on inference using MCMC. The first system consists of three species where an unstable monomer, *S*_{1}, can dimerize to an unstable dimer, *S*_{2}, which is then converted to a stable form, *S*_{3}. The reaction set for this system is
and the state of the system at time *t* will be denoted by ** X**(

*t*)=[

*S*

_{1}(

*t*),

*S*

_{2}(

*t*),

*S*

_{3}(

*t*)]

^{T}. The propensity functions, or state transition probabilities, are

**(**

*f***,**

*X***)=[**

*θ**c*

_{1}

*S*

_{1}(

*t*),

*c*

_{2}

*Ω*

^{−1}

*S*

_{1}(

*t*)(

*S*

_{1}(

*t*)−1)/2,

*c*

_{3}

*S*

_{2}(

*t*),

*c*

_{4}

*S*

_{3}(

*t*)]

^{T}and the corresponding state change matrix is 6.1

For our experiments, we will assume that initial conditions are known and set them to *S*_{1}(*t*_{0})=5*Ω*, *S*_{2}(*t*_{0})=*S*_{3}(*t*_{0})=0, *t*_{0}=0. Moreover, we will set the reaction rate parameters to *c*_{1}=1, , *c*_{3}=0.5 and *c*_{4}=0.04. Note that we make explicit the relation between the system size and parameter and we will infer rate *c*_{2} up to a proportionality constant. For all the experiments, we simulate data using the SSA of Gillespie [1] for the time interval *t*∈[0,10] and we discretize such that *t*_{i}−*t*_{i−1}=0.1. Each observation ** X**(

*t*

_{i}) is obtained independently by simulating a trajectory from

*t*

_{0}to

*t*

_{i}and keeping only the last state, discarding the rest of the trajectory. Moreover, for each time point

*t*

_{i}, we also simulate 10 independent observations. Since each observation is obtained by a different trajectory of the MJP we assume that initial conditions do not have a point mass; rather, for each trajectory we sample its initial condition from a Poisson with means

*S*

_{1}(0),

*S*

_{2}(0) and

*S*

_{3}(0).

We use the synthetic data to perform inference for the rate parameters by drawing samples from the posterior
where *r* indexes independent observations for the same time point. For all simulations in this paper, we assume that the means for the initial conditions are known. Following similar arguments to those for the derivation of the LNA in §3, namely that as the system approaches its thermodynamic limit transition densities become Gaussian, the initial conditions for the ordinary differential equation systems for the mean and variance of the transition densities are ** ϕ**(0)=

**(0)**

*X**Ω*

^{−1}and

**(0)=**

*V***, where**

*I***is the identity matrix. In a more realistic scenario, the initial conditions must be included as additional parameters in**

*I***. For all parameters, we used an independent lognormal prior with base 10, zero mean and one standard deviation, and chains are initialized by drawing a random sample from the prior. For the Metropolis–Hastings sampler, we set the initial proposal scale parameters to ≈1e**

*θ*^{−6}and automatically adapt them every 100 samples during the burn-in phase in order to achieve an acceptance rate of 25–30% [17]. The same adaptation strategy was followed for the SMMALA and RMHMC algorithm where the initial step size was also set to ≈1e

^{−6}and was tuned in order to achieve acceptance rates of the order of 70–80% [22]. Finally, the number of leapfrog steps for RMHMC was fixed to 5. We have found that a burn-in period of 10 000–20 000 samples was adequate for all algorithms to converge to the stationary distribution.

Table 1 compares the minimum ESS and the time-normalized ESS obtained by all algorithms for different values of the system size parameter *Ω*. The SMMALA and RMHMC samplers use the gradients and the Fisher information of the approximate likelihood obtained by the LNA in order to make efficient proposals. As the system size increases and thus the LNA better approximates the true likelihood then mixing of the manifold MCMC algorithms improves. For this particular example, we can see that good mixing can be achieved even for very small systems with only ≈25 molecules (*Ω*=5). The Metropolis–Hastings sampler is not affected by the system size but its mixing is very poor in all cases. From the time-normalized ESS, we can also see that despite the improved mixing of RMHMC the computational cost is significant. On the contrary, the SMMALA provides a good trade-off between mixing efficiency and computational cost. Finally, table 2 reports the marginal posterior means and standard deviations for different values of *Ω* obtained by RMHMC. The marginal posteriors for parameters *c*_{3} and *c*_{4} with *Ω*≥5 are also shown in figure 1. Results from the Metropolis–Hastings and SMMALA samplers are similar and are omitted. For small system sizes, we can observe that there is an increased bias of the Monte Carlo estimate while the posterior standard deviation is higher, reflecting the high degree of uncertainty around the mean. The bias, however, significantly reduces as the system size increases and for *Ω*≥5 reasonable estimates can be obtained.

The second example from the chemical kinetics literature that we consider is the Schlögl reaction set,
The corresponding state transition rates and state change matrix are given in equations (6.2) and (6.3), respectively. The state of the system consists only of the number of molecules of a single species ** X**(

*t*)=

*S*

_{1}(

*t*), 6.2 and 6.3 The system is known to have two stable states that appear at different times depending on the size of the system. Wallace

*et al.*[14] have shown that the LNA fails to provide a reasonable approximation of this system even for large concentration numbers. Their numerical experiments demonstrate that the LNA can only approximate one of the two modes depending on the initial conditions. Here, our aim is to show that using the LNA to obtain an approximate posterior over the unknown reaction rate constants can be very misleading for bi-stable systems. Using the resulting posterior means for the reaction rates gives us an LNA that fails to approximate any of the two stable modes.

To demonstrate, we follow the same experimental procedure as in the previous example. That is, we simulate data using the SSA for the time interval *t*_{i}∈[0,10], *t*_{i}−*t*_{i−1}=0.1 with fixed rate parameters and then use these data for posterior inference of the rate parameters using MCMC. Values for the true rate parameters and initial conditions were set as in Wallace *et al.* [14]. Namely, *c*_{1}=0.003, *c*_{2}=0.0001, *c*_{3}=200, *c*_{4}=3.5 and ** X**(

*t*

_{0})=280

*Ω*, where

*Ω*was fixed to 1. After 10 000 burn-in samples all samplers converged to a posterior distribution with mean and variance The LNA obtained by using the posterior means for the rate constants is shown in figure 2 along with the data obtained by the SSA and the LNA using the true values for the rate constants. We can see that the LNA obtained by the posterior means fails to approximate any of the two modes. Rather, it approximates the empirical mean and variance of the data.

### (b) Single-gene expression

Finally, to illustrate the applicability of the methodology to systems biology we also consider a simplified model for the biochemical reactions involved in the expression of a single gene to a protein. The model presented in this section is the same as the model used in the study of Komorowski *et al.* [15] and we adopt the same notation in order to make comparisons easier. Gene expression is modelled in terms of three biochemical species: DNA, mRNA and protein; and four chemical reactions or state transitions: transcription, mRNA degradation: translation and protein degradation. The model can be written in chemical reaction notation as

The system state at time *t* is ** X**(

*t*)=[

*R*(

*t*),

*P*(

*t*)]

^{T}, where

*R*(

*t*) and

*P*(

*t*) are the number of mRNA and protein molecules, respectively. The corresponding state-dependent transition rates are

**(**

*f***,**

*X**t*)=[

*k*

_{R}(

*t*),

*γ*

_{R}

*R*(

*t*),

*k*

_{P}

*R*(

*t*),

*γ*

_{P}

*P*(

*t*)]

^{T}, where

*γ*

_{R},

*k*

_{P}and

*γ*

_{P}are unknown reaction rate constants.

*k*

_{R}(

*t*) is the time-dependent transcription rate of the gene, which, for the purposes of this section, is modelled as where all the

*b*

_{i}s are also unknown parameters controlling gene transcription. This corresponds to a transcription rate that, owing to some stimulus (experimental or environmental), increases for

*t*<

*b*

_{2}and then drops towards the base line

*b*

_{3}for

*t*>

*b*

_{2}. Finally, the state change matrix for this set of reactions is given in equation (6.4), 6.4

As in the study of Komorowski *et al*. [15], we also consider a nonlinear extension of this model in which the transcription rate of the gene *k*_{R}(*t*) is a function of the protein concentration that the gene is transcribed to. This is modelled using a Hill function
where for the experiments in this section we will set *H*=*b*_{3}*k*_{P}/(2*γ*_{R}*γ*_{P}) and , making the protein an inhibitor of mRNA transcription. A schematic of this model is shown in figure 3. For the rest of this section, we will refer to this model as the auto-regulatory single-gene expression model.

Using the transition probabilities ** f**(

**,**

*X**t*) and matrix

**, we simulate synthetic data using the SSA [1] and sample at discrete time points. Values for the unknown rate constants and the parameters controlling gene transcription are shown in table 3. The time interval is taken to be**

*S**t*

_{i}∈[0,25], while the interval between two observations

*t*

_{i}−

*t*

_{i−1}=0.25. Each time point is sampled from an independent trajectory by starting the SSA from

*t*

_{0}and simulating up to

*t*

_{i}, keeping only the state

**(**

*X**t*

_{i}) and discarding the rest of the trajectory. This resembles the experimental conditions often encountered in biology where, in order to make an observation, the sample has to be ‘sacrificed’. Finally, for each time point, we also generate 10 independent observations from different trajectories. Initial conditions

**(**

*X**t*

_{0}) are simulated from a Poisson distribution with means

*b*

_{3}/

*γ*

_{R}and

*b*

_{3}

*k*

_{P}/(

*γ*

_{R}

*γ*

_{P}) for the mRNA and protein molecules, respectively. The system-size parameter

*Ω*is considered to be unknown and for this experiment is set to 1 such that concentrations are equal to the number of molecules. Figure 4

*a*,

*b*shows data simulated from this process from the single-gene expression model as well as the LNA prediction. Simulated data for the auto-regulatory model are presented in figure 4

*c*,

*d*.

We use the simulated data to infer the unknown parameters ** θ**=(

*γ*

_{R},

*k*

_{P},

*γ*

_{P},

*b*

_{0},

*b*

_{1},

*b*

_{2},

*b*

_{5})

^{T}by sampling using MCMC from the LNA posterior where

*r*indexes independent samples for the same time point and

*R*=10.

Table 3 summarizes the results from the MCMC chains for the two models of gene expression. Firstly, we can see that despite the relatively small number of molecules in both systems the LNA provides very accurate estimates for the true parameters. Moreover, we can see that the mixing of the Metropolis–Hastings sampler is very poor for both models while the RMHMC algorithm and the SMMALA perform very well. This can be explained by the strong correlations between parameters in the posterior distribution preventing the Metropolis–Hastings sampler from making sufficiently large proposals. For example, the parameters *k*_{P},*γ*_{P} control mRNA translation and protein degradation, respectively. The concentration of protein molecules is directly affected by the two rates and they are expected to be heavily correlated. In figure 5*a*,*b*, we show the marginal joint posterior for parameters *k*_{P},*γ*_{P} and *γ*_{R},*b*_{3} for the single-gene expression model which exhibit very strong positive correlation. Finally, figure 6 compares the trace plots obtained from Metropolis–Hastings, SMMALA and RMHMC for parameters *γ*_{P} and *k*_{P} of the auto-regulatory gene expression model.

## 7. Conclusions and future work

Bayesian inference for MJPs is a challenging problem that has many important practical applications. Previous research [6] has shown that, although exact inference is possible, the computational cost and the auto-correlation of the Markov chains are such that they limit its applicability to small problems. The main problem stems from the requirement to simulate the MJP for the trajectory of the system between discrete observations. Golightly & Wilkinson [13] have shown that by considering a diffusion approximation the simulation can be performed in a much more efficient manner. In this paper, we considered the LNA, which only requires a system of ordinary differential equations to be simulated while the stochastic fluctuations have an exact analytic solution. The LNA is valid only when the system is sufficiently close to its thermodynamic limit, a condition that is also required for the diffusion approximation. Previous research on the LNA [15] has focused on the Metropolis–Hastings sampler. We have demonstrated here that when the posterior distribution exhibits strong correlation between parameters then the Metropolis–Hastings sampler has strong auto-correlations. Such correlations are very common for chemical reaction and gene regulatory systems. The Riemann manifold MCMC algorithms we considered in this work exploit the geometric structure of the target posterior in order to design efficient proposal mechanisms. In particular, the SMMALA is a conceptually simple algorithm that provides a good trade-off between computational cost and sample auto-correlation.

Although the problems considered in this work are relatively small, but certainly non-trivial, we believe that the proposed methodology is applicable for larger and more complex systems. The systems we studied in this paper all have a linear dependence on the unknown parameters and we have not observed any local modes in our simulations. The analysis of such systems is the subject of ongoing work. Moreover, in real applications, it is not possible to observe the populations of all species and there is an additional measurement error term. Extension of the LNA to handle such cases is straightforward (e.g. [15]); however, the effect of partial observations and measurement error on the MCMC inference is something that needs to be studied in more detail.

## Acknowledgements

We would like to thank the guest editor and the anonymous reviewers for their useful comments that have helped us to improve the presentation of the paper. This work was funded by the FP7 EC project: ASSET—Analysing and Striking the Sensitivities of Embryonal Tumours, reference no. 259348. M.A.G. is funded by Engineering and Physical Sciences Research Council (EPSRC) Advanced Research Fellowship EP/E052029/2.

## Footnotes

One contribution of 17 to a Discussion Meeting Issue ‘Signal processing and inference for the physical sciences’.

- © 2012 The Author(s) Published by the Royal Society. All rights reserved.