Mechanistic models based on systems of nonlinear differential equations can help provide a quantitative understanding of complex physical or biological phenomena. The use of such models to describe nonlinear interactions in molecular biology has a long history; however, it is only recently that advances in computing have allowed these models to be set within a statistical framework, further increasing their usefulness and binding modelling and experimental approaches more tightly together. A probabilistic approach to modelling allows us to quantify uncertainty in both the model parameters and the model predictions, as well as in the model hypotheses themselves. In this paper, the Bayesian approach to statistical inference is adopted and we examine the significant challenges that arise when performing inference over nonlinear ordinary differential equation models describing cell signalling pathways and enzymatic circadian control; in particular, we address the difficulties arising owing to strong nonlinear correlation structures, high dimensionality and non-identifiability of parameters. We demonstrate how recently introduced differential geometric Markov chain Monte Carlo methodology alleviates many of these issues by making proposals based on local sensitivity information, which ultimately allows us to perform effective statistical analysis. Along the way, we highlight the deep link between the sensitivity analysis of such dynamic system models and the underlying Riemannian geometry of the induced posterior probability distributions.
The use of mathematical modelling has long played an important role in describing and predicting the behaviour of natural processes [1–3]. It is only more recently, however, that the use of such models within a statistical framework has allowed modelling and experimental approaches to be more tightly bound together forming an iterative, more symbiotic relationship [4,5]. Mathematical models are abstract representations of reality that are useful for making testable predictions with varying levels of detail. Deterministic nonlinear ordinary differential equations (ODEs), for example, may be most appropriate when describing the average concentration of a protein within a population of cells, in which the stochastic effects of individual molecules do not greatly affect the overall dynamics; this is estimated to be for populations above the range 102–103 molecules per chemical species . For smaller numbers of molecules, the use of stochastic models may be more appropriate . Similarly, when modelling more global physiology, at a tissue or organ level, for example, it is no longer feasible to describe the dynamics with such molecular detail and a more course-grained approach or multi-scale techniques are often necessary, e.g. [8,9]. The current availability of cellular transcript and proteomic assay techniques , which measure average concentrations in cell populations leads us to focus our attention in this paper on continuous deterministic ODE models, which are applicable not just in biology, but also in a large number of other areas of science and engineering, e.g. .
The biochemical mechanisms by which even relatively well-known enzymatic networks operate are not always clear . There are often multiple plausible network topologies that are consistent with the known underlying biology, and in this context, the idea of modularity within biology is important as it provides a natural means of iteratively building up a model description alongside successive biological experiments, with each half of the process informing the other. Model hypotheses can thus be constructed and compared with one another by incorporating new components into an existing model and then assessing the model based on the data . Recent advances in understanding the molecular origin of circadian rhythms in Arabidopsis thaliana [12,13] offer an excellent example of this type of approach, whereby differential equation models have been able to offer predictions that have subsequently been tested experimentally .
Mechanistic modelling is a first step towards characterizing the aetiology of many different types of diseases that are thought to be the result of disruptions to the normal functioning of signalling pathways and biochemical networks. Obtaining a detailed mechanistic understanding of such enzymatic control processes could have major implications in the study and potential treatment of disease at a molecular level. The circadian clock, for example, appears to play a central role in the physiology of plants and mammals, controlling many important cellular functions and influencing pathways implicated with a variety of diseases ; indeed the timing of drug administration relative to circadian rhythms appears to strongly affect the efficacy of certain anti-cancer agents . Signalling pathways are also strongly implicated in the origin of many cancers  and understanding the intricate networks of nonlinear interactions is key to being able to predict the impact of biochemical changes within the system, whether natural or artificially induced with the use of drugs .
Current investigations into biochemical networks are characterized by complex nonlinear dynamics, as well as by measurement and model uncertainty. When studying the possible structure of such systems, working hypotheses can be encoded as statistical models based on systems of nonlinear differential equations that capture all assumptions regarding the likely mechanisms of interaction. As models increase in size and complexity, so does the need for sophisticated statistical methodology that can consistently evaluate and update the evidence in favour of each model, as new data become available (e.g. [4,19]). Given the limited and variable experimental data that are often available, a probabilistic approach based on Bayesian statistics offers a natural way of dealing with such parameter and model uncertainty. Rather than finding an optimal working set of parameters, as is the case in the frequentist setting , the Bayesian paradigm advocates averaging over all possible parameter sets with respect to their individual probabilities. This marginalization procedure automatically provides a compromise between fitting the data and penalizing model complexity, such that we may find the simplest model that is, however, still complex enough to accurately describe the observed dynamic behaviour.
Initial proof of concept investigations has shown that a Bayesian approach to model ranking can be very successful [4,11]; however, it is acknowledged that performing Bayesian inference over ODE models is extremely challenging . The procedure is equivalent to evaluating integrals involving a highly nonlinear function over a high-dimensional space. In more than three or four dimensions deterministic approaches are no longer feasible and we must resort to stochastic integration using simulation-based Monte Carlo techniques. A common approach is to construct a Markov process that converges to the target posterior distribution ; however, this is not straightforward in practice owing to the following three main issues.
— Global convergence. Moving from a random starting position in parameter space to the true stationary distribution may be difficult, owing to the possibility of multiple modes. We require sampling methods that can explore all modes globally with the correct frequency without becoming stuck in local maxima of negligible probability mass.
— Local mixing. Once the Markov chain reaches a region of high probability mass, we want it to fully explore and ideally obtain near-uncorrelated samples. This is challenging for many types of differential equation-based models, particularly those that are high dimensional and whose nonlinear dynamics induce very strong nonlinear correlation structures in the posterior distribution.
— Computational cost. The likelihood of these models can be expensive to evaluate, since it involves approximately solving the system of ODEs with a numerical integration scheme for each set of proposed parameters. Although generally unavoidable, this cost can be minimized through efficient exploration of the parameter space, which can be measured in terms of effective sample size (ESS), normalized by the overall computational time .
The issue of identifiability is very important and has a direct impact on our ability to perform efficient frequentist or Bayesian statistical analysis. It has been noted many times that ODE models of biochemical networks generally exhibit widely varying parameter sensitivities [24–28]; investigation of second-order sensitivities of these models, evaluated at the maximum likelihood, often reveals a wide eigenvalue spectrum that itself may change depending on the point in parameter space at which it is calculated. In settings with such varying parameter scalings, standard Markov chain Monte Carlo (MCMC) samplers generally have very poor mixing properties and produce highly correlated samples , resulting in estimates of the required Bayesian quantities with large Monte Carlo errors. This is often a result of structural unidentifiability of the model , such that parameters cannot be estimated with low variance. For example, if the output of a model depends strongly only on the ratio of two parameters θ1/θ2, then the conditional distributions p(θ1|θ2) and p(θ2|θ1) may be well constrained, but their marginal distributions p(θ1) and p(θ2) might not be.
Recently developed differential geometric MCMC methods, on the other hand, seem particularly well suited to this type of application, since they exploit the natural representation of the model parameter space as a Riemannian manifold , which is induced using the expected Fisher information as a metric tensor . The Fisher information, therefore, defines a local distance measure and effectively allows the MCMC proposals to be based on the curvature of the manifold, which is directly defined by the parameter sensitivities of the underlying model describing the dynamical system, see §10 in Girolami & Calderhead .
We examine the application of manifold MCMC methodology for performing Bayesian inference over ODE models of biologically relevant size with complex dynamics and partially unidentifiable structures. We perform posterior inference on biologically realistic examples of enzymatic control processes that have recently been studied in the systems biology literature. We consider a model describing the main circadian clock components in A. thaliana , building up inference over different sets of model parameters to gain insight into the challenges of a statistical analysis, and also a model describing a cell signalling pathway . We perform inference using datasets generated from each model and infer the posterior distributions describing the probable parameter values.
2. Methods: models and inference
2.1. Differential equation models
We consider statistical models based on systems of ODEs of the form , which defines a relationship between the time derivatives of the states and, often nonlinear, functions of the states and parameters . In all but the very simplest cases, such systems are analytically intractable and must be solved numerically given suitable initial state conditions x0 at t = 0. If we wish to employ sampling methods based on differential geometric information, we must be able to calculate the first, and in some methods second-order sensitivities of the model outputs with respect to the input parameters. One method of obtaining these required sensitivities at all time points is to approximate them using finite differences, however this method may not be very accurate, particularly for higher order sensitivities. A much more accurate approach for ODE models is to calculate them via a forward simulation of auxiliary equations obtained by differentiating the original system with respect to each parameter.1 The general form of the first-order sensitivities therefore follows as 2.1where we have taken the total derivative of f(x(t),θ, x0) with respect to each θi, since the states x also depend implicitly on the parameter values. The sensitivity of xn at time t with respect to a change in the parameter θi is therefore given by St,ni. This gives us a set of linear differential equations to augment the original system.
ODE models for most biological systems can be constructed from a set of common components describing various mechanisms of molecular interaction, such as Michaelis–Menten type enzyme-mediated interactions , and their proposed structure can be informed from current knowledge of the underlying biology. The systems we consider illustrate the typical challenges associated with performing statistical inference over realistic biochemical models.
2.1.1. Circadian control model
Arabidopsis thaliana is widely accepted in plant biology as a model plant worthy of close and careful study . Although its genetic networks are much simpler than higher organisms, it is still complex enough to offer much insight into a large variety of biologically interesting mechanisms of biochemical interaction, such as negative feedback loops that induce nonlinear oscillatory behaviour. Overviews of the recent advances in understanding this reasonably complex biological system are given in McClung  and Hubbard et al. .
We employ a model based on the core circadian network in A. thaliana comprising transcription factors LHY and CCA, and the pseudo-response regulator TOC1. As a simplification, both LHY and CCA are modelled as one component, since they have qualitatively similar behaviour. Michaelis–Menten kinetics are used to describe enzyme-driven protein degradation, with Hill functions describing transcriptional activation of mRNA for LHY/CCA and TOC1. The model is from Locke et al.  and is the minimal description of the network describing circadian behaviour of Arabidopsis, which we simulate in constant darkness. It consists of six nonlinear differential equations and a total of 24 parameters. The concentrations of the species are represented by [TOC1] and [LHY] with subscripts m, c and n denoting mRNA, protein within the cytoplasm and protein within the nucleus, respectively. The Hill coefficients a and b are set to 1 and 2 respectively, based on evidence from the literature . There are therefore 22 free parameters to be inferred from the experimental data, where (ni, gi) are transcription rates, (mi, ki) are degradation rates, (pi) is translation rate and (ri) is rate-defining transport between the nucleus and cytoplasm. The structure of this biochemical network is represented in figure 1 and the equations are given in appendix A.1.
2.1.2. A cell signalling model
The modelling of cell surface receptors is an important means of furthering our understanding of intra-cellular signalling. Such mechanisms allow extra-cellular cues to drive activation and dynamic behaviour within each cell . In particular, changes in the environment are often encoded at a molecular level in terms of changes in the concentration of ligands outside the cell; such ligands may then bind to the surface of cells inducing internal changes brought about by phosphorylation mechanisms. A recent example of such a model is given in Becker et al. , which describes the nonlinear dynamic behaviour of the erythropoietin (Epo) receptor in response to changes in the ligand Epo concentrations outside the cell. Encoded signals sent in this manner have significant biological knock-on effects, invoking cellular responses that activate a number of signalling networks further downstream. The model we employ consists of six nonlinear differential equations, based on mass-action kinetics, with eight parameters, and details of the equations and parameter values are given in appendix A.2. We reparametrize the model such that we have parameters measured in log10 space, allowing us to more easily consider a wide range of possible parameter values that vary by orders of magnitude. This model is complicated further by the addition of an observation model, since the biochemical species in this network are not all observable individually; in particular, only the total concentration of [Epo] and [dEpo_e] is biologically observable, along with [Epo_EpoR], such that our observations take the form y1 = [Epo] + [dEpo_e] and y2 = [Epo_EpoR].
2.2. Parameter identifiability
A model parameter can be considered only weakly identifiable if changes in its value have very little effect on the output of the model. There are two main causes of unidentifiability ; the first comes from measurement uncertainty in the data, and the second is a result of the model structure, in particular the mathematical relationship between the parameters and the states. Structural identifiability issues can occur, for example, when the parameters of a statistical model appear as a ratio; if the output of the model depends only on this ratio, then the numerator and denominator may effectively take on an infinite number of values. This effect has been termed parameter evaporation  since these unconstrained parameters can tend to infinity. This poses a particular problem for methods that make use of second-order geometric information, as the unconstrained parameters can result in an ill-conditioned Hessian or Fisher information matrix , and so the point-wise variance estimates of such parameters can be very poor.
In the context of the circadian biochemical models above, the commonly used Michaelis–Menten terms are of particular interest . This is a model to describe an irreversible enzymatic reaction and relates the rate of the reaction to the concentration of a substrate, [S]. Let us consider the Michaelis–Menten equation 2.2where r is the current reaction rate rmax is the maximum reaction rate and K is the Michaelis–Menten rate, which is inversely related to the enzyme's affinity for the substrate S. Firstly, let us note that 2.3
In this case, when the rate constant K is much less than the concentration level of S, the overall reaction rate is determined by rmax. K is, therefore, unidentifiable as it has almost no effect on the model output. Similarly we see that 2.4and so it is the ratio of rmax to K that is important in determining model behaviour. In this case, both rmax and K are potentially unidentifiable, depending on the values of [S]. This unidentifiability results from our decision to employ Michaelis–Menten kinetics to describe our system; the fact that parameter values could blow up to infinity or evaporate to zero may not have any bearing on the biological reality, but are more likely to be artefacts of our model approximation of the underlying system. Within a Bayesian setting, the use of priors may be used to enforce weak identifiability and improve numerical conditioning .
2.3. The choice of priors
Priors are used in Bayesian statistics as a means of incorporating existing knowledge regarding likely parameter values. In this setting, such knowledge commonly arises from the consideration of the underlying biology and experimental observations, however it may also arise by considering the model itself. In previous work examining oscillatory behaviour of biochemical networks , bifurcation analysis of a relatively simple delay differential equation model was employed to find the regions of parameter space that permitted oscillatory output, and this information then informed the choice of priors. This approach, however, is obviously dependent on the model being amenable to such analysis; bifurcation analysis on much larger, more complex models quickly becomes unfeasible. The idea of informing the range of admissible parameter values based on a mathematical analysis of the model has also been suggested in Transtrum et al. , in particular for the case of examining Michaelis–Menten kinetics, where the aim was to avoid numerical issues and parameter evaporation. Using such an approach for the circadian rhythm models considered in this paper, we can enforce weak identifiability without restricting the range of behaviour our model can produce. In this case, we want K to be neither too small nor too big, but rather roughly the same order of magnitude as we would expect [S] to be.
In the clock model, we employ gamma priors, with shape parameter k = 1.5 and scale parameter θ = 2, for the model parameters. These priors have positive support, reflecting the fact that parameters correspond to physical rate constants, and they discourage parameter values approaching zero, since we assume that all terms in each model play a role in the overall biochemical process. Such priors are also suitable for preventing the rate parameters getting too large, since we assume that all chemical processes are slow enough to be observed. The prior thus serves to constrain the parameter values, and this has the added numerical benefit of regularizing the Fisher information, when it might otherwise be near singular with a very high condition number ; the overall metric tensor is therefore formed by taking the expectation of the posterior distribution, as discussed in §4.2 in Girolami & Calderhead . We employ gamma priors, with shape parameter k = 1.5 and scale parameter θ = 10, for the initial conditions of the clock model; these were chosen to allow the initial values to be roughly the same order of magnitude as the expected concentration levels of the chemical species, as measured experimentally , and they cover the support of the likely initial conditions for each observed species.
In the signalling model, we employ Gaussian priors with a mean of −2 and a standard deviation of 2; since we reparametrize the model in log10 space, we no longer require our prior to have only positive support, and such priors cover parameter values of many orders of magnitude.
2.4. Differential geometric Markov chain Monte Carlo for ordinary differential equation models
Sensitivities of parameters can change depending on the dynamical behaviour of the model, and thus can vary dramatically in different parts of parameter space. This has implications for the successful application of MCMC methods, since the transition kernel must therefore be tuned in different parts of the space, particularly during the transient phase in which the chain may be traversing large regions of space to reach the stationary distribution (see figure 9). In this situation, standard Metropolis–Hastings sampling methods can fail or perform poorly at best. Adaptive MCMC methods  also address this issue by attempting to estimate the correlation structure of the target distribution during sampling; however, this is not an easy task particularly as the dimensionality increases. In this section, we introduce MCMC approaches that calculate this local structure directly via the Fisher information.
Strong correlations and widely differing variances make it challenging to propose large moves that will have a high probability of acceptance. In addition, the expense of calculating the likelihood in this setting underlines the importance of obtaining effective proposal mechanisms. Some parameter values may be perturbed by large amounts and yet still have very little effect on the model output; in contrast, other parameters may be extremely sensitive in terms of their effect on model output and must therefore be perturbed by very small amounts. To complicate matters, sometimes the model output may be sensitive to correlated changes in subsets of parameters. However, recent advances in MCMC methodology provide a systematic means of overcoming these difficulties by the design of proposal mechanisms that make use of the local sensitivity information of the system .
A statistical model can be thought of as a family of parametrized probability distributions, as defined through the likelihood function, which gives the probability of a particular dataset given a set of parameter values. Work in the 1940s by Jeffreys  and Rao  sought to define distance between probability distributions. The elegant solution recast probability theory in the language of differential geometry—in particular they found that a natural measure of distance may be defined between probability distributions using the expected Fisher information as a local metric, and the task is equivalent to calculating shortest paths over a Riemannian manifold , where the expected second-order sensitivities at a set of parameter values define a local distance measure. We can consider two sets of perturbed parameter values to be similar if they result in similar posterior probabilities.
We begin with an analogy. Let us consider measuring distance on a map of a mountainous region. On the flat two-dimensional map, one can define a distance purely in terms of its Euclidean coordinate system; in real life, however, the actual measure of distance must also take into account changes in the height of the terrain. The greater the variability in the height of the terrain, the less accurate the two-dimensional approximation. Similarly, in our statistical model, any measure of distance based purely on the parameter values is likely to be very misleading; a much more accurate measure takes into account rates of change in the probability of the statistical model, and it is exactly this information that is given in a local sensitivity analysis via the expected Fisher information.
From this geometrical point of view, we see why standard MCMC approaches may fare badly, since they generally employ a Euclidean measure of distance when proposing moves in parameter space. Yet, a small change in each parameter value may in actual fact have widely varying effects on the output, and hence probability, of the statistical model. The Fisher information provides a local measure of the expected second-order sensitivities of a system evaluated at a particular set of parameter values. An eigenvalue decomposition of the Fisher information can elucidate the parameter dependencies within the model; most biochemical networks have a relatively small number of parameter combinations that strongly affect the model output, with the remaining combinations having little effect [20,24,25,41].
The differential geometric MCMC methods developed in Girolami & Calderhead  effectively make use of local sensitivity information to propose superior moves in parameter space. In this work, we embed these sampling schemes within a population MCMC framework to help escape local maxima and fully explore the parameter space [21,42,43].
We assume normally distributed additive errors in the experimental data, in which case the expected Fisher information takes the form of the weighted inner product of the sensitivities at each observed timepoint such that , where G(θ)ij denotes the (i,j)th entry in the Fisher information matrix obtained at the parameter values θ. The vector S·,nj has elements that correspond to the sensitivity of the nth species with respect to the jth parameter at each of the sampled time points. The error covariance for the nth species at all sampled time points is denoted as Σn. The metric tensor of the Riemannian manifold induced by the statistical model is then G(θ) + Φ(θ), where Φ(θ) is the negative Hessian of the log-prior , and we employ this in the sampling schemes throughout this paper.
Geodesics are the equivalent of straight lines on a Riemannian manifold, and are formally paths along which the tangent vector is constant. In the case of the Riemannian connection used in the following MCMC algorithms, they also locally define the shortest path between two points on the manifold, allowing them to generate samples very efficiently . The manifold MCMC methods we employ are generalizations of the Langevin and Hamiltonian Monte Carlo methods [44,45]. The first method, the manifold Metropolis-adjusted Langevin algorithm (mMALA) makes proposals by simulating a diffusion process across the manifold. The proposal is a Gaussian distribution, whose mean is given by following the curvature of the space along the direction of steepest natural gradient, and whose covariance is given by a scaled inverse Fisher information matrix. A simplified version of mMALA may also be used by assuming a locally constant metric at each point when making proposals, whose mean then involves a cheaper to compute approximation of the path defined in mMALA. The second method, Riemannian manifold Hamiltonian Monte Carlo (RMHMC) proposes moves by randomly following a geodesic from the current point on the manifold. This is more expensive to compute, as the geodesic paths must be obtained by symplectically integrating a non-separable Hamiltonian system. The reader is directed to Girolami & Calderhead  for a full exposition, and accompanying Matlab code, of the sampling methods employed.
One practical challenge of using the Fisher information matrix is that it can become ill-conditioned for parameters that are highly insensitive . The use of a prior helps mediate this problem somewhat and does make the overall metric tensor G(θ) + Φ(θ) better conditioned, however we will see that this ill-conditioning still negatively impacts on sampling efficiency. The parameters in these models of biochemical networks can be naturally partitioned according to their functional role, and in the case of the Michaelis–Menten parameters, it is possible to predict where the issue of unidentifiability may arise. We can use these natural parameter groupings to design a blocked MCMC algorithm , such that each set of parameters is sampled separately using a manifold MCMC method conditioned on all other parameters; we iterate through exploring the submanifolds induced by our chosen groups of parameters. This blocking strategy is designed to capture a large part of the correlation structure in the system, while improving the numerical stability by using smaller Fisher information matrices. We shall examine the effect of this strategy in §4.
Synthetic datasets are very useful for evaluating the performance of new methodology, and with complex ODE models, it is convenient to have a set of parameter values that generate the data as a benchmark. We investigate the models using a variety of noise levels, which we detail along with the results in §3; ultimately we wish to test our methodology using data with biologically realistic numbers of observations and levels of noise.
For the circadian model, we generate data by sampling 10 data points for each species evenly over a time period of 48 h, using parameter values given in the literature . Using a Gaussian error model, we add zero mean noise to the data points generated for species LHY[m,c,n] and TOC1[m,c,n], with standard deviation proportional to the amplitude of the concentration of each species. The circadian model we investigate has the difficulty that the Fisher information over all parameters is numerically ill-conditioned, as often happens when parameters are weakly identifiable . We therefore employ blocking strategies that are informed by analytical consideration of the model and partition the parameters into subgroups, each containing potentially strong correlation structure. Instead of using one large ill-conditioned Fisher information matrix to inform our sampling proposals, we employ multiple smaller, but better-conditioned, Fisher information matrices.
We embed the sampler within a population framework to help escape local maxima and fully explore the parameter space, noting that the samples obtained could be further used for calculating Bayes factors via thermodynamics integration [21,42,43]. Within this framework, we explore multiple tempered distributions simultaneously, each defined by a power posterior . These form a smooth family of distributions between the prior and posterior, and exchange moves between these distributions allow faster convergence to the global mode of interest. We employ the same temperature schedule as detailed in Calderhead & Girolami .
A burn-in phase of between 10 000 and 20 000 samples was found to be sufficiently long for the Markov chains to converge to the stationary distributions defined by each of the power posteriors used. During this time, the step sizes of the parameters were adjusted every 100 iterations to achieve an acceptance ratio of between 20 and 50 per cent for standard Metropolis and between 30 and 70 per cent for manifold sampling using simplified mMALA. After the burn-in period, step sizes were fixed to ensure samples were drawn from the stationary distribution.
The cell signalling model is made identifiable by fixing two parameters, the values of which may be obtained experimentally  (see appendix A.2). The initial condition of EpoR may also be found in the experiment of Becker et al. . The initial condition of Epo is assumed to be known, and the other four initial conditions are set to zero. The initial conditions of the observed species Epo + dEpo_e and Epo_EpoR are, therefore, completely defined. The Fisher information for the remaining six parameters is well conditioned and we may therefore infer these together, without using a blocking approach. For this model, a burn-in phase of 5000 samples was found to be sufficiently long for the Markov chains to converge.
We obtained the auxiliary sensitivity equations for our ODE models using the Symbolic Math Toolbox in Matlab. This automated process is extremely fast and helps prevent human mathematical errors. We solve them by making use of the SBToolbox2  for Matlab, which provides an interface to a C implementation of the Sundials solver  and is up to two orders of magnitude faster than using the built-in solvers in Matlab.
3.1. Circadian model
We first consider three subsets of the parameters individually; the linear parameters (p1:2, r1:4), the transcription parameters (n1:2, g1:2) and the Michaelis–Menten parameters (m1:6, k1:6). This allows us to see the potential challenges of inferring parameters in these three different types of functions. We perform inference over each set of parameters with the other sets of parameters fixed at their true values.
3.1.1. Linear parameters
For now, we run a single chain initialized on the true parameter values for the purpose of evaluating the sampling properties at the mode, and we fix the initial conditions.
We first compare posterior sampling of the linear parameters using Metropolis, simplified mMALA, MALA and RMHMC, and a comparison of the sample traces is given in figure 2. We employ a component-wise Metropolis sampler, whereby each parameter is updated sequentially conditioned on all other parameters using a Gaussian proposal, and we note that the component-wise Metropolis–Hastings sampler, therefore, requires the ODE model to be solved once for each parameter. The proposal variances are automatically tuned to achieve acceptance rates of between 20 and 50 per cent, and then fixed before drawing posterior samples. This standard sampler performs very poorly and is unable to sample efficiently from the strongly correlated posterior distribution. The manifold methods, however, perform far better, making use of the local geometry information to propose better moves through parameter space. RMHMC makes proposals that follow geodesic paths across the manifold; however, in our implementation, it is computationally much more expensive,2 requiring an average of five evaluations of the second-order sensitivity equations and an average of 25 evaluations of the first-order sensitivity equations. mMALA makes moves based on a diffusion process across the manifold, and also performs much better than Metropolis–Hastings, although again this is computationally expensive because of the need to calculate second-order sensitivities of the differential equation model; it requires two evaluations of the second-order sensitivity equations per iteration. Simplified mMALA makes valid MCMC proposals based on an approximate diffusion across the manifold. It gives similar results to mMALA and is computationally much more efficient as it only requires two evaluations of the first-order sensitivities per iteration. In practice, we find that solving the extended first-order ODE system twice for simplified mMALA is only three to four times slower than solving the original system for each parameter in a component-wise Metropolis-Hastings sampler, yet it produces samples that are visibly far better (figure 2). We refer the interested reader to Girolami & Calderhead  for a more detailed comparison of the time-normalized efficiency of manifold sampling methods, noting that the extended first-order system consists only of additional linear equations that may be solved very efficiently . We therefore employ simplified mMALA as our sampler of choice for all subsequent experiments in this paper.
Figure 3 shows scatter plots of the samples obtained from simplified mMALA for each parameter combination. Even with these linear parameters, there are some very strong correlations and severe scaling differences, illustrating the difficulty of sampling from statistical models of this type. As an example, let us consider parameters r1 and r2, which define transport between the nucleus and cytoplasm for LHY/CCA. There is a very strong correlation structure between these parameters; as the concentration of the protein in the nucleus increases, the concentration of protein in the cytoplasm decreases and vice versa. There is a similar strong correlation structure between parameters r3 and r4 describing nuclear/cytoplasmic transport for TOC1. Although the other parameters do not exhibit such strong dependencies, their differing scales are evident, which further compounds the difficulties.
We also compared the effect that changing the noise level has on parameter inference, and in particular on the practical identifiability of the model. Even with very low levels of noise, we observed wide variability in the parameter estimates; however, the model was still able to make predictions with high certainty, as has been observed in biological models examined in previous work ; dealing with large uncertainty in parameter values is an unavoidable part of the modelling process. For many models, the experimental accuracy required to obtain tight bounds on individual parameters is simply unachievable, and indeed often undesirable if we are mainly interested in the model predictions. The careful use of priors can therefore help to constrain parameter values to biologically meaningful ranges.
3.1.2. Transcription parameters
We now consider inference over the four transcription parameters with Hill coefficients, and fix all other parameters at their true values. Again, there are strong correlations between these parameters, particularly between parameters appearing together within the same algebraic term. This correlation structure is shown in figure 4 and was seen to remain even with increasing levels of error variance in the data.
3.1.3. Michaelis–Menten parameters
Fixing all parameters apart from those appearing in the Michaelis–Menten terms, we investigate the effect of a blocking strategy on sampling. Informed by the analysis of the Michaelis–Menten terms in §2.2, we may group the 12 parameters into pairs; each pair appears together in a Michaelis–Menten term and in certain cases the system may respond to changes in the ratios of these pairs. Figure 5 shows the traces of Michaelis–Menten parameter samples obtained with and without the use of blocking. Without blocking, the Fisher information becomes badly conditioned (condition number greater than 108) and the resulting numerical inaccuracy results in very small step sizes and highly correlated samples being drawn. The use of blocking parameters is seen to mitigate this problem.
Scatter plots of the parameter pairs (mi, ki) are also shown in figure 6. Interestingly, we see that the pair of Michaelis–Menten parameters with the strongest correlation (m1, k1) is indeed for a species with low concentration, such that k1 ≫ [LHY]m. We see that this type of severe correlation structure was premeditated in §1, where we considered the case of the ratio of m1 to k1 driving the dynamics. Clearly, when performing inference using MCMC, we wish to make moves in parameter space that take into account these strong correlations. There was seen to be less correlation between (mi, ki) and (mj, kj), when i ≠j, which we might also expect from the structure of the equations since each Michaelis–Menten term contains only the chemical species of the equation in which it appear.
3.1.4. Inference with full and partial observations
Until now we have employed a single chain initialized at the correct parameter values to examine the local mixing properties of manifold MCMC methodology. We now consider the effect of using random starting values for our parameters. Performing inference on just the linear parameters, keeping all others fixed, we find that there appears to be a single mode, which our Markov chain can reach regardless of the starting parameters. However, if we infer both the linear parameters and the transcription parameters, there is another mode to which our Markov chain may converge. This alternative set of parameters gives a non-oscillatory output that roughly corresponds to the average of the true model output, as shown in figure 7. Systems with complex nonlinear dynamics are particularly susceptible to this problem, since local maxima may occur when the model moves in and out of phase with different parts of the data. Any MCMC methodology used for this problem must therefore not only have good local mixing properties, but it must also be capable of making more global steps, allowing it to escape from local modes of negligible probability mass. Using simplified mMALA to explore a population of tempered distributions  allows us to resolve this issue.
We infer all sets of model parameters together using a blocking strategy and population scheme with 20 tempered distributions, as in Calderhead & Girolami . Figure 8 shows the differences in the predictive model outputs for differing numbers of observed species. This is important as it is usually not possible to obtain measurements for all components in a biological system, owing to either financial or technical constraints.
3.2. Cell signalling model
We infer parameters over the cell signalling model described in §2.1.2. in which the observations are a linear combination of the underlying modelled species, demonstrating that manifold sampling methodology extends straightforwardly to incorporate observation models. It also demonstrates how this methodology may be used to infer parameters of models based on commonly used mass action kinetics. The model is parametrized in log10 space allowing exploration over several orders of magnitude. Despite having only six parameters, this cell signalling model still exhibits reasonably complex dynamics and sensitivities that change markedly throughout the parameter space. This is shown in figure 9, where we observe the path taken by a Markov chain during the burn-in period guided by the sensitivities of the model parameters, which dramatically change as the chain explores new regions of the space. Despite the very wide priors covering orders of magnitude, the population-based simplified mMALA sampler is still able to converge to the correct mode. In addition, we find that such a model based on mass action kinetics also induces a highly correlated, asymmetric posterior distribution that is challenging to sample from, as shown in figure 10, particularly as the parameter values are of differing orders of magnitude. Finally, model predictions are made with very low uncertainty (figure 11); however, it has been noted  that a careful initial analysis of the model is necessary to achieve this structural and practical identifiability.
4. Discussion and conclusions
Statistical analysis of mechanistic models describing biological systems plays a vital role in furthering our understanding of the underlying mechanisms driving observed behaviour [4,19,30,48]. Both frequentist and Bayesian approaches to statistical inference face the challenge of searching through high-dimensional parameter space; the former relying on optimization, the latter on MCMC sampling.
In this paper, we have focused on the Bayesian approach and find standard sampling methods are inadequate for this task. We require our sampler to take into account the locally changing correlation structure in order to propose moves that are accepted with high probability and adequately and efficiently explore the space. One approach is to employ adaptive MCMC methods  that estimate the covariance using previous samples from the Markov chain; however, the rate of diminishing adaptation must be carefully controlled to ensure convergence and it may also take a while to obtain reasonable estimates of what is essentially the global covariance structure. In contrast, we obtain the local covariance structure directly using the induced Riemannian geometry of the model. Manifold MCMC methods  make moves in parameter space based directly on the sensitivities of the underlying differential equations with respect to each parameter, resulting in efficient convergence to the correct posterior distribution.
A further inferential challenge stems from the fact that many complex ODE models may induce multiple modes, which can occur, for example, when a non-oscillatory steady-state solution is found that may be fitted to the average values of the observed data as we have illustrated in figure 7. Population MCMC approaches offer an effective solution to this problem and further allow for accurate estimation of marginal likelihoods .
The RMHMC sampler clearly offers the most effective scheme in terms of generating nearly independent posterior samples; however, solving the equations to calculate the geodesic paths on the induced manifold can be computationally expensive. Ideas for mitigating this issue include the adoption of adjoint differentiation methods for efficiently calculating sensitivities of large numbers of parameters, the use of appropriate guiding Hamiltonians in the proposal mechanism  and implementation in computationally more efficient programming languages. Further work will bring these costs down and make RMHMC available to a larger class of problems. In the meantime, the simplified mMALA sampler offers the benefits of a geometric approach to MCMC, while remaining computationally feasible for large models.
Many realistic biological models have parameters that are unidentifiable and this issue has recently received a lot of attention . Such problems can be mitigated by considering a reduced model or, in a Bayesian setting, employing biological and mathematical analysis to inform the choice of priors and enforce weak identifiability . Although ODE models may be applied to model a wide range of important natural phenomena, other modelling formalisms, such as stochastic and partial differential equations [7,49], are also important in other settings. Recent work developing calculations of the Fisher information for SDEs  suggest that such systems may also be considered within a manifold framework.
The manifold samplers we employed use the local sensitivities of the model to construct an efficient Markov transition kernel, allowing us to sample from complex posterior densities, which in turn provide the desired global assessment of sensitivity over a range of dynamic responses of the system and not just at a single operating point. In this way, manifold samplers provide an efficient route from local to global sensitivity analysis.
We thank the four anonymous referees for constructive comments that have improved this paper. This work was funded by the following research grants: The Molecular Nose ‘EPSRC EP/E032745/1, Advanced Research Fellowship’ EPSRC EP/E052029/1, Inference-based Modelling in Population and Systems Biology ‘BBSRC BB/G006997/1, The Silicon Trypanasome’ BBSRC BB/1004599/1, Analysing and Striking the Sensitivities of Embryonal Tumours' EU FP7, Bayesian Inference in Systems Biology: Modelling Organ Specificity of Circadian Control in Plants' Microsoft Research, Research Fellowship—2020 Science—UCL, University of Oxford, Microsoft Research.
A. Differential equation models
A.1. Circadian control model
We used the following parameter values from the literature : p1 = 9.0002, p2 = 3.6414, r1 = 5.6429, r2 = 8.2453, r3 = 1.2789, r4 = 5.3527, n1 = 0.6187, n2 = 7.7768, g1 = 3.7051, g2 = 9.7142, k1 = 7.8618, m1 = 7.3892, k2 = 3.2829, m2 = 0.4716, k3 = 6.3907, m3 = 4.1307, k4 = 1.0631, m4 = 5.7775, k5 = 0.9271, m5 = 4.4555, k6 = 5.0376, m6 = 7.6121. The Hill coefficients were fixed at a = 1 and b = 2, and the initial conditions used for generating the synthetic data were cLm(0) = 0.1290, cLc(0) = 13.6937, cLn(0) = 9.1584, cTm(0) = 1.9919, cTc(0) = 5.9266, cTn(0) = 1.1007.
A.2. Signalling pathway model
The parameter values (in log10 scale) were determined from the literature to be: kon = −4.091, kex = −2.447, kt = −1.758, ke = −1.177, kdi = −2.730 and kde = −1.884. The remaining parameters were fixed to enforce identifiability: kD = 2.583, Bmax = 2.821. The initial conditions were fixed at: Epo(0) = 2098.9, EpoR(0) = 662.22, Epo_EpoR(0) = 0, Epo_EpoR_i(0) = 0, dEpo_i(0)=0 and dEpo_e(0) = 0.
One contribution of 9 to a Theme Issue ‘Inference in complex systems’.
↵2 This performance could be substantially improved by using adjoint differentiation methods and programming using a compiled language, rather than the interpreted language Matlab that we use here.
- Received May 30, 2011.
- Accepted August 1, 2011.
- This journal is © 2011 The Royal Society