Abstract
Estimating functions provide a very general framework for statistical inference, and are particularly useful when one is either unable or unwilling to specify a likelihood function. This paper aims to provide an accessible review of estimating function theory that has potential for application to the analysis and modelling of a wide range of complex systems. Assumptions are given in terms that can be checked relatively easily in practice, and some of the more technical derivations are relegated to an online supplement for clarity of exposition. The special case of the generalized method of moments is considered in some detail. The main points are illustrated by considering the problem of inference for a class of stochastic rainfall models based on point processes, with simulations used to demonstrate the performance of the methods.
1. Introduction
This paper deals with the generic situation in which one wishes to carry out parametric inference on the basis of a vector y of data, regarded as the realized value of a corresponding vector Y of random variables. By ‘parametric inference’, we mean that interest lies in learning about a finite vector θ = (θ_{1} … θ_{p})′ of p unknown quantities that are known to lie within some subset, Θ say, of ℝ^{p}. In a conventional statistical setting, the elements of θ would be parameters in a statistical model that specifies the joint distribution from which the data are considered to be generated: hence, we will refer to Θ as the parameter space. More generally, however, θ may contain just relevant descriptors (formally, functionals) of this joint distribution such as means, variances or specific probabilities of interest (such as the probability that an individual becomes infected during an epidemic). More generally still, one might wish to calibrate or ‘tune’ a model that is not explicitly probabilistic.
In settings involving a parametric statistical model that specifies the full joint distribution, f(·; θ) say, of Y, the likelihood function forms the basis of most modern statistical techniques for inference. Given data y, the likelihood function is defined as and plays a central role in Bayesian inference as well as in maximum likelihood and its variants. The likelihood can be regarded as measuring the relative fidelity of the model to the data under different parameter values.
However, particularly when studying complex systems, one may be unable or unwilling to formulate a likelihood function. For example, in nonGaussian settings, there is a relative scarcity of tractable multivariate distributions: if there are dependencies among the elements of Y therefore, one may be unable to specify a joint density that is both plausible and tractable.
Another feature of complex systems is that the available data often do not correspond directly with what is being modelled. A common scenario is that the fundamental mechanisms of the process under consideration operate at fine temporal and spatial scales, but only coarsescale data are available. For example, many stochastic epidemic models attempt to represent the process of disease transmission between individuals but outbreak data, such as those from daily or monthly hospital reports, are typically aggregated both spatially and temporally. Likelihood construction for such models is complicated without knowing the exact times of individuals becoming infected (e.g. [1–3]).
Many stochastic models attempt to provide a simplified probabilistic representation of system structure and dynamics. Although potentially useful for obtaining insights into the system behaviour, a fully likelihoodbased analysis of such simplified models may create problems by encouraging fidelity to features of the data that the model was never supposed to reproduce. An example of this type of problem is given in §2.
A final situation in which likelihood inference is infeasible is where the model is not probabilistic. This is the case, for example, with many deterministic models of system dynamics in application areas such as epidemic modelling [4]. Such models are, however, routinely calibrated by minimizing some measure of discrepancy between model outputs and observations and, unless they provide a perfect fit to the observations, one is often interested in making some statements about model parameter uncertainty. In such settings, it is still conceptually possible to consider the data as realized values of random variables: any discrepancy between model and data can be described probabilistically if one is prepared to accept, for example, that two sets of observations, gathered under conditions that are identical in terms of the model inputs, will in general be different. In some sense, if the model at hand provides some information about the data y, then we can think of it as partially specifying some aspects of an underlying distribution. Clearly, however, likelihoodbased analysis is not possible in such settings because this underlying distribution is not fully specified.
It is possible to develop specific techniques to address each of the problems outlined above: indeed, some of the cited references give explicit suggestions for doing so. As an alternative, however, in this paper we provide a review of a general theory of estimation that is applicable whenever a model is calibrated by solving an equation of the form 1.1
Here g(·; ·) is a vectorvalued function of the unknown θ and data y, containing p elements (recall that p is the number of elements in θ). In this context, the function g(·;·) is called an estimating function (EF), and an equation of the form (1.1) is an estimating equation. EFs provide a unifying framework for many statistical techniques—for example, g(θ; ·) may be the gradient of a loglikelihood or error sum of squares—but their applicability extends well beyond this.
Example. To help fix ideas, here and subsequently, suppose that p = 1 so that θ ≡ θ is scalarvalued. Suppose in addition that there are n data items in total so that we can write y = (y_{1} … y_{n})′; and consider the equation
1.2
This has the form of equation (1.1), with g(θ; y) = ∑_{i=1}^{n}(y_{i} − θ). The solution to equation (1.2) is clearly θ = n^{−1} ∑_{i=1}^{n} y_{i}: this is just the sample mean, which is often computed as an estimator of the population mean in a statistical model where the data are regarded as drawn independently from a common probability distribution. Thus, the sample mean can be regarded as the root of an estimating equation.
Much of the material presented here is perhaps wellknown as ‘folklore’ within the statistical community. However, we are not aware of a source that collates and presents all of it in generally applicable terms. The extensive statistical literature on EFs is focused primarily on optimality in specific settings [5,6]. In parallel, a large body of econometrics literature, under the guise of the generalized method of moments (GMM) discussed in §4, deals mainly with regressiontype models (see [7–9], ch. 9). Bera & Bilias [10] provide a good review of the historical development of, and connections between, several different inference techniques including EFs, GMM and likelihood methods. We aim here to provide an accessible summary of EF theory, with assumptions given in operational terms so that they can be checked relatively easily in applications. Our approach is similar in spirit to that of Bibby et al. [11], who focus on inference for diffusion processes with a view towards applications in mathematical finance. There are also close connections with the method of indirect inference [12] (see [13] for a review of this area).
In §2, we discuss a class of stochastic models for rainfall time series that illustrate many of the issues discussed above. In §3, we review the general theory of EFs and in §4 we cover the important special case of the GMM: additional details for these sections can be found in the electronic supplementary material. In §5, we present the results of a simulation study to illustrate the performance of the theory in practice when applied to a simple stochastic rainfall model and §6 concludes.
2. Motivating example: stochastic rainfall models
Our use of EFs has been largely motivated by work with a class of stochastic models for rainfall time series. These models date back to RodriguezIturbe et al. [14] and are widely used in the hydrological community to generate artificial rainfall time series, for purposes such as the assessment of flood risk and the impacts of climate change. A model of this type is used in the ‘weather generator’ provided as part of the latest suite of UK national climate change projections [15]. These models exemplify many of the issues summarized in §1. Here, we describe their structure, and highlight some of the difficulties that they pose for inference.
The models are based on a simplified conceptual representation of the physical structure of the rainfall process, in which rainfall is considered to be generated from ‘cells’ of activity in the atmosphere [16]. This conceptual representation is illustrated in figure 1. At the heart of such models is a point process, each event of which marks the arrival time of a cell. Each cell has a random duration, during which it deposits rain with a constant intensity (also random) so that its temporal profile is rectangular. At any point in time, the rainfall intensity is the sum of the intensities of all cells that are currently active.
In one of the simplest models of this type, the driving point process is Poisson and the cell durations are exponentially distributed random variables, realized independently and identically for each cell. Cell intensities have a twoparameter distribution and are again realized independently and identically for each cell; they are also independent of cell duration. This simple model has four parameters: the cell arrival rate (λ, say), mean cell duration (μ_{L}) and the parameters of the cell intensity distribution which are conveniently expressed via its mean (μ_{X}) and coefficient of variation (σ_{X}/μ_{X}). The model is stationary, and a variety of theoretical properties can be derived [14], such as the mean rainfall intensity (which is λμ_{X}μ_{L}, as might be expected), the autocorrelation function and the probability that no cells are active at a particular time instant (i.e. that it is not raining at that instant). In general, however, full distributional properties are hard to derive, even for this very simple model: this prohibits the construction of a likelihood for the model parameters λ, μ_{X}, σ_{X}/μ_{X} and μ_{L}.
A further difficulty is that rain gauges usually record rainfall accumulations rather than instantaneous intensities. Rainfall data therefore typically consist of time series at a specific temporal resolution, often daily or hourly. Again, some theoretical properties of aggregated series can be derived for many of the point process rainfall models in the literature. For most models, expressions are available for the mean and variance of rainfall intensity at arbitrary time resolutions, as well as the autocorrelation function and proportion of ‘dry’ intervals (i.e. periods during which the rainfall intensity is zero); but it should be clear that any attempt to calculate the joint distribution for the aggregated data (and hence a likelihood function) is essentially hopeless. For the Poissonbased model described above, in fact it is possible to derive a joint distribution for the binary sequence defined by considering whether or not the rainfall in each time interval is zero: this can be used to define a marginal likelihood [17]. However, even this approach seems intractable for the more complex models that are needed for a realistic representation of observed rainfall structure (see [18] for a review of these models).
The rectangular temporal profile of the rain cells in figure 1 is clearly unrealistic in practice: the intensity of real rain cells varies throughout their lifetime, and observed rainfall records rarely contain the same value in successive wet intervals. By contrast, physically realistic model parameter values lead to realizations in which the rainfall intensity is often the same in successive intervals (although this feature is generally considered unimportant in most hydrological applications). Typically, therefore, the only way that model realizations can be made to resemble observed rainfall in detail is by having an unrealistically high cell arrival rate and very short cells. This is an example of a model simplification that potentially renders likelihoodbased analysis undesirable: maximumlikelihood parameter estimates might be expected to reflect the temporal resolution of the data rather than the underlying physical structures [19].
In view of the infeasibility and potential undesirability of likelihoodbased inference for this class of rainfall models, parameters are usually chosen to achieve as close a match as possible, according to a weighted leastsquares criterion, between the observed and fitted values of selected statistical properties [18]. Specifically, let T = (T_{1} … T_{k})′ be a vector of k ≥ p summary statistics computed from data, and denote by τ(θ) = (τ_{1}(θ) … τ_{k}(θ))′ the expected value of T under the model. Then, θ is estimated by minimizing an objective function of the form 2.1for some collection of weights {w_{1}, … ,w_{k}}. This raises questions such as: how to select the properties {T_{i}}; how to choose the weights {w_{i}}; how to calculate standard errors and confidence regions for the model parameters; and how to discriminate between competing models. In the hydrological literature, many authors have sought to address some of these questions empirically (e.g. [18,20]). However, EF theory provides direct answers to all of them.
3. Estimating functions: review of theory
Most standard treatments of EF theory start by examining the properties of the EFs themselves: e.g. [6,21]. See Desmond [22] for a clear and readable justification of this approach. In applications, however, practitioners typically will be more interested in the properties of the corresponding estimator of θ. For most EFs, only asymptotic properties of the estimators can be derived, although these can still provide useful approximations in finite samples. Here, therefore, we focus on the asymptotic theory for EF estimators.
Throughout, we consider the data as a vector of random variables. For an asymptotic treatment, it is necessary to consider the properties of EF estimators as the sample size (i.e. the length of the data vector Y) tends to infinity. It is convenient to extend the earlier notation to make explicit this dependence on sample size: a set of n data items will be denoted by Y_{n}, and the corresponding EF by g_{n}(·; ·). Then, for a sample of size n, the EF estimator (, say) is the root of the equation 3.1This implicitly assumes that equation (3.1) has a unique root. In practice, if there are multiple roots, then a common pragmatic solution is to restrict the search to ‘plausible’ regions of parameter space so as to guarantee a unique root: this is operationally equivalent to working with an amended version of the EF that takes some arbitrary nonzero value elsewhere. By contrast, if there are no roots, then it may be possible to work with an amended EF for which a root exists: as discussed below, this often does not affect the asymptotic properties of the estimator and hence may be considered as a legitimate device for inference, at least insofar as the asymptotics provide useful approximations in finite samples. We therefore proceed under the assumption that equation (3.1) has at least one root.
3.1. Target of the estimation
Our first task is to establish whether an estimating equation of the form (3.1) has a target value. By this, we mean a value, θ_{0} say, to which the associated estimator converges as the sample size increases; this is sometimes called the object of inference. In settings involving a fully specified statistical model, one would hope that θ_{0} is the value of the parameter vector corresponding to the distribution from which the data were generated. However, as noted in §1, one might equally be interested in quantities that would not conventionally be regarded as statistical model parameters. The notion of a target value provides a way of handling this. We proceed under the following assumption:
Assumption 3.1. There exists a sequence (η_{n}) of p × p matrices, independent of θ and such that as n → ∞, η_{n} g_{n}(θ; Y_{n}) converges uniformly in probability to a nonrandom function, g_{ℓ}(θ) say, with the following properties:

The equation g_{ℓ} (θ) = 0 has a unique root at θ = θ_{0}.

g_{ℓ}(·) is bounded away from zero except in the neighbourhood of θ_{0} (formally: for any ε > 0, we can find some δ > 0 such that g_{ℓ}(θ) > δ whenever θ − θ_{0} > ε).
Under this assumption, notice first that as by definition, and since η_{n} does not vary with θ, we must also have . Next, consider an arbitrary neighbourhood of θ_{0}. As n → ∞, requirement (2) in assumption 3.1 excludes the possibility that infinitesimal wiggles in η_{n}g(θ;Y_{n}), arising from sampling variation, can produce roots to the estimating equation outside this neighbourhood. As the neighbourhood can be arbitrarily small, θ_{0} is the only value of θ that is guaranteed to lie within it. We have already limited the discussion to situations in which the estimating equation (3.1) has at least one root; any roots of the equation must therefore converge to θ_{0}. This provides an informal justification of the following result:
Result 3.2. Under the conditions of assumption 3.1, as n → ∞, the EF defines a unique estimator that converges in probability to θ_{0}.
In practice, it is often technically challenging to verify the uniform convergence of η_{n}g_{n}(θ; Y_{n}) in assumption 3.1: this requires that, with probability tending to 1 as n → ∞, the discrepancy η_{n}g_{n}(θ; Y_{n}) − g_{ℓ}(θ) becomes arbitrarily small simultaneously at all candidate values of θ. It is often straightforward to establish pointwise convergence (i.e. the discrepancy vanishes at each value of θ considered individually), but this does not guarantee uniform convergence without additional assumptions. Moreover, it is difficult to legislate mathematically for all the possible ways in which uniform convergence may fail. Fortunately, however, such failure is usually associated with pathological situations that are unlikely to be encountered in applications: in practice, it can usually be ensured via fairly innocuous requirements, for example by imposing some conditions of smoothness on the EFs {g_{n}(θ; ·)} (see [23, ch. 6] for examples of this in the context of maximumlikelihood estimation). Some other possibilities are indicated in the examples below; for a more thorough discussion, see [24, ch. 5].
Example (continued). In the present notation, the EF in equation (1.2) is g_{n}(θ; Y_{n}) = ∑_{i=1}^{n} (Y_{i} − θ). Suppose that the {Y_{i}} are independent random variables with a common mean μ and finite variances; and consider the sequence
where is the sample mean of the observations. Now, as n → ∞ the weak law of large numbers guarantees that converges to μ in probability; and therefore n^{−1}g_{n}(θ; Y_{n}) converges to g_{ℓ}(θ) = μ − θ (which is the expectation of the scaled EF: this type of situation is common). The convergence is uniform, as
which does not depend on θ. Moreover, the function g_{ℓ}(θ) has a unique zero at θ = μ and is bounded away from zero elsewhere. Hence, the conditions of assumption 3.1 are satisfied by taking η_{n} = n^{−1}; and we conclude that the target value of the estimator defined by equation (1.2) is θ_{0} = μ, the common mean of the {Y_{i}}.
To illustrate the role of requirement (2) in assumption 3.1, consider now modifying the EF in this example to g_{n}^{†}(θ; Y_{n}) = exp(−θ^{2})∑_{i=1}^{n} (Y_{i} − θ). When multiplied by η_{n} = n^{−1} this modified EF converges uniformly to g_{ℓ}^{†}(θ) = exp(−θ^{2})(μ − θ), which again has a unique zero at θ = μ but also approaches zero as θ → ∞. In this case, for any finite sample size there will be values of θ for which g_{ℓ}^{†}(θ) is small relative to the sampling variation in n^{−1}g_{n}^{†}(θ; Y_{n}); hence, this modified EF will typically yield multiple (nonconvergent) roots to the estimating equation tending to ±∞ as the sample size increases, in addition to the ‘obvious’ target μ.
In this example, uniform convergence of η_{n} g_{n} (θ; Y_{n}) is guaranteed owing to its being a continuous function of a single statistic , which itself converges in probability to μ. The same principle applies whenever the (normalized) EF is a continuous function of a finite vector of summary statistics, say T_{n}(Y_{n}) = (T_{1}(Y_{n}) … T_{k}(Y_{n}))′; and where T_{n}(Y_{n}) itself converges in probability to some limiting value. This is the case for an EF constructed by differentiating an objective function such as equation (2.1), for example.
The conditions in assumption 3.1 are sufficient to guarantee the existence of a unique target value for the estimation procedure, and can be verified reasonably intuitively in most applications. However, they are not necessary. For example, requirement (2) can be relaxed if the EF is obtained by differentiating some objective function (the corresponding estimator is sometimes called an Mestimator; see [24, ch. 5]). In this case, it may be that the normalized EF itself has multiple zeroes or that it approaches zero away from θ_{0}; but that these zeroes can be eliminated from consideration by considering the corresponding values of the (similarly normalized) objective function.
As in the example above, usually the sequence (η_{n}) tends to the zero matrix as n → ∞. This gives some flexibility for modifying the EF without changing the target value, for example by adding some quantity κ_{n}(θ; Y_{n}) such that η_{n}κ_{n}(θ; Y_{n}) converges uniformly in θ to a vector of zeroes as n → ∞. This in turn allows us to work with modified EFs that have desirable properties in finite samples, for example so that at least one root to equation (3.1) is guaranteed or to reduce estimator bias [25,26].
3.2. Limiting distribution of the estimator
Having established that a particular EF has a unique target value θ_{0}, it is of interest to characterize the uncertainty in the estimator by providing an approximate expression for its covariance matrix and, if possible, for its joint distribution. To proceed with this, we need to consider the p × p partial derivative matrices {∂g_{n}/∂θ = G_{n}(θ)}, say (which are assumed to exist). We make the following additional assumption:
Assumption 3.3. The matrices {G_{n}(θ)} are invertible. Moreover, there exist sequences (γ_{n}) and (δ_{n}) of invertible p × p matrices that do not depend on θ and are such that:

At θ = θ_{0}, the covariance matrix of the normalized EF exists and converges to a limiting positive definite matrix, say, as n → ∞.

Defining within a neighbourhood of θ_{0}, the matrix converges uniformly in probability to an invertible matrix M(θ) with elements that are continuous functions of θ.
The following result can now be stated (a sketch proof is provided in the electronic supplementary material):
Result 3.4. Under the conditions of assumptions 3.1 and 3.3, as n → ∞, the expected value of converges to θ_{0} and the covariance matrix of converges to where M_{0} = M(θ_{0}).
If, in addition, the normalized EF has a limiting multivariate normal (MVN) distribution, then the distribution of is itself multivariate normal in the limit.
Notice that the exact covariance matrix of (defined in the first requirement of assumption 3.3) can be written as γ_{n}Σ_{n}γ′_{n} say, where Σ_{n} is the covariance matrix of the original EF g_{n} (θ_{0}; Y_{n}). Moreover, if the expectation of G_{n}(θ_{0}) exists and is equal to G_{0}, say, then clearly M_{0} in assumption 3.3 is just the limiting expectation of γ_{n}G_{0}. In this case, an operational interpretation of result 3.4 is as follows:
Result 3.4 (operational version). Let Σ_{n} denote the covariance matrix of g_{n}(θ_{0}; Y_{n}). Then, if assumptions 3.1 and 3.3 hold, and if exists, the distribution of is approximately in large samples.
The conditions of assumption 3.3 hold in a wide variety of applications and are easy to check. A continuation of the earlier example gives a flavour of their nature.
Example (continued). Suppose the {Y_{i}} are independent random variables with a common mean μ. In this case, θ_{0} = μ is the target of the estimation procedure as demonstrated earlier. Suppose in addition that for all i, var(Y_{i}) = σ_{i}^{2}, such that n^{−1}∑_{i=1}^{n} σ_{i}^{2} converges to a constant value as n → ∞. Then, the EF g_{n}(θ; Y_{n}) = ∑_{i=1}^{n} (Y_{i} − θ) satisfies the conditions of assumption 3.3 with γ_{n} = δ_{n} = n^{−1/2}. For in this case, we have G_{n}(θ) = ∂g_{n}/∂θ = −n, meeting the differentiability and invertibility condition. Next,
which converges to a constant so that requirement (1) is satisfied. Finally, , which fulfils requirement (2).
Recall that the root of the estimating equation in this example is . In this case, therefore, result 3.4 states that has asymptotic mean zero and variance . Moreover, one can appeal to a central limit theorem [27, §2.5] to claim that has a limiting normal distribution. In the limit as n → ∞ therefore, we have
The operational version of result 3.4 restates this by saying that in large samples, approximately.
For this particular example, the reasoning as presented above seems laboured: indeed, the application of result 3.4 here relies on the properties of so that the reasoning is arguably circular. The main purpose, however, is to illustrate the nature of the sequences (γ_{n}) and (δ_{n}) in a simple example, and to show the steps needed to verify the requirements of assumption 3.3. The bottom line is to choose (γ_{n}) so as to normalize the variance of the EFs themselves; and then to choose (δ_{n}) to normalize the expectation of their derivatives. As here, it is often appropriate to take both sequences proportional to n^{−1/2} (when θ has p > 1 elements, the choice γ_{n} = δ_{n} = n^{−1/2}I_{p×p} is often appropriate, where I_{p×p} is the p × p identity matrix). This is not always the case, however: particularly in situations involving a combination of stationary and nonstationary stochastic processes, a mixture of normalizations may be needed. A simple example of this occurs when regressing a time series Y_{n} upon the time index: here, to stabilize the variance of the leastsquares EF, the component corresponding to the regression slope must be normalized by a factor n^{−3/2} (this can be deduced from the results given in [28, p. 72]).
As with assumption 3.1, the requirements of assumption 3.3 are not strictly necessary to obtain useful asymptotic results for EF estimators. For example, Sweeting [29] provides a generalization of result 3.4 without assuming the existence of expectations or covariance matrices. Another extension is to processes for which the sequence converges in distribution to a random matrix M_{0}, rather than to a fixed value: in this case, inference about θ_{0} can be considered as conditional upon the realized value of M_{0} [30]. This occurs, for example, when regressing a time series upon a random walk covariate: in this case, conditioning upon M_{0} is equivalent to conditioning on the observed covariate trajectory, which seems reasonable in applications (see [28, §5.2] for more discussion). Such extensions ensure, in effect, that result 3.4 is applicable in a very broad range of settings.
Result 3.4 provides means of comparing different EFs with the same target value, via their limiting covariance matrices. In particular, suppose that EFs g^{(A)} and g^{(B)} have the same target value and share the same normalizing sequence (δ_{n}) in assumption 3.3; and denote the limiting covariance matrices of the corresponding normalized estimators by C^{(A)} and C^{(B)}. If the matrix C^{(B)} − C^{(A)} is nonnegative definite for all θ ∈ Θ then, in the limit, any linear transformation of θ is estimated at least as precisely (in the sense of having no larger variance) when using g^{(A)} as when using g^{(B)}. All other things being equal therefore, in this case g^{(A)} is preferable to g^{(B)}. For example, it is known that in regular problems maximumlikelihood estimators (for which the EF is the derivative of the loglikelihood function, as noted in §1) are asymptotically optimal within the class of all EFs that have the same target [23, §6.5; 31, ch. 8]. In settings where likelihoodbased inference is infeasible or undesirable however, one might instead choose to work with a more restricted class of EFs: for example, all EFs defined by expressions of the form (2.1) with different values for the weights {w_{i}}. In this case, if g^{(A)} is a member of the chosen class and if the nonnegative definite condition holds for any other choice g^{(B)} from the same class, then g^{(A)} can be thought of as (asymptotically) optimal in that class. The idea of finding the optimal EF within a class will be revisited in §4.
The proof of result 3.4 relies upon a Taylor expansion for the estimation error (see electronic supplementary material, equation S2). This expansion takes the form , where A_{n}^{−1} is a matrix of partial derivatives of the EF and A_{n} converges, upon normalization, to M_{0} as defined in result 3.4. Of course, the EF could be multiplied by any nonsingular matrix without changing the estimator itself. In particular, the estimator is unaffected if the EF is multiplied by A_{n}, or by something asymptotically equivalent to it (such as its expectation, where this exists). Under this standardization, the estimation error and EF have the same limiting distribution. From a practitioner's perspective, this can be seen as providing some justification for the use of optimality criteria based on the standardized EFs themselves, rather than upon the estimators that they define (see references given earlier).
3.3. Model comparison
It is often of interest to investigate whether the data are consistent with specific restrictions on θ, potentially involving several elements simultaneously, that have a relevant subjectmatter interpretation. For example, in the context of the Poisson rainfall model considered in §2, one may wish to determine whether the parameter σ_{X}/μ_{X} is equal to 1, as this might suggest that cell intensities could be modelled using an exponential distribution. More generally, in statistical practice, such restrictions often involve elements of θ being set to zero to indicate that the corresponding model components or processes have no effect upon the data Y. Other restrictions of interest might involve setting two elements of θ equal to each other, or requiring that some subset of the elements should sum to zero. Any such restriction can be regarded as defining a new model that is a special case of the one initially envisaged: to determine whether the restriction is supported by the data is therefore to compare two competing models.
The types of restriction outlined above can all be defined via a set of q ≤ p linear constraints: Ξθ = ξ_{0} say, where Ξ is a q × p matrix of rank q. Thus, to represent the restriction θ_{1} = 0, we could write Ξ = (1 0 ⋯ 0) and ξ_{0} = 0; to represent the restriction θ_{1} = θ_{2}, we could write Ξ = (1 − 1 0 ⋯ 0) and ξ_{0} = 0 (note that ξ_{0} is a scalar in both of these cases as there is only one restriction). To represent a set of q > 1 such restrictions, simply stack the defining equations on top of each other.
Clearly, given an estimator of θ, a natural estimator of is . Suppose, as in the operational version of result 3.2 above, that has approximately a multivariate normal distribution with mean θ_{0} and covariance matrix 3.2
Then, we have 3.3approximately. Using standard properties of the multivariate normal distribution therefore, if is indeed equal to ξ_{0} then the quadratic form 3.4has approximately a χ^{2} distribution with q degrees of freedom; otherwise, it will tend to be larger. Thus, the hypothesis can be tested at the 100α% level by comparing (3.4) with the 100(1 − α)% point of the χ_{q}^{2} distribution: if equation (3.4) exceeds this value, the hypothesis is rejected. This procedure is sometimes called a quasiWald test.
An alternative approach to testing restrictions is based on the limiting distribution of the EF itself; this is analogous to the use of score tests in a likelihoodbased setting. If the EF is the gradient vector of some objective function, then the procedure is to optimize the objective function subject to the restrictions of interest, and then to calculate a test statistic as a quadratic form in the EF evaluated at the resulting estimate of θ. Otherwise, however, it is difficult to define an analogue of such a procedure: if a pcomponent EF does not correspond to the gradient of an objective function then it is not clear how to impose q restrictions on its solution meaningfully. For this reason, such quasiscore tests are rarely used in conjunction with EFs in general settings. One possibility is to define an artificial objective function as a quadratic form in the original EF, and to minimize this subject to the restrictions of interest. In fact, this can be viewed as a special case of the GMM considered in §4, and from this perspective the procedure is equivalent to the LM test of Newey & West [32].
A final possibility for model comparison arises when the EF is a gradient vector: g_{n}(θ; Y_{n}) = ∂Q_{n}/∂θ say, where Q_{n}(θ; Y_{n}) is some differentiable function that is minimized to obtain the estimator (situations involving maximization are easily handled in the same framework: just work with −Q_{n}(θ; Y_{n}) instead). In this case, denote by the value of θ minimizing Q_{n}(θ; Y_{n}) under the restriction . By definition, we must have . However, it seems reasonable to expect that if the restriction holds, then the difference will be small. The following result, which can be regarded as generalizing that of Kent [33] on the distribution of likelihood ratios in misspecified models, makes this precise:
Result 3.5. Suppose that assumptions 3.1 and 3.3 hold and that E[G_{n}(θ_{0})] = G_{0} exists. Then, if the quantity has approximately the same distribution as Z′A^{−1}Z, where and .
A derivation can be found in the electronic supplementary material.
In principle, result 3.5 yields a straightforward procedure for testing the hypothesis : accept if 3.5is less than the appropriate percentile of the distribution of Z′A^{−1}Z and reject otherwise. Equivalently, the procedure defines an objective function threshold below which candidate values of ξ_{0} are accepted and above which they are rejected. The corresponding set of ξ_{0}values defines a confidence region for (of course, the quasiWald and quasiscore tests can also be used to generate confidence intervals in this way).
Unfortunately, distributions of quadratic forms such as Z′A^{−1}Z are intractable in general. Notice, however, that in the present context G_{0} is the expectation of a Hessian matrix containing the second derivatives of Q_{n}(·; Y_{n}) at θ_{0}: it is therefore symmetric. Thus, is also symmetric, as is A^{−1}. In this case, the distribution of Z′A^{−1}Z can be approximated with a scaled and shifted χ^{2} distribution [34, p. 88]. Specifically, the distribution is approximated by that of aX + c where 3.6, and tr(·) denotes the trace operator.
An important special case of result 3.5 occurs when q = 1 so that ξ = ξ, A = A and Z = Z are all scalars. In this case, the quantity Z′A^{−1}Z becomes Z^{2}/A, which is equal to [var(Z)/A] × χ^{2}, where χ^{2} has a χ^{2} distribution with one degree of freedom. Thus, denoting by χ_{1}^{2}(1 − α) the 100(1 − α)% point of the χ_{1}^{2} distribution, a 100(1 − α)% confidence interval for ξ can be constructed as the set of values for which the objective function can be made smaller than the threshold 3.7
In particular, if Ξ is a 1 × p matrix containing zeroes everywhere except for a 1 in position i so that Ξθ = θ_{i} is the ith component of θ, then var(Z) is just the approximate variance of as given by result 3.4, and A is the corresponding diagonal element of G_{0}^{−1}.
Another special case is when Ξ is the p × p identity matrix. Here, the result provides a means of calculating simultaneous confidence regions for all elements of θ based on the values of the objective function: this can be seen as a formal way to determine which values are ‘nearoptimal’. Notice that in this case Z ∼ MVN(0, Γ_{n}) and A = G_{0}^{−1} so that Z′A^{−1}Z = Z′G_{0}Z.
Finally, note that if G_{0} = Σ_{n} (i.e. the expected Hessian of the objective function is equal to the covariance matrix of its gradient vector), then Γ_{n} = G_{0}^{−1} (from equation (3.2), coupled with the symmetry of G_{0}). In this case, we have ; and the distribution of Z′A^{−1}Z is exactly χ^{2} with q degrees of freedom. The equality G_{0} = Σ_{n} is known as the second Bartlett identity and holds, for example, when Q_{n} (·;Y_{n}) is the negative loglikelihood for a correctly specified model for Y_{n}; this is the basis for the standard chisquared reference distribution used in likelihood ratio tests.
For the confidence intervals in the q = 1 case above, it is slightly unsatisfactory that the objective function threshold (3.7) varies depending on the matrix Ξ: in particular, different thresholds are required for different elements of θ. If the second Bartlett identity holds, however, all confidence intervals are defined via the single threshold : this seems much more intuitive. It also has computational advantages, as approximations such as equation (3.6) are not required. Sometimes therefore, it may be convenient to adjust the objective function to recover the second Bartlett identity. This also has theoretical benefits, as described by Chandler & Bate [35] in the context of inference based on a misspecified loglikelihood.
3.4. Estimation of G_{0} and Σ_{n}
To use results 3.4 and 3.5 in practice, it is necessary to estimate G_{0} and Σ_{n} (or equivalently M_{0} and as defined in result 3.4 and assumption 3.3 respectively). In fact, in large samples it suffices to use any estimator for which, with probability tending to 1 as n → ∞, the estimation error is negligible compared with the quantity being estimated. One possibility is to plug the estimate into analytical expressions for G_{0} and Σ_{n}, if these are available. For M_{0}, another option is numerical differentiation of g_{n} (·; ·) at .
In the absence of an analytical expression for Σ_{n}, often an empirical estimator can be used instead. For example, if g_{n}(·; ·) can be written as a sum of uncorrelated contributions from m distinct groups of observations: and if m → ∞ as n → ∞, then can be estimated to the required accuracy as 3.8
A potential difficulty here is that the p(p − 1)/2 distinct elements of Σ_{n} are estimated individually. Unless p is rather small, in finite samples the associated estimation errors can accumulate to degrade many of the approximations upon which result 3.4 is based. It has been noted, for example, that estimators of the form (3.8) can lead to liberal tests of hypotheses [36], and in such situations it can pay to think more carefully about how to estimate Σ_{n}. We see an example of this in §5.
4. The generalized method of moments
Given data Y_{n}, one way to construct an estimating function is to consider a vector T_{n} = T_{n}(Y_{n}) of k ≥ p summary statistics with expectation under the model, so that 4.1
This suggests that θ could be estimated by equating τ(θ) with T_{n}. However, sampling variation in T_{n} will usually render this infeasible, particularly when k > p. As an alternative therefore, we could minimize an expression of the form 4.2where W_{n} is a k × k matrix that may depend on the data but which converges in probability to a positive definite matrix W, say. Notice that the expression (2.1), used to fit the stochastic rainfall models reviewed in §2, can be written in the form (4.2): in this case, W = diag(w_{1} w_{2} … w_{k}) is a diagonal matrix.
In fact, for a general treatment, equation (4.2) must be modified slightly. We will assume the existence of a sequence (γ_{n}) of k × k matrices that do not depend on θ and are such that, at θ = θ_{0}, the covariance matrix of tends to a limiting positive definite matrix S. Under this assumption, instead of minimizing equation (4.2) we will minimize 4.3
The rationale for this is given in the electronic supplementary material, section S4.2. Clearly, equations (4.2) and (4.3) will deliver identical results if γ_{n} is proportional to the k × k identity matrix: γ_{n} = γ_{n}I_{k×k} say. This is the case in many applications (see §5). However, the variances of different components of T_{n} may change at different rates with n: in this case, the theory below applies only to estimators derived from equation (4.3).
This approach to inference was formalized by Hansen [8] and is referred to as the generalized method of moments. The closely related area of indirect inference (see §1) also involves optimizing an expression of the form (4.3); here however, the quantity is itself derived from an approximating model. In this section, we examine GMM estimation from an estimating functions perspective, and also discuss the choice of matrix W_{n}.
4.1. The generalized method of moments estimating functions
Differentiating equation (4.3) with respect to θ yields, apart from an irrelevant factor of 2, the EF 4.4where (the existence of this k × p matrix is assumed). To translate the theory of §3 to the GMM context therefore, it is necessary to ensure that assumptions 3.1 and 3.3 hold for equation (4.4). For operational purposes, the requirements are most helpfully stated in terms of the summary statistics T_{n} and their theoretical counterparts τ(θ). In fact, as noted in the earlier discussion of result 3.2, the conditions of assumption 3.1 are trivially satisfied if the elements of T_{n} converge to τ(θ_{0}), and if τ(θ) is bounded away from this limiting value outside the neighbourhood of θ_{0}.
To establish the conditions for assumption 3.3, the first task is to ensure the existence and invertibility of the derivative matrix . In the electronic supplementary material, it is shown that exists if τ(θ) is twice differentiable with respect to θ. G_{n}(θ) will then be invertible if it is of full rank. This is the case if the functions {τ_{(i)}(θ):i = 1, … , k} are linearly independent (this can be seen by inspection of equations (S9) and (S10) in the electronic supplementary material, together with equation (4.1)). In practice, this means that T_{n} should not contain redundant elements (i.e. elements that add nothing to the information contained elsewhere): this is not a serious restriction.
To progress further, we need to define , where is the rth element of (r = 1, … , k). With this definition, the requirements of assumption 3.3 in the GMM context can now be stated equivalently as follows:
Assumption 4.1. There exists a sequence (δ_{n}) of invertible p × p matrices that do not depend on θ and are such that:

lim_{n→∞}δ_{n} = 0, the zero matrix;

Within a neighbourhood of θ_{0}, for all r ∈ {1, … , k} the matrices and converge to limiting matrices H(θ) and Ω^{(r)}(θ), respectively, all elements of which are continuous functions of θ and where H(θ) is invertible.
The application in §5 provides some insights into the nature of assumption 4.1. However, the continuity requirements are not strong. If H(θ) exists, then its continuity is guaranteed by the earlier assumption that h_{n}(θ; Y_{n}) is twice differentiable; and continuity of Ω^{(r)}(θ) corresponds to that of the second derivatives of τ(θ).
The GMM version of result 3.4 can now be stated as follows (for a proof, see the electronic supplementary material):
Result 4.2. Assuming the existence of a sequence (γ_{n}) such that has a limiting covariance matrix S as described earlier, let be the value of θ minimizing Q_{n}(θ; Y_{n}) in equation (4.3). Then, under assumption 4.1, the expected value of converges to θ_{0} and the covariance matrix of converges to where 4.5
and 4.6
Furthermore, if T_{n} has a limiting multivariate normal distribution after standardization, the limiting distribution of is itself multivariate normal.
As in the general EF setting, to use this result in practice the quantities and must be estimated. As suggested in §3, this can be performed via plugin or empirical estimation based on the EF (4.4). However, for the GMM, it is often simpler (and, in our experience, computationally more stable) to work directly with T_{n} and τ(θ), and then to use equations (4.5) and (4.6) to derive estimates of and M(θ_{0}). Specifically, H(θ_{0}) can be estimated as and S can be estimated via an empirical or plugin estimator of the covariance matrix of T_{n}. Once again however, some care is required with empirical covariance matrix estimators: indeed, there are now k(k − 1)/2 separate elements of S to estimate rather than p(p − 1)/2 elements of . This can lead to poor performance if T_{n} contains too many elements (e.g. [37]). This is another argument for careful choice of T_{n}: ideally, it will contain a small number of summary statistics that contain most of the information about θ.
4.2. Optimal weighting matrices
Until now, we have not considered in detail the choice of weighting matrix W_{n} in equation (4.3). However, it is clear that this choice will affect the properties of the estimator, and in particular its covariance matrix. In §3, the concept of optimality within a class of EFs was introduced. In the present context, this leads to the question of whether, given a specific set of summary statistics T_{n}, there is an optimal weighting matrix W_{n} in equation (4.3). Recalling that S is the limiting covariance matrix of , the answer to this question is given by the following:
Result 4.3. Under the conditions set out above, the smallest attainable limiting covariance matrix of is and is achieved by setting in equation (4.3).
The proof follows exactly the steps given in [38, §3.6], and is therefore omitted.
As well as its theoretical optimality, the choice W_{n} = S^{−1} is appealing from another perspective. Specifically, using the results in §S4 of the electronic supplementary material it may be shown that with this choice, asymptotically the objective function (4.3) satisfies the second Bartlett identity. Hypotheses involving linear restrictions on θ can therefore be tested by referring test statistics of the form (3.5) directly to the quantiles of the appropriate chisquared distribution.
4.3. Extensions
In the discussion above, the matrices and contained derivatives of τ(θ) and, in particular, were nonrandom. This is because of the form (4.1) of the quantities used to construct the GMM objective function (4.3). More generally, however, the GMM idea can be applied with any choice of for which . For most choices of , and will be random matrices: in this case, the theory outlined above can be adapted by requiring that and converge uniformly in probability to and respectively. See the study of Lindsay & Qu [39] for an accessible overview of this area, and its links to many other inference methods.
5. Theory in practice: a simulation study
We now illustrate the theory outlined above, using a simulation study involving the fourparameter Poisson rainfall model from §2. As described previously, such models are often fitted by minimizing an expression of the form (2.1); this is a special case of the GMM in which the weighting matrix W_{n} is diagonal. Typically, the minimization must be carried out numerically: our experience is that this can be challenging, and that working with the logarithms of the model parameters can stabilize the procedure. Accordingly, for the purposes of this study the model parameter vector is taken as θ = (logλ logμ_{X} logσ_{X}/μ_{X} logμ_{L})′. The computations are all carried out in the R environment [40] using routines that are available from http://www.homepages.ucl.ac.uk/~ucakarc/work/momfit.html.
5.1. Simulation setup
To account for seasonality in rainfall data, these models are usually fitted separately to data for each calendar month [18,19]: data from different years are considered to provide independent replicates of the rainfall process. Each of our simulations represents 20 years' worth of data for a single calendar month: specifically, m = 20 independent sets of 30 days' worth of hourly values are generated, each from a model with parameter vector θ_{0} = (−3.5 0 0 1.1)′. These values were obtained by rounding estimates obtained by fitting the model to January data from Birmingham airport in the UK for the period 1950–1997 (with time measured in hours and intensity in mm h^{−1}), and can be considered typical of January rainfall in the UK. The 20 year record length is towards the low end of what might be considered desirable in the hydrological literature; however, long hourly rainfall records are relatively scarce so that practitioners are often forced to use shorter records such as those simulated here.
We have generated 1000 simulated datasets with the structure described above. For each dataset, GMM estimates have been computed from a ‘typical’ set of summary statistics (i.e. elements of T_{n} in equation (4.3)) based on common practice in the hydrological literature [18]. These statistics are the mean rainfall intensity; the variance of 1, 6 and 24 hourly rainfall intensities; the proportion of dry hours; the proportion of dry days; and the lag 1 autocorrelation at hourly and daily resolution.
As each dataset consists of m = 20 independent 30day replicates rather than a single contiguous time series, we have taken the opportunity to calculate, for each dataset, a vector T of summary statistics along with an empirical estimate of its covariance matrix (which is proportional to S in the notation of the previous section). Specifically, we calculate separate summary statistic vectors T^{(1)}, … ,T^{(m)} for each of the replicates, and then calculate . The covariance matrix of T is then estimated as 5.1
A potential problem with this approach is that any smallsample biases in the individual elements of the T^{(i)} will be preserved in the averaging. Of course, the sample mean and proportion of dry intervals are exactly unbiased; however, the sample lag 1 autocorrelation in particular is a biased estimator of the true autocorrelation coefficient in short series [41, pp. 78–79]. In the present context, this is a particular concern for the lag1 autocorrelation of daily data, for which only 30 observations are available in each replicate. Strictly speaking, this means that as the sample size (i.e. the number of years, or replicates) grows then T will not converge to the vector of theoretical properties under the model. However, in practice, the asymptotic theory is merely a guide to what might be expected if the finitesample bias is small. We argue, therefore, that if the bias can be made as small as possible then the asymptotics should still be useful. To this end, for our sample autocorrelations we have used Quenouille's biasreduced estimator [41, p. 79] that eliminates the firstorder bias in the usual estimator.
Apart from the potential bias in the summary statistics, the conditions from §4 are easily shown to hold as m, the number of replicates, tends to infinity. Since T is a mean of m independent sets of summary statistics, we have var(T) = S/m, where S is the covariance matrix of a single set of summary statistics. Thus, the sequence (γ_{n}) defined in §4 can be taken as m^{1/2}I_{k×k}. With this normalization, equations (4.2) and (4.3) yield identical results (see the discussion following equation (4.3)). Moreover, setting δ_{n} = m^{−1/2}I_{p×p} (which tends to the zero matrix as m → ∞, as required by assumption 4.1) ensures that the matrices and converge to the appropriate derivatives of τ(θ_{0}). It follows that H(θ) in equations (4.5) and (4.6) is equal to ∂τ/∂θ, and that S in equation (4.5) is equal to m^{1/2}var(T). We estimate H(θ_{0}) by numerical differentiation of τ(θ) at θ̂, and estimate S as . We note finally that T might be expected to have approximately a multivariate normal distribution owing to central limittype results for the individual replicates {T^{(i)}}, as well as to the averaging of these replicates. Thus, result 4.2 should hold in its entirety. The simulation exercise therefore provides an opportunity to determine whether the asymptotics provide a reasonable approximation when m = 20 (i.e. when 20 years' worth of data are available).
As well as examining the validity of the asymptotics, we compare different choices of weighting matrix W in equation (4.3) (for notational convenience, we drop the n subscript here). Result 4.3 gives the optimal choice as S^{−1}. Apart from an irrelevant constant of proportionality, this can be estimated as , say, with defined at equation (5.1). To investigate the benefits that accrue from the use of W_{0}, for each simulated dataset we also compute estimates using the following alternatives:

W_{1} = I_{k×k}: equal weights.

W_{2}: unequal weights. This is the same as W_{1}, except that the entries corresponding to the hourly summary statistics are set to 100. This has been suggested as a means of ensuring that the fitted models reproduce the most hydrologically important properties of observed rainfall sequences [18].

W_{3}: inverse variances. This matrix is again diagonal, but the entries are now the inverses of the corresponding diagonal elements of . This can be regarded as a crude approximation to the ‘optimal’ weighting matrix W_{0} in which correlations between the summary statistics are ignored.
5.2. Performance measurement criteria
A simple way to compare the performance of the different weighting matrices is to estimate the biases and covariance matrices of the corresponding estimators. With 1000 simulated datasets, these quantities can be estimated directly using the sample mean and covariance matrix of each estimator. We refer to the resulting covariance matrix estimates as ‘empirical’.
Another question of interest is whether the asymptotic covariance matrix given in result 4.2 can be regarded as an adequate approximation to the true covariance matrix. To assess this, for each dataset and weighting matrix we estimate the covariance matrix of the estimator by substituting estimates of H(θ_{0}) and S (as described above) into result 4.2 via equations (4.5) and (4.6); and for each weighting matrix we calculate the sample mean of these estimated covariance matrices as a summary measure. We refer to the results as ‘theoretical’ covariance matrices: one way to check the asymptotic theory is to compare these with the empirical versions described above, for example, by extracting the standard errors corresponding to each parameter and comparing these. It is also of interest, of course, to compare the standard errors between different weighting schemes.
Finally, we assess the accuracy of confidence intervals and regions computed using the asymptotic theory. Specifically, the limiting multivariate normal distribution from result 4.2 provides a means of calculating individual confidence intervals for each parameter in the model: if the theory holds in practice, then a 100(1 − α)% confidence interval should contain the true parameter value roughly 100(1 − α)% of the time. This can be assessed by computing confidence intervals for each simulated dataset, and recording the proportion of these that contain the true value. Similarly, confidence regions for the entire parameter vector can be computed based on objective function thresholds as in result 3.5. To check the validity of such regions, for each dataset we estimate the threshold by plugging in estimates of the required quantities as usual; and then evaluate the objective function at the true parameter value. If the asymptotic theory is valid, the result should be less than the computed threshold roughly 100(1 − α)% of the time.
5.3. Results
Figure 2 shows, for each choice of weighting matrix, the distributions of estimation error obtained from the 1000 simulated datasets. The variancebased weighting schemes, W_{3} and W_{0}, perform similarly and deliver much less variable estimators than the fixed schemes W_{1} and W_{2}, especially for the parameter logμ_{L}. Both of the fixed schemes produce several outlying estimates (this is especially true for W_{1} when estimating logσ_{X}/μ_{X}), which suggests that inference based on asymptotic normality of the estimators may not be suitable for these weighting schemes. Interestingly, the use of fixed weighting schemes to fit this type of model is standard in the hydrological literature, and numerical optimization of the GMM objective function frequently must be constrained to ensure that parameter estimates are physically realistic (e.g. [20]). The results of figure 2, together with similar results for other models (not shown here), suggest that the use of variancebased weighting schemes may alleviate this problem to a large extent.
Another conclusion from figure 2 is that, for all weighting matrices, any bias in the estimators is negligible when compared with the sampling variability. This is reassuring, and suggests that any bias in the summary statistics themselves is relatively unimportant for the purposes of parameter estimation for this particular model.
We next consider the variances of the estimators. Table 1 shows the parameter standard errors for each weighting scheme, computed from both the empirically and theoretically estimated covariance matrices as described above. The differences between the empirical standard errors for the different weighting schemes merely reflect the differences in variability noted in figure 2: in particular, there seems little to choose between weighting scheme W_{3} and the optimal scheme W_{0}. Of more interest is to compare the theoretical standard errors with the empirical ones. In general, the theoretical standard errors in table 1 are lower than those observed: exceptions are for logσ_{X}/μ_{X} with weighting schemes W_{1} and W_{2} (for which the accuracy of the computed theoretical covariance matrices was often suspect owing to illconditioned matrix calculations); and the results for weighting scheme W_{3} where the two sets of standard errors are more or less in agreement. Overall, these results suggest that the asymptotic theory tends to underestimate variability and to make the estimators look more precise than they actually are.
The discrepancies between the empirical and theoretical standard errors suggest that confidence intervals and regions, constructed using the theoretical covariance matrices, will be inaccurate. This is indeed the case: table 2 shows the proportion of simulations for which confidence intervals and regions, constructed using the asymptotic theory, included the true parameter value. In general, the coverages are well below their nominal levels. This is particularly true for the confidence regions for θ constructed under the optimal weighting scheme W_{0}. The results for scheme W_{3} seem rather better, although the coverages are still slightly too low.
Clearly, the results in tables 1 and 2 suggest that—as implemented here, at least—the asymptotic theory yields unreliable assessments of estimator variability for this sample size. The problems are at least partly associated with poorly estimated covariance matrices, as indicated by the comparison of empirical and theoretical standard errors in table 1. For the estimators based on W_{1} and W_{2}, it is likely that failure of the normal approximation also contributes. This failure can be seen from the box plots in figure 2. It has also been verified using quantile–quantile plots, which however show that the distribution of estimators from W_{3} and W_{0} is indeed very close to normal.
Overall, the analyses in this section suggest that the variancedependent weighting schemes W_{3} and W_{0} offer a substantial improvement in estimator performance over the fixed schemes W_{0} and W_{1}. We have obtained similar results in other simulation settings. This seems like a strong argument for using these variancedependent schemes. It is, however, of interest to understand in more detail the reasons for the poor performance of the asymptotic theory in the assessment of estimator variability, and if possible to find practically feasible ways of improving this performance. We address this next.
5.4. Improved inference
In our simulations, we have seen that the theoretical covariance matrix calculations tend to underestimate the real variability of the estimators. This must be associated either with the estimation of H(θ_{0}) and S, or with a failure of the asymptotic theory to provide reasonable approximations on the basis of 20 replicates of 30 days' data. An extensive set of analyses (not reported here) has been undertaken to identify the cause of the problem: the conclusion was that it is primarily associated with inaccurate estimation of S, the (normalized) covariance matrix of the vector T of summary statistics.
This conclusion is perhaps unsurprising, given the known problems associated with empirical covariance matrix estimation as discussed previously: here, S is an 8 × 8 symmetric matrix to be estimated from m = 20 independent observation vectors. One way to avoid the problem is to obtain an initial estimate of θ using a suboptimal weighting matrix, and then to plug this initial estimate into an analytical expression for S. This effectively reduces the degrees of freedom of the covariance matrix estimator from k(k − 1)/2 (=28 in our application) to p (=4). As a result, we might expect that the use of a plugin estimator will reduce the cumulative effect of sampling variation in the estimation of S.
In the present context however, and for many other problems involving complex system structures, analytical expressions for S are not available. For example, the variance of secondorder sample properties depends on fourthorder theoretical properties of the aggregated rainfall process: even for the relatively simple model considered here, the amount of work involved to derive these properties is prohibitive. We therefore propose an alternative that can be used whenever analytical evaluation of S is impractical, but where simulation is possible. Specifically, we suggest that an initial estimate of θ is used to simulate a large artificial dataset from which an improved estimate of S can be obtained. Of course, apart from the simulation error (which can be reduced by increasing the size of the artificial dataset), this is equivalent to parametrizing the covariance matrix as a function of θ. In a second stage, θ can be reestimated using the optimal weights based on this improved estimate of S.
We have applied this idea to our simulations. For each of our 1000 datasets, the initial estimate of θ was obtained using weighting scheme W_{3}, which was shown in the previous section to be a reasonable estimator. One thousand independent 30 day replicates of the process were then simulated using this pilot estimate; and these replicates were used to construct a new estimate of S. Revised estimates of θ were then obtained using weighting schemes W_{3} and W_{0} from the previous section, but with this new estimate of S substituted for the original, empirical estimate. To discriminate between the different procedures, we denote the resulting weighting matrices by W_{3}^{*} and W_{0}^{*}, respectively.
The empirical and theoretical standard errors for this twostep procedure are compared in the bottom two rows of table 1. The results for W_{3}^{*} are very similar to those of the original version W_{3}. However, there is now much better agreement between the empirical and theoretical versions for the optimal weighting scheme W_{0}^{*}. Moreover, of all the schemes considered, this one now delivers the smallest empirical standard errors for all parameters except logλ (where W_{3} yields a slightly smaller value, although the difference is easily within sampling variation).
The bottom two rows in table 2 show an improvement in the coverages obtained from both W_{3}^{*} and W_{0}^{*}, compared with those from W_{3} and W_{0}; the improvement for W_{0}^{*} is particularly striking. Coupled with the smaller standard errors of the estimators, this makes the twostep optimal procedure potentially attractive for carrying out inference using this model with relatively modest sample sizes. The main drawback is the computing time which, owing to the need for a large simulation to estimate S, can be several times greater than a onestep procedure using, say, W_{3}.
It may be argued that, even with the use of the twostep procedure, the coverages from W_{0}^{*} in table 2 still tend to be less than their nominal values. However, the difficulty of the inference task on the basis of a relatively small sample of data should not be underestimated. We have conducted additional simulation experiments with more complex and realistic models, with results similar to those obtained above. We also find that increasing the number of replicates improves the coverages still further, as might be expected. The takehome message for the prudent analyst is that the asymptotic theory can deliver reasonable, but approximate, inference in difficult situations on the basis of relatively small amounts of data.
6. Discussion
We have attempted to summarize some inferential procedures that can be used whenever quantities of interest are estimated by solving an equation of the form (1.1). This is often the case for standard techniques such as maximumlikelihood and leastsquares estimation, but is much more widely applicable in general. The theory reviewed here is particularly useful in situations where standard techniques are not available: the procedures are straightforward providing the covariance matrix and derivative of the estimating function can be estimated. For accurate inference, however, some care is required to obtain a reasonable estimate of the covariance matrix.
We have focused in particular upon the special case of GMM estimation, because this provides a general purpose—and, in practical terms, relatively painless—way to exploit links between the theoretical properties of models and observable properties of data. As such, the GMM seems, along with the related method of indirect inference, similar in spirit to the technique of approximate Bayesian computation (ABC; [42,43]) in which the aim is to derive a posterior distribution for quantities of interest on the basis of selected data summaries. Indeed, the combination of ABC with indirect inference has recently been applied to the modelling of macroparasite populations [44].
Although the theory outlined here holds under rather mild conditions, they are not always satisfied. Situations in which the theory is not applicable include: procedures involving the optimization of an objective function where the optimum value does not satisfy (1.1) (as in maximumlikelihood estimation of the unknown endpoint of a distribution—see [45, §7.2]); changepoint problems in which one or more of the model parameters changes abruptly at a particular value of some covariate (often time—for some insight into why these problems are irregular, see [46]); situations where the parameter space Θ is a closed set outside which the EF is undefined, and where θ_{0} lies on the boundary of Θ so that derivatives are illbehaved there (this can occur in models involving multiple variance components when some variances are zero); and problems where there are fundamental differences in the behaviour of the covariance or derivative matrix of the EF as θ varies (as when one moves from the stationary to the nonstationary region of parameter space in many timeseries models). It is also worth noting that even when the conditions do hold, the target θ_{0} of the estimation procedure is not guaranteed to be the value of substantive interest. Perhaps, the bestknown example of this is the Neyman–Scott or incidental parameters problem in maximumlikelihood estimation [45, §7.6.2]: here, the total number of model parameters increases with the sample size and the maximumlikelihood estimator of a lowdimensional parameter of interest is inconsistent (in this review, we have only considered situations in which p, the dimension of θ, is fixed for all n). Another example involves the application of generalized estimating equations to misspecified models for clustered data [47].
The techniques for model comparison in §3 assume that one model is nested within the other by imposing some linear restrictions on the parameter vector. In fact, essentially the same result holds for nonlinear restrictions under some mild conditions on their nature (see [31, §8.6] for a discussion of this in the context of maximumlikelihood estimation). However, we have not discussed techniques for comparing nonnested models. For likelihoodbased inference, criteria such as the Akaike information criterion and Bayesian information criterion can be used for this purpose [46, §4.7], although even these are considerably less accurate when the models are not nested. Similar criteria seem less welldeveloped for general EFs. Various suggestions have been made in specific situations however (see, for example, [48–50]), and most of these follow a very similar structure: this points towards the possibility of formulating a general purpose criterion for the comparison of nonnested models under a wide range of different estimation techniques. See Cox [45] for a more general discussion of nonnested model comparison,.
We conclude with a caveat. The theory presented here is general in nature, and relatively easy to apply in practice. However, in models of complex systems, it will often be the case that improved inference can be obtained from specialized methods that are more difficult to implement but are specifically designed for the particular problem at hand. In practice, therefore, it will often be necessary to weigh up the convenience of the methods presented here against the potential benefits of a specialist technique.
Acknowledgements
Joao Jesus' research was funded by a PhD studentship from the Engineering and Physical Sciences Research Council. We thank the two anonymous reviewers for their constructive and supportive comments.
Footnotes
One contribution to a Theme Issue ‘Inference in complex systems’.
 Received June 15, 2011.
 Accepted August 11, 2011.
 This journal is © 2011 The Royal Society