These reference chapters have been taken from Volume III, and use the same chapter and section numbering as the printed version.
In a discrete choice model the dependent variable only takes on integer values. Such a discrete dependent variable can denote a category (e.g. mode of transport: car, train, or bus) or count the number of events (e.g. the number of accidents in a week). A model with a categorical dependent variable is often called a discrete choice model, but is also referred to as dummyendogenous or qualitative response model. As our main reference for discrete choice models we use Cramer (2003).
The use of ordinary least squares in a discrete choice model is called the linear probability model. The main shortcoming of this model is that it can predict probabilities that are outside the [0,1] interval. PcGive implements logit and probit estimation which avoids this drawback.
General references include McFadden (1984), Amemiya (1981) and Cramer (2003), among others. In the remainder we exclusively refer to Cramer (2003).
In the simple discrete choice model the dependent variable only takes on two values, for example, when modelling car ownership:

With a discrete dependent variable, interest lies in modelling the probabilities of observing a certain outcome:
p_{i}=P{y_{i}=1}. 
Using OLS on a dummy dependent variable:
y_{i} = x_{i}'β+ε_{i} 
does not restrict ŷ_{i} between 0 and 1, which would be required if it is to be interpreted as a probability. Also, the disturbances cannot be normally distributed, as they only take on two values: ε_{i}=1p_{i} or ε_{i}=0p_{i}, writing p_{i}=x_{i}'β, and are heteroscedastic.
A simple solution is to introduce an underlying continuous variable U_{i}, which is not observed. Observed is:


Now we can introduce explanatory variables:
U_{i}=x_{i}'βε_{i}. 
and write
p_{i}=P{y_{i}=1}=P{x_{i}'βε_{i}≥0}=F_{ε} ( x_{i}'β) . 
Observations with y_{i}=1 contribute p_{i} to the likelihood, observations with y_{i}=0 contribute 1p_{i}:
L( βX) =∏_{{yi=0}}( 1p_{i}) ∏_{{yi=1}}p_{i}, 
and the loglikelihood becomes:
l( βX) =∑_{i=1}^{N}[ ( 1y_{i}) log ( 1p_{i}) +y_{i} log p_{i}] . 
The choice of F_{ε} determines the method. Using the logistic distribution:
F_{ε} ( z) = 

leads to logit. Logit has a linear logodds ratio:
log ( 
 ) =x_{i}'β. 
The standard normal distribution gives probit. We can multiply y_{i}^{*} by any nonzero constant without changing the outcome, the scale of these distributions is fixed: the logistic has variance π^{2}/3, the standard normal has variance equal to 1.
Assume that observations on n individuals (or households, firms, etc.) are obtained by random sampling, and that all individuals have made a choice from the same set of alternatives A={1,2,...,S}. Using subscript i to denote an individual and s to denote an alternative, we can write the dependent variable as y_{i}:
y_{i} = s if individual i has made choice s, s=1,...,S; i=1,...,n. 
The existence of an unobserved continuous variable U_{is} is assumed, where the alternative with the highest value is chosen:
y_{i} = s if U_{is}= max _{k∈A} {U_{ik}}. 
See Cramer (2003, Ch. 2) for an interpretation in terms of utility maximization. Adding a stochastic component:


Logit and probit specify a certain distribution for ε_{is}. The systematic part V_{is} is a functional form of explanatory variables and coefficients. We can distinguish three types of explanatory variables:
The x_{i} are called alternative independent regressors, while the remaining two types, called alternative dependent, are treated identically in PcGive.
Moreover, there are two types of coefficients:
The full specification in PcGive is:


So alternative independent regressors always enter with an alternative specific coefficient, while alternative dependent regressors have a generic coefficient.
As an illustration, consider the transport mode example:

Then x_{i} may be the income of individual i, and z_{is} the travel time for individual i using mode s. Interest may focus on the probabilities that individual i uses mode s, or the coefficients, to estimate the influence on marginal utility of explanatory variables. In the absence of alternative specific parameters it is possible to investigate the consequences of introducing a new mode of transport.
Let p_{is} denote the probability that individual i chooses alternative s. Then from (eq:2.2):

The loglikelihood function is:


where

The multinomial logit (MNL) model arises from (eq:2.2) when the ε_{ik} are independently type I extreme value distributed (see Cramer, 2003, Ch. 3). This is the most convenient model, because of the closed form of the probabilities:


and the concavity of the loglikelihood.
The logodds ratio is:
log (p_{is}/p_{ik})=V_{is}V_{ik}. 
Substituting (eq:2.3):
V_{is}V_{ik} = x_{i}'(β_{s}β_{k}) +(z_{is}z_{ik})'γ, 
shows that a normalization is needed on β_{k}: a constant can be added (for all k) without changing the probabilities. PcGive takes:
β_{1} = 0, 
and the first alternative is then called the reference state.
Although the name multinomial logit in PcGive is generally used for the specification (eq:2.3), it sometimes stands for a more specific functional form:
V_{is}=x_{i}'β_{s}. 
In that case, the conditional logit model refers to
V_{is}=z_{is}'γ. 
When V_{is} does not depend on attributes of alternatives other than alternative s, the MNL model has the independence from irrelevant alternatives (IIA) property,.
PcGive allows the specification of a weight variable in the loglikelihood:
l(θ) = ∑_{i=1}^{n}w_{i}∑_{s=1}^{S}Y_{is} log P_{is}(θ). 
The use could be twofold:
In multinomial logit, only the intercept is incorrect when treating a choicebased sample as a random sample (see Maddala, 1983, pp 9091). Manski and Lerman (1977) proposed a weighted estimation to get consistent estimates for choicebased samples. Let Q_{0}(s) be the population fraction choosing alternative s, and H(s) the fraction in the sample choosing s. Denote the observed response of individual i as s_{i}, then the weights are:
w_{i} = Q_{0}(s_{i}) / H(s_{i}). 
Also see Amemiya and Vuong (1987).
The coefficient estimates β̂_{s}, s=2,...,S and γ̂ form the basis for the estimated probabilities for each observation: p̂_{is}, s=1,...,S, i=1,...,n.
If the estimated model contains a constant term, it can be compared with the baseline model in a likelihood ratio test. The baseline model has only an intercept, and the resulting test is analogous to the regression Ftest. In the baseline MNL model:
V_{is} = α_{s}, s=2,...,S, 
the likelihood can be maximized explicitly, resulting in:
α_{s} = log (f_{s}/f_{1}), 
where f_{s} is the relative frequency in the sample of alternative s. With K alternative independent regressors (x) and C alternative dependent regressors (z), the estimated model has K(S1)+C parameters, and a loglikelihood of l(θ̂), say. The baseline model has S1 parameters, and a loglikelihood of l(α̂), so that 2l(θ̂)  2l(α̂) is χ^{2}[(K1)(S1)+C] distributed.
If the estimated model has no intercept, the zeroline is used in the test. This is a model without parameters:
p̂_{is}= 1/S . 
In a multinomial logit model that does have an intercept, the sum of the estimated probabilities for state s equals the number of observations having state s: 1/n ∑_{i=1}^{n} p̂_{is}=f_{s}. This follows, because at the maximum, the derivatives of the loglikelihood:
 =∑_{i=1}^{n} (Y_{is}p_{is})x'_{i} 
are zero, and hence, if one of the regressors is a row of ones:
∑_{i=1}^{n} Y_{is}=∑_{i=1}^{n} p̂_{is}. 
With grouped data, the loglikelihood of the saturated model is reported. This is a model with perfect fit: the predicted frequencies equal the observed frequencies. Let n_{is} be the number in group i having response s. Then ∑_{s=1}^{S}n_{is}=n_{i} and ∑_{i=1}^{G}n_{i}=n, with G the number of groups and n the total number of observations. The saturated model
p̂_{is}= 

has G(1S) parameters and a loglikelihood of
∑_{i=1}^{G}∑_{s=1}^{S}n_{is} log 
 . 
The saturated loglikelihood is not reported if any cell has n_{is}=0.
The derivatives of the probabilities with respect to the regressors give the effect of a change in one of the regressors on the probabilities. Since the probabilities sum to one, the derivatives sum to zero. The derivatives for the MNL model are:

We can substitute the relative frequency of each state for the probabilities (p̂_{s}=f_{s}) or select other relative frequencies. In the calculation of the tvalues the p̂_{s} are treated as nonstochastic.
In the binary probit model the derivatives are:
 =φ(V̂)β_{m}. 
These derivatives are calculated at the mean of the explanatory variables (V̂=\bf x'β̂), but can be calculated at other values.
PcGive also reports quasielasticities:
 = 
 x_{m}. 
If x_{m} increases with q%, p_{s} will approximately increase by q×∂p_{s}/∂ log x_{m} percentage points. Elasticities are also reported:

 . 
All three measures have the same tvalues.
The distribution of the estimated probabilities is shown in histograms. The first histogram is the frequency distribution of the probabilities of the observed state. Consider the following example with three states:

The probabilities of the observed state are underlined. Their frequency distribution is shown in Figure Figure:2.1a (e.g. 50% is in [0.4,0.5), while on the right is the distribution of the estimated probabilities for the first state. These graphs can be made for each state, with S+1 histograms in total.
A norm observation has the sample mean for each explanatory variable. The probabilities of each response can be calculated for the norm observation, as well as for other userdefined norm observations. The probabilities of the norm observation do not equal the relative frequencies of the alternatives: p̂_{s}(V)≠f_{s}.
The predicted choice is the alternative with the highest predicted probability. The table of observed versus predicted choice is a rather crude measure, since 0.34,0.33,0.33 and 0.9,0.05,0.05 both select the first alternative as predicted response. For the example of §2.6 the table would read:

Let p̂(i) denote the estimated probability that individual i selects the alternative she actually choose. These are underlined in the example of §2.6. PcGive can print all p̂(i) which are smaller than a designated value.
In a panel data setting, we have timeseries observations on multiple entities, for example companies or individuals. We denote the crosssection sample size by N, and, in an ideal setting, have t=1,...,T timeseries observations covering the same calendar period. This is called a balanced panel. In practice, it often happens that some crosssections start earlier, or finish later. The panel data procedures in PcGive are expressly designed to handle such unbalanced panels.
When T is large, N small, and the panel balanced, it will be possible to use the simultaneousequations modelling facilities of PcGive. When T is small, and the model dynamic (i.e. includes a lagged dependent variable), the estimation bias can be substantial (see Nickell, 1981). Methods to cope with such dynamic panel data models are the primary focus of this part of PcGive, particularly the GMMtype estimators of Arellano and Bond (1991), Arellano and Bover (1995), and Blundell and Bond (1998), but also some of the Anderson and Hsiao (1982) estimators. Additional information can be found in Arellano (2003). In addition, PcGive makes several of the standard static panel data estimators available, such as between and within groups and feasible GLS. Table Table:4.1 lists the available estimators.
static panel data models  dynamic panel data models  
OLS in levels  OLS in levels  
Between estimator  onestep IV estimation (AndersonHsiao IV)  
Within estimator  onestep GMM estimation  
Feasible GLS  onestep estimation with robust std. errors  
GLS (OLS residuals)  twostep estimation  
Maximum likelihood (ML) 
This chapter only provides a cursory overview of panel data methods. In addition to the referenced literature, there are several text books on panel data, notably Hsiao (1986) and more recently Baltagi (1995). Especially relevant is the book by Arellano (2003). Many general econometric text books have a chapter on panel data (but rarely treating dynamic models), see, e.g., Johnston and DiNardo (1997, Ch. 12).
The static singleequation panel model can be written as:


The λ_{t} and η_{i} are respectively time and individual specific effects and x_{it} is a k^{*} vector of explanatory variables. N is the number of crosssection observations. The total number of observations is NT.
Stacking the data for an individual according to time yields:
( 
 ) = ( 
 ) + ( 
 ) + ( 
 ) + ( 
 ), 
or, more concisely:
y_{i} = X_{i}γ + λ + ι_{i} η_{i} + v_{i}, i = 1, ..., N, 
where y_{i}, X_{i}, and λ_{i} are T×1, and ι_{i} is a T column of ones. Using D_{i} for the matrix with dummies:
y_{i} = X_{i}γ + D_{i}δ + v_{i}, i = 1, ..., N. 
The next step is to stack all individuals, combining the data into W=[X : D]:


Baltagi (1995) reviews the standard estimation methods used in this setting:
More detail is provided in §7.3.
The general model that can be estimated with PcGive is a single equation with individual effects of the form:


where η_{i} and λ_{t} are respectively individual and time specific effects, x_{it} is a vector of explanatory variables, β(L) is a vector of associated polynomials in the lag operator and q is the maximum lag length in the model. The number of time periods available on the ith individual, T_{i}, is small and the number of individuals, N, is large. Identification of the model requires restrictions on the serial correlation properties of the error term v_{it} and/or on the properties of the explanatory variables x_{it}. It is assumed that if the error term was originally autoregressive, the model has been transformed so that the coefficients α's and β's satisfy some set of common factor restrictions. Thus only serially uncorrelated or movingaverage errors are explicitly allowed. The v_{it} are assumed to be independently distributed across individuals with zero mean, but arbitrary forms of heteroscedasticity across units and time are possible. The x_{it} may or may not be correlated with the individual effects η_{i}, and for each of theses cases they may be strictly exogenous, predetermined or endogenous variables with respect to v_{it}. A case of particular interest is where the levels x_{it} are correlated with η_{i} but where Δx_{it} (and possibly Δy_{it}) are uncorrelated with η_{i}; this allows the use of (suitably lagged) Δx_{is} (and possibly Δy_{is}) as instruments for equations in levels.
The (T_{i}q) equations for individual i can be written conveniently in the form:
y_{i}=W_{i}δ +ι _{i}η_{i}+v_{i}, 
where δ is a parameter vector including the α_{k}'s, the β's and the λ's, and W_{i} is a data matrix containing the time series of the lagged dependent variables, the x's and the time dummies. Lastly, ι _{i} is a (T_{i}q)×1 vector of ones. PcGive can be used to compute various linear GMM estimators of δ with the general form:
δ̂=[ ( ∑_{i}W_{i}^{*'}Z_{i}) A_{N}( ∑_{i}Z_{i}'W_{i}^{*}) ] ^{1}( ∑_{i}W_{i}^{*'}Z_{i}) A_{N}( ∑_{i}Z_{i}'y_{i}^{*}), 
where
A_{N}=( 1/N ∑_{i}Z_{i}'H_{i}Z_{i}) ^{1}, 
and W_{i}^{*} and y_{i}^{*} denote some transformation of W_{i} and y_{i} (e.g. levels, first differences, orthogonal deviations, combinations of first differences (or orthogonal deviations) and levels, deviations from individual means). Z_{i} is a matrix of instrumental variables which may or may not be entirely internal, and H_{i} is a possibly individualspecific weighting matrix.
If the number of columns of Z_{i} equals that of W_{i}^{*}, A_{N} becomes irrelevant and δ ̂ reduces to
δ̂ =( ∑_{i}Z_{i}'W_{i}^{*}) ^{1}( ∑_{i}Z_{i}'y_{i}^{*}). 
In particular, if Z_{i}=W_{i}^{*} and the transformed W_{i} and y_{i} are
deviations from individual means or orthogonal deviations
x_{it}^{*}=( x_{it} 
 ) ( 
 ) ^{1/2} for t=1,...,T1. 
If the original errors are serially uncorrelated and homoscedastic, the transformed errors will also be serially uncorrelated and homoscedastic.]
, then δ ̂ is the withingroups estimator. As another example, if the transformation denotes first differences, Z_{i}=I_{Ti}⊗x_{i}' and H_{i}=v̂_{i}^{*}v̂_{i}^{*'}, where the v̂_{i}^{*} are some consistent estimates of the firstdifferenced residuals, then δ̂ is the generalized threestage least squares estimator of Chamberlain (1984). These two estimators require the x_{it} to be strictly exogenous with respect to v_{it} for consistency. In addition, the withingroups estimator can only be consistent as N→∞ for fixed T if W_{i}^{*} does not contain lagged dependent variables and all the explanatory variables are strictly exogenous.
When estimating dynamic models, we shall therefore typically be concerned with transformations that allow the use of lagged endogenous (and predetermined) variables as instruments in the transformed equations. Efficient GMM estimators will typically exploit a different number of instruments in each time period. Estimators of this type are discussed in Arellano (1988), Arellano and Bond (1991), Arellano and Bover (1995) and Blundell and Bond (1998). PcGive can be used to compute a range of linear GMM estimators of this type.
Where there are no instruments available that are uncorrelated with the
individual effects η_{i}, the transformation must eliminate this
component of the error term. The first difference and orthogonal deviations
transformations are two examples of transformations that eliminate η_{i}
from the transformed error term, without at the same time introducing all
lagged values of the disturbances v_{it} into the transformed error term.
Hence these transformations allow the use of suitably lagged endogenous (and predetermined) variables as instruments. For example, if the panel is balanced, p=1, there are no explanatory variables nor time effects, the v_{it} are serially uncorrelated, and the initial conditions y_{i1} are uncorrelated with v_{it} for t=2,...,T, then using first differences we have:
Equations  Instruments available  
Δy_{i3}=αΔy_{i2}+Δv_{i3}  y_{i1}  
Δy_{i4}=αΔy_{i3}+Δv_{i4}  y_{i1},y_{i2}  
Δy_{iT}=αΔy_{i,T1}+Δv_{iT}  y_{i1},y_{i2},...,y_{i,T2} 
In this case, y_{i}^{*}=(Δy_{i3},...,Δy_{iT})^{'},W_{i}^{*}=(Δy_{i2},...,Δy_{i,T1})' and
Z_{i}=Z_{i}^{D}=( 
 ) 
Notice that precisely the same instrument set would be used to estimate the model in orthogonal deviations. Where the panel is unbalanced, for individuals with incomplete data the rows of Z_{i} corresponding to the missing equations are deleted, and missing values in the remaining rows are replaced by zeros.
In PcGive, we call onestep estimates those which use some known matrix as the choice for H_{i}. For a firstdifference procedure, the onestep estimator uses
H_{i}=H_{i}^{D}= 1/2 ( 
 ), 
while for a levels or orthogonal deviations procedure, the onestep estimator sets H_{i} to an identity matrix. If the v_{it} are heteroscedastic, a twostep estimator which uses
H_{i}=v̂_{i}^{*}v̂_{i}^{*'}, 
where v̂_{i}^{*} are onestep residuals, is more efficient (cf. White, 1982). PcGive produces both onestep and twostep GMM estimators, with asymptotic variance matrices that are heteroscedasticityconsistent in both cases. Users should note that, particularly when the v_{it} are heteroscedastic, simulations suggest that the asymptotic standard errors for the twostep estimators can be a poor guide for hypothesis testing in typical sample sizes. In these cases, inference based on asymptotic standard errors for the onestep estimators seems to be more reliable (see Arellano and Bond, 1991, and Blundell and Bond, 1998 for further discussion).
In models with explanatory variables, Z_{i} may consist of submatrices with the blockdiagonal form illustrated above (exploiting all or part of the moment restrictions available), concatenated to straightforward onecolumn instruments. A judicious choice of the Z_{i} matrix should strike a compromise between prior knowledge (from economic theory and previous empirical work), the characteristics of the sample and computer limitations (see Arellano and Bond, 1991 for an extended discussion and illustration). For example, if a predetermined regressor x_{it} correlated with the individual effect, is added to the model discussed above, i.e.

then the corresponding optimal Z_{i} matrix is given by
Z_{i}=( 
 ) 
Where the number of columns in Z_{i} is very large, computational considerations may require those columns containing the least informative instruments to be deleted. Even when computing speed is not an issue, it may be advisable not to use the whole history of the series as instruments in the later crosssections. For a given crosssectional sample size (N), the use of too many instruments may result in (small sample) overfitting biases. When overfitting results from the number of time periods (T) becoming large relative to the number of individuals (N), and there are no endogenous regressors present, these GMM estimators are biased towards within groups, which is not a serious concern since the withingroups estimator is itself consistent for models with predetermined variables as T becomes large (see Álvarez and Arellano, 1998). However, in models with endogenous regressors, using too many instruments in the later crosssections could result in seriously biased estimates. This possibility can be investigated in practice by comparing the GMM and withingroups estimates.
The assumption of no serial correlation in the v_{it} is essential for the
consistency of estimators such as those considered in the previous examples,
which instrument the lagged dependent variable with further lags of the same
variable. Thus, PcGive reports tests for the absence of firstorder and
secondorder serial correlation in the firstdifferenced residuals. If the
disturbances v_{it} are not serially correlated, there should be evidence
of significant negative firstorder serial correlation in differenced
residuals (i.e. v̂_{it}v̂_{i,t1}), and no evidence of
secondorder serial correlation in the differenced residuals. These tests
are based on the standardized average residual autocovariances which are
asymptotically N(0,1) variables under the null of no autocorrelation. The
tests reported are based on estimates of the residuals in first differences,
even when the estimator is obtained using orthogonal deviations.
More generally, the Sargan (1964) tests of overidentifying restrictions are also reported. That is, if A_{N} has been chosen optimally for any given Z_{i}, the statistic


is asymptotically distributed as a χ^{2} with as many degrees of freedom as overidentifying restrictions, under the null hypothesis of the validity of the instruments. Note that only the Sargan test based on the twostep GMM estimator is heteroscedasticityconsistent. Again, Arellano and Bond (1991) provide a complete discussion of these procedures.
Where there are instruments available that are uncorrelated with the
individual effects η_{i}, these variables can be used as instruments for
the equations in levels. Typically this will imply a set of moment
conditions relating to the equations in first differences (or orthogonal
deviations) and a set of moment conditions relating to the equations in
levels, which need to be combined
to obtain the efficient GMM estimator.
For example, if the simple AR(1) model considered earlier is meanstationary, then the first differences Δy_{it} will be uncorrelated with η_{i} , and this implies that Δy_{i,t1} can be used as instruments in the levels equations (see Arellano and Bover, 1995 and Blundell and Bond, 1998 for further discussion). In addition to the instruments available for the firstdifferenced equations that were described earlier, we then have:
Equations  Instruments available  
y_{i3}=αy_{i2}+η_{i}+v_{i3}  Δy_{i2}  
y_{i4}=αy_{i3}+η_{i}+v_{i4}  Δy_{i3}  
y_{iT}=αy_{i,T1}+η_{i}+v_{iT}  Δy_{i,T1} 
Notice that no instruments are available in this case for the first levels equation (i.e., y_{i2}=αy_{i1}+η_{i}+v_{i2}), and that using further lags of Δy_{is} as instruments here would be redundant, given the instruments that are being used for the equations in first differences. In a balanced panel, we could use only the last levels equation (i.e., y_{iT}=αy_{i,T1}+η_{i}+v_{iT}), where (Δy_{i2},Δy_{i3},...,Δy_{i,T1}) would all be valid instruments; however this approach does not extend conveniently to unbalanced panels.
In this case, we use

and
Z_{i}=( 
 ) 
where Z_{i}^{D} is the matrix of instruments for the equations in first differences, as described above. Again Z_{i} would be precisely the same if the transformed equations in y_{i}^{*} and W_{i}^{*} were in orthogonal deviations rather than first differences. In models with explanatory variables, it may be that the levels of some variables are uncorrelated with η_{i}, in which case suitably lagged levels of these variables can be used as instruments in the levels equations, and in this case there may be instruments available for the first levels equation.
For the system of equations in first differences and levels, the onestep estimator computed in PcGive uses the weighting matrix
H_{i}=( 
 ) 
where H_{i}^{D} is the weighting matrix described above for the first
differenced estimator, and I_{i} is an identity matrix with dimension equal
to the number of levels equations observed for individual i. For the
system of equations in orthogonal deviations and levels, the onestep
estimator computed in PcGive sets H_{i} to an identity matrix with dimension
equal to the total number of equations in the system for individual i. In
both cases the corresponding twostep estimator
uses H_{i}=v̂_{i}^{*}v̂_{i}^{*'}. We adopt these particular onestep weighting
matrices because they are equivalent in the following sense: for a balanced
panel where all the available linear moment restrictions are exploited (i.e.,
no columns of Z_{i} are omitted for computational or smallsample reasons),
the associated onestep GMM estimators are numerically identical, regardless
of whether the first difference or orthogonal deviations transformation is
used to construct the system. Notice though that the onestep estimator is
asymptotically inefficient relative to the twostep estimator for both of
these systems, even if the v_{it} are homoscedastic.
Again simulations have suggested that asymptotic inference based on the onestep versions may be more reliable than asymptotic inference based on the twostep versions, even in moderately large samples (see Blundell and Bond, 1998).
The validity of these extra instruments in the levels equations can be
tested using the Sargan statistic provided by PcGive.
Since the set of
instruments used for the equations in first differences (or orthogonal
deviations) is a strict subset of that used in the system of
firstdifferenced (or orthogonal deviations) and levels equations, a more
specific test of these additional instruments is a Difference Sargan test
which compares the Sargan statistic for the system estimator and the Sargan
statistic for the corresponding firstdifferenced (or orthogonal deviations)
estimator. Another possibility is to compare these estimates using a Hausman (1978)
specification test, which can be computed here by including another set of
regressors that take the value zero in the equations in first differences
(or orthogonal deviations), and reproduce the levels of the righthand side
variables for the equations in levels.
W_{i}^{*}=( 
 ) _{.}' 
]
The test statistic is then a Wald test of the hypothesis that the coefficients on these additional regressors are jointly zero. Full details of these test procedures can be found in Arellano and Bond (1991) and Arellano (1995).
This section gives a summary of the statistical output of PcGive, giving the formulae which are used in the computations.
x^{*} _{it} = x _{it}, t = 1,...,T_{i}. 

x _{i} = 
 ∑_{s=1}^{Ti} x _{is}. 
x^{*} _{it} = x _{it}x _{i}, t = 1,...,T. 
x^{o} _{it} = (x _{it}  
 ∑_{s=t+2}^{Ti} x _{is}) ( 
 )^{1/2}, t = 2,...,T_{i}1. 
The orthogonal deviations are stored with one lag, so that the first observation is lost instead of the last (this brings it in line with first differencing):

x^{*} _{it} = x _{it}θ_{i} x _{i}, t = 1,...,T_{i}, 
where the choice of θ_{i} determines the method, as discussed in the next section.
The starting point is (eq:4.2):
y = Wβ + v. 
The total number of observations is O=NT. In an unbalanced panel, different individuals may have a different number of observations, T_{i}. In that case the total number of observations equals O=∑T_{i}. Baltagi (1995) reviews the standard estimation methods used in this setting:

Here W has k columns, containing all specified regressors including dummies, so p=k.
θ_{i} = 1  
 , σ_{i}^{2} = σ^{2}_{v} + T_{i}σ^{2}_{η}. 
When OLS residuals u = v̂_{OLS} are used, the GLS estimator can be based on:

PcGive computes θ_{i} from σ^{2} _{v} and σ^{2}_{η} with the latter derived from σ̂ _{0}^{2} as:
σ̂ ^{2}_{η} = (σ̂_{0}^{2}  σ̂ _{v}^{2}) N/O . 
In a balanced panel: N/O=1/T. This is not optimal in an unbalanced panel, but seems reasonable.
The standard feasible GLS estimator uses the between and within estimates:

For convenience, θ_{i} is specified using three variance components σ^{2} _{v}, σ^{2}_{a}, and σ^{2}_{η}:





where τ= σ^{2}_{η} / σ^{2}_{v}, so that:
θ_{i} = 1  (1 + T_{i}τ)^{1/2}. 
To summarize the implementation in PcGive:
number of  degrees of freedom  transforms  
observations, n  lost in estimation, p  dummies  
OLS (no transformation)  O  k  no 
within  O  k+N  no 
between  N  k  yes 
GLS, ML  O  k  yes 
It is important to note that the within transformations are only applied to X, and not to the dummies D. In the within estimator, individual dummies are redundant after subtracting the individual means. In the between estimator, time dummies are irrelevant, and individual dummies create a perfect fit. In GLS and ML, the individual means are subtracted with weights.
The single equation dynamic panel model can be written as:
y_{it} = ∑_{s=1}^{m} α_{i} y_{i,ts} + x_{it}'γ + λ_{t} + η_{i} + v_{it}, t = 0,..., T_{i}1, i = 0, ..., N  1. 
As before, λ_{t} and η_{i} are time and individual specific effects and x_{it} is a k^{*} vector of explanatory variables. It is assumed that allowance for lagged observations has been made, so observations on y_{i,s}, ..., y_{i,1} are available. The total number of observations available for estimation equals O=∑T_{i}.
Stacking the data for an individual according to time: and using D for the matrix with dummies:
y_{i} = X_{i}γ + D_{i}δ + v_{i}, i = 0, ..., N  1. 
In formulating the model, we distinguish the following types of variables:

The G_{i} are described under the Gmm function. From these variables, the dependent variable, regressor and instrument matrices are formed as follows:

where the ^{*} denotes some transformation which can be chosen from §7.1. This also affects the degrees of freedom, as shown in Table Table:7.1. The estimators and tests are described in Arellano and Bond (1991), and have the form:


transformation  number of  degrees of freedom  transforms dummies 
observations, n  lost in estimation, p  
differences  ON  k  D^{s}=D (optional: D^{*}) 
orthogonal  
deviations  ON  k  D^{s}=D (optional: D^{*}) 
none  O  k  no: D^{s}=D 
within  O  k+N  no: D^{s}=D^{*} 
between  N  k  yes: D^{s}=D^{*} 
GLS  O  k  yes: D^{s}=D^{*} 
When the option to concentrate out dummies is used, y^{*}, X^{*}, I^{*}, and L are replaced by the residuals from regressing on the set of dummies D^{s}. Subsequently, the dummies are omitted from further computations.
In onestep estimation, H_{i} is the identity matrix, except when estimating in first differences:


In twostep estimation H_{i} is based on the previous step residuals:

with a subscript added to distinguish between one and two step estimation.
The output reports the residual and total sum of squares:

So the TSS is in deviation from the mean.
The onestep estimated variance of the coefficients is:
V̂_{1}[β̂] = σ̂_{1,u}^{2} M_{1}^{1}. 
If selected, robust standard errors after onestep estimation are computed as:


When there are no instruments, (eq:7.5) simplifies to:
(W'W)^{1} (∑_{i=1}^{N} W_{i}'û_{i} û_{i}'W_{i}) (W'W)^{1}, 
which is the robust variance estimator proposed in Arellano (1987).
∑_{i=1}^{NT} w_{i}'û_{i} û_{i}w_{i}. 
]
The twostep asymptotic variance matrix is:
V̂_{2}[β̂] = M_{2}^{1}. 
This variance estimator can be severely downward biased in small samples, and is therefore only reported when robust standard errors is switched off. The preffered solution (and the default) is to use the robust variance for twostep estimation, V̂_{2r}[β̂], which uses the small sample correction as derived by Windmeijer (2000).
The AR test for order m is computed as:


Using w_{i} for the residuals lagged m periods (substituting zero for missing lags):
w_{it} = u_{i,tm} for t=m,...,T_{i}, and w_{it} = 0 for t < m. 


The components are used as the notation suggests, except for H_{i} in onestep estimation:

After estimation in orthogonal deviations, the AR test (eq:7.6) is based on the residuals from the differenced model as follows (the superscript Δ refers to the regressors and residuals from the model in differenced form):


H_{i}^{Δ} is H_{1,i}^{diff} after onestep estimation, and u_{i}^{Δ'}u_{i}^{Δ} otherwise. Ψ_{i} equals
( 
 ) w_{i} 
after onestep estimation, and u_{i}u_{i}^{Δ'}w_{i}^{Δ} after onestep robust or twostep estimation.
The Sargan test with zk degrees of freedom is after onestep estimation:
S_{1} = (∑_{i} v̂^{*'}_{1,i}Z_{i}) A_{1,N} (∑_{i} Z_{i}'v̂^{*}_{1,i})σ̂_{1,u}^{2}, 
and after twostep estimation (see eq:4.4 and equation (10) in Arellano and Bond, 1991):
S = (∑_{i} v̂^{*'}_{2,i}Z_{i}) A_{2,N} (∑_{i} Z_{i}'v̂^{*}_{2,i}). 
Three Wald tests are routinely reported to test the significance of groups of variables:
Wald (joint):  on W, all regressors except dummies, 
Wald (dummy):  on D, all dummies (including constant), 
Wald (time):  on time dummies (including the constant in the 
differenced/deviations model). 
In combined estimation, GMMSYS, the levels equation is used jointly with the transformed equation, see Blundell and Bond (1998). The IV estimation of (eq:7.3) still applies, but the data are organized as follows:

G^{+}_{i} are the GMMstyle instruments for the levels equation,
described under the GmmLevel function.
The dummies in the transformed equation are always transformed. When using differences or deviations, the effective number of observations in the transformed equations is as before (ON), whereas the levels equations have O observations. Internally, both the transformed and levels equation i has T_{i} equations, but, when using differences/deviations, the first observation in the transformed equation is set to zero, resulting in T_{i}1 effective observations.
When the option to concentrate out dummies is used, y^{*}, X^{*}, I^{*}, and L are replaced by the residuals from regressing on the set of dummies D^{*}, and y and X on D. Note that there is one time dummy extra in combined estimation compared to normal estimation when using differences or deviations.
After estimation, there are two sets of residuals:

The reported residual variance is based on the levels residuals: σ̂_{u+}^{2} = û^{+'}û^{+}/(np). Here n = O =∑_{i=1}^{N} T_{i} and p are as before. The reported RSS and TSS are also based on the levels, using û^{+} and q^{+}.
In onestep estimation, H_{i} is as in (eq:7.4)
in the transformed equations, and the identity matrix in the level
equations (1/2I when using differences).
Only robust standard errors are reported after onestep estimation.
In twostep estimation, H_{i} is based on the previous step residuals û_{i}' = (û_{i}^{*'} : û_{i}^{+'}).
The AR test (eq:7.6) is based on the residuals from the transformed equation:

After orthogonal deviations, the ^{*} is replaced by ^{Δ} as in (eq:7.8).
The autoregressive conditionalheteroscedastic (ARCH, see Engle, 1982) and generalized ARCH (GARCH, see Bollerslev, 1986) models have found widespread application since their introduction. These models were designed to capture the volatility clustering which can be observed in macroeconomic series such as inflation (Engle, 1982), or financial timeseries such as exchange rates and stock market returns. Recent surveys of GARCHtype models include Bera and Higgins (1993), Bollerslev, Engle, and Nelson (1994), Shephard (1996), and Gourieroux (1997), also see the text books by Hamilton (1994), Mills (1999) and Tsay (2005).
The ARCH(q) model featured briefly in Volume I as part of nonlinear modelling, and throughout in the form of the LM test for ARCH effects in the residuals of a regression model. The ARCH(q) model can be defined as:


Assuming ε_{t}~IN[0,1], identically and independently distributed for all t, gives u_{t}u_{t1}...u_{tq}~N(0,h _{t}). So conditional on the past, the model is normal but heteroscedastic. The loglikelihood for observation t differs from the linear model with normal errors through the timedependence of the conditional variance:
l_{t}( μ, αF_{t1}) = c½ log ( h _{t}) ½ 
 , 
where F_{t1} captures all the information up to t1.
In contrast with stochastic volatility models (see Shephard, 1996), where the variance is modelled as an unobserved (latent) process, the model in (eq:8.1) leads to an explicit expression for the loglikelihood: given values for μ and α=(α_{1},...,α_{q})', we can evaluate û_{tq},...,û_{t1} and therefore ĥ_{t} and l_{t}. This makes maximum likelihood estimation of the model in (eq:8.1), and its extensions discussed below, relatively easy in comparison with stochastic volatility models. There is a problem very early on in this recursion regarding the treatment of û_{t} for t<0  this is discussed below. The firstorder conditions for a maximum of the loglikelihood cannot be solved analytically in terms of the parameters; instead this requires the use of a numerical optimization routine.
More generally, the regression model with normalGARCH(p,q) errors is defined as:


The ARCH(q) model corresponds to GARCH(0,q). The GARCH(p,q) model allows for a parsimonious parametrization of an ARCH(∞) structure: assuming that the roots of 1β(L)=1∑_{i=1}^{p}β_{i}L^{i} are outside the unit circle, and letting α(L)=∑_{i=1}^{q}α_{i}L^{i}, (eq:8.2) implies
h_{t}= 
 + 
 u_{t}^{2}. 
The conditional variance h_{t} must be nonnegative, which requires that the coefficients of δ(L)=α(L)/(1β(L)) are nonnegative (and α_{0}≥0). A sufficient condition for this is that α_{i}≥0 and β_{i}≥0 for all i (however this is overly restrictive; see Nelson and Cao, 1992).
The equation for h_{t} in (eq:8.2) can be written in ARMA form using ν_{t}=u^{2}_{t}h_{t}=(ε^{2}_{t}1)h_{t}:


where m= max (p,q) and we set β_{i}=0 for i>p, and α_{i}=0 for i>q. Note that E [ν_{t}F_{t1}]=0. It follows that u_{t} has a finite unconditional variance, given by E[u_{t}^{2}]=α_{0}/(1α(1)β(1)), if and only if the roots of
1α(L)β(L)=1∑_{i=1}^{m}(α_{i}+β_{i})L^{i} 
are all outside the unit circle. A necessary and (given nonnegativity of α_{i} and β_{i}) sufficient condition for this is α(1)+β(1)=∑_{i=1}^{m}(α_{i}+β_{i})<1. Under this condition, the process u_{t} is weakly stationary. Equation (eq:8.3) also forms the basis for forecasting, because, for the GARCH(1,1) model:
E[h_{T+H}h_{T}]=(α+β)^{H}h_{T} + α_{0}[∑_{j=0}^{H1}(α+β)^{j}]. 
In the first forecast, ν_{T1} still features, but afterwards
it is set to its expectation of zero. The default
parameter space in PcGive is formulated in terms of the ARMA
representation, with α_{0}≥0, α_{i}+β_{i}≥0,
and α(1)+β(1)<1.
When α(1)+β(1)=1, the ARMA representation (eq:8.3) has a unit root, so that u_{t} has no finite unconditional variance; this situation is called integrated GARCH (IGARCH). In the IGARCH(1,1) case:
E[h_{T+H}]=h_{T} + α_{0}H, 
which grows linearly with the forecast horizon H. However, the u_{t} process is strictly stationary (provided α_{0} > 0, see Nelson, 1990).
The GARCH model can also be estimated with standardized Studentt distributed errors (see §9.5). As the degrees of freedom go to infinity, this distribution approaches the standard normal.
We also consider the exponential GARCH, or EGARCH(p,q), specification (see Nelson, 1991):


with α_{1}=1. The logformulation implies that h_{t} is always nonnegative, regardless of the (possibly negative) parameter values; stationarity is now related to the roots of 1β(L), in particular to the question whether β(1)<1 (but the implementation in PcGive does not impose restrictions on the EGARCH parameter space). When θ_{1}≠0, the model allows for an asymmetric response. In the normal EGARCH model, the errors follow a standard normal distribution. PcGive also allows the EGARCH model with a generalized error distribution (GED), which is defined as:


where Γ(^{.}) is the gamma function, and λ^{2}=2^{2/ν}Γ(1/ν)/Γ(3/ν). It contains the standard normal distribution as a special case, in which case E ε_{t} =(2/π)^{1/2}. When ν< 2 the distribution has a thicker tail than the standard normal. PcGive estimates ν^{*}= log ν/2, which is zero for the standard normal distribution; in that case ν^{*}<0 implies thicker tails.
Both the GARCH and EGARCH models can be estimated with the conditional variance in the mean, with a choice between h_{t}, h_{t}^{1/2} and log h_{t} entering the vector of regressors x_{t}.
Finally, asymmetric and threshold versions of the GARCH model are available, where h_{t} is specified as:


where D_{t1}=1 if u _{t1}<κ_{1} and zero otherwise;
κ_{1} is the asymmetry parameter, and κ_{2} the threshold
parameter. Either or both can be zero: a pure threshold model
has κ_{1}=0 (Glosten, Jagannathan, and Runkle, 1993),
and a pure asymmetric GARCH model has κ_{2}=0.
In summary, the following models can be estimated:

The default is a GARCH(1,1) model with normal errors, see (eq:8.2). If desired, other values for p and q can be selected.
The Model Settings dialog also has a range of options (the default is marked below with a *):
This is either a GARCH model with studentt distributed errors or an EGARCH model with errors which have a generalized error distribution.
Selecting both results in a threshold asymmetric GARCH model,
This imposes no restrictions, and could in rare cases lead to negative variance estimates.
Nelson and Cao (1992) argued that imposing all coefficients to be nonnegative is overly restrictive, and that negative estimates occur in practice (they list several examples). Subsequently, He and Teräsvirta (1999) have shown that such negative coefficients allow for richer shapes of the autocorrelation function. Doornik and Ooms (2008) discuss the implementation of these restrictions.
Doornik and Ooms (2008) also considered restrictions in terms of (eq:8.3). Defining m= max (p,q), β_{i}=0 for i>p, α_{i}=0 for i>q:


In terms of (eq:8.3), these restrictions imply that the unconditional variance exists, and is always positive. Formally, PcGive imposes α(1)+β(1)<1 with strict inequality. However, numerically, the result can get so close as to being virtually indistinguishable from one. In that case the sum is assumed to be on the IGARCH boundary.
Bollerslev (1986) proposed the GARCH model with α_{0} ≥0, α_{i} ≥0, and β_{i} ≥0. This ensures that h_{t}≥0.
Evaluation of the likelihood requires presample values for u_{t}^{2} and h_{t}. PcGive implements two methods:
Bollerslev (1986) suggested using the mean of the squared residuals:


In EGARCH, the recursion is started with the logarithm of the mean of the squared residuals.
This amounts to adding the missing u^{2}_{1}...u^{2}_{m} as parameters which are to be estimated. This make the likelihood derivatives more complex, and PcGive will only use numerical derivatives if this method is selected (see below).
This uses minus the inverse of the numerical second derivative of the loglikelihood function.
The information matrix is computed analytically, but only for standard GARCH models.
In addition, the robust standard errors are printed by default when the information matrix J is available. These are of the form suggested by Bollerslev and Wooldridge (1992): Ĵ^{1}ĜĴ^{1}, where Ĝ is the outer product of the gradients.
A comparison of various estimators is given in Fiorentini, Calzolari, and Panattoni, 1996.
Especially when p>1, it is possible that the likelihood has multiple optima. This final set of advanced options allows for a search from random starting values. Because each of these involves maximization of the likelihood, this option can be time consuming. See Doornik and Ooms (2008) for an application.
PcGive only offers BFGS (see e.g. Fletcher, 1987 or Gill, Murray, and Wright, 1981), because it is our preferred method for numerical optimization. BFGS avoids the need for second derivatives, while being one of the most robust methods available. This is supplemented by a line search when, at an iteration step, the likelihood does not increase. BFGS was not considered by Fiorentini, Calzolari, and Panattoni (1996), but we found 100%convergence when replicating their Table 1 with 1000 replications (requiring about 17 iterations on average, whether starting from the DGP values, or from a starting value routine).
PcGive uses the method of Galbraith and ZindeWalsh (1997) applied to the squared data (after removing regressors in the mean). If necessary, the resulting parameter values are reduced to ensure that the unconditional variance exists.
Numerical derivatives are more convenient, but less accurate than analytical derivatives (see Fiorentini, Calzolari, and Panattoni, 1996). The latter are to be preferred, but convenience often dictates the use of the former. In simple GARCH models, we found numerical derivatives sufficiently effective, with model estimation taking the same amount of time, and convergence achieved as frequently.
PcGive uses analytical derivatives, except (1) when the Hessian matrix is required for the variancecovariance matrix, (2) when the initial values of the conditional variances are added as parameters, (3) for EGARCHtype models and (4) for conditional variance in mean models.
The fractionallyintegrated autoregressivemoving average model, denoted arfima(p,d,q) can be used for the statistical analysis of a univariate time series y_{t} with long memory. The more familiar arma (or `BoxJenkins') models (when d=0) or arima (when d is a positive integer) are special cases. The arfima(p,d,q) model allows one to model the longrun behaviour of a time series in a flexible way. Empirical modelling involves identification, estimation and testing. In the identification stage of ARIMAmodelling one determines the integer part of the order of differencing, d, and the orders of the autoregressive and the movingaverage parts of the model, p, and q. In the ARFIMAmodel one can estimate d and compute a confidence interval. The arfima model allows for likelihood ratio and Wald tests for the null of shortmemory stationarity (d=0), as well as for the null of unitroot nonstationarity (d=1) against long memory and intermediate memory alternatives. These tests are easy to apply since they have conventional chisquared limit distributions.
The tests complement widely applied DickeyFuller type tests for d=1 against short memory alternatives and tests for d=0 against d=1 based on variances of cumulative sums. PcGive's parameter instability test applied to the constant term of a regression model can be interpreted as a test for d=0 with fixed mean, against d=1 with a random walk mean.
From the early nineteen sixties onwards, when Mandelbrot suggested longmemory models for economic time series, there has been a steady growth in the literature on the subject. Robinson (1994) and Baillie (1996) provided useful surveys of the developments in the econometric modelling of long memory; Beran's monograph, Beran (1994), discusses most of the central issues, including forecasting. The implementation in PcGive is derived from the Arfima package for Ox by Doornik and Ooms (1999), while an application and a more extensive discussion can be found in Doornik and Ooms (2004).
We write the arfima(p,d,q) model as:


where Φ( L )=(1φ_{1}L...φ_{p}L^{p}) is the autoregressive polynomial and Θ( L )=(1+θ_{1}L+...+θ_{q}L^{q}) is the moving average polynomial in the lag operator L; p and q are integers, d is real. For example, the arfima(1,d,1) model is:
( 1L) ^{d}z_{t} = φ_{1} ( 1L) ^{d}z_{t1} + ε_{t} + θ_{1}ε_{t1}, 
where
z_{t} =y_{t}μ_{t}. 
The notation in textbooks often differs in terms of the sign of the parameters: we follow the regressionmodel notation.
( 1L) ^{d}=Δ^{d} is the fractional difference operator defined by the following binomial expansion:
(1L)^{d}=∑_{j=0}^{∞} δ_{j} L^{j} = ∑_{j=0}^{∞} ( 
 ) (L)^{j}. 
We assume ε_{t}~NID[0,σ_{ε}^{2}], and write μ_{t} for the mean of y_{t}. The armapart of the model is assumed invertible and stationary: all roots of Θ(z)=0 and Φ(z)=0 are outside the unit circle. In addition, Θ(z)=0 and Φ(z)=0 do not have common roots. We say that z_{t}=y_{t}μ_{t} is I(d), integrated of order d. The zero mean arfima(p,d,q) process z_{t} is covariance stationary if d<0.5, see Hosking (1981), with autocovariance function that decays hyperbolically. The process is long memory in the case 0<d<0.5. For 0.5<d<0 the process is called intermediate memory or `overdifferenced', see Brockwell and Davis (1993) and e.g. Chang and Dickey (1994). We assume d>1, which makes the process z_{t} invertible, see Odaki (1993). We provide three estimation methods, Exact Maximum Likelihood (EML), Modified Profile Likelihood (MPL) and nonlinear least squares (NLS). By definition, EML and MPL impose 1 < d < 0.5. MPL is preferred over EML if the model includes regressor variables and the sample is not very large. NLS allows for d > 0.5 and can be used to estimate models for nonstationary processes directly, without a priori differencing. NLS estimation is usually fast.
The next chapter shows how PcGive can be used to estimate arfima models. We discuss implementation details in Chapter 13.
This chapter gives a summary of the implementation details of the procedures in the Arfima package. It complements Doornik and Ooms (2004), while further computational aspects are discussed in Doornik and Ooms (2003).
The basic arma(p,q) model is
y_{t}=φ_{1}y_{t1}+...+φ_{p}y_{tp}+ε_{t}+θ_{1}ε_{t1}+...+θ_{q}ε_{tq}, t=1,...,T, 
assuming either ε_{t}~NID(0,σ_{ε}^{2}) , or E[ε_{t}]=0 and E[ε_{t}^{2}]=σ_{ε}^{2}. Using lag polynomials and introducing a mean μ we write:
Φ( L) ( y_{t}μ) =Θ( L) ε_{t}. 
With a fractional integration parameter d, the arfima(p,d,q) model is written as


The autocovariance function (ACovF) of a stationary arma process with mean μ:
c( i) =E[ ( y_{t}μ) ( y_{ti}μ) ], 
defines the variance matrix of the joint distribution of y=(y_{1},...,y_{T})':


which is a Toeplitz matrix, denoted by T. Under normality:


The autocorrelation function, ACF: c( i) /c( 0), of a stationary arma process is discussed in many textbooks, and readily computed from the φ_{i} and θ_{i} using the Ox function armavar. We often work with the autocovariances scaled by the error variance:
r=[ r( 0) ...r( T1) ] '=σ_{ε}^{2}[ c( 0) ...c( T1) ] '. 
An algorithm for the computation of the ACovF of an arfima process is derived in Sowell (1992):






Here Γ is the gamma function, ρ_{j} are the roots of the AR polynomial (assumed distinct), and F(a,1;c;ρ) is the hypergeometric function, see e.g. Abramowitz and Stegun (1970, Ch. 15):
F(a,b;c;ρ)=∑_{i=0}^{∞} 

 , 
where we use Pochhammer's symbol:
( a) _{i}=a( a+1) ( a+2) ...( a+i1) , ( a) _{0}=1. 
So (1)_{i} equals i!.
In the absence of AR parameters (eq:13.4) reduces to
c( i) =σ_{ε}^{2}∑_{k=q}^{q}ψ_{k} 

 . 
Any set of exogenous regressors may be used to explain the mean:
z=yμ, μ=f(X,β), 
where X is a T×k matrix. In the leading linear case f(X,β)=Xβ and β is a k×1 vector.
Initial values for the parameter estimates are obtained in the order: regressors in mean, d, AR part, and finally MA part. The very first step is to subtract a mean from y_{t}: z_{t}=y_{t}μ_{t} . When either the sample mean or a specified (known, possibly zero) mean is used: μ_{t} =μ. If regressors are used, take μ_{t}=f(x_{t},β). In the linear case β is obtained by regression.


( 
 ) φ_{0} = ( 
 ), 
where ρ̂( i) is the empirical autocorrelation of u_{t}. When q is zero, the matrix on the righthand side is the Toeplitz matrix T[ ρ̂(0), ..., ρ̂( p1) ].
We use OLS to solve this system, giving the initial estimates of the AR coefficients φ̂_{0}; this will also give a solution when the matrix is singular. Subsequently, the arma0 function is used to obtain autoregression residuals u_{t}^{*}.
When the initial values are used as starting values for further estimation, the following adjustments are made:
Based on normality (eq:13.3), and with a procedure to compute the autocovariances in (eq:13.2), the loglikelihood is simply (writing z for the data vector used for maximization):


It is convenient to concentrate σ_{ε}^{2} out of the likelihood, starting by writing Σ= Rσ_{ε}^{2}:
log L( d,φ,θ,β,σ_{ε}^{2}) ~ 1/2 log  R  T/2 log σ_{ε}^{2} 
 z'R^{1}z. 
Differentiating with respect to σ_{ε}^{2}, and solving yields


with concentrated likelihood (CLF):
l_{c}( d,φ,θ,β) = T/2 log ( 2π)  T/2  1/2 log  R  T/2 log [ T^{1}z'R^{1}z] . 
When f(X,β)=Xβ it is more convenient to also concentrate β out of the likelihood. The resulting normal profile loglikelihood function becomes:




The function used in the maximization procedure is:


from which the value for the loglikelihood (eq:13.10) is easily derived. The computational procedure described in Doornik and Ooms (1999) writes
σ_{ε}^{2}=T^{1}z'R^{1}z=T^{1}e'e, 
with R a byproduct of the procedure.
Function (eq:13.12) is maximized using BFGS with numerical derivatives. During estimation, stationarity is imposed at each step by rejecting parameter values which have:
In addition, the procedure can fail because:
The modified profile loglikelihood, l_{M}, for the regression model with stationary arfimaerrors and f(X,β)=Xβ:


see An and Bloomfield (1993), who applied the idea of Cox and Reid (1987) to reduce the bias of the EML estimator due to the presence of unknown nuisance parameters of the regressors.
The residual variance estimator now uses Tk, so that it is unbiased when p=q=d=0:


Defining e_{t} as the residuals from applying the arfima(p,d,q) filter to y_{t}μ_{t}, the residual variance is:




The arfima filter is computed using the Ox function diffpow, see (eq:13.7), followed by arma0. Since (eq:13.7) essentially drops the first observation, e_{1}=0 when d is estimated.
Function (eq:13.16) is maximized using BFGS with numerical derivatives, optionally with stationarity imposed.
Let θ'= [d φ' θ']. The variancecovariance matrix for the EML (l=l_{P}) and MPL (l=l_{M}) estimates is computed as:
( 
 ). 
The second derivative of l is computed numerically.
For NLS, the variancecovariance is the inverse of minus the numerical second derivative of (eq:13.16).
Estimation output consists of

where f for NLS is from (eq:13.16). So at the same parameter values, EML and NLS would report the same loglikelihood.
AIC.T= 2l̂ + 2s, 
where s the number of estimated parameters. When no parameters are fixed: s=1+p+q+k+1 (the last accounts for the residual variance). The AIC.T/T is also reported.
It has been found in early Monte Carlo experiments that, in smaller samples, using the theoretical mean could lead to more accurate estimation of d (see e.g. Cheung and Diebold, 1994). This can be seen as the most effective way to reduce the effect of a nuisance parameter on inference for the parameter of interest. Therefore, the Arfima package allows for fixing the mean at a specified value. Let y_{t} denote the original dependent variable, and μ_{y} the known mean.
The z used in §13.3.3 for estimation when specifying a known mean is:
z_{t}=y_{t}μ_{y}, 
otherwise the package uses
z_{t}=y_{t}μ̂_{y}. 
The specification of the mean affects the likelihood. For the last term in the loglikelihood:
( yμ) 'R^{1}( yμ) =y'R^{1}y2μ'R^{1}y+μ'R^{1}μ, 
so the known mean case adds μ_{y}ι'R^{1}y, whereas the second case adds μ̂_{y}ι'R^{1}y, and different results must be expected.
It is possible to fix d at a specific value, or drop arma terms.
A weight variable w, w_{t}≥0, can be used in estimation. Write w̃_{t} = w_{t}/w_{>0}, where w_{>0} is the mean of the positive weights.
Then (eq:13.12) for EML becomes:
 1/2 {T^{1} log  R T^{1}∑_{w̃t>0} log w̃_{t} + log T^{1}∑_{t=1}^{T} e_{t}^{2}w̃_{i} }, 
The NLS function is adjusted in a similar fashion. Weighted estimation is not available for MPL, and weights are ignored for forecasting. The weighted residuals, ê_{t} w̃_{i}^{1/2}, are used in the residualbased diagnostics.
With both additive normal regressors x_{t} and innovation Z variables z_{t} the arfima model becomes:
Φ( L) ( 1L) ^{d} ( y_{t}x_{t}'β ) = Θ( L)( ε_{t}+z_{t}'γ). 
The notation for the Z variables in this subsection should not be confused with z_{t}, the demeaned y_{t}. After applying the normal EML or NLS filter to z_{t}, z_{t}'γ̂ is subtracted at each iteration.
This model has the familiar ADL (Autoregressive Distributedlag model) as a special case, since z_{t} can contain different lags of the same (exogenous) variable. Whereas additive outliers (and missing observations) can be estimated using dummies for the X variables, see e.g. Brockwell and Davis (1993, §12.3), innovation outliers can be modelled by dummies for Z variables. Note that adding a single observation dummy for a Z variable has the same effect as giving that observation zero weight in the W variable. This is illustrated in fracest8.ox.
Z variables are not available for MPL.
Two methods of forecasting are supported, based on the results in Beran (1994, §8.7). As before let z=(z_{1},...,z_{T})' denote the observations over the estimation period. Assume z_{t} is stationary and d>1. The best linear prediction of z_{T+h} is
ẑ_{T+h}=[ r( T1+h) ...r( h) ] {T[ r( 0) ,...,r( T1) ] }^{1}z=q'z, 
which consists of the reversed ACovF starting from h, times the original data weighted by their correlations. The solvetoeplitz function is used to solve Tx=z in order to keep storage requirements of order T, see Doornik and Ooms (1999). The mean square error is
MSE(ẑ_{T+h})=σ̂_{ε}^{2}( r( 0) r'q) . 
In the presence of a meanfunction μ_{t}=f(x_{t},β) the forecasts are:
ŷ_{T+h}=q'( yμ) +μ_{t+h} +x_{T+h}'β̂. 
The program computes all requested forecasts z_{h}̂=(z_{T+1},...,z_{T+h})' and their joint variancecovariance matrix, Cov(z_{h}̂) simultaneously. Cov(z_{h}̂) is also used to derive the mean squared errors for partial sums, ∑_{i=1}^{h} ẑ_{T+i}, integrated partial sums etc.
`Naive' forecasts are derived from the autoregressive representation of the process, truncated at T+h:
Θ^{1}( L) Φ( L) ( 1L) ^{d}z_{t}=( 1b_{1}L...b_{T+h1}L^{T+h1}) z_{t}=B( L) z_{t}. 
In this case the z_{t} need not be stationary, cf. Beran (1995), but d>0.5. The first T coefficients in the ( 1L) ^{d} polynomial can be computed using the diffpow function when the input is a one followed by T1 zeroes; this follows from (eq:13.7). For polynomial multiplication and division, polymul and polydiv are used. The naive forecasts are computed recursively:
ẑ_{T+h}^{*}=[ b_{T+h1}...b_{1}] ×[ z' ẑ_{T+1}...ẑ_{T+h1}] ', 


where a_{i} are the coefficients of B^{1}( L) .
Level forecasts are computed by adding the (integrated) partial sums of forecasts to a specified starting level. The reported MSE of the integrated naive forecasts can be obtained directly from (eq:13.17).
Forecasting with Z variables is not yet implemented.