These reference chapters have been taken from Volume III, and use the same chapter and section numbering as the printed version.
The autoregressive conditional-heteroscedastic (ARCH, see Engle, 1982) and generalized ARCH (GARCH, see Bollerslev, 1986) models have found widespread application since their introduction. These models were designed to capture the volatility clustering which can be observed in macro-economic series such as inflation (Engle, 1982), or financial time-series such as exchange rates and stock market returns. Recent surveys of GARCH-type models include Bera and Higgins (1993), Bollerslev, Engle and Nelson (1994), Shephard (1996), and Gourieroux (1997), also see the text books by Hamilton (1994) and Mills (1999).
The ARCH(q) model featured briefly in Volume I as part of non-linear modelling, and throughout in the form of the LM test for ARCH effects in the residuals of a regression model. The ARCH(q) model can be defined as:
|
|
Assuming εt~IN[0,1], identically and independently distributed for all t, gives ut|ut-1...ut-q~N(0,h t). So conditional on the past, the model is normal but heteroscedastic. The log-likelihood for observation t differs from the linear model with normal errors through the time-dependence of the conditional variance:
| lt( μ, α|Ft-1) = c-½ log ( h t) -½ |
| , |
where Ft-1 captures all the information up to t-1.
In contrast with stochastic volatility models (see Shephard, 1996), where the variance is modelled as an unobserved (latent) process, the model in (eq:2.1) leads to an explicit expression for the log-likelihood: given values for μ and α=(α1,...,αq)', we can evaluate ût-q,...,ût-1 and therefore ĥt and lt. This makes maximum likelihood estimation of the model in (eq:2.1), and its extensions discussed below, relatively easy in comparison with stochastic volatility models. There is a problem very early on in this recursion regarding the treatment of ût for t<0 --- this is discussed below. The first-order conditions for a maximum of the log-likelihood cannot be solved analytically in terms of the parameters; instead this requires the use of a numerical optimization routine.
More generally, the regression model with normal-GARCH(p,q) errors is defined as:
|
|
The ARCH(q) model corresponds to GARCH(0,q). The GARCH(p,q) model allows for a parsimonious parametrization of an ARCH(∞) structure: assuming that the roots of 1-β(L)=1-∑i=1pβiLi are outside the unit circle, and letting α(L)=∑i=1qαiLi, (eq:2.2) implies
| ht= |
| + |
| ut2. |
The conditional variance ht must be nonnegative, which requires that the coefficients of δ(L)=α(L)/(1-β(L)) are nonnegative (and α0≥0). A sufficient condition for this is that αi≥0 and βi≥0 for all i (however this is overly restrictive; see Nelson and Cao, 1992).
The equation for ht in (eq:2.2) can be written in ARMA form using νt=u2t-ht=(ε2t-1)ht:
|
|
where m= max (p,q) and we set βi=0 for i>p, and αi=0 for i>q. Note that E [νt|Ft-1]=0. It follows that ut has a finite unconditional variance, given by E[ut2]=α0/(1-α(1)-β(1)), if and only if the roots of
| 1-α(L)-β(L)=1-∑i=1m(αi+βi)Li |
are all outside the unit circle. A necessary and (given nonnegativity of αi and βi) sufficient condition for this is α(1)+β(1)=∑i=1m(αi+βi)<1. Under this condition, the process ut is weakly stationary. Equation (eq:2.3) also forms the basis for forecasting, because, for the GARCH(1,1) model:
| E[hT+H|hT]=(α+β)HhT + α0[∑j=0H-1(α+β)j]. |
In the first forecast, νT-1 still features, but afterwards
it is set to its expectation of zero. The default
parameter space in PcGive is formulated in terms of the ARMA
representation, with α0≥0, αi+βi≥0,
and α(1)+β(1)<1.
When α(1)+β(1)=1, the ARMA representation (eq:2.3) has a unit root, so that ut has no finite unconditional variance; this situation is called integrated GARCH (IGARCH). In the IGARCH(1,1) case:
| E[hT+H]=hT + α0H, |
which grows linearly with the forecast horizon H. However, the ut process is strictly stationary (provided α0 > 0, see Nelson, 1990).
The GARCH model can also be estimated with standardized Student-t distributed errors (see §2.5). As the degrees of freedom go to infinity, this distribution approaches the standard normal.
We also consider the exponential GARCH, or EGARCH(p,q), specification (see Nelson, 1991):
|
|
with α1=1. The log-formulation implies that ht is always nonnegative, regardless of the (possibly negative) parameter values; stationarity is now related to the roots of 1-β(L), in particular to the question whether β(1)<1 (but the implementation in PcGive does not impose restrictions on the EGARCH parameter space). When θ1≠0, the model allows for an asymmetric response. In the normal EGARCH model, the errors follow a standard normal distribution. PcGive also allows the EGARCH model with a generalized error distribution (GED), which is defined as:
|
|
where Γ(.) is the gamma function, and λ2=2-2/νΓ(1/ν)/Γ(3/ν). It contains the standard normal distribution as a special case, in which case E| εt| =(2/π)1/2. When ν< 2 the distribution has a thicker tail than the standard normal. PcGive estimates ν*= log ν/2, which is zero for the standard normal distribution; in that case ν*<0 implies thicker tails.
Both the GARCH and EGARCH models can be estimated with the conditional variance in the mean, with a choice between ht, ht1/2 and log ht entering the vector of regressors xt.
Finally, asymmetric and threshold versions of the GARCH model are available, where ht is specified as:
|
|
where Dt-1=1 if u t-1<κ1 and zero otherwise;
κ1 is the asymmetry parameter, and κ2 the threshold
parameter. Either or both can be zero: a pure threshold model
has κ1=0 (Glosten, Jagannathan and Runkle, 1993),
and a pure asymmetric GARCH model has κ2=0.
In summary, the following models can be estimated:
|
The default is a GARCH(1,1) model with normal errors, see (eq:2.2). If desired, other values for p and q can be selected.
The Model Settings dialog also has a range of options (the default is marked below with a *):
This is either a GARCH model with student-t distributed errors or an EGARCH model with errors which have a generalized error distribution.
Selecting both results in a threshold asymmetric GARCH model,
This imposes no restrictions, and could in rare cases lead to negative variance estimates.
Nelson and Cao (1992) argued that imposing all coefficients to be nonnegative is overly restrictive, and that negative estimates occur in practice (they list several examples). Subsequently, He and Teräsvirta (1999) have shown that such negative coefficients allow for richer shapes of the autocorrelation function. Doornik and Ooms (2000) discuss the implementation of these restrictions.
Doornik and Ooms (2000) also considered restrictions in terms of (eq:2.3). Defining m= max (p,q), βi=0 for i>p, αi=0 for i>q:
|
|
In terms of (eq:2.3), these restrictions imply that the unconditional variance exists, and is always positive. Formally, PcGive imposes α(1)+β(1)<1 with strict inequality. However, numerically, the result can get so close as to being virtually indistinguishable from one. In that case the sum is assumed to be on the IGARCH boundary.
Bollerslev (1986) proposed the GARCH model with α0 ≥0, αi ≥0, and βi ≥0. This ensures that ht≥0.
Evaluation of the likelihood requires pre-sample values for ut2 and ht. PcGive implements two methods:
Bollerslev (1986) suggested using the mean of the squared residuals:
|
|
In EGARCH, the recursion is started with the logarithm of the mean of the squared residuals.
This amounts to adding the missing u21...u2m as parameters which are to be estimated. This make the likelihood derivatives more complex, and PcGive will only use numerical derivatives if this method is selected (see below).
This uses minus the inverse of the numerical second derivative of the log-likelihood function.
The information matrix is computed analytically, but only for standard GARCH models.
In addition, the robust standard errors are printed by default when the information matrix J is available. These are of the form suggested by Bollerslev and Wooldridge (1992): Ĵ-1ĜĴ-1, where Ĝ is the outer product of the gradients.
A comparison of various estimators is given in Fiorentini, Calzolari and Panattoni, 1996.
Especially when p>1, it is possible that the likelihood has multiple optima. This final set of advanced options allows for a search from random starting values. Because each of these involves maximization of the likelihood, this option can be time consuming. See Doornik and Ooms (2000) for an application.
PcGive only offers BFGS (see e.g. Fletcher, 1987 or Gill, Murray and Wright, 1981), because it is our preferred method for numerical optimization. BFGS avoids the need for second derivatives, while being one of the most robust methods available. This is supplemented by a line search when, at an iteration step, the likelihood does not increase. BFGS was not considered by Fiorentini, Calzolari and Panattoni (1996), but we found 100%convergence when replicating their Table 1 with 1000 replications (requiring about 17 iterations on average, whether starting from the DGP values, or from a starting value routine).
PcGive uses the method of Galbraith and Zinde-Walsh (1997) applied to the squared data (after removing regressors in the mean). If necessary, the resulting parameter values are reduced to ensure that the unconditional variance exists.
Numerical derivatives are more convenient, but less accurate than analytical derivatives (see Fiorentini, Calzolari and Panattoni, 1996). The latter are to be preferred, but convenience often dictates the use of the former. In simple GARCH models, we found numerical derivatives sufficiently effective, with model estimation taking the same amount of time, and convergence achieved as frequently.
PcGive uses analytical derivatives, except (1) when the Hessian matrix is required for the variance-covariance matrix, (2) when the initial values of the conditional variances are added as parameters, (3) for EGARCH-type models and (4) for conditional variance in mean models.
In a discrete choice model the dependent variable only takes on integer values. Such a discrete dependent variable can denote a category (e.g. mode of transport: car, train, or bus) or count the number of events (e.g. the number of accidents in a week). A model with a categorical dependent variable is often called a discrete choice model, but is also referred to as dummy-endogenous or qualitative response model. The main reference for discrete choice models is Cramer (2001), and we will frequently omit details when a reference to Cramer (2001) can be made instead. The use of ordinary least squares in a discrete choice model is called the linear probability model. The main shortcoming of this model is that it can predict probabilities that are outside the [0,1] interval. PcGive implements logit and probit estimation which avoids this drawback.
General references include McFadden (1984), Amemiya (1981) and Cramer (2001), among others. In the remainder we exclusively refer to Cramer (2001).
In the simple discrete choice model the dependent variable only takes on two values, for example, when modelling car ownership:
|
With a discrete dependent variable, interest lies in modelling the probabilities of observing a certain outcome:
| pi=P{yi=1}. |
Using OLS on a dummy dependent variable:
| yi = xi'β+εi |
does not restrict ŷi between 0 and 1, which would be required if it is to be interpreted as a probability. Also, the disturbances cannot be normally distributed, as they only take on two values: εi=1-pi or εi=0-pi, writing pi=xi'β, and are heteroscedastic.
A simple solution is to introduce an underlying continuous variable Ui, which is not observed. Observed is:
|
|
Now we can introduce explanatory variables:
| Ui=xi'β-εi. |
and write
| pi=P{yi=1}=P{xi'β-εi≥0}=Fε ( xi'β) . |
Observations with yi=1 contribute pi to the likelihood, observations with yi=0 contribute 1-pi:
| L( β|X) =∏{yi=0}( 1-pi) ∏{yi=1}pi, |
and the log-likelihood becomes:
| l( β|X) =∑i=1N[ ( 1-yi) log ( 1-pi) +yi log pi] . |
The choice of Fε determines the method. Using the logistic distribution:
| Fε ( z) = |
|
leads to logit. Logit has a linear log-odds ratio:
| log ( |
| ) =xi'β. |
The standard normal distribution gives probit. We can multiply yi* by any non-zero constant without changing the outcome, the scale of these distributions is fixed: the logistic has variance π2/3, the standard normal has variance equal to 1.
Assume that observations on n individuals (or households, firms, etc.) are obtained by random sampling, and that all individuals have made a choice from the same set of alternatives A={1,2,...,S}. Using subscript i to denote an individual and s to denote an alternative, we can write the dependent variable as yi:
| yi = s if individual i has made choice s, s=1,...,S; i=1,...,n. |
The existence of an unobserved continuous variable Uis is assumed, where the alternative with the highest value is chosen:
| yi = s if Uis= max k∈A {Uik}. |
See Cramer (2001, Ch. 3) for an interpretation in terms of utility maximization. Adding a stochastic component:
|
|
Logit and probit specify a certain distribution for εis. The systematic part Vis is a functional form of explanatory variables and coefficients. We can distinguish three types of explanatory variables:
The xi are called alternative independent regressors, while the remaining two types, called alternative dependent, are treated identically in PcGive.
Moreover, there are two types of coefficients:
The full specification in PcGive is:
|
|
So alternative independent regressors always enter with an alternative specific coefficient, while alternative dependent regressors have a generic coefficient.
As an illustration, consider the transport mode example:
|
Then xi may be the income of individual i, and zis the travel time for individual i using mode s. Interest may focus on the probabilities that individual i uses mode s, or the coefficients, to estimate the influence on marginal utility of explanatory variables. In the absence of alternative specific parameters it is possible to investigate the consequences of introducing a new mode of transport.
Let pis denote the probability that individual i chooses alternative s. Then from (eq:5.2):
|
The log-likelihood function is:
|
|
where
|
The multinomial logit (MNL) model arises from (eq:5.2) when the εik are independently type I extreme value distributed (see Cramer, 2001, Ch. 3). This is the most convenient model, because of the closed form of the probabilities:
|
|
and the concavity of the log-likelihood.
The log-odds ratio is:
| log (pis/pik)=Vis-Vik. |
Substituting (eq:5.3):
| Vis-Vik = xi'(βs-βk) +(zis-zik)'γ, |
shows that a normalization is needed on βk: a constant can be added (for all k) without changing the probabilities. PcGive takes:
| β1 = 0, |
and the first alternative is then called the reference state.
Although the name multinomial logit in PcGive is generally used for the specification (eq:5.3), it sometimes stands for a more specific functional form:
| Vis=xi'βs. |
In that case, the conditional logit model refers to
| Vis=zis'γ. |
When Vis does not depend on attributes of alternatives other than alternative s, the MNL model has the independence from irrelevant alternatives (IIA) property, see Cramer (2001, Ch. 3).
PcGive allows the specification of a weight variable in the log-likelihood:
| l(θ) = ∑i=1nwi∑s=1SYis log Pis(θ). |
The use could be two-fold:
In multinomial logit, only the intercept is incorrect when treating a choice-based sample as a random sample (see Maddala, 1983, pp 90-91). Manski and Lerman (1977) proposed a weighted estimation to get consistent estimates for choice-based samples. Let Q0(s) be the population fraction choosing alternative s, and H(s) the fraction in the sample choosing s. Denote the observed response of individual i as si, then the weights are:
| wi = Q0(si) / H(si). |
Also see Amemiya and Vuong (1987).
The coefficient estimates β̂s, s=2,...,S and γ̂ form the basis for the estimated probabilities for each observation: p̂is, s=1,...,S, i=1,...,n.
If the estimated model contains a constant term, it can be compared with the baseline model in a likelihood ratio test. The baseline model has only an intercept, and the resulting test is analogous to the regression F-test. In the baseline MNL model:
| Vis = αs, s=2,...,S, |
the likelihood can be maximized explicitly, resulting in:
| αs = log (fs/f1), |
where fs is the relative frequency in the sample of alternative s. With K alternative independent regressors (x) and C alternative dependent regressors (z), the estimated model has K(S-1)+C parameters, and a log-likelihood of l(θ̂), say. The baseline model has S-1 parameters, and a log-likelihood of l(α̂), so that 2l(θ̂) - 2l(α̂) is χ2[(K-1)(S-1)+C] distributed.
If the estimated model has no intercept, the zeroline is used in the test. This is a model without parameters:
| p̂is= 1/S . |
In a multinomial logit model that does have an intercept, the sum of the estimated probabilities for state s equals the number of observations having state s: 1/n ∑i=1n p̂is=fs. This follows, because at the maximum, the derivatives of the log-likelihood:
| =∑i=1n (Yis-pis)x'i |
are zero, and hence, if one of the regressors is a row of ones:
| ∑i=1n Yis=∑i=1n p̂is. |
With grouped data, the log-likelihood of the saturated model is reported. This is a model with perfect fit: the predicted frequencies equal the observed frequencies. Let nis be the number in group i having response s. Then ∑s=1Snis=ni and ∑i=1Gni=n, with G the number of groups and n the total number of observations. The saturated model
| p̂is= |
|
has G(1-S) parameters and a log-likelihood of
| ∑i=1G∑s=1Snis log |
| . |
The saturated log-likelihood is not reported if any cell has nis=0.
The derivatives of the probabilities with respect to the regressors give the effect of a change in one of the regressors on the probabilities. Since the probabilities sum to one, the derivatives sum to zero. The derivatives for the MNL model are:
|
We can substitute the relative frequency of each state for the probabilities (p̂s=fs) or select other relative frequencies. In the calculation of the t-values the p̂s are treated as non-stochastic.
In the binary probit model the derivatives are:
| =φ(V̂)βm. |
These derivatives are calculated at the mean of the explanatory variables (V̂=\bf x'β̂), but can be calculated at other values.
PcGive also reports quasi-elasticities:
| = |
| xm. |
If xm increases with q%, ps will approximately increase by q×∂ps/∂ log xm percentage points. Elasticities are also reported:
|
| . |
All three measures have the same t-values.
The distribution of the estimated probabilities is shown in histograms. The first histogram is the frequency distribution of the probabilities of the observed state. Consider the following example with three states:
|
The probabilities of the observed state are underlined. Their frequency distribution is shown in Figure Figure:5.1a (e.g. 50% is in [0.4,0.5), while on the right is the distribution of the estimated probabilities for the first state. These graphs can be made for each state, with S+1 histograms in total.

A norm observation has the sample mean for each explanatory variable. The probabilities of each response can be calculated for the norm observation, as well as for other user-defined norm observations. The probabilities of the norm observation do not equal the relative frequencies of the alternatives: p̂s(V)≠fs.
The predicted choice is the alternative with the highest predicted probability. The table of observed versus predicted choice is a rather crude measure, since 0.34,0.33,0.33 and 0.9,0.05,0.05 both select the first alternative as predicted response. For the example of §5.6 the table would read:
| ||||||||||||||||||||||||||||||||
Let p̂(i) denote the estimated probability that individual i selects the alternative she actually choose. These are underlined in the example of §5.6. PcGive can print all p̂(i) which are smaller than a designated value.
In a panel data setting, we have time-series observations on multiple entities, for example companies or individuals. We denote the cross-section sample size by N, and, in an ideal setting, have t=1,...,T time-series observations covering the same calendar period. This is called a balanced panel. In practice, it often happens that some cross-sections start earlier, or finish later. The panel data procedures in PcGive are expressly designed to handle such unbalanced panels.
When T is large, N small, and the panel balanced, it will be possible to use the simultaneous-equations modelling facilities of PcGive. When T is small, and the model dynamic (i.e. includes a lagged dependent variable), the estimation bias can be substantial (see Nickell, 1981). Methods to cope with such dynamic panel data models are the primary focus of this part of PcGive, particularly the GMM-type estimators of Arellano and Bond (1991), Arellano and Bover (1995), and Blundell and Bond (1998), but also some of the Anderson and Hsiao (1982) estimators. In addition, PcGive makes several of the standard static panel data estimators available, such as between and within groups and feasible GLS. Table Table:7.1 lists the available estimators.
| static panel data models | dynamic panel data models | |
| OLS in levels | OLS in levels | |
| Between estimator | one-step IV estimation (Anderson--Hsiao IV) | |
| Within estimator | one-step GMM estimation | |
| Feasible GLS | one-step estimation with robust std. errors | |
| GLS (OLS residuals) | two-step estimation | |
| Maximum likelihood (ML) |
This chapter only provides a cursory overview of panel data methods. In addition to the referenced literature, there are several text books on panel data, notably Hsiao (1986) and more recently Baltagi (1995). Many general econometric text books have a chapter on panel data (but rarely treating dynamic models), see, e.g., Johnston and DiNardo (1997, Ch. 12).
The static single-equation panel model can be written as:
|
|
The λt and ηi are respectively time and individual specific effects and xit is a k* vector of explanatory variables. N is the number of cross-section observations. The total number of observations is NT.
Stacking the data for an individual according to time yields:
| ( |
| ) = ( |
| ) + ( |
| ) + ( |
| ) + ( |
| ), |
or, more concisely:
| yi = Xiγ + λ + ιi ηi + vi, i = 1, ..., N, |
where yi, Xi, and λi are T×1, and ιi is a T column of ones. Using Di for the matrix with dummies:
| yi = Xiγ + Diδ + vi, i = 1, ..., N. |
The next step is to stack all individuals, combining the data into W=[X : D]:
|
|
Baltagi (1995) reviews the standard estimation methods used in this setting:
More detail is provided in §10.3.
The general model that can be estimated with PcGive is a single equation with individual effects of the form:
|
|
where ηi and λt are respectively individual and time specific effects, xit is a vector of explanatory variables, β(L) is a vector of associated polynomials in the lag operator and q is the maximum lag length in the model. The number of time periods available on the ith individual, Ti, is small and the number of individuals, N, is large. Identification of the model requires restrictions on the serial correlation properties of the error term vit and/or on the properties of the explanatory variables xit. It is assumed that if the error term was originally autoregressive, the model has been transformed so that the coefficients α's and β's satisfy some set of common factor restrictions. Thus only serially uncorrelated or moving-average errors are explicitly allowed. The vit are assumed to be independently distributed across individuals with zero mean, but arbitrary forms of heteroscedasticity across units and time are possible. The xit may or may not be correlated with the individual effects ηi, and for each of theses cases they may be strictly exogenous, predetermined or endogenous variables with respect to vit. A case of particular interest is where the levels xit are correlated with ηi but where Δxit (and possibly Δyit) are uncorrelated with ηi; this allows the use of (suitably lagged) Δxis (and possibly Δyis) as instruments for equations in levels.
The (Ti-q) equations for individual i can be written conveniently in the form:
| yi=Wiδ +ι iηi+vi, |
where δ is a parameter vector including the αk's, the β's and the λ's, and Wi is a data matrix containing the time series of the lagged dependent variables, the x's and the time dummies. Lastly, ι i is a (Ti-q)×1 vector of ones. PcGive can be used to compute various linear GMM estimators of δ with the general form:
| δ̂=[ ( ∑iWi*'Zi) AN( ∑iZi'Wi*) ] -1( ∑iWi*'Zi) AN( ∑iZi'yi*), |
where
| AN=( 1/N ∑iZi'HiZi) -1, |
and Wi* and yi* denote some transformation of Wi and yi (e.g. levels, first differences, orthogonal deviations, combinations of first differences (or orthogonal deviations) and levels, deviations from individual means). Zi is a matrix of instrumental variables which may or may not be entirely internal, and Hi is a possibly individual-specific weighting matrix.
If the number of columns of Zi equals that of Wi*, AN becomes irrelevant and δ ̂ reduces to
| δ̂ =( ∑iZi'Wi*) -1( ∑iZi'yi*). |
In particular, if Zi=Wi* and the transformed Wi and yi are
deviations from individual means or orthogonal deviations
| xit*=( xit- |
| ) ( |
| ) 1/2 for t=1,...,T-1. |
If the original errors are serially uncorrelated and homoscedastic, the transformed errors will also be serially uncorrelated and homoscedastic.]
, then δ ̂ is the within-groups estimator. As another example, if the transformation denotes first differences, Zi=ITi⊗xi' and Hi=v̂i*v̂i*', where the v̂i* are some consistent estimates of the first-differenced residuals, then δ̂ is the generalized three-stage least squares estimator of Chamberlain (1984). These two estimators require the xit to be strictly exogenous with respect to vit for consistency. In addition, the within-groups estimator can only be consistent as N→∞ for fixed T if Wi* does not contain lagged dependent variables and all the explanatory variables are strictly exogenous.
When estimating dynamic models, we shall therefore typically be concerned with transformations that allow the use of lagged endogenous (and predetermined) variables as instruments in the transformed equations. Efficient GMM estimators will typically exploit a different number of instruments in each time period. Estimators of this type are discussed in Arellano (1988), Arellano and Bond (1991), Arellano and Bover (1995) and Blundell and Bond (1998). PcGive can be used to compute a range of linear GMM estimators of this type.
Where there are no instruments available that are uncorrelated with the
individual effects ηi, the transformation must eliminate this
component of the error term. The first difference and orthogonal deviations
transformations are two examples of transformations that eliminate ηi
from the transformed error term, without at the same time introducing all
lagged values of the disturbances vit into the transformed error term.
Hence these transformations allow the use of suitably lagged endogenous (and predetermined) variables as instruments. For example, if the panel is balanced, p=1, there are no explanatory variables nor time effects, the vit are serially uncorrelated, and the initial conditions yi1 are uncorrelated with vit for t=2,...,T, then using first differences we have:
| Equations | Instruments available | |
| Δyi3=αΔyi2+Δvi3 | yi1 | |
| Δyi4=αΔyi3+Δvi4 | yi1,yi2 | |
| ΔyiT=αΔyi,T-1+ΔviT | yi1,yi2,...,yi,T-2 |
In this case, yi*=(Δyi3,...,ΔyiT)',Wi*=(Δyi2,...,Δyi,T-1)' and
| Zi=ZiD=( |
| ) |
Notice that precisely the same instrument set would be used to estimate the model in orthogonal deviations. Where the panel is unbalanced, for individuals with incomplete data the rows of Zi corresponding to the missing equations are deleted, and missing values in the remaining rows are replaced by zeros.
In PcGive, we call one-step estimates those which use some known matrix as the choice for Hi. For a first-difference procedure, the one-step estimator uses
| Hi=HiD= 1/2 ( |
| ), |
while for a levels or orthogonal deviations procedure, the one-step estimator sets Hi to an identity matrix. If the vit are heteroscedastic, a two-step estimator which uses
| Hi=v̂i*v̂i*', |
where v̂i* are one-step residuals, is more efficient (cf. White, 1982). PcGive produces both one-step and two-step GMM estimators, with asymptotic variance matrices that are heteroscedasticity-consistent in both cases. Users should note that, particularly when the vit are heteroscedastic, simulations suggest that the asymptotic standard errors for the two-step estimators can be a poor guide for hypothesis testing in typical sample sizes. In these cases, inference based on asymptotic standard errors for the one-step estimators seems to be more reliable (see Arellano and Bond, 1991, and Blundell and Bond, 1998 for further discussion).
In models with explanatory variables, Zi may consist of sub-matrices with the block-diagonal form illustrated above (exploiting all or part of the moment restrictions available), concatenated to straightforward one-column instruments. A judicious choice of the Zi matrix should strike a compromise between prior knowledge (from economic theory and previous empirical work), the characteristics of the sample and computer limitations (see Arellano and Bond, 1991 for an extended discussion and illustration). For example, if a predetermined regressor xit correlated with the individual effect, is added to the model discussed above, i.e.
|
then the corresponding optimal Zi matrix is given by
| Zi=( |
| ) |
Where the number of columns in Zi is very large, computational considerations may require those columns containing the least informative instruments to be deleted. Even when computing speed is not an issue, it may be advisable not to use the whole history of the series as instruments in the later cross-sections. For a given cross-sectional sample size (N), the use of too many instruments may result in (small sample) overfitting biases. When overfitting results from the number of time periods (T) becoming large relative to the number of individuals (N), and there are no endogenous regressors present, these GMM estimators are biased towards within groups, which is not a serious concern since the within-groups estimator is itself consistent for models with predetermined variables as T becomes large (see Álvarez and Arellano, 1998). However, in models with endogenous regressors, using too many instruments in the later cross-sections could result in seriously biased estimates. This possibility can be investigated in practice by comparing the GMM and within-groups estimates.
The assumption of no serial correlation in the vit is essential for the
consistency of estimators such as those considered in the previous examples,
which instrument the lagged dependent variable with further lags of the same
variable. Thus, PcGive reports tests for the absence of first-order and
second-order serial correlation in the first-differenced residuals. If the
disturbances vit are not serially correlated, there should be evidence
of significant negative first-order serial correlation in differenced
residuals (i.e. v̂it-v̂i,t-1), and no evidence of
second-order serial correlation in the differenced residuals. These tests
are based on the standardized average residual autocovariances which are
asymptotically N(0,1) variables under the null of no autocorrelation. The
tests reported are based on estimates of the residuals in first differences,
even when the estimator is obtained using orthogonal deviations.
More generally, the Sargan (1964) tests of overidentifying restrictions are also reported. That is, if AN has been chosen optimally for any given Zi, the statistic
|
|
is asymptotically distributed as a χ2 with as many degrees of freedom as overidentifying restrictions, under the null hypothesis of the validity of the instruments. Note that only the Sargan test based on the two-step GMM estimator is heteroscedasticity-consistent. Again, Arellano and Bond (1991) provide a complete discussion of these procedures.
Where there are instruments available that are uncorrelated with the
individual effects ηi, these variables can be used as instruments for
the equations in levels. Typically this will imply a set of moment
conditions relating to the equations in first differences (or orthogonal
deviations) and a set of moment conditions relating to the equations in
levels, which need to be combined
to obtain the efficient GMM estimator.
For example, if the simple AR(1) model considered earlier is mean-stationary, then the first differences Δyit will be uncorrelated with ηi , and this implies that Δyi,t-1 can be used as instruments in the levels equations (see Arellano and Bover, 1995 and Blundell and Bond, 1998 for further discussion). In addition to the instruments available for the first-differenced equations that were described earlier, we then have:
| Equations | Instruments available | |
| yi3=αyi2+ηi+vi3 | Δyi2 | |
| yi4=αyi3+ηi+vi4 | Δyi3 | |
| yiT=αyi,T-1+ηi+viT | Δyi,T-1 |
Notice that no instruments are available in this case for the first levels equation (i.e., yi2=αyi1+ηi+vi2), and that using further lags of Δyis as instruments here would be redundant, given the instruments that are being used for the equations in first differences. In a balanced panel, we could use only the last levels equation (i.e., yiT=αyi,T-1+ηi+viT), where (Δyi2,Δyi3,...,Δyi,T-1) would all be valid instruments; however this approach does not extend conveniently to unbalanced panels.
In this case, we use
|
and
| Zi=( |
| ) |
where ZiD is the matrix of instruments for the equations in first differences, as described above. Again Zi would be precisely the same if the transformed equations in yi* and Wi* were in orthogonal deviations rather than first differences. In models with explanatory variables, it may be that the levels of some variables are uncorrelated with ηi, in which case suitably lagged levels of these variables can be used as instruments in the levels equations, and in this case there may be instruments available for the first levels equation.
For the system of equations in first differences and levels, the one-step estimator computed in PcGive uses the weighting matrix
| Hi=( |
| ) |
where HiD is the weighting matrix described above for the first
differenced estimator, and Ii is an identity matrix with dimension equal
to the number of levels equations observed for individual i. For the
system of equations in orthogonal deviations and levels, the one-step
estimator computed in PcGive sets Hi to an identity matrix with dimension
equal to the total number of equations in the system for individual i. In
both cases the corresponding two-step estimator
uses Hi=v̂i*v̂i*'. We adopt these particular one-step weighting
matrices because they are equivalent in the following sense: for a balanced
panel where all the available linear moment restrictions are exploited (i.e.,
no columns of Zi are omitted for computational or small-sample reasons),
the associated one-step GMM estimators are numerically identical, regardless
of whether the first difference or orthogonal deviations transformation is
used to construct the system. Notice though that the one-step estimator is
asymptotically inefficient relative to the two-step estimator for both of
these systems, even if the vit are homoscedastic.
Again simulations have suggested that asymptotic inference based on the one-step versions may be more reliable than asymptotic inference based on the two-step versions, even in moderately large samples (see Blundell and Bond, 1998).
The validity of these extra instruments in the levels equations can be
tested using the Sargan statistic provided by PcGive.
Since the set of
instruments used for the equations in first differences (or orthogonal
deviations) is a strict subset of that used in the system of
first-differenced (or orthogonal deviations) and levels equations, a more
specific test of these additional instruments is a Difference Sargan test
which compares the Sargan statistic for the system estimator and the Sargan
statistic for the corresponding first-differenced (or orthogonal deviations)
estimator. Another possibility is to compare these estimates using a Hausman (1978)
specification test, which can be computed here by including another set of
regressors that take the value zero in the equations in first differences
(or orthogonal deviations), and reproduce the levels of the right-hand side
variables for the equations in levels.
| Wi*=( |
| ) .' |
]
The test statistic is then a Wald test of the hypothesis that the coefficients on these additional regressors are jointly zero. Full details of these test procedures can be found in Arellano and Bond (1991) and Arellano (1995).
This section gives a summary of the statistical output of PcGive, giving the formulae which are used in the computations.
| x* it = x it, t = 1,...,Ti. |
|
| x i = |
| ∑s=1Ti x is. |
| x* it = x it-x i, t = 1,...,T. |
| xo it = (x it - |
| ∑s=t+2Ti x is) ( |
| )1/2, t = 2,...,Ti-1. |
The orthogonal deviations are stored with one lag, so that the first observation is lost instead of the last (this brings it in line with first differencing):
|
| x* it = x it-θi x i, t = 1,...,Ti, |
where the choice of θi determines the method, as discussed in the next section.
The starting point is (eq:7.2):
| y = Wβ + v. |
The total number of observations is O=NT. In an unbalanced panel, different individuals may have a different number of observations, Ti. In that case the total number of observations equals O=∑Ti. Baltagi (1995) reviews the standard estimation methods used in this setting:
|
Here W has k columns, containing all specified regressors including dummies, so p=k.
| θi = 1 - |
| , σi2 = σ2v + Tiσ2η. |
When OLS residuals u = v̂OLS are used, the GLS estimator can be based on:
|
PcGive computes θi from σ2 v and σ2η with the latter derived from σ̂ 02 as:
| σ̂ 2η = (σ̂02 - σ̂ v2) N/O . |
In a balanced panel: N/O=1/T. This is not optimal in an unbalanced panel, but seems reasonable.
The standard feasible GLS estimator uses the between and within estimates:
|
For convenience, θi is specified using three variance components σ2 v, σ2a, and σ2η:
|
|
|
|
|
where τ= σ2η / σ2v, so that:
| θi = 1 - (1 + Tiτ)-1/2. |
To summarize the implementation in PcGive:
| number of | degrees of freedom | transforms | |
| observations, n | lost in estimation, p | dummies | |
| OLS (no transformation) | O | k | no |
| within | O | k+N | no |
| between | N | k | yes |
| GLS, ML | O | k | yes |
It is important to note that the within transformations are only applied to X, and not to the dummies D. In the within estimator, individual dummies are redundant after subtracting the individual means. In the between estimator, time dummies are irrelevant, and individual dummies create a perfect fit. In GLS and ML, the individual means are subtracted with weights.
The single equation dynamic panel model can be written as:
| yit = ∑s=1m αi yi,t-s + xit'γ + λt + ηi + vit, t = 0,..., Ti-1, i = 0, ..., N - 1. |
As before, λt and ηi are time and individual specific effects and xit is a k* vector of explanatory variables. It is assumed that allowance for lagged observations has been made, so observations on yi,-s, ..., yi,-1 are available. The total number of observations available for estimation equals O=∑Ti.
Stacking the data for an individual according to time: and using D for the matrix with dummies:
| yi = Xiγ + Diδ + vi, i = 0, ..., N - 1. |
In formulating the model, we distinguish the following types of variables:
|
The Gi are described under the Gmm function. From these variables, the dependent variable, regressor and instrument matrices are formed as follows:
|
where the * denotes some transformation which can be chosen from §10.1. This also affects the degrees of freedom, as shown in Table Table:10.1. The estimators and tests are described in Arellano and Bond (1991), and have the form:
|
|
| transformation | number of | degrees of freedom | transforms dummies |
| observations, n | lost in estimation, p | ||
| differences | O-N | k | Ds=D (optional: D*) |
| orthogonal | |||
| deviations | O-N | k | Ds=D (optional: D*) |
| none | O | k | no: Ds=D |
| within | O | k+N | no: Ds=D* |
| between | N | k | yes: Ds=D* |
| GLS | O | k | yes: Ds=D* |
When the option to concentrate out dummies is used, y*, X*, I*, and L are replaced by the residuals from regressing on the set of dummies Ds. Subsequently, the dummies are omitted from further computations.
In one-step estimation, Hi is the identity matrix, except when estimating in first differences:
|
|
In two-step estimation Hi is based on the previous step residuals:
|
with a subscript added to distinguish between one and two step estimation.
The output reports the residual and total sum of squares:
|
So the TSS is in deviation from the mean.
The one-step estimated variance of the coefficients is:
| V̂1[β̂] = σ̂1,u2 M1-1. |
If selected, robust standard errors after one-step estimation are computed as:
|
|
When there are no instruments, (eq:10.5) simplifies to:
| (W'W)-1 (∑i=1N Wi'ûi ûi'Wi) (W'W)-1, |
which is the robust variance estimator proposed in Arellano (1987).
| ∑i=1NT wi'ûi ûiwi. |
]
The two-step asymptotic variance matrix is:
| V̂2[β̂] = M2-1. |
This variance estimator can be severely downward biased in small samples, and is therefore only reported when robust standard errors is switched off. The preffered solution (and the default) is to use the robust variance for two-step estimation, V̂2r[β̂], which uses the small sample correction as derived by Windmeijer (2000).
The AR test for order m is computed as:
|
|
Using wi for the residuals lagged m periods (substituting zero for missing lags):
| wit = ui,t-m for t=m,...,Ti, and wit = 0 for t < m. |
|
|
The components are used as the notation suggests, except for Hi in one-step estimation:
|
After estimation in orthogonal deviations, the AR test (eq:10.6) is based on the residuals from the differenced model as follows (the superscript Δ refers to the regressors and residuals from the model in differenced form):
|
|
HiΔ is H1,idiff after one-step estimation, and uiΔ'uiΔ otherwise. Ψi equals
| ( |
| ) wi |
after one-step estimation, and uiuiΔ'wiΔ after one-step robust or two-step estimation.
The Sargan test with z-k degrees of freedom is after one-step estimation:
| S1 = (∑i v̂*'1,iZi) A1,N (∑i Zi'v̂*1,i)σ̂1,u-2, |
and after two-step estimation (see eq:7.4 and equation (10) in Arellano and Bond, 1991):
| S = (∑i v̂*'2,iZi) A2,N (∑i Zi'v̂*2,i). |
Three Wald tests are routinely reported to test the significance of groups of variables:
| Wald (joint): | on W, all regressors except dummies, |
| Wald (dummy): | on D, all dummies (including constant), |
| Wald (time): | on time dummies (including the constant in the |
| differenced/deviations model). |
In combined estimation, GMM-SYS, the levels equation is used jointly with the transformed equation, see Blundell and Bond (1998). The IV estimation of (eq:10.3) still applies, but the data are organized as follows:
|
G+i are the GMM-style instruments for the levels equation,
described under the GmmLevel function.
The dummies in the transformed equation are always transformed. When using differences or deviations, the effective number of observations in the transformed equations is as before (O-N), whereas the levels equations have O observations. Internally, both the transformed and levels equation i has Ti equations, but, when using differences/deviations, the first observation in the transformed equation is set to zero, resulting in Ti-1 effective observations.
When the option to concentrate out dummies is used, y*, X*, I*, and L are replaced by the residuals from regressing on the set of dummies D*, and y and X on D. Note that there is one time dummy extra in combined estimation compared to normal estimation when using differences or deviations.
After estimation, there are two sets of residuals:
|
The reported residual variance is based on the levels residuals: σ̂u+2 = û+'û+/(n-p). Here n = O =∑i=1N Ti and p are as before. The reported RSS and TSS are also based on the levels, using û+ and q+.
In one-step estimation, Hi is as in (eq:10.4)
in the transformed equations, and the identity matrix in the level
equations (1/2I when using differences).
Only robust standard errors are reported after one-step estimation.
In two-step estimation, Hi is based on the previous step residuals ûi' = (ûi*' : ûi+').
The AR test (eq:10.6) is based on the residuals from the transformed equation:
|
After orthogonal deviations, the * is replaced by Δ as in (eq:10.8).
The fractionally-integrated autoregressive-moving average model, denoted arfima(p,d,q) can be used for the statistical analysis of a univariate time series yt with long memory. The more familiar arma (or `Box--Jenkins') models (when d=0) or arima (when d is a positive integer) are special cases. The arfima(p,d,q) model allows one to model the long-run behaviour of a time series in a flexible way. Empirical modelling involves identification, estimation and testing. In the identification stage of ARIMA-modelling one determines the integer part of the order of differencing, d, and the orders of the autoregressive and the moving-average parts of the model, p, and q. In the ARFIMA-model one can estimate d and compute a confidence interval. The arfima model allows for likelihood ratio and Wald tests for the null of short-memory stationarity (d=0), as well as for the null of unit-root non-stationarity (d=1) against long memory and intermediate memory alternatives. These tests are easy to apply since they have conventional chi-squared limit distributions.
The tests complement widely applied Dickey--Fuller type tests for d=1 against short memory alternatives and tests for d=0 against d=1 based on variances of cumulative sums. PcGive's parameter instability test applied to the constant term of a regression model can be interpreted as a test for d=0 with fixed mean, against d=1 with a random walk mean.
From the early nineteen sixties onwards, when Mandelbrot suggested long-memory models for economic time series, there has been a steady growth in the literature on the subject. Robinson (1994) and Baillie (1996) provided useful surveys of the developments in the econometric modelling of long memory; Beran's monograph, Beran (1994), discusses most of the central issues, including forecasting. The implementation in PcGive is derived from the Arfima package for Ox by Doornik and Ooms (1999), while an application and a more extensive discussion can be found in Doornik and Ooms (2004).
We write the arfima(p,d,q) model as:
|
|
where Φ( L )=(1-φ1L-...-φpLp) is the autoregressive polynomial and Θ( L )=(1+θ1L+...+θqLq) is the moving average polynomial in the lag operator L; p and q are integers, d is real. For example, the arfima(1,d,1) model is:
| ( 1-L) dzt = φ1 ( 1-L) dzt-1 + εt + θ1εt-1, |
where
| zt =yt-μt. |
The notation in textbooks often differs in terms of the sign of the parameters: we follow the regression-model notation.
( 1-L) d=Δd is the fractional difference operator defined by the following binomial expansion:
| (1-L)d=∑j=0∞ δj Lj = ∑j=0∞ ( |
| ) (-L)j. |
We assume εt~NID[0,σε2], and write μt for the mean of yt. The arma-part of the model is assumed invertible and stationary: all roots of Θ(z)=0 and Φ(z)=0 are outside the unit circle. In addition, Θ(z)=0 and Φ(z)=0 do not have common roots. We say that zt=yt-μt is I(d), integrated of order d. The zero mean arfima(p,d,q) process zt is covariance stationary if d<0.5, see Hosking (1981), with autocovariance function that decays hyperbolically. The process is long memory in the case 0<d<0.5. For -0.5<d<0 the process is called intermediate memory or `overdifferenced', see Brockwell and Davis (1993) and e.g. Chang and Dickey (1994). We assume d>-1, which makes the process zt invertible, see Odaki (1993). We provide three estimation methods, Exact Maximum Likelihood (EML), Modified Profile Likelihood (MPL) and nonlinear least squares (NLS). By definition, EML and MPL impose -1 < d < 0.5. MPL is preferred over EML if the model includes regressor variables and the sample is not very large. NLS allows for d > -0.5 and can be used to estimate models for non-stationary processes directly, without a priori differencing. NLS estimation is usually fast.
The next chapter shows how PcGive can be used to estimate arfima models. We discuss implementation details in Chapter 12.
This chapter gives a summary of the implementation details of the procedures in the Arfima package. It complements Doornik and Ooms (2004), while further computational aspects are discussed in Doornik and Ooms (2003).
The basic arma(p,q) model is
| yt=φ1yt-1+...+φpyt-p+εt+θ1εt-1+...+θqεt-q, t=1,...,T, |
assuming either εt~NID(0,σε2) , or E[εt]=0 and E[εt2]=σε2. Using lag polynomials and introducing a mean μ we write:
| Φ( L) ( yt-μ) =Θ( L) εt. |
With a fractional integration parameter d, the arfima(p,d,q) model is written as
|
|
The autocovariance function (ACovF) of a stationary arma process with mean μ:
| c( i) =E[ ( yt-μ) ( yt-i-μ) ], |
defines the variance matrix of the joint distribution of y=(y1,...,yT)':
|
|
which is a Toeplitz matrix, denoted by T. Under normality:
|
|
The autocorrelation function, ACF: c( i) /c( 0), of a stationary arma process is discussed in many textbooks, and readily computed from the φi and θi using the Ox function armavar. We often work with the autocovariances scaled by the error variance:
| r=[ r( 0) ...r( T-1) ] '=σε-2[ c( 0) ...c( T-1) ] '. |
An algorithm for the computation of the ACovF of an arfima process is derived in Sowell (1992):
|
|
|
|
|
|
Here Γ is the gamma function, ρj are the roots of the AR polynomial (assumed distinct), and F(a,1;c;ρ) is the hypergeometric function, see e.g. Abramowitz and Stegun (1970, Ch. 15):
| F(a,b;c;ρ)=∑i=0∞ |
|
| , |
where we use Pochhammer's symbol:
| ( a) i=a( a+1) ( a+2) ...( a+i-1) , ( a) 0=1. |
So (1)i equals i!.
In the absence of AR parameters (eq:13.4) reduces to
| c( i) =σε2∑k=-qqψk |
|
| . |
Any set of exogenous regressors may be used to explain the mean:
| z=y-μ, μ=f(X,β), |
where X is a T×k matrix. In the leading linear case f(X,β)=Xβ and β is a k×1 vector.
Initial values for the parameter estimates are obtained in the order: regressors in mean, d, AR part, and finally MA part. The very first step is to subtract a mean from yt: zt=yt-μt . When either the sample mean or a specified (known, possibly zero) mean is used: μt =μ. If regressors are used, take μt=f(xt,β). In the linear case β is obtained by regression.
|
|
| ( |
| ) φ0 = ( |
| ), |
where ρ̂( i) is the empirical autocorrelation of ut. When q is zero, the matrix on the right-hand side is the Toeplitz matrix T[ ρ̂(0), ..., ρ̂( p-1) ].
We use OLS to solve this system, giving the initial estimates of the AR coefficients φ̂0; this will also give a solution when the matrix is singular. Subsequently, the arma0 function is used to obtain autoregression residuals ut*.
When the initial values are used as starting values for further estimation, the following adjustments are made:
Based on normality (eq:13.3), and with a procedure to compute the autocovariances in (eq:13.2), the log-likelihood is simply (writing z for the data vector used for maximization):
|
|
It is convenient to concentrate σε2 out of the likelihood, starting by writing Σ= Rσ&eps