PcGive Volume III: Econometric Modelling

These reference chapters have been taken from Volume III, and use the same chapter and section numbering as the printed version.

Table of contents
2 Discrete choice models
  2.1 Introduction
  2.2 Binary discrete choice
  2.3 The binary logit and probit model
  2.4 Multinomial discrete choice
  2.4.1 The multinomial logit model
  2.4.2 Weighted estimation
  2.5 Evaluation
  2.5.1 Estimated probabilities
  2.5.2 Likelihood ratio tests
  2.5.3 Derivatives of probabilities
  2.6 Histograms
  2.7 Norm observations
  2.8 Observed versus predicted
  2.9 Outlier analysis
4 Panel Data Models
  4.1 Introduction
  4.2 Econometric methods for static panel data models
  4.2.1 Static panel-data estimation
  4.3 Econometric methods for dynamic panel data models
7 Panel Data Implementation Details
  7.1 Transformations
  7.2 Static panel-data estimation
  7.3 Dynamic panel data estimation
  7.4 Dynamic panel data, combined estimation
8 Introduction to Volatility Models
  8.1 Introduction
10 GARCH Implementation Details
  10.1 GARCH model settings
  10.2 Some implementation details
11 Introduction to Time Series Models
13 ARFIMA Implementation Details
  13.1 Introduction
  13.2 The Arfima model
  13.2.1 Autocovariance function
  13.3 Estimation
  13.3.1 Regressors in mean
  13.3.2 Initial values
  13.3.3 Exact maximum likelihood (EML)
  13.3.4 Modified profile likelihood (MPL)
  13.3.5 Non-linear least squares (NLS)
  13.3.6 Variance-covariance matrix estimates
  13.4 Estimation output
  13.5 Estimation options
  13.5.1 Sample mean versus known mean
  13.5.2 Fixing parameters
  13.5.3 Weighted estimation
  13.5.4 Z variables
  13.6 Forecasting
List of figures
  Figure:2.1 Example histograms of probabilities
List of tables
  Table:4.1 Static and dynamic panel data estimators of PcGive
  Table:7.1 Transformation and the treatment of dummy variables

Chapter 2 Discrete choice models

2.1 Introduction

In a discrete choice model the dependent variable only takes on integer values. Such a discrete dependent variable can denote a category (e.g. mode of transport: car, train, or bus) or count the number of events (e.g. the number of accidents in a week). A model with a categorical dependent variable is often called a discrete choice model, but is also referred to as dummy-endogenous or qualitative response model. As our main reference for discrete choice models we use Cramer (2003).

The use of ordinary least squares in a discrete choice model is called the linear probability model. The main shortcoming of this model is that it can predict probabilities that are outside the [0,1] interval. PcGive implements logit and probit estimation which avoids this drawback.

General references include McFadden (1984), Amemiya (1981) and Cramer (2003), among others. In the remainder we exclusively refer to Cramer (2003).

2.2 Binary discrete choice

In the simple discrete choice model the dependent variable only takes on two values, for example, when modelling car ownership:

yi=1 if household i owns a private car,
yi=0 otherwise.

With a discrete dependent variable, interest lies in modelling the probabilities of observing a certain outcome:


Using OLS on a dummy dependent variable:

yi = xi'βi

does not restrict ŷi between 0 and 1, which would be required if it is to be interpreted as a probability. Also, the disturbances cannot be normally distributed, as they only take on two values: εi=1-pi or εi=0-pi, writing pi=xi'β, and are heteroscedastic.

A simple solution is to introduce an underlying continuous variable Ui, which is not observed. Observed is:

0 if Ui<0,
1 if Ui≥0.

Now we can introduce explanatory variables:


and write

pi=P{yi=1}=P{xi'βi≥0}=Fε ( xi'β) .

Observations with yi=1 contribute pi to the likelihood, observations with yi=0 contribute 1-pi:

L( β|X) =∏{yi=0}( 1-pi) ∏{yi=1}pi,

and the log-likelihood becomes:

l( β|X) =∑i=1N[ ( 1-yi) log ( 1-pi) +yi log pi] .

2.3 The binary logit and probit model

The choice of Fε determines the method. Using the logistic distribution:

Fε ( z) =


leads to logit. Logit has a linear log-odds ratio:

log (

) =xi'β.

The standard normal distribution gives probit. We can multiply yi* by any non-zero constant without changing the outcome, the scale of these distributions is fixed: the logistic has variance π2/3, the standard normal has variance equal to 1.

2.4 Multinomial discrete choice

Assume that observations on n individuals (or households, firms, etc.) are obtained by random sampling, and that all individuals have made a choice from the same set of alternatives A={1,2,...,S}. Using subscript i to denote an individual and s to denote an alternative, we can write the dependent variable as yi:

yi = s   if individual i has made choice s,  s=1,...,S; i=1,...,n.

The existence of an unobserved continuous variable Uis is assumed, where the alternative with the highest value is chosen:

yi = s   if  Uis= max k∈A {Uik}.

See Cramer (2003, Ch. 2) for an interpretation in terms of utility maximization. Adding a stochastic component:


Logit and probit specify a certain distribution for εis. The systematic part Vis is a functional form of explanatory variables and coefficients. We can distinguish three types of explanatory variables:

  1. depending only on the characteristics of individual i (denoted xi);
  2. depending only on the alternative s (denoted rs);
  3. depending on both (denoted zis).

The xi are called alternative independent regressors, while the remaining two types, called alternative dependent, are treated identically in PcGive.

Moreover, there are two types of coefficients:

  1. alternative specific (denoted βj);
  2. generic (denoted γ).

The full specification in PcGive is:


So alternative independent regressors always enter with an alternative specific coefficient, while alternative dependent regressors have a generic coefficient.

As an illustration, consider the transport mode example:

yi=1  if mode of transport is car,
yi=2  if mode of transport is bus,
yi=3  if mode of transport is train.

Then xi may be the income of individual i, and zis the travel time for individual i using mode s. Interest may focus on the probabilities that individual i uses mode s, or the coefficients, to estimate the influence on marginal utility of explanatory variables. In the absence of alternative specific parameters it is possible to investigate the consequences of introducing a new mode of transport.

Let pis denote the probability that individual i chooses alternative s. Then from (eq:2.2):

=P(Uis= max k∈A {Uik})
=P(Visis>Vikik for all k≠s).

The log-likelihood function is:

l(θ) = log L(θ) = ∑i=1ns=1SYis log Pis(θ),


Yis=1   if yi=s,
Yis=0   otherwise.

2.4.1 The multinomial logit model

The multinomial logit (MNL) model arises from (eq:2.2) when the εik are independently type I extreme value distributed (see Cramer, 2003, Ch. 3). This is the most convenient model, because of the closed form of the probabilities:

exp (Vis)

k=1S exp (Vik)

1+∑k=1,k≠sS exp (Vik-Vis)

and the concavity of the log-likelihood.

The log-odds ratio is:

log (pis/pik)=Vis-Vik.

Substituting (eq:2.3):

Vis-Vik = xi'(βs-βk) +(zis-zik)'γ,

shows that a normalization is needed on βk: a constant can be added (for all k) without changing the probabilities. PcGive takes:

β1 = 0,

and the first alternative is then called the reference state.

Although the name multinomial logit in PcGive is generally used for the specification (eq:2.3), it sometimes stands for a more specific functional form:


In that case, the conditional logit model refers to


When Vis does not depend on attributes of alternatives other than alternative s, the MNL model has the independence from irrelevant alternatives (IIA) property,.

2.4.2 Weighted estimation

PcGive allows the specification of a weight variable in the log-likelihood:

l(θ) = ∑i=1nwis=1SYis log Pis(θ).

The use could be two-fold:

  1. Weighted M-estimation (Ruud, 1986)
    To get consistent estimates of the slope parameters (up to an unknown scale factor) in the case of a mis-specified distribution (e.g. using logit when the true model is probit, although in the binary case the models are very close). In discrete choice models weighting is not always necessary, see Ruud (1983).

  2. Choice-based sampling

    In multinomial logit, only the intercept is incorrect when treating a choice-based sample as a random sample (see Maddala, 1983, pp 90-91). Manski and Lerman (1977) proposed a weighted estimation to get consistent estimates for choice-based samples. Let Q0(s) be the population fraction choosing alternative s, and H(s) the fraction in the sample choosing s. Denote the observed response of individual i as si, then the weights are:

    wi = Q0(si) / H(si).

    Also see Amemiya and Vuong (1987).

2.5 Evaluation

2.5.1 Estimated probabilities

The coefficient estimates β̂s, s=2,...,S and γ̂ form the basis for the estimated probabilities for each observation: p̂is, s=1,...,S, i=1,...,n.

2.5.2 Likelihood ratio tests

If the estimated model contains a constant term, it can be compared with the baseline model in a likelihood ratio test. The baseline model has only an intercept, and the resulting test is analogous to the regression F-test. In the baseline MNL model:

Vis = αs,  s=2,...,S,

the likelihood can be maximized explicitly, resulting in:

αs = log (fs/f1),

where fs is the relative frequency in the sample of alternative s. With K alternative independent regressors (x) and C alternative dependent regressors (z), the estimated model has K(S-1)+C parameters, and a log-likelihood of l(θ̂), say. The baseline model has S-1 parameters, and a log-likelihood of l(α̂), so that 2l(θ̂) - 2l(α̂) is χ2[(K-1)(S-1)+C] distributed.

If the estimated model has no intercept, the zeroline is used in the test. This is a model without parameters:

is= 1/S .

In a multinomial logit model that does have an intercept, the sum of the estimated probabilities for state s equals the number of observations having state s: 1/n ∑i=1nis=fs. This follows, because at the maximum, the derivatives of the log-likelihood:


=∑i=1n (Yis-pis)x'i

are zero, and hence, if one of the regressors is a row of ones:

i=1n Yis=∑i=1nis.

With grouped data, the log-likelihood of the saturated model is reported. This is a model with perfect fit: the predicted frequencies equal the observed frequencies. Let nis be the number in group i having response s. Then ∑s=1Snis=ni and ∑i=1Gni=n, with G the number of groups and n the total number of observations. The saturated model



has G(1-S) parameters and a log-likelihood of

i=1Gs=1Snis log


The saturated log-likelihood is not reported if any cell has nis=0.

2.5.3 Derivatives of probabilities

The derivatives of the probabilities with respect to the regressors give the effect of a change in one of the regressors on the probabilities. Since the probabilities sum to one, the derivatives sum to zero. The derivatives for the MNL model are:

∂ps/∂xm =s(β̂sm-∑k=1Sβ̂kmk),
∂ps/∂zsm =γms(1-p̂s),
∂ps/∂zrm =γmsr,  r≠s.

We can substitute the relative frequency of each state for the probabilities (p̂s=fs) or select other relative frequencies. In the calculation of the t-values the p̂s are treated as non-stochastic.

In the binary probit model the derivatives are:



These derivatives are calculated at the mean of the explanatory variables (V̂=\bf x'β̂), but can be calculated at other values.

PcGive also reports quasi-elasticities:


∂ log xm


If xm increases with q%, ps will approximately increase by q×∂ps/∂ log xm percentage points. Elasticities are also reported:




All three measures have the same t-values.

2.6 Histograms

The distribution of the estimated probabilities is shown in histograms. The first histogram is the frequency distribution of the probabilities of the observed state. Consider the following example with three states:

observed state 123
1 0.1 0.5 0.4
2 0.1 0.8 0.1
3 0.5 0.1 0.4
2 0.3 0.4 0.3

The probabilities of the observed state are underlined. Their frequency distribution is shown in Figure Figure:2.1a (e.g. 50% is in [0.4,0.5), while on the right is the distribution of the estimated probabilities for the first state. These graphs can be made for each state, with S+1 histograms in total.


Figure:2.1 Example histograms of probabilities

2.7 Norm observations

A norm observation has the sample mean for each explanatory variable. The probabilities of each response can be calculated for the norm observation, as well as for other user-defined norm observations. The probabilities of the norm observation do not equal the relative frequencies of the alternatives: p̂s(V)≠fs.

2.8 Observed versus predicted

The predicted choice is the alternative with the highest predicted probability. The table of observed versus predicted choice is a rather crude measure, since 0.34,0.33,0.33 and 0.9,0.05,0.05 both select the first alternative as predicted response. For the example of §2.6 the table would read:

Observed response Predicted
State 1State 2State 3total
Predicted State 1 0 0 1 1
Predicted State 2 1 2 0 3
Predicted State 3 0 0 0 0
Observed total 1 2 1 4

2.9 Outlier analysis

Let p̂(i) denote the estimated probability that individual i selects the alternative she actually choose. These are underlined in the example of §2.6. PcGive can print all p̂(i) which are smaller than a designated value.

Chapter 4 Panel Data Models

4.1 Introduction

In a panel data setting, we have time-series observations on multiple entities, for example companies or individuals. We denote the cross-section sample size by N, and, in an ideal setting, have t=1,...,T time-series observations covering the same calendar period. This is called a balanced panel. In practice, it often happens that some cross-sections start earlier, or finish later. The panel data procedures in PcGive are expressly designed to handle such unbalanced panels.

When T is large, N small, and the panel balanced, it will be possible to use the simultaneous-equations modelling facilities of PcGive. When T is small, and the model dynamic (i.e. includes a lagged dependent variable), the estimation bias can be substantial (see Nickell, 1981). Methods to cope with such dynamic panel data models are the primary focus of this part of PcGive, particularly the GMM-type estimators of Arellano and Bond (1991), Arellano and Bover (1995), and Blundell and Bond (1998), but also some of the Anderson and Hsiao (1982) estimators. Additional information can be found in Arellano (2003). In addition, PcGive makes several of the standard static panel data estimators available, such as between and within groups and feasible GLS. Table Table:4.1 lists the available estimators.

Table:4.1 Static and dynamic panel data estimators of PcGive

static panel data models dynamic panel data models
OLS in levels OLS in levels
Between estimator one-step IV estimation (Anderson--Hsiao IV)
Within estimator one-step GMM estimation
Feasible GLS one-step estimation with robust std. errors
GLS (OLS residuals) two-step estimation
Maximum likelihood (ML)

This chapter only provides a cursory overview of panel data methods. In addition to the referenced literature, there are several text books on panel data, notably Hsiao (1986) and more recently Baltagi (1995). Especially relevant is the book by Arellano (2003). Many general econometric text books have a chapter on panel data (but rarely treating dynamic models), see, e.g., Johnston and DiNardo (1997, Ch. 12).

4.2 Econometric methods for static panel data models

4.2.1 Static panel-data estimation

The static single-equation panel model can be written as:

yit = xit'γ + λt + ηi + vit,  t = 1,..., T,  i = 1, ..., N.

The λt and ηi are respectively time and individual specific effects and xit is a k* vector of explanatory variables. N is the number of cross-section observations. The total number of observations is NT.

Stacking the data for an individual according to time yields:

) = (
) + (
) + (
) + (

or, more concisely:

yi = Xiγ + λ + ιi ηi + vi,  i = 1, ..., N,

where yi, Xi, and λi are T×1, and ιi is a T column of ones. Using Di for the matrix with dummies:

yi = Xiγ + Diδ + vi,  i = 1, ..., N.

The next step is to stack all individuals, combining the data into W=[X : D]:

y = Wβ + v.

Baltagi (1995) reviews the standard estimation methods used in this setting:

More detail is provided in §7.3.

4.3 Econometric methods for dynamic panel data models

The general model that can be estimated with PcGive is a single equation with individual effects of the form:

yit=∑k=1pαkyi,t-k+β'(L)xitti+vit,   t=q+1,...,Ti;  i=1,...,N,

where ηi and λt are respectively individual and time specific effects, xit is a vector of explanatory variables, β(L) is a vector of associated polynomials in the lag operator and q is the maximum lag length in the model. The number of time periods available on the ith individual, Ti, is small and the number of individuals, N, is large. Identification of the model requires restrictions on the serial correlation properties of the error term vit and/or on the properties of the explanatory variables xit. It is assumed that if the error term was originally autoregressive, the model has been transformed so that the coefficients α's and β's satisfy some set of common factor restrictions. Thus only serially uncorrelated or moving-average errors are explicitly allowed. The vit are assumed to be independently distributed across individuals with zero mean, but arbitrary forms of heteroscedasticity across units and time are possible. The xit may or may not be correlated with the individual effects ηi, and for each of theses cases they may be strictly exogenous, predetermined or endogenous variables with respect to vit. A case of particular interest is where the levels xit are correlated with ηi but where Δxit (and possibly Δyit) are uncorrelated with ηi; this allows the use of (suitably lagged) Δxis (and possibly Δyis) as instruments for equations in levels.

The (Ti-q) equations for individual i can be written conveniently in the form:

yi=Wiδ +ι iηi+vi,

where δ is a parameter vector including the αk's, the β's and the λ's, and Wi is a data matrix containing the time series of the lagged dependent variables, the x's and the time dummies. Lastly, ι i is a (Ti-q)×1 vector of ones. PcGive can be used to compute various linear GMM estimators of δ with the general form:

δ̂=[ ( ∑iWi*'Zi) AN( ∑iZi'Wi*) ] -1( ∑iWi*'Zi) AN( ∑iZi'yi*),


AN=( 1/N ∑iZi'HiZi) -1,

and Wi* and yi* denote some transformation of Wi and yi (e.g. levels, first differences, orthogonal deviations, combinations of first differences (or orthogonal deviations) and levels, deviations from individual means). Zi is a matrix of instrumental variables which may or may not be entirely internal, and Hi is a possibly individual-specific weighting matrix.

If the number of columns of Zi equals that of Wi*, AN becomes irrelevant and δ ̂ reduces to

δ̂ =( ∑iZi'Wi*) -1( ∑iZi'yi*).

In particular, if Zi=Wi* and the transformed Wi and yi are deviations from individual means or orthogonal deviations

[Note: Orthogonal deviations, as proposed by Arellano (1988) and Arellano and Bover (1995), express each observation as the deviation from the average of future observations in the sample for the same individual, and weight each deviation to standardize the variance, i.e.

xit*=( xit-

) (

) 1/2 for  t=1,...,T-1.

If the original errors are serially uncorrelated and homoscedastic, the transformed errors will also be serially uncorrelated and homoscedastic.]

, then δ ̂ is the within-groups estimator. As another example, if the transformation denotes first differences, Zi=ITixi' and Hi=v̂i*v̂i*', where the v̂i* are some consistent estimates of the first-differenced residuals, then δ̂ is the generalized three-stage least squares estimator of Chamberlain (1984). These two estimators require the xit to be strictly exogenous with respect to vit for consistency. In addition, the within-groups estimator can only be consistent as N→∞ for fixed T if Wi* does not contain lagged dependent variables and all the explanatory variables are strictly exogenous.

When estimating dynamic models, we shall therefore typically be concerned with transformations that allow the use of lagged endogenous (and predetermined) variables as instruments in the transformed equations. Efficient GMM estimators will typically exploit a different number of instruments in each time period. Estimators of this type are discussed in Arellano (1988), Arellano and Bond (1991), Arellano and Bover (1995) and Blundell and Bond (1998). PcGive can be used to compute a range of linear GMM estimators of this type.

Where there are no instruments available that are uncorrelated with the individual effects ηi, the transformation must eliminate this component of the error term. The first difference and orthogonal deviations transformations are two examples of transformations that eliminate ηi from the transformed error term, without at the same time introducing all lagged values of the disturbances vit into the transformed error term.

[Note: There are many other transformations which share these properties. See Arellano and Bover (1995) for further discussion.]

Hence these transformations allow the use of suitably lagged endogenous (and predetermined) variables as instruments. For example, if the panel is balanced, p=1, there are no explanatory variables nor time effects, the vit are serially uncorrelated, and the initial conditions yi1 are uncorrelated with vit for t=2,...,T, then using first differences we have:

Equations Instruments available
Δyi3=αΔyi2+Δvi3 yi1
Δyi4=αΔyi3+Δvi4 yi1,yi2
ΔyiT=αΔyi,T-1+ΔviT yi1,yi2,...,yi,T-2

In this case, yi*=(Δyi3,...,ΔyiT)',Wi*=(Δyi2,...,Δyi,T-1)' and

yi1 0 0 ...0 0 ...0
0 yi1 yi2 ...0 0 ...0
0 0 0 ...yi1 yi2 ...yi,T-2

Notice that precisely the same instrument set would be used to estimate the model in orthogonal deviations. Where the panel is unbalanced, for individuals with incomplete data the rows of Zi corresponding to the missing equations are deleted, and missing values in the remaining rows are replaced by zeros.

In PcGive, we call one-step estimates those which use some known matrix as the choice for Hi. For a first-difference procedure, the one-step estimator uses

Hi=HiD= 1/2 (
2 -1 ...0
-1 2 ...0
0 0 ...-1 2

while for a levels or orthogonal deviations procedure, the one-step estimator sets Hi to an identity matrix. If the vit are heteroscedastic, a two-step estimator which uses


where v̂i* are one-step residuals, is more efficient (cf. White, 1982). PcGive produces both one-step and two-step GMM estimators, with asymptotic variance matrices that are heteroscedasticity-consistent in both cases. Users should note that, particularly when the vit are heteroscedastic, simulations suggest that the asymptotic standard errors for the two-step estimators can be a poor guide for hypothesis testing in typical sample sizes. In these cases, inference based on asymptotic standard errors for the one-step estimators seems to be more reliable (see Arellano and Bond, 1991, and Blundell and Bond, 1998 for further discussion).

In models with explanatory variables, Zi may consist of sub-matrices with the block-diagonal form illustrated above (exploiting all or part of the moment restrictions available), concatenated to straightforward one-column instruments. A judicious choice of the Zi matrix should strike a compromise between prior knowledge (from economic theory and previous empirical work), the characteristics of the sample and computer limitations (see Arellano and Bond, 1991 for an extended discussion and illustration). For example, if a predetermined regressor xit correlated with the individual effect, is added to the model discussed above, i.e.

E(xitvis) =0    for s≥t
0 otherwise
E(xitηi) 0

then the corresponding optimal Zi matrix is given by

yi1 xi1 xi2 0 0 0 0 0 ...0 ...0 0 ...0
0 0 0 yi1 yi2 xi1 xi2 xi3 ...0 ...0 0 ...0
0 0 0 0 0 0 0 0 ...yi1 ...yi,T-2 xi1 ...xi,T-1

Where the number of columns in Zi is very large, computational considerations may require those columns containing the least informative instruments to be deleted. Even when computing speed is not an issue, it may be advisable not to use the whole history of the series as instruments in the later cross-sections. For a given cross-sectional sample size (N), the use of too many instruments may result in (small sample) overfitting biases. When overfitting results from the number of time periods (T) becoming large relative to the number of individuals (N), and there are no endogenous regressors present, these GMM estimators are biased towards within groups, which is not a serious concern since the within-groups estimator is itself consistent for models with predetermined variables as T becomes large (see Álvarez and Arellano, 1998). However, in models with endogenous regressors, using too many instruments in the later cross-sections could result in seriously biased estimates. This possibility can be investigated in practice by comparing the GMM and within-groups estimates.

The assumption of no serial correlation in the vit is essential for the consistency of estimators such as those considered in the previous examples, which instrument the lagged dependent variable with further lags of the same variable. Thus, PcGive reports tests for the absence of first-order and second-order serial correlation in the first-differenced residuals. If the disturbances vit are not serially correlated, there should be evidence of significant negative first-order serial correlation in differenced residuals (i.e. v̂it-v̂i,t-1), and no evidence of second-order serial correlation in the differenced residuals. These tests are based on the standardized average residual autocovariances which are asymptotically N(0,1) variables under the null of no autocorrelation. The tests reported are based on estimates of the residuals in first differences, even when the estimator is obtained using orthogonal deviations.

[Note: Although the validity of orthogonality conditions is not affected, the transformation to orthogonal deviations can induce serial correlation in the transformed error term if the vit are serially uncorrelated but heteroscedastic.]

More generally, the Sargan (1964) tests of overidentifying restrictions are also reported. That is, if AN has been chosen optimally for any given Zi, the statistic

S=( ∑iv̂i*'Zi) AN( ∑iZi'v̂i*)

is asymptotically distributed as a χ2 with as many degrees of freedom as overidentifying restrictions, under the null hypothesis of the validity of the instruments. Note that only the Sargan test based on the two-step GMM estimator is heteroscedasticity-consistent. Again, Arellano and Bond (1991) provide a complete discussion of these procedures.

Where there are instruments available that are uncorrelated with the individual effects ηi, these variables can be used as instruments for the equations in levels. Typically this will imply a set of moment conditions relating to the equations in first differences (or orthogonal deviations) and a set of moment conditions relating to the equations in levels, which need to be combined to obtain the efficient GMM estimator.

[Note: In special cases it may be efficient to use only the equations in levels; for example, in a model with no lagged dependent variables and all regressors strictly exogenous and uncorrelated with individual effects.]

For example, if the simple AR(1) model considered earlier is mean-stationary, then the first differences Δyit will be uncorrelated with ηi , and this implies that Δyi,t-1 can be used as instruments in the levels equations (see Arellano and Bover, 1995 and Blundell and Bond, 1998 for further discussion). In addition to the instruments available for the first-differenced equations that were described earlier, we then have:

Equations Instruments available
yi3=αyi2i+vi3 Δyi2
yi4=αyi3i+vi4 Δyi3
yiT=αyi,T-1i+viT Δyi,T-1

Notice that no instruments are available in this case for the first levels equation (i.e., yi2=αyi1i+vi2), and that using further lags of Δyis as instruments here would be redundant, given the instruments that are being used for the equations in first differences. In a balanced panel, we could use only the last levels equation (i.e., yiT=αyi,T-1i+viT), where (Δyi2,Δyi3,...,Δyi,T-1) would all be valid instruments; however this approach does not extend conveniently to unbalanced panels.

In this case, we use

Wi*=(Δyi2,...,Δyi,T-1, yi2,...,yi,T-1)',


ZiD 0 ...0
0 Δyi2 ...0
0 0 ...Δyi,T-1

where ZiD is the matrix of instruments for the equations in first differences, as described above. Again Zi would be precisely the same if the transformed equations in yi* and Wi* were in orthogonal deviations rather than first differences. In models with explanatory variables, it may be that the levels of some variables are uncorrelated with ηi, in which case suitably lagged levels of these variables can be used as instruments in the levels equations, and in this case there may be instruments available for the first levels equation.

For the system of equations in first differences and levels, the one-step estimator computed in PcGive uses the weighting matrix

HiD 0
0 1/2 Ii

where HiD is the weighting matrix described above for the first differenced estimator, and Ii is an identity matrix with dimension equal to the number of levels equations observed for individual i. For the system of equations in orthogonal deviations and levels, the one-step estimator computed in PcGive sets Hi to an identity matrix with dimension equal to the total number of equations in the system for individual i. In both cases the corresponding two-step estimator uses Hi=v̂i*v̂i*'. We adopt these particular one-step weighting matrices because they are equivalent in the following sense: for a balanced panel where all the available linear moment restrictions are exploited (i.e., no columns of Zi are omitted for computational or small-sample reasons), the associated one-step GMM estimators are numerically identical, regardless of whether the first difference or orthogonal deviations transformation is used to construct the system. Notice though that the one-step estimator is asymptotically inefficient relative to the two-step estimator for both of these systems, even if the vit are homoscedastic.

[Note: With levels equations included in the system, the optimal weight matrix depends on unknown parameters (for example, the ratio of var(ηi) to var(vit)) even in the homoscedastic case.]

Again simulations have suggested that asymptotic inference based on the one-step versions may be more reliable than asymptotic inference based on the two-step versions, even in moderately large samples (see Blundell and Bond, 1998).

The validity of these extra instruments in the levels equations can be tested using the Sargan statistic provided by PcGive. Since the set of instruments used for the equations in first differences (or orthogonal deviations) is a strict subset of that used in the system of first-differenced (or orthogonal deviations) and levels equations, a more specific test of these additional instruments is a Difference Sargan test which compares the Sargan statistic for the system estimator and the Sargan statistic for the corresponding first-differenced (or orthogonal deviations) estimator. Another possibility is to compare these estimates using a Hausman (1978) specification test, which can be computed here by including another set of regressors that take the value zero in the equations in first differences (or orthogonal deviations), and reproduce the levels of the right-hand side variables for the equations in levels.

[Note: Thus in the AR(1) case described above we would have

0 ... 0 yi2 ... yi,T-1
Δyi2 ... Δyi,T-1 yi2 ... yi,T-1
) .'


The test statistic is then a Wald test of the hypothesis that the coefficients on these additional regressors are jointly zero. Full details of these test procedures can be found in Arellano and Bond (1991) and Arellano (1995).

Chapter 7 Panel Data Implementation Details

This section gives a summary of the statistical output of PcGive, giving the formulae which are used in the computations.

7.1 Transformations

7.2 Static panel-data estimation

The starting point is (eq:4.2):

y = Wβ + v.

The total number of observations is O=NT. In an unbalanced panel, different individuals may have a different number of observations, Ti. In that case the total number of observations equals O=∑Ti. Baltagi (1995) reviews the standard estimation methods used in this setting:

To summarize the implementation in PcGive:

number of degrees of freedom transforms
observations, n lost in estimation, p dummies
OLS (no transformation) O k no
within O k+Nno
between N k yes
GLS, ML O k yes

It is important to note that the within transformations are only applied to X, and not to the dummies D. In the within estimator, individual dummies are redundant after subtracting the individual means. In the between estimator, time dummies are irrelevant, and individual dummies create a perfect fit. In GLS and ML, the individual means are subtracted with weights.

7.3 Dynamic panel data estimation

The single equation dynamic panel model can be written as:

yit = ∑s=1m αi yi,t-s + xit'γ + λt + ηi + vit,  t = 0,..., Ti-1,  i = 0, ..., N - 1.

As before, λt and ηi are time and individual specific effects and xit is a k* vector of explanatory variables. It is assumed that allowance for lagged observations has been made, so observations on yi,-s, ..., yi,-1 are available. The total number of observations available for estimation equals O=∑Ti.

Stacking the data for an individual according to time: and using D for the matrix with dummies:

yi = Xiγ + Diδ + vi,  i = 0, ..., N - 1.

In formulating the model, we distinguish the following types of variables:

y dependent variables,
X regressors, including lagged dependent variables,
D dummy variables,
I normal instruments,
L `level' instruments,
G GMM-style instruments.

The Gi are described under the Gmm function. From these variables, the dependent variable, regressor and instrument matrices are formed as follows:

qi =y*i Ti ×1
Wi =[X*i : Dsi] Ti ×k
Zi =[Gi : I*i : Dsi : Li] Ti ×z

where the * denotes some transformation which can be chosen from §7.1. This also affects the degrees of freedom, as shown in Table Table:7.1. The estimators and tests are described in Arellano and Bond (1991), and have the form:

M =(∑iWi'Zi)AN (∑iZi'Wi),
AN =(∑iZi'HiZi)-1,
β̂ =M-1(∑iWi'Zi)AN (∑iZi'qi),
σ̂u2 =û'û/(n-p),
û =q - Wβ̂.

Table:7.1 Transformation and the treatment of dummy variables

transformation number of degrees of freedom transforms dummies
observations, n lost in estimation, p
differences O-N k Ds=D (optional: D*)
deviations O-N k Ds=D (optional: D*)
none O k no: Ds=D
within O k+Nno: Ds=D*
between N k yes: Ds=D*
GLS O k yes: Ds=D*

When the option to concentrate out dummies is used, y*, X*, I*, and L are replaced by the residuals from regressing on the set of dummies Ds. Subsequently, the dummies are omitted from further computations.

In one-step estimation, Hi is the identity matrix, except when estimating in first differences:

H1,i = ITi, except: H1,idiff = (
1 -1/2 0 ...0
-1/21 -1/20
0 -1/2 1 0
0 0 0 ...1

In two-step estimation Hi is based on the previous step residuals:

one-step H1,i, see (eq:7.4) M1A1,N1,i = *1,i
two step H2,i = *1,i1,i*' M2A2,N2,i = *2,i

with a subscript added to distinguish between one and two step estimation.

The output reports the residual and total sum of squares:

RSS = û' û,
TSS = q̂' q̂ - (q̂' ι)2/O.

So the TSS is in deviation from the mean.

The one-step estimated variance of the coefficients is:

V̂1[β̂] = σ̂1,u2 M1-1.

If selected, robust standard errors after one-step estimation are computed as:

V̂1r[β̂] = M1-1 (∑iWi'Zi)A1,N A2,N-1A1,N (∑iZi'Wi) M1-1.

When there are no instruments, (eq:7.5) simplifies to:

(W'W)-1 (∑i=1N Wi'ûi ûi'Wi) (W'W)-1,

which is the robust variance estimator proposed in Arellano (1987).

[Note: This robust estimator is not identical to the standard White (1980) estimator for OLS, which uses as the central matrix:

i=1NT wi'ûiiwi.


The two-step asymptotic variance matrix is:

V̂2[β̂] = M2-1.

This variance estimator can be severely downward biased in small samples, and is therefore only reported when robust standard errors is switched off. The preffered solution (and the default) is to use the robust variance for two-step estimation, V̂2r[β̂], which uses the small sample correction as derived by Windmeijer (2000).

The AR test for order m is computed as:

AR(m) =

(d1 + d2 + d3)1/2

Using wi for the residuals lagged m periods (substituting zero for missing lags):

wit = ui,t-m for t=m,...,Ti, and   wit = 0 for t < m.

The di are defined as:

d0 = ∑i wi' ui,
d1 = ∑i wi' Hi wi,
d2 = -2 (∑i wi'Wi)M-1 (∑i Wi'Zi) AN (∑i Zi' Hi wi),
d3 = (∑i wi'Wi) V̂[β̂] (∑i Wi'wi).

The components are used as the notation suggests, except for Hi in one-step estimation:

Hi M AN ui V̂[β̂]
one step: σ̂1,u2H1,iM1A1,N*1,iV̂1[β̂]
one step, robust: H2,iM1A1,N*1,iV̂1r[β̂]
two step: H2,iM2A2,N*2,iV̂2[β̂]
two step, robust: H2,iM2A2,N*2,iV̂2r[β̂]

After estimation in orthogonal deviations, the AR test (eq:7.6) is based on the residuals from the differenced model as follows (the superscript Δ refers to the regressors and residuals from the model in differenced form):

d0 = ∑i wiΔ' uiΔ,
d1 = ∑i wiΔ' HiΔ wiΔ,
d2 = -2 (∑i wiΔ'WiΔ)M-1 (∑i Wi'Zi) AN (∑i Zi' Ψi),
d3 = (∑i wiΔ'WiΔ) V̂[β̂] (∑i WiΔ'wΔi).

HiΔ is H1,idiff after one-step estimation, and uiΔ'uiΔ otherwise. Ψi equals

((Ti+1)/Ti)½ 0 0 ...0
((Ti-1)/(Ti-2))½(Ti/(Ti-1))½0 ...0
0 0 0 (1/2)½(2/1)½
) wi

after one-step estimation, and uiuiΔ'wiΔ after one-step robust or two-step estimation.

The Sargan test with z-k degrees of freedom is after one-step estimation:

S1 = (∑i *'1,iZi) A1,N (∑i Zi'*1,i)σ̂1,u-2,

and after two-step estimation (see eq:4.4 and equation (10) in Arellano and Bond, 1991):

S = (∑i *'2,iZi) A2,N (∑i Zi'*2,i).

Three Wald tests are routinely reported to test the significance of groups of variables:

Wald (joint): on W, all regressors except dummies,
Wald (dummy): on D, all dummies (including constant),
Wald (time): on time dummies (including the constant in the
differenced/deviations model).

7.4 Dynamic panel data, combined estimation

In combined estimation, GMM-SYS, the levels equation is used jointly with the transformed equation, see Blundell and Bond (1998). The IV estimation of (eq:7.3) still applies, but the data are organized as follows:

qi =[
] =[
] 2Ti ×1
Wi =[
] =[
X*i D*i
Xi Di
] 2Ti ×k
Zi =[
] =[
Gi 0 I*i 0 Li
0 G+i Ii Di 0
] 2Ti ×(z+z+)

G+i are the GMM-style instruments for the levels equation, described under the GmmLevel function.

[Note: The code used to estimate Blundell and Bond (1998) had Z*i=(Gi 0 I*i D*i Li).]

The dummies in the transformed equation are always transformed. When using differences or deviations, the effective number of observations in the transformed equations is as before (O-N), whereas the levels equations have O observations. Internally, both the transformed and levels equation i has Ti equations, but, when using differences/deviations, the first observation in the transformed equation is set to zero, resulting in Ti-1 effective observations.

When the option to concentrate out dummies is used, y*, X*, I*, and L are replaced by the residuals from regressing on the set of dummies D*, and y and X on D. Note that there is one time dummy extra in combined estimation compared to normal estimation when using differences or deviations.

After estimation, there are two sets of residuals:

û* =q* - W*β̂ (transformed equation),
û+ =q+ - W+β̂ (levels equation).

The reported residual variance is based on the levels residuals: σ̂u+2 = û+'û+/(n-p). Here n = O =∑i=1N Ti and p are as before. The reported RSS and TSS are also based on the levels, using û+ and q+.

In one-step estimation, Hi is as in (eq:7.4) in the transformed equations, and the identity matrix in the level equations (1/2I when using differences).

[Note: The code used to estimate in Blundell and Bond (1998) had H1,i = ITi also for estimation in differences.]

Only robust standard errors are reported after one-step estimation.

In two-step estimation, Hi is based on the previous step residuals i' = (i*' : i+').

The AR test (eq:7.6) is based on the residuals from the transformed equation:

d0 = ∑i wi*' ui*,
d1 = ∑i wi*' ui* ui*' wi*,
d2 = -2 (∑i wi*'Wi*)M-1 (∑i Wi'Zi) AN (∑i Zi' ui ui*' wi*),
d3 = (∑i wi*'Wi*) V̂[β̂] (∑i Wi*'w*i).

After orthogonal deviations, the * is replaced by Δ as in (eq:7.8).

Chapter 8 Introduction to Volatility Models

8.1 Introduction

The autoregressive conditional-heteroscedastic (ARCH, see Engle, 1982) and generalized ARCH (GARCH, see Bollerslev, 1986) models have found widespread application since their introduction. These models were designed to capture the volatility clustering which can be observed in macro-economic series such as inflation (Engle, 1982), or financial time-series such as exchange rates and stock market returns. Recent surveys of GARCH-type models include Bera and Higgins (1993), Bollerslev, Engle, and Nelson (1994), Shephard (1996), and Gourieroux (1997), also see the text books by Hamilton (1994), Mills (1999) and Tsay (2005).

The ARCH(q) model featured briefly in Volume I as part of non-linear modelling, and throughout in the form of the LM test for ARCH effects in the residuals of a regression model. The ARCH(q) model can be defined as:

yt=μ+ut,   with ut=h1/2 tεt,  and h t0+∑j=1qαjut-j2.

Assuming εt~IN[0,1], identically and independently distributed for all t, gives ut|ut-1...ut-q~N(0,h t). So conditional on the past, the model is normal but heteroscedastic. The log-likelihood for observation t differs from the linear model with normal errors through the time-dependence of the conditional variance:

lt( μ, α|Ft-1) = c-½ log ( h t) -½
( yt-μ) 2

h t

where Ft-1 captures all the information up to t-1.

In contrast with stochastic volatility models (see Shephard, 1996), where the variance is modelled as an unobserved (latent) process, the model in (eq:8.1) leads to an explicit expression for the log-likelihood: given values for μ and α=(α1,...,αq)', we can evaluate ût-q,...,ût-1 and therefore ĥt and lt. This makes maximum likelihood estimation of the model in (eq:8.1), and its extensions discussed below, relatively easy in comparison with stochastic volatility models. There is a problem very early on in this recursion regarding the treatment of ût for t<0 --- this is discussed below. The first-order conditions for a maximum of the log-likelihood cannot be solved analytically in terms of the parameters; instead this requires the use of a numerical optimization routine.

More generally, the regression model with normal-GARCH(p,q) errors is defined as:

yt=xt'ζ+ ut,
u t =εt ht1/2,  εt|Ft-1 ~N[0,1],
ht=α0 + ∑i=1q αi u2 t-i + ∑i=1p βi ht-i,  t=1,...T.

The ARCH(q) model corresponds to GARCH(0,q). The GARCH(p,q) model allows for a parsimonious parametrization of an ARCH(∞) structure: assuming that the roots of 1-β(L)=1-∑i=1pβiLi are outside the unit circle, and letting α(L)=∑i=1qαiLi, (eq:8.2) implies




The conditional variance ht must be nonnegative, which requires that the coefficients of δ(L)=α(L)/(1-β(L)) are nonnegative (and α0≥0). A sufficient condition for this is that αi≥0 and βi≥0 for all i (however this is overly restrictive; see Nelson and Cao, 1992).

The equation for ht in (eq:8.2) can be written in ARMA form using νt=u2t-ht=(ε2t-1)ht:

u2t0 + ∑i=1mi + βi) u2 t-i - ∑i=1p βi νt-i + νt,

where m= max (p,q) and we set βi=0 for i>p, and αi=0 for i>q. Note that Et|Ft-1]=0. It follows that ut has a finite unconditional variance, given by E[ut2]=α0/(1-α(1)-β(1)), if and only if the roots of


are all outside the unit circle. A necessary and (given nonnegativity of αi and βi) sufficient condition for this is α(1)+β(1)=∑i=1mii)<1. Under this condition, the process ut is weakly stationary. Equation (eq:8.3) also forms the basis for forecasting, because, for the GARCH(1,1) model:

E[hT+H|hT]=(α+β)HhT + α0[∑j=0H-1(α+β)j].

In the first forecast, νT-1 still features, but afterwards it is set to its expectation of zero. The default parameter space in PcGive is formulated in terms of the ARMA representation, with α0≥0, αii≥0, and α(1)+β(1)<1.

[Note: Formally, PcGive imposes α(1)+β(1)<1 with strict inequality. However, numerically, the result can get so close as to being virtually indistinguishable from one. In that case the sum is assumed to be on the IGARCH boundary. By default, α(1)+β(1)>1 is excluded, although this restriction can be relaxed. ]

When α(1)+β(1)=1, the ARMA representation (eq:8.3) has a unit root, so that ut has no finite unconditional variance; this situation is called integrated GARCH (IGARCH). In the IGARCH(1,1) case:

E[hT+H]=hT + α0H,

which grows linearly with the forecast horizon H. However, the ut process is strictly stationary (provided α0 > 0, see Nelson, 1990).

The GARCH model can also be estimated with standardized Student-t distributed errors (see §9.5). As the degrees of freedom go to infinity, this distribution approaches the standard normal.

We also consider the exponential GARCH, or EGARCH(p,q), specification (see Nelson, 1991):

log ht0 + ∑i=1qαi1 εt-i + θ2 (|εt-i| - Et|)}+ ∑i=1pβi log ht-i,

with α1=1. The log-formulation implies that ht is always nonnegative, regardless of the (possibly negative) parameter values; stationarity is now related to the roots of 1-β(L), in particular to the question whether β(1)<1 (but the implementation in PcGive does not impose restrictions on the EGARCH parameter space). When θ1≠0, the model allows for an asymmetric response. In the normal EGARCH model, the errors follow a standard normal distribution. PcGive also allows the EGARCH model with a generalized error distribution (GED), which is defined as:

ν exp (-½|z/λ|ν)

,  -∞< z < ∞  ν> 0,

where Γ(.) is the gamma function, and λ2=2-2/νΓ(1/ν)/Γ(3/ν). It contains the standard normal distribution as a special case, in which case E| εt| =(2/π)1/2. When ν< 2 the distribution has a thicker tail than the standard normal. PcGive estimates ν*= log ν/2, which is zero for the standard normal distribution; in that case ν*<0 implies thicker tails.

Both the GARCH and EGARCH models can be estimated with the conditional variance in the mean, with a choice between ht, ht1/2 and log ht entering the vector of regressors xt.

Finally, asymmetric and threshold versions of the GARCH model are available, where ht is specified as:

ht0 + ∑i=1q αi (u t-i1)2 + κ2 Dt-1(u t-11)2 + ∑i=1p βi ht-i,

where Dt-1=1 if u t-11 and zero otherwise; κ1 is the asymmetry parameter, and κ2 the threshold parameter. Either or both can be zero: a pure threshold model has κ1=0 (Glosten, Jagannathan, and Runkle, 1993), and a pure asymmetric GARCH model has κ2=0.

[Note: The terminology we adopt here is somewhat arbitrary: both models actually allow for asymmetry.]

In summary, the following models can be estimated:

normal or t errors
conditional variance in mean
regressors in conditional variance
normal or GED errors
conditional variance in mean
regressors in conditional variance.

Chapter 10 GARCH Implementation Details

10.1 GARCH model settings

The default is a GARCH(1,1) model with normal errors, see (eq:8.2). If desired, other values for p and q can be selected.

The Model Settings dialog also has a range of options (the default is marked below with a *):

  1. Main options

    1. Order p and q

    2. choice between EGARCH(p,q), see (eq:8.4), or GARCH(p,q), see (eq:8.2)

    3. Non-normal error distribution

      This is either a GARCH model with student-t distributed errors or an EGARCH model with errors which have a generalized error distribution.

  2. GARCH variations

    1. Asymmetric GARCH, see (eq:8.6)

    2. Threshold GARCH, see (eq:8.6)

      Selecting both results in a threshold asymmetric GARCH model,

    3. Conditional variance in mean

      • No conditional variance in mean
      • h_t in mean
      • sqrt(h_t) in mean
      • log(h_t) in mean

  3. GARCH parameter restrictions

    1. Unrestricted estimation

      This imposes no restrictions, and could in rare cases lead to negative variance estimates.

    2. Impose conditional variance >= 0

      Nelson and Cao (1992) argued that imposing all coefficients to be nonnegative is overly restrictive, and that negative estimates occur in practice (they list several examples). Subsequently, He and Teräsvirta (1999) have shown that such negative coefficients allow for richer shapes of the autocorrelation function. Doornik and Ooms (2008) discuss the implementation of these restrictions.

    3. *Impose stationarity and alpha+beta >= 0

      Doornik and Ooms (2008) also considered restrictions in terms of (eq:8.3). Defining m= max (p,q), βi=0 for i>p, αi=0 for i>q:

      α0 ≥0,
      αii≥0,for i=1,...,m.

      In terms of (eq:8.3), these restrictions imply that the unconditional variance exists, and is always positive. Formally, PcGive imposes α(1)+β(1)<1 with strict inequality. However, numerically, the result can get so close as to being virtually indistinguishable from one. In that case the sum is assumed to be on the IGARCH boundary.

    4. Impose alpha,beta >= 0

      Bollerslev (1986) proposed the GARCH model with α0 ≥0, αi ≥0, and βi ≥0. This ensures that ht≥0.

  4. Start-up of GARCH variance recursion.

    Evaluation of the likelihood requires pre-sample values for ut2 and ht. PcGive implements two methods:

    1. *Use mean variance

      Bollerslev (1986) suggested using the mean of the squared residuals:

      u2i = hi = T-1t=1T u2t, for i ≤0.

      In EGARCH, the recursion is started with the logarithm of the mean of the squared residuals.

    2. Estimate missing variances

      This amounts to adding the missing u21...u2m as parameters which are to be estimated. This make the likelihood derivatives more complex, and PcGive will only use numerical derivatives if this method is selected (see below).

  5. Preferred covariance estimator

    1. Second derivatives

      This uses minus the inverse of the numerical second derivative of the log-likelihood function.

    2. *Information matrix

      The information matrix is computed analytically, but only for standard GARCH models.

    3. Outer product of gradients

    In addition, the robust standard errors are printed by default when the information matrix J is available. These are of the form suggested by Bollerslev and Wooldridge (1992): Ĵ-1ĜĴ-1, where Ĝ is the outer product of the gradients.

    A comparison of various estimators is given in Fiorentini, Calzolari, and Panattoni, 1996.

  6. Search for global maximum after initial estimation

    Especially when p>1, it is possible that the likelihood has multiple optima. This final set of advanced options allows for a search from random starting values. Because each of these involves maximization of the likelihood, this option can be time consuming. See Doornik and Ooms (2008) for an application.

10.2 Some implementation details

  1. Maximization technique

    PcGive only offers BFGS (see e.g. Fletcher, 1987 or Gill, Murray, and Wright, 1981), because it is our preferred method for numerical optimization. BFGS avoids the need for second derivatives, while being one of the most robust methods available. This is supplemented by a line search when, at an iteration step, the likelihood does not increase. BFGS was not considered by Fiorentini, Calzolari, and Panattoni (1996), but we found 100%convergence when replicating their Table 1 with 1000 replications (requiring about 17 iterations on average, whether starting from the DGP values, or from a starting value routine).

  2. Starting values for the parameters

    PcGive uses the method of Galbraith and Zinde-Walsh (1997) applied to the squared data (after removing regressors in the mean). If necessary, the resulting parameter values are reduced to ensure that the unconditional variance exists.

  3. Use of analytical derivatives

    Numerical derivatives are more convenient, but less accurate than analytical derivatives (see Fiorentini, Calzolari, and Panattoni, 1996). The latter are to be preferred, but convenience often dictates the use of the former. In simple GARCH models, we found numerical derivatives sufficiently effective, with model estimation taking the same amount of time, and convergence achieved as frequently.

    PcGive uses analytical derivatives, except (1) when the Hessian matrix is required for the variance-covariance matrix, (2) when the initial values of the conditional variances are added as parameters, (3) for EGARCH-type models and (4) for conditional variance in mean models.

Chapter 11 Introduction to Time Series Models

The fractionally-integrated autoregressive-moving average model, denoted arfima(p,d,q) can be used for the statistical analysis of a univariate time series yt with long memory. The more familiar arma (or `Box--Jenkins') models (when d=0) or arima (when d is a positive integer) are special cases. The arfima(p,d,q) model allows one to model the long-run behaviour of a time series in a flexible way. Empirical modelling involves identification, estimation and testing. In the identification stage of ARIMA-modelling one determines the integer part of the order of differencing, d, and the orders of the autoregressive and the moving-average parts of the model, p, and q. In the ARFIMA-model one can estimate d and compute a confidence interval. The arfima model allows for likelihood ratio and Wald tests for the null of short-memory stationarity (d=0), as well as for the null of unit-root non-stationarity (d=1) against long memory and intermediate memory alternatives. These tests are easy to apply since they have conventional chi-squared limit distributions.

The tests complement widely applied Dickey--Fuller type tests for d=1 against short memory alternatives and tests for d=0 against d=1 based on variances of cumulative sums. PcGive's parameter instability test applied to the constant term of a regression model can be interpreted as a test for d=0 with fixed mean, against d=1 with a random walk mean.

From the early nineteen sixties onwards, when Mandelbrot suggested long-memory models for economic time series, there has been a steady growth in the literature on the subject. Robinson (1994) and Baillie (1996) provided useful surveys of the developments in the econometric modelling of long memory; Beran's monograph, Beran (1994), discusses most of the central issues, including forecasting. The implementation in PcGive is derived from the Arfima package for Ox by Doornik and Ooms (1999), while an application and a more extensive discussion can be found in Doornik and Ooms (2004).

We write the arfima(p,d,q) model as:

Φ( L) ( 1-L) d( ytt ) =Θ( L) εt,   t=1,...,T.

where Φ( L )=(1-φ1L-...-φpLp) is the autoregressive polynomial and Θ( L )=(1+θ1L+...+θqLq) is the moving average polynomial in the lag operator L; p and q are integers, d is real. For example, the arfima(1,d,1) model is:

( 1-L) dzt = φ1 ( 1-L) dzt-1 + εt + θ1εt-1,


zt =ytt.

The notation in textbooks often differs in terms of the sign of the parameters: we follow the regression-model notation.

( 1-L) dd is the fractional difference operator defined by the following binomial expansion:

(1-L)d=∑j=0 δj Lj = ∑j=0 (
) (-L)j.

We assume εt~NID[0,σε2], and write μt for the mean of yt. The arma-part of the model is assumed invertible and stationary: all roots of Θ(z)=0 and Φ(z)=0 are outside the unit circle. In addition, Θ(z)=0 and Φ(z)=0 do not have common roots. We say that zt=ytt is I(d), integrated of order d. The zero mean arfima(p,d,q) process zt is covariance stationary if d<0.5, see Hosking (1981), with autocovariance function that decays hyperbolically. The process is long memory in the case 0<d<0.5. For -0.5<d<0 the process is called intermediate memory or `overdifferenced', see Brockwell and Davis (1993) and e.g. Chang and Dickey (1994). We assume d>-1, which makes the process zt invertible, see Odaki (1993). We provide three estimation methods, Exact Maximum Likelihood (EML), Modified Profile Likelihood (MPL) and nonlinear least squares (NLS). By definition, EML and MPL impose -1 < d < 0.5. MPL is preferred over EML if the model includes regressor variables and the sample is not very large. NLS allows for d > -0.5 and can be used to estimate models for non-stationary processes directly, without a priori differencing. NLS estimation is usually fast.

The next chapter shows how PcGive can be used to estimate arfima models. We discuss implementation details in Chapter 13.

Chapter 13 ARFIMA Implementation Details

13.1 Introduction

This chapter gives a summary of the implementation details of the procedures in the Arfima package. It complements Doornik and Ooms (2004), while further computational aspects are discussed in Doornik and Ooms (2003).

13.2 The Arfima model

The basic arma(p,q) model is

yt1yt-1+...+φpyt-pt1εt-1+...+θqεt-q,  t=1,...,T,

assuming either εt~NID(0,σε2) , or Et]=0 and Et2]=σε2. Using lag polynomials and introducing a mean μ we write:

Φ( L) ( yt-μ) =Θ( L) εt.

With a fractional integration parameter d, the arfima(p,d,q) model is written as

Φ( L) ( 1-L) d( yt-μ) =Θ( L) εt.

The autocovariance function (ACovF) of a stationary arma process with mean μ:

c( i) =E[ ( yt-μ) ( yt-i-μ) ],

defines the variance matrix of the joint distribution of y=(y1,...,yT)':

V[ y] =(
c( 0) c( 1) ...c( T-1)
c( 1) c( 0)
c( 2) c( 1)
c( 1)
c( T-1) ...c( 1) c( 0)
) =T[ c( 0) ,...,c( T-1) ] =Σ,

which is a Toeplitz matrix, denoted by T. Under normality:


The autocorrelation function, ACF: c( i) /c( 0), of a stationary arma process is discussed in many textbooks, and readily computed from the φi and θi using the Ox function armavar. We often work with the autocovariances scaled by the error variance:

r=[ r( 0) ...r( T-1) ] '=σε-2[ c( 0) ...c( T-1) ] '.

13.2.1 Autocovariance function

An algorithm for the computation of the ACovF of an arfima process is derived in Sowell (1992):

c( i) =σε2k=-qqj=1pψkζjC(d,p+k-i,ρj),


ψk=∑s=|k|qθsθs-|k|,    ζj-1=ρ[∏i=1p( 1-ρiρj) ∏m≠j( ρjm) ] ,


C(d,h,ρ)= Γ( 1-2d) /[ Γ( 1-d) ] 2 ( d) h/( 1-d) h [ ρ2pF( d+h;1-d+h;ρ) +
F( d-h;1-d-h;ρ) -1] .

Here Γ is the gamma function, ρj are the roots of the AR polynomial (assumed distinct), and F(a,1;c;ρ) is the hypergeometric function, see e.g. Abramowitz and Stegun (1970, Ch. 15):

( a) i( b) i

( c)i


where we use Pochhammer's symbol:

( a) i=a( a+1) ( a+2) ...( a+i-1) ,  ( a) 0=1.

So (1)i equals i!.

In the absence of AR parameters (eq:13.4) reduces to

[Note: Note the typo in the equation below (8) in Sowell (1992, p.173): Γ( d+s-l) in the numerator should read Γ( d-s+l) .]

c( i) =σε2k=-qqψk
Γ( 1-2d)

[ Γ( 1-d) ] 2
( d) k-i

( 1-d) k-i

13.3 Estimation

13.3.1 Regressors in mean

Any set of exogenous regressors may be used to explain the mean:

z=y-μ,   μ=f(X,β),

where X is a T×k matrix. In the leading linear case f(X,β)= and β is a k×1 vector.

13.3.2 Initial values

Initial values for the parameter estimates are obtained in the order: regressors in mean, d, AR part, and finally MA part. The very first step is to subtract a mean from yt: zt=ytt . When either the sample mean or a specified (known, possibly zero) mean is used: μt =μ. If regressors are used, take μt=f(xt,β). In the linear case β is obtained by regression.

  1. For the fractional integration parameter the (frequency domain) log periodogram regression of Geweke and Porter-Hudak (1983) is used, yielding d̂0. We use [T-1/2] nonzero periodogram points, except when p=q=0 when we use all available points. The initial time domain residuals are then obtained using the Ox function diffpow, which computes `naive' fractional differences of zt:

    ( -d̂0) j


  2. Next, AR starting values are obtained from solving the Yule-Walker equations taking the number of MA parameters into account:

    ρ̂( q) ...ρ̂( q-p+1)
    ρ̂(q+p-1)ρ̂( q)
    ) φ0 = (
    ρ̂( q+1)
    ρ̂( q+p)

    where ρ̂( i) is the empirical autocorrelation of ut. When q is zero, the matrix on the right-hand side is the Toeplitz matrix T[ ρ̂(0), ..., ρ̂( p-1) ].

    We use OLS to solve this system, giving the initial estimates of the AR coefficients φ̂0; this will also give a solution when the matrix is singular. Subsequently, the arma0 function is used to obtain autoregression residuals ut*.

  3. Starting values for the MA parameters are derived from ut* using Tunnicliffe-Wilson's method, see Granger and Newbold (1986, p.88). Because this iterative method is slow to converge, we choose rather loose convergence criteria. A non-invertible MA is `flipped' to an invertible MA by inverting roots outside the unit circle. The arma0 function is then used to obtain ARMA residuals ut**.

When the initial values are used as starting values for further estimation, the following adjustments are made:

  1. If d is not significant at 5%, it is set to zero. A value of d̂0 less than -0.45 is set to -0.40, and similarly to 0.40 for a value greater than 0.45.

  2. If q > 0 and the solution from the Yule-Walker equations yields non-stationary AR parameters, the method is applied as if q=0.

13.3.3 Exact maximum likelihood (EML)

Based on normality (eq:13.3), and with a procedure to compute the autocovariances in (eq:13.2), the log-likelihood is simply (writing z for the data vector used for maximization):

log L( d,φ,θ,βε2) =- T/2 log ( 2π) - 1/2 log | Σ| - 1/2 z'Σ-1z.

It is convenient to concentrate σε2 out of the likelihood, starting by writing Σ= Rσε2:

log L( d,φ,θ,βε2) ~- 1/2 log | R| - T/2 log σε2-


Differentiating with respect to σε2, and solving yields


with concentrated likelihood (CLF):

lc( d,φ,θ,β) =- T/2 log ( 2π) - T/2 - 1/2 log | R| - T/2 log [ T-1z'R-1z] .

When f(X,β)= it is more convenient to also concentrate β out of the likelihood. The resulting normal profile log-likelihood function becomes:

lP( d, φ ,θ ) = - T/2 (1+ log 2π) - 1/2 log | R | - T/2 log [ T-1ẑ'R -1ẑ],


=y-Xβ̂,  β̂ =(X'R-1X)-1 X'R-1y.

The function used in the maximization procedure is:

- 1/2 {T-1 log | R| + log σε2 },

from which the value for the log-likelihood (eq:13.10) is easily derived. The computational procedure described in Doornik and Ooms (1999) writes


with |R| a by-product of the procedure.

Function (eq:13.12) is maximized using BFGS with numerical derivatives. During estimation, stationarity is imposed at each step by rejecting parameter values which have:

In addition, the procedure can fail because:

13.3.4 Modified profile likelihood (MPL)

The modified profile log-likelihood, lM, for the regression model with stationary arfima-errors and f(X,β)=:

lM( d, φ,θ) =- T/2 (1+ log 2π) -( 1/2 - 1/T ) log | R | -
T-k-2/2 log [T-1ẑ'R -1ẑ] - 1/2 log | X'R-1X |,

see An and Bloomfield (1993), who applied the idea of Cox and Reid (1987) to reduce the bias of the EML estimator due to the presence of unknown nuisance parameters of the regressors.

The residual variance estimator now uses T-k, so that it is unbiased when p=q=d=0:


ẑ' R-1ẑ.

13.3.5 Non-linear least squares (NLS)

Defining et as the residuals from applying the arfima(p,d,q) filter to ytt, the residual variance is:



NLS simply maximizes

f( d,φ,θ,β) =- 1/2 log ( 1/T ∑t=1Tet2).

The arfima filter is computed using the Ox function diffpow, see (eq:13.7), followed by arma0. Since (eq:13.7) essentially drops the first observation, e1=0 when d is estimated.

Function (eq:13.16) is maximized using BFGS with numerical derivatives, optionally with stationarity imposed.

13.3.6 Variance-covariance matrix estimates

Let θ'= [d φ' θ']. The variance-covariance matrix for the EML (l=lP) and MPL (l=lM) estimates is computed as:

(- ∂2 l(θ)/ ∂θθ'|θ̂ )-1 0
0 σ̂ε2(X'R-1X)-1

The second derivative of l is computed numerically.

For NLS, the variance-covariance is the inverse of minus the numerical second derivative of (eq:13.16).

13.4 Estimation output

Estimation output consists of

13.5 Estimation options

13.5.1 Sample mean versus known mean

It has been found in early Monte Carlo experiments that, in smaller samples, using the theoretical mean could lead to more accurate estimation of d (see e.g. Cheung and Diebold, 1994). This can be seen as the most effective way to reduce the effect of a nuisance parameter on inference for the parameter of interest. Therefore, the Arfima package allows for fixing the mean at a specified value. Let yt denote the original dependent variable, and μy the known mean.

The z used in §13.3.3 for estimation when specifying a known mean is:


otherwise the package uses


The specification of the mean affects the likelihood. For the last term in the log-likelihood:

( y-μ) 'R-1( y-μ) =y'R-1y-2μ'R-1y+μ'R-1μ,

so the known mean case adds μyι'R-1y, whereas the second case adds μ̂yι'R-1y, and different results must be expected.

13.5.2 Fixing parameters

It is possible to fix d at a specific value, or drop arma terms.

13.5.3 Weighted estimation

A weight variable w, wt≥0, can be used in estimation. Write w̃t = wt/w>0, where w>0 is the mean of the positive weights.

Then (eq:13.12) for EML becomes:

- 1/2 {T-1 log | R| -T-1t>0 log w̃t + log T-1t=1T et2i },

The NLS function is adjusted in a similar fashion. Weighted estimation is not available for MPL, and weights are ignored for forecasting. The weighted residuals, êti1/2, are used in the residual-based diagnostics.

13.5.4 Z variables

With both additive normal regressors xt and innovation Z variables zt the arfima model becomes:

Φ( L) ( 1-L) d ( yt-xt'β ) = Θ( L)( εt+zt'γ).

The notation for the Z variables in this subsection should not be confused with zt, the demeaned yt. After applying the normal EML or NLS filter to zt, zt'γ̂ is subtracted at each iteration.

This model has the familiar ADL (Autoregressive Distributed-lag model) as a special case, since zt can contain different lags of the same (exogenous) variable. Whereas additive outliers (and missing observations) can be estimated using dummies for the X variables, see e.g. Brockwell and Davis (1993, §12.3), innovation outliers can be modelled by dummies for Z variables. Note that adding a single observation dummy for a Z variable has the same effect as giving that observation zero weight in the W variable. This is illustrated in fracest8.ox.

Z variables are not available for MPL.

13.6 Forecasting

Two methods of forecasting are supported, based on the results in Beran (1994, §8.7). As before let z=(z1,...,zT)' denote the observations over the estimation period. Assume zt is stationary and d>-1. The best linear prediction of zT+h is

T+h=[ r( T-1+h) ...r( h) ] {T[ r( 0) ,...,r( T-1) ] }-1z=q'z,

which consists of the reversed ACovF starting from h, times the original data weighted by their correlations. The solvetoeplitz function is used to solve Tx=z in order to keep storage requirements of order T, see Doornik and Ooms (1999). The mean square error is

MSE(ẑT+h)=σ̂ε2( r( 0) -r'q) .

In the presence of a mean-function μt=f(xt,β) the forecasts are:

T+h=q'( y-μ) +μt+h +xT+h'β̂.

The program computes all requested forecasts zĥ=(zT+1,...,zT+h)' and their joint variance-covariance matrix, Cov(zĥ) simultaneously. Cov(zĥ) is also used to derive the mean squared errors for partial sums, ∑i=1hT+i, integrated partial sums etc.

`Naive' forecasts are derived from the autoregressive representation of the process, truncated at T+h:

Θ-1( L) Φ( L) ( 1-L) dzt=( 1-b1L...-bT+h-1LT+h-1) zt=B( L) zt.

In this case the zt need not be stationary, cf. Beran (1995), but d>-0.5. The first T coefficients in the ( 1-L) d polynomial can be computed using the diffpow function when the input is a one followed by T-1 zeroes; this follows from (eq:13.7). For polynomial multiplication and division, polymul and polydiv are used. The naive forecasts are computed recursively:

T+h*=[ bT+h-1...b1] ×[ z'  ẑT+1...ẑT+h-1] ',


MSE(ẑT+h*)=σ̂ε2( 1+∑i=1h-1ai2) ,

where ai are the coefficients of B-1( L) .

Level forecasts are computed by adding the (integrated) partial sums of forecasts to a specified starting level. The reported MSE of the integrated naive forecasts can be obtained directly from (eq:13.17).

Forecasting with Z variables is not yet implemented.