Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Gauss-Markov Assumptions, Full Ideal

Conditions of OLS
The full ideal conditions consist of a collection of assumptions about the true
regression model and the data generating process and can be thought of
as a description of an ideal data set. Ideal conditions have to be met in
order for OLS to be a good estimate (BLUE, unbiased and efficient)
Most real data do not satisfy these conditions, since they are not generated by
an ideal experiment. However, the linear regression model under full ideal
conditions can be thought of as being the benchmark case with which
other models assuming a more realistic DGP should be compared.
One has to be aware of ideal conditions and their violation to be able to control
for deviations from these conditions and render results unbiased or at least
consistent:
1. Linearity in parameters alpha and beta: the DV is a linear function of a set
of IV and a random error component
→ Problems: non-linearity, wrong determinants, wrong estimates; a relationship
that is actually there can not be detected with a linear model
2. The expected value of the error term is zero for all observations
E  i   0
→ Problem: intercept is biased
3. Homoskedasticity: The conditional variance of the error term is
constant in all x and over time: the error variance is a measure of
model uncertainty. Homoskedasticity implies that the model
uncertainty is identical across observations.

V  i   E  i ²   2  cons tan t

→ Problem: heteroskedasticity – variance of error term is different


across observations – model uncertainty varies from observation
to observation – often a problem in cross-sectional data, omitted
variables bias
4. Error term is independently distributed and not correlated, no
correlation between observations of the DV.
Cov  i ,  j   E  i  j   0, i  j

→ Problem: spatial correlation (panel and cross-sectional data), serial


correlation/ autocorrelation (panel and time-series data)
5. Xi is deterministic: x is uncorrelated with the error term since xi is
deterministic: Cov  X ,    E  X    E  X  * E   
i i i i i i

 X i E  i   X i E  i   sin ce X i is det
0

→ Problems: omitted variable bias, endogeneity and simultaneity


6. Other problems: measurement errors, multicolinearity

If all Gauss-Markov assumptions are met than the OLS estimators alpha
and beta are BLUE – best linear unbiased estimators:
best: variance of the OLS estimator is minimal, smaller than the
variance of any other estimator
linear: if the relationship is not linear – OLS is not applicable.
unbiased: the expected values of the estimated beta and alpha
equal the true values describing the relationship between x and y.
Inference
Is it possible to generalize the regression results for the sample under
observation to the universe of cases (the population)?
Can you draw conclusions for individuals, countries, time-points beyond
those observations in your data-set?
• Significance tests are designed to answer exactly these questions.
• If a coefficient is significant (p-value<0.10, 0.05, 0.01) then you can
draw conclusions for observations beyond the sample under
observation.
But…
• Only in case the samples matches the characteristics of the
population
• This is normally the case if all (Gauss-Markov) assumptions of OLS
regressions are met by the data under observation.
• If this is not the case the standard errors of the coefficients might be
biased and therefore the result of the significance test might be
wrong as well leading to false conclusions.
Significance test: the t-test

 2
 
2
,
N *Var  X 

The t-test:
• T-test for significance: testing the H0 (Null-Hypothesis) that beta
equals zero: H0: beta=0; HA: beta≠0
• The test statistic follows a student t distribution under the Null
ˆ  r ˆ  r
  t n  2
SE  
ˆ SSR
N *Var  X 
ˆ ˆ
  t n  2
 
SE ˆ SSR
N *Var  X 

• t is the critical value of a t – distribution for a specific number of


observations and a specific level of significance: convention in
statistics is a significance level of 5% (2.5% on each side of the t-
distribution for a 2-sided test) – this is also called the p-value.
Assume beta is 1 and the estimated standard error is 0.8
The critical value of the two-sided symmetric student t-distribution for n=∞ and
alpha=5% is 1.96
ˆ  0
t
Acceptance at the 5% level:
 
SE ˆ

The Null (no significant relationship) will not be rejected if: 1.96  t  1.96
ˆ  0
This condition can be expressed in terms of beta 1.96   1.96
by substituting for t: SE   
ˆ

Multiplying through by the SE of beta:  


1.96* SE ˆ  ˆ  0  1.96* SE ˆ  
Then: 0  1.96* SE  ˆ   ˆ  0  1.96* SE  ˆ 

0  1.96*0.8  1  0  1.96*0.8
Substituting beta=1 and se(beta)=0.8:
1.568  1  1.568

Since this in-equality holds true, the null-hypothesis is not rejected. Thus, we
accept that there is rather no relationship between x and y and beta equals
with a high probability (95%) zero.
probability density
function of beta acceptance region

2.5% 2.5%
0

-1.568 1 1.568 beta


Now assume that the standard error of beta is 0.4 instead of 0.8, we
get:
0  1.96*0.4  1  0  1.96*0.4
0.784  1  0.784

This in-equality is wrong, therefore we reject the null-hypothesis that


beta equals zero and decide in favour of the alternative hypothesis
that there is a significant positive relationship between x and y.
probability density
function of beta acceptance region

2.5% 2.5%
0

-0.748 0.748 1 beta


Significance test – rule of thumb:
If the regression-coefficient (beta) is at least
twice as large as the corresponding
standard error of beta the result is
statistically significant at the 5% level.
Power of a test
For a given test statistic and a critical region of a given
significance level we define the probability of rejecting the null
hypothesis as the power of a test
The power would be optimal if the probability of rejecting the null
would be 0 if there is a relationship and 1 otherwise.
This is, however, not the case in reality. There is always a
positive probability to draw the wrong conclusions from the
results:

• One can reject the null-hypothesis even though it is true (type 1


error, alpha error):
alpha=Pr[Type I Error]
=Pr[rejecting the H0 | H0 is true]

• Or not reject the null-hypothesis even though it is wrong (type 2


error, beta error)
beta=Pr[Type II Error]
=Pr[accepting the H0 | Ha is true]
Type I and Type II errors
Alpha and beta errors: an example:
A criminal trial: taking as the null hypothesis that the defendant is
innocent, type I error occurs when the jury wrongly decides that the
defendant is guilty. A type two error occurs when the jury wrongly
acquits the defendant.

In significance test:
H0: beta is insignificant = 0:
Type I error: wrongly rejecting the null hypothesis
Type II error: wrongly accepting the null that the coefficient is zero.
Selection of significance levels increase or decrease the probability of
type I and type II errors.
The smaller the significance level (5%, 1%) the lower the probability of
type I and the higher the probability of type II errors.
Confidence Intervals
Significance tests assume that hypotheses come before
the test: beta≠0. however, the significance test leaves us
with some vacuum since we know that beta is different
from zero but since we have a probabilistic theory we are
not sure what the exact value should be.

Confidence intervals give us a range of numbers that are


plausible and are compatible with the hypothesis.

As for significance test the researcher has to choose the


level of confidence (95% is convention)

Using the same example again: estimated beta is 1 and the


SE(beta) is 0.4 ; the critical value of the two-sided t-
distribution are 1.96 and -1.96
Calculation of the confidence interval:
The question is how far can a hypothetical value differ from the
estimated result before they become incompatible with the
estimated value?
The regression coefficient b and the hypothetical value beta are
incompatible if either
b b
 tcrit or  tcrit
SE  b  SE  b 

That is if beta satisfies the double inequality:


b  SE  b  * tcrit    b  SE  b  * tcrit
Any hypothetical value of beta that satisfies this inequality will therefore
automatically be compatible with the estimate b, that is will not be
rejected. The set of all such values, given by the interval between
the lower and upper limits of the inequality, is known as the
confidence interval for b. The centre of the confidence interval is the
estimated b.
If the 5% significance level is adopted the corresponding confidence
interval is known as the 95% confidence interval (1% - 99%).
Since the critical value of the t distribution is greater for the 1% level
than for the 5% level, for any given number of degrees of freedom, it
follows that the 99% interval is wider than the 95% interval and
encompasses al the hypothetical values of beta in the 95%
confidence interval plus some more on either side
Example: b=1, se(b)=0.4, 95% confidence interval, t_critical=1.96,
-1.96:

1-0.4*1.96 ≤ beta ≤ 1+0.4*1.96

95% confidence interval:

0.216 ≤ beta ≤ 1.784

Thus, all values between 0.216 and 1.784 are theoretically possible
and would not be rejected. They are compatible to the estimated b =
1. 1 is the central value of the confidence interval.
Interpretation of regression results:
reg y x

Source | SS df MS Number of obs = 100


-------------+---------------------------------------------- F( 1, 98) = 89.78
Model | 1248.96129 1 1248.96129 Prob > F = 0.0000
Residual | 1363.2539 98 13.9107541 R-squared = 0.4781
-------------+---------------------------------------------- Adj R-squared = 0.4728
Total | 2612.21519 99 26.386012 Root MSE = 3.7297

----------------------------------------------------------------------------------------------------
y| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-------------------------------------------------------------------------------------
x | 1.941914 .2049419 9.48 0.000 1.535213 2.348614
_cons | .8609647 .4127188 2.09 0.040 .0419377 1.679992
----------------------------------------------------------------------------------------------------

Degrees of freedom: number of observations minus number of estimated


parameters: in this case alpha and beta: 100-2=98. If we had 2 explanatory
variables the number of degrees of freedom would decrease to 97, 3 – 96,
etc.
The concept of DoF implies that you cannot have more explanatory variables
than observations!
Definitions
Total Sum of Squares (SST):
n
SST    yi  y 
2

i 1

Explained (Estimation) Sum of Squares (SSE):


n
SSE    yˆi  y 
2

i 1

Residual Sum of Squares or Sum of Squares Residuals (SSR):

n n
SSR   ˆ    yi     xi 
2 2
i
i 1 i 1
Goodness of Fit
How well does the explanatory variable explain the dependent
variable?
How well does the regression line fit the data?

The R-squared (coefficient of determination) measures how much


variation of the dependent variable can be explained by the
explanatory variables.

The R² is the ratio of the explained variation compared to the total


variation: it is interpreted as the fraction of the sample variation in y
that is explained by x.

Explained variation of y / total variation of y:


n

 (Yˆ  Yˆ ) 2
SSE SSR
R 
2 i 1
n
  1
 (Y  Y ) 2 SST SST
i 1
Properties of R²:

• 0 ≤ R² ≤ 1, often the R² is multiplied by 100 to get the percentage of


the sample variation in y that is explained by x
• If the data points all lie on the same line, OLS provides a perfect fit
to the data. In this case the R² equals 1 or 100%.
• A value of R² that is nearly equal to zero indicates a poor fit of the
OLS line: very little of the variation in the y is captured by the
variation in the y_hat (which all lie on the regression line)
• R²=(corr(y,yhat))²
• The R² follows a complex distribution which depends on the
explanatory variable
• Adding further explanatory variables leads to an increase the R²
• The R² can have a reasonable size in spurious regressions if the
regressors are non-stationary
• Linear transformations of the regression model change the value of
the R² coefficient
• The R² is not bounded between 0 and 1 in models without intercept
Properties of an Estimator
1. Finite Sample Properties

There are often more than 1 possible estimators to estimate a


relationship between x and y (e.g. OLS or Maximum Likelihood)
How do we choose between two estimators: the 2 mostly used
selection criteria are bias and efficiency.
Bias and efficiency are finite sample properties, because they describe
how an estimator behaves when we only have a finite sample (even
though the sample might be large)
In comparison so called “asymptotic properties” of an estimator have
to do with the behaviour of estimators as the sample size grows
without bound
Since we always deal with finite samples and it is hard to say whether
asymptotic properties translate to finite samples, examining the
behaviour of estimators in finite samples seems to be more
important.
Unbiasedness
UnBiasedness: the estimated coefficient is on average true:
That is: in repeated samples of size n the mean outcome of the estimate equals
the true – but unknown – value of the parameter to be estimated.


E   ˆ  0 
If an estimator is unbiased, then its probability distribution has an expected
value equal to the parameter it is supposed to be estimating. Unbiasedness
does not mean that the estimate we get with any particular sample is equal
to the true parameter or even close. Rather the mean of all estimates from
infinitely drawn random samples equals the true parameter.
3.5
3
2.52
Density
1.5 1
.5
0

.5 .6 .7 .8 .9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1
b1
Sampling Variance of an estimator
Efficiency: is a relative measure between two estimators – measures
the sampling variance of an estimator: V(beta)
Let ̂ and  be two unbiased estimator of the true parameter  . With
variances V ˆ  and V   . Then ̂ is called to be relative more
efficient than  if V ˆ  is smaller than V   .
   
The property of relative efficiency only helps us to rank two unbiased
estimators.
.08

.4
.06

.3
Density
Density
.04

.2
.02

.1
0
0

-20 - 10 0 10 20 -4 -2 0 2 4
b1 b2
Trade-off between Bias and Efficiency
With real world data and the related problems we sometimes have only
the choice between a biased but efficient and an unbiased but
inefficient estimator. Then another criterion can be used to choose
between the two estimators, the root mean squared error (RMSE).
The RMSE is a combination of bias and efficiency and gives us a
measure of overall performance of an estimator.
RMSE:


1 K
RMSE     
2.5

2
ˆ ˆ   true
K k 1
2


MSE  E    true 


2
xtfevd ˆ
 
1.5


MSE  Var   Bias ,  true 

 
Density

2
ˆ ˆ
 
1

k measures the number


.5

fixed effects of experiments, trials or


simulations
0

-.5 0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5


coefficient of z3
Asymptotic Properties of Estimators
We can rule out certain silly estimators by studying the asymptotic or
large sample properties of estimators.
We can say something about estimators that are biased and whose
variances are not easily found.
Asymptotic analysis involves approximating the features of the sampling
distribution of an estimator.
Consistency: how far is the estimator likely to be from the true
parameter as we let the sample size increase indefinitely.
If N→∞ the estimated beta equals the true beta:
lim Pr  ˆ n       0, p lim ˆ n  ,
n   
lim E ˆ n   
n 

Unlike unbiasedness, consistency involves that the variance of the


estimator collapses around the true value as N approaches infinity.
Thus unbiased estimators are not necessarily consistent, but those
whose variance shrink to zero as the sample size grows are
consistent.
Multiple Regressions
• In most cases the dependent variable y is not just a
function of a single explanatory variable but a
combination of several explanatory variables.
• THUS: drawback of binary regression: impossible to draw
ceteris paribus conclusions about how x affects y
(omitted variable bias).
• Models with k-independent variables:
yi    1x i1  2 x i2  ...  k x ik  i

• Control for omitted variable bias


• But: increases inefficiency of the regression since
explanatory variables might be collinear.
Obtaining OLS Estimates in Multiple
Regressions
yi   0  1 xi1   2 xi 2   i

n n n n

 x  x2  x  x1  yi  y     xi1  x1  xi 2  x2    xi 2  x2  yi  y 
2
i2 i1
ˆ1  i 1 i 1 i 1 i 1
2
n n
 n 
 x  x1   x  x2      xi1  x1  xi 2  x2  
2 2
i1 i2
i 1 i 1  i 1 

ˆ  X ' X 1 X ' y

The intercept is the predicted value of y when all explanatory variables


equal zero.
The estimated betas have partial effect or ceteris paribus interpretations.
We can obtain the predicted change in y given the changes in each x.
when x_2 is held fixed then beta_1 gives the change in y if x_1
changes by one unit.
“Holding Other Factors Fixed”

• The power of multiple regression analysis is that it


provides a ceteris paribus interpretation even though the
data have not been collected in a ceteris paribus fashion.

• Example: multiple regression coefficients tell us what


effect an additional year of education has on personal
income if we hold social background, intelligence, sex,
number of children, marital status and all other factors
constant that also influence personal income.
Standard Error and Significance in Multiple
Regressions

 
2 n
1
Var ˆ1 
SST1 1  R12 
 ˆ 2
 i
n   k  1 i 1

ˆ 2

n
SST1    xi1  x1 
2

i 1
n

  xˆi1  x1 
2

SSE
R12 for the regression of xi1 on xi 2 : R12   i 1
n

  xi1  x1 
SST 2

i 1


   
SD ˆ1  SE ˆ1 
SST1 1  R12 
F – Test: Testing Multiple Linear Restrictions
• t-test (as significance test) is associated with any OLS coefficient.
• We also want to test multiple hypotheses about the underlying
parameters beta_0…beta_k.
• The F-test, tests multiple restriction: e.g. all coefficients jointly equal
zero:
H0: beta_0=beta_1=…=beta_k=0
Ha: H0 is not true, thus at least one beta differs from zero
• The F-statistic (or F-ratio) is defined as:
F 
 SSRr  SSRur  / q
• The F-statistic is F distributed under the SSRur /  n  k  1
Null-Hypothesis.
F  Fq , n  k 1

• F-test for overall significance of a regression: H0: all coefficients are


jointly zero – in this case we can also compute the F-statistic by using
the R² of the Regression:
R² / k
1  R ²  /  n  k  1
• SSR_r: Sum of Squared Residuals of the restricted model (constant
only)
• SSR_ur: Sum of Squared Residuals of the unrestricted model (all
regressors)
• SSR_r can be never smaller than SSR_ur  F is always non-
negative
• k – number of explanatory variables (regressors), n – number of
observations, q – number of exclusion restrictions (q of the variables
have zero coefficients): q = df_r – df_ur (difference in degrees of
freedom between the restricted and unrestricted models; df_r >
df_ur)
• The F-test is a one sided test, since the F-statistic is always non-
negative
• We reject the Null at a given significance level if F>F_critical for this
significance level.
• If H0 is rejected than we say that all explanatory variables are jointly
statistically significant at the chosen significance level.
• THUS: The F-test only allows to not reject H0 if all t-tests for all
single variables are insignificant too.
Goodness of Fit in multiple Regressions:

As with simple binary regressions we can define SST, SSE and SSR.
And we can calculate the R² in the same way.
BUT: R² never decreases but tends to increase with the number of
explanatory variables.
THUS, R² is a poor tool for deciding whether one variable of several
variables should be added to a model.
We want to know whether a variable has a nonzero partial effect on y in
the population.
Adjusted R²: takes the number of explanatory variables into account
since the R² increases with the number of regressors:

n 1
R 2
adj  1
nk
1  R 2

k is the number of explanatory variables and n the number of
observations
Comparing Coefficients
The size of the slope parameters depends on the scaling of the
variables (on which scale a variables is measured), e.g. population
in thousands or in millions etc.
To be able to compare the size effects of different explanatory
variables in a multiple regression we can use standardized
coefficients:
ˆ x j ˆ j
bˆ j  for j  1,..., k
ˆ y
Standardized coefficients take the standard deviation of the dependent
and explanatory variables into account. So they describe how much
y changes if x changes by one standard deviation instead of one
unit. If x changes by 1 SD – y changes by b_hat SD. This makes the
scale of the regressors irrelevant and we can compare the
magnitude of the effects of different explanatory variables (the
variables with the largest standardized coefficient is most important
in explaining changes in the dependent variable).
Problems in Multiple Regressions:
1. Multicolinearity
• Perfect multicolinearity leads to drop out of one of the variables: if x1
and x2 are perfectly correlated (correlation of 1) – the statistical
program at hand does the job.

• The higher the correlation the larger the population variance of the
coefficients, the less efficient the estimation and the higher the
probability to get erratic point estimates. Multicolinearity can result in
numerically unstable estimates of the regression coefficients (small
changes in X can result in large changes to the estimated regression
coefficients).

• Trade off between omitted variable bias and inefficiency due to


multicolinearity.
Testing for Multicolinearity
Correlation between explanatory variables: Pairwise colinearity can be
determined from viewing a correlation matrix of the independent
variables. However, correlation matrices will not reveal higher order
colinearity.

Variance Inflation Factor (vif): measures the impact of collinearity among


the x in a regression model on the precision of estimation. vif detects
higher order multicolinearity: one or more x is/are close to a linear
combination of the other x.
• Variance inflation factors are a scaled version of the multiple
correlation coefficient between variable j and the rest of the
independent variables. Specifically, 1
VIF j 
1  R 2j
where Rj is the multiple correlation coefficient.
• Variance inflation factors are often given as the reciprocal of the
above formula. In this case, they are referred to as the tolerances.
• If Rj equals zero (i.e., no correlation between Xj and the remaining
independent variables), then VIFj equals 1. This is the minimum
value. Neter, Wasserman, and Kutner (1990) recommend looking at
the largest VIF value. A value greater than 10 is an indication of
potential multicolinearity problems.
Possible Solutions
• Reduce the overall error: by including explanatory
variables not correlated with other variables but
explaining the dependent variable

• Drop variables which are highly multi-collinear

• Increase the variance by increasing the number of


observations

• Increase the variance of the explanatory variables

• If variables are conceptually similar – combine them into


a single index, e.g. by factor or principal component
analysis
2. Omitted Variable Bias
• The effect of omitted variables that ought to be included:
• Suppose the dependent variable y depends on two explanatory
variables:
yi   0  1 xi1   2 xi 2   i
• But you are unaware of the importance of x2 and only include x
yi   0  1 xi1   i
• If x2 is omitted from the regression equation, x1 will have a
“double” effect on y (a direct effect and one mimicking x2)
• The mimicking effect depends on the ability of x1 to mimic x2 (the
correlation) and how much x2 would explain y
• Beta1 in the second equation is biased upwards in case x1 and x2
are positively correlated and downward biased otherwise
• Beta1 is only unbiased if x1 and x2 are not related (corr(x1,x2)=0
• However: including variable that is unnecessary, because it does
not explain an variation in y the regression becomes inefficient and
the reliability of point estimates decreases.
Testing for Omitted Variables
• Heteroskedasticity of the error term with respect to the observation
of a specific independent variable is a good indication for omitted
variable bias:
• Plot the error term against all explanatory variables
• Ramsey RESET F-test for omitted variables in the whole model:
tests for wrong functional form (if e.g. an interaction term is omitted):
– Regress Y on the X’s and keep the fitted value Y_hat ;
– Regress Y on the X’s, and Y_hat² and Y_hat³.
– Test the significance of the fitted value terms using an F test.
• Szroeter test for monotonic variance of the error term in the
explanatory variables
Solutions:
• Include variables that are theoretically important and have a high
probability of being correlated with one or more variables in the
model and explaining significant parts of the variance in the DV.
• Fixed unit effects for unobserved unit heterogeneity (time invariant
unmeasurable characteristics of e.g. countries – culture, institutions)
3. Heteroskedasticity
The variance of the error term is not constant in each observation but
dependent on unobserved effects, not controlling for this problem
violates one of the basic assumptions of linear regressions and
renders the estimation results inefficient.
Possible causes:
• Omitted variables, for example: spending might vary with the
economic size of a country, but size is not included in the model.
Test:
• plot the error term against all independent variables
• White test if the form of Heteroskedasticity is unknown
• Breusch-Pagan Lagrange Multiplier test if the form is known

Solutions:
• Robust Huber-White sandwich estimator (GLS)
• White Heteroskedasticity consistent VC estimate: manipulates the
variance-covariance matrix of the error term.
• More substantially: include omitted variables
• Dummies for groups of individuals or countries that are assumed to
behave more similar than others
Tests for Heteroskedasticity:
a. Breusch-Pagan LM test for known form of Heteroskedasticity:
groupwise 2
n
T 2
 si 
LM    2
2 i 1  s
 1 

si2 =sum of group-specific squared residuals
s 2 = OLS residuals
H0: homoskedasticity ~ Chi² with n-1 degrees of freedom
LM-test assumes normality of residuals, not appropriate if
assumption not met.

b. Likelihood Ratio Statistic


Residuals are computed using MLE (e.g. iterated FGLS, OLS loss
of power)

 
2 ln      NT  ln  2    T ln  i2  ~  2  dF  n  1
c. White test if form of Heteroskedasticity is unknown:

H0: V  i | xi   
2

• Ha: V  i | xi    2i

1. Estimate the model under H0


2
2. Compute squared residuals: i e
3. Use squared residuals as dependent variable of auxiliary
regression: RHS: all regressors, their quadratic forms and
interaction terms
ei2   0  1 xi 2  ...   k 1 xik   k 1 xi22   k 1 xi 2 xi 3  ...   q xik2  i

4. Compute White statistic from R² of auxiliary regression:


n * R 2 
a
  (2q )

5. Use one-sided test and check if n*R² is larger than 95%


quantile of Chi²-distribution
Robust White Heteroskedasticity Consistent Variance-
Covariance Estimator:
 2
 
2
,
N *Var  X 
Normal Variance of beta: 

ˆ  ˆ  1 ˆ
Robust White VC matrix: V    
n
  n  X ' X  X ' DX  X ' X 
1 1
ˆ ˆ

Dˆ  diag ei2 

D is a n*n matrix with off-diagonals=0 and diagonal the squared


residuals.
The normal variance covariance matrix is weighted by the non-constant
error variance.
Robust Standard errors therefore tend to be larger.
Generalized Least Squares Approaches
• The structure of the variance covariance matrix Omega is used not just to
adjust the standard errors but also the estimated coefficient.
• GLS can be an econometric solution to many violations of the G-M conditions
(Autocorrelation, Heteroskedasticity, Spatial Correlation…), since the Omega
Matrix can be flexibly specified
• Since the Omega matrix is not known, it has to be estimated and GLS
becomes FGLS (Feasible Generalized Least Squares)
• All FGLS approaches are problematic if number of observations is limited –
very inefficient, since the Omega matrix has to be estimated
Beta: 1
 N
  N

    X i '  X i    X i '  yi  
1 1

 i 1   i 1 
1
 N
  N

  X i '  X i    X i '  yi 
ˆ  ˆ 1 ˆ 1

 i 1   i 1 

X 
Estimated covariance matrix: 1
1
' X
Omega matrix with heteroscedastic error structure and
contemporaneously correlated errors, but in principle
FGLS can handle all different correlation structures…:

 12  21 31   n1 
 
 12  22 32   n 2 
   13  23 32   n3 
 
      
  2n 3n   n2 
 1n
4. Autocorrelation
The observation of the residual in t1 is dependent on the
observation in t0: not controlling for autocorrelation
violates on of the basic assumptions of OLS and may
bias the estimation of the beta coefficients
Options:
• lagged dependent variable
• differencing the dependent variable
• differencing all variables
• Prais-Winston Transformation of the data
• HAC constitent VC matrix
i t  i t 1  it
Tests:
• Durbin-Watson, Durbin’s m, Breusch-Godfrey test
• Regress e on lag(e)
Autocorrelation
The error term in t1 is dependent on the error term in t0: not controlling
for autocorrelation violates on of the basic assumptions of OLS and
may bias the estimation of the beta coefficients
i t  i t 1  it
The residual of a regression model picks up the influences of those
variables affecting the DV that have not been included in the
regression equation. Thus, persistence in excluded variables is the
most frequent cause of autocorrelation.
Autocorrelation does make no predictions about a trend, though a trend
in the DV is often a sign for serial correlation.
Positive autocorrelation: rho is positive: it is more likely that a positive
value of the error-term is followed by a one and a negative by a
negative one.
Negative autocorrelation: rho is negative: it is more likely that a positive
value of the error-term is followed by a negative one and vice versa.
DW test for first order AC: T

  et  et 1 
2

d t 2
T

e
t 1
2
t

• Regression must have an intercept


• Explanatory variables have to be deterministic
• Inclusion of LDV biases statistic towards 2
Efficiency problem of serial correlation can be fixed by Newey-West
HAC consistent VC matrix for Heteroskedasticity of unknown form
and AC of order p: Problem: VC matrix consistent but coefficient can
still be biased! (HAC is possible with “ivreg2” in stata)

VˆNW  ˆ   T  X ' X  S *  X ' X 


1 1

S   et xt xt    l et et 1  xt xt' l  xt l xt' 
* 1 T 2 ' 1 p T
T t 1 T t 1 t l 1
1
l  1 
p 1
OR a simpler Test:

• Estimate the model by OLS

• compute the residuals

• Regress the residuals on all independent variables


(including the LDV if present) and the lagged residuals

• If the coefficient on the lagged residual is significant (with


the usual t-test), we can reject the null of independent
errors.
Lagged Dependent Variable
yit    0 yit 1  k x it  i t

• The interpretation of the LDV as measure of time-persistency is


missleading
• LDV captures average dynamic effect, this can be shown by
Cochrane-Orcutt distributive lag models. Thus LDV assumes that all
x-variables have an one period lagged effect on y
 make sure interpretation is correct – calculating the real effect of x -
variables
• Is an insignificant coefficient really insignificant if coefficient of lagged
y is highly significant?
yit     0 yi ,t 1  1 xit  it

yit  (   0 yi ,t 1   it )
1 
xit
First Difference models
• Differencing only the dependent variable – only if theory
predicts effects of levels on changes
• FD estimator assumes that the coefficient of the LDV is
exactly 1 – this is often not true
• Theory predicts effects of changes on changes
• Suggested remedy if time series is non-stationary (has a
single unit root), asymptotic analysis for T→ ∞.
• Consistent

 
K
yi t  yi t 1  k  x k i t  x k i t 1 i t  i t 1
k 1
K
 yi t  k  x k i t  i t
k 1
Prais-Winsten Transformation
• Models the serial correlation in the error term –
regression results for X variables are more straight
forwardly interpretable:
yi t  x i t   i t with i t  i t 1  it
The it are iid – with N  0,  2 
• The VC matrix of the error term is  1  2  T 1 
 
 1   T  2 
1  2
   1  T 3 
1  2  
      
T 1 T  2 T  3  1 

• The matrix is stacked for N units. Diagonals are 1.


• Prais-Winston is estimated by GLS. It is derived from
the AR(1) model for the error term. The first observation
is preserved
1. Estimation of a standard linear regression:

y i t  x i t   i t
2. An estimate of the correlation in the residuals is then obtained by the
following auxiliary regression:     
it i t 1 it

3. A Cochrane-Orcutt transformation is applied for observations


t=2,…,n
 
yi t  yi t 1   x i t  x i t 1  i t
4. And the transformation for t=1 is as follows:

1  2 y1    
1  2 x1  1  2 1
5. With Iterating to convergence, the whole process is repeated until
the change in the estimate of rho is within a specified tolerance, the
new estimates are used to produce fitted values for y and rho is re-

 
estimated, by:
yi t  yˆ i t   yi t 1  yˆ i t 1  i t
Distributed Lag Models
• Simplest form is Cochrane-Orcutt – dynamic structure of
all independent variables is captured by 1 parameter,
either in the error term or as LDV
• If dynamics are that easy – LDV or Prais-Winston is fine
– saves Degrees of Freedom
• Problem: if theory predicts different lags for different right
hand side variables – than a miss-specified model leads
necessarily to bias
• Test down – start with relatively large number of lags for
potential candidates:

yi t  x i t 1  x i t 12  x i t  23    x i t  n n 1  i t
n  1, , t  1
Specification Issues in Multiple Regressions:
1. Non-Linearity
One or more explanatory variables have a non-linear effect on the
dependent variable: estimating a linear model would lead to wrong
or/and insignificant results. Thus, even though in the population
there exist a relationship between an explanatory variable and the
dependent variable, but this relationship cannot be detected due to
the strict linearity assumption of OLS

Test:
• Ramsay RESET F-test gives a first indication for the whole model
• In general, we can use acprplot to verify the linearity assumption
against an explanatory variable – though this is just “eye-balling”
• Theoretical expectations should guide the inclusion of squared
terms.
Augmented component plus residual
-10 -5 0 5 10

.2 .4 0 .6
institutional openness to trade standardized
1 .8

Augmented component plus residual


-15 -10 -5 0 5 10
4 6 0
2
level of democracy
8
10
Solutions
Handy solutions without leaving the linear regression framework:
• Logarithmize the IV and DV: gives you the elasticity, higher values are
weighted less (engel curve – income elasticity of demand). This model is
called a log-log model or a log-linear model log y  log     log x  log 
i i i

– Different functional forms give parameter estimates that have different substantial
interpretations. The parameters of the linear model have an interpretation as
marginal effects. The elasticities will vary depending on the data. In contrast the
parameters of the log-log model have an interpretation as elasticities. So the log-
log model assumes a constant elasticity over all values of the data set. Therefore
the coefficients of a log-linear model can be interpreted as percentage changes –
if the explanatory variable changes by one percent the dependent variable
changes by beta percent.
– The log transformation is only applicable when all the observations in the data set
are positive. This can be guaranteed by using a transformation like log(X+k) where
k is a positive scalar chosen to ensure positive values. However, careful thought
has to be given to the interpretation of the parameter estimates.
– For a given data set there may be no particular reason to assume that one
functional form is better than the other. A model selection approach is to estimate
competing models by OLS and choose the model with the highest R-squared.
• include an additional squared term of the IV to test for U-shape and inverse
U-shape relationships. Careful with the interpretation! The size of the two
coefficients (linear and squared) determines whether there is indeed a u-
shaped or inverse u-shaped relationship.
yi    1  x i  2  x i  i
2
Hausken, Martin, Plümper 2004: Government Spending and Taxation in
Democracies and Autocracies, Constitutional Political Economy 15, 239-59.
polity_sqr polity govcon
0 0 20.049292
0.25 0.5 18.987946
1 1 18.024796
2.25 1.5 17.159842
4 2 16.393084
6.25 2.5 15.724521
9 3 15.154153
12.25 3.5 14.681982
16 4 14.308005
20.25 4.5 14.032225
25 5 13.85464
30.25 5.5 13.775251
36 6 13.794057
42.25 6.5 13.911059
49 7 14.126257
56.25 7.5 14.43965
64 8 14.851239
72.25 8.5 15.361024
81 9 15.969004
90.25 9.5 16.67518
100 10 17.479551
The „u“ shaped relationship between democracy and government spending:

21
government consumption in % of GDP
20

19

18

17

16

15

14

13
0 2 4 6 8 10
degree of democracy
2. Interaction Effects

Two explanatory variables do not only have a


direct effect on the dependent variable but also
a combined effect
yi    1  x 1i  2  x 2i  3  x1i  x 2i  i

Interpretation: combined effect:


b1*SD(x1)+b2*SD(x2)+b3*SD(x1*x2)
Example: monetary policy of currency union has
a direct effect on monetary policy in outsider
countries but this effect is increased by import
shares.
Example:
government
spending
Government spending in % of GDP 50

45
Low cristian democratic
portfolio
High cristian democratic
portfolio
40

35
Low unemployment High unemployment
70
6050
spend
40 30
20

0 5 10 15 20
unem

government spending trade+1sd


trade mean trade-1sd
80
60
spend
40
20

0 50 100 150
trade

government spending unem+1sd


unem mean unem-1sd
Interaction Effects of Continuous Variables

Marginal Effect of Unemployment on Spending as Trade Openness changes


Dependent Variable: Government Spending
2
Marginal Effect of Unemployment
1.5
1
.5
0
-.5

0 20 40 60 80 100 120 140 160


Trade Openness

Marginal Effect of Unemployment on Spending


95% Confidence Interval
2

.015
Marginal Effect of unemployment

Kernel Density Estimate of trade


1.5
on government spending

.01
.5 1

.005
0

Mean of international trade exposure


-.5

0
0 50 100 150
international trade exposure
Thick dashed lines give 95% confidence interval.
Thin dashed line is a kernel density estimate of trade.
3. Dummy variables

An explanatory variable that takes on only the


values 0 and 1
Example: DV: spending, IV: whether a country
is a democracy (1) or not (0).
yi      D  i
Alpha then is the effect for non-democracies
and alpha+beta is the effect for
democracies.
4. Outliers

Problem:
The OLS principle implies the minimization of squared
residuals. From this follows that extreme cases can have
a strong impact on the regression line.
Inclusion/exclusion of extreme cases might change the
results significantly.

The slope and intercept of the least squares line is very


sensitive to data points which lie far from the true
regression line. These points are called outliers, i.e.
extreme values of observed variables that can distort
estimates of regression coefficients.
Test for Outliers
• symmetry (symplot) and normality (dotplot) of dependent variable gives first
indication for outlier cases
• Residual-vs.-fitted plots (rvfplot) indicate which observations of the DV are
far away from the predicted values
• lvr2plot is the leverage against residual squared plot. The upper left corner
of the plot will be points that are high in leverage and the lower right corner
will be points that are high in the absolute of residuals. The upper right
portion will be those points that are both high in leverage and in the
absolute of residuals.
• DFBETA: how much would the coefficient of an explanatory variable change
if we omitted one observation?
The measure that measures how much impact each observation has
on a particular coefficient is DFBETAs. The DFBETA for an
explanatory variable and for a particular observation is the difference
between the regression coefficient calculated for all of the data and
the regression coefficient calculated with the observation
deleted, scaled by the standard error calculated with the
observation deleted. The cut-off value for DFBETAs is 2/sqrt(n),
where n is the number of observations.
.08
Belgium

.06 Belgium

Belgium
Japan
Netherlands
Leverage

Japan
Belgium Belgium Switzerland
IrelandSweden
UKFinland Ireland Ireland
.04

Belgium
Ireland Ireland
Ireland
UK
UK Japan
Japan
Belgium Ireland
UKJapan Switzerland
Ireland
Austria Austria
Netherlands
UK
UK Austria
Belgium
UK
Italy Switzerland
Switzerland
Ireland Ireland
UKBelgium
Japan
Belgium
Italy Finland
UKCanadaAustralia
Netherlands
Netherlands
UK Australia
Ireland
Italy
ItalyItaly
Belgium
UK AustraliaFinlandUK Japan Australia Sweden
Belgium
US UK
Austria
Ireland
US Canada
Ireland US Japan Switzerland
Switzerland
Switzerland
France
Italy
UK
Belgium
Italy
Canada
Italy Italy
Italy
Germany
UK Australia
US
UKFinland
Ireland
UK
UK Japan
Italy
GermanyJapan
Canada
Japan Australia
Australia Switzerland
.02

Italy
Japan
CanadaCanada
Germany
US Canada
Germany
Ireland
Netherlands
Ireland Canada
Germany
Denmark Ireland
Netherlands Japan Australia
Denmark Switzerland
Ireland
Ireland
US Netherlands
Germany
Denmark
Netherlands
Germany
Ireland
Belgium
Germany
CanadaItalyUS
Ireland
Belgium
Finland
Netherlands
Italy
Ireland
Finland Ireland
Canada
Australia Switzerland
Switzerland
SwedenSweden
Japan
Germany
Germany
Ireland
Ireland
Italy
Canada
Finland
France
Japan
Italy
US
UK Sweden
Canada
UK Ireland
Netherlands
Finland
Germany
Canada
ItalyNorway
Italy France
Belgium
Italy
Italy Ireland
Ireland Australia
France
Switzerland
Australia Switzerland
Switzerland
France
Japan
US
Austria
Belgium US
Netherlands
Ireland
Canada
Netherlands
Canada
Italy Netherlands
Netherlands
Netherlands
BelgiumBelgium
Australia
UK Denmark AustraliaSwitzerland
Australia Switzerland
Australia Switzerland
Switzerland
Finland
UK
UK
US Ireland
Italy
UK Germany
Canada
Norway
UK
France
US
Norway
Austria
Finland
Austria
Canada
UK
Canada
DenmarkItaly
Canada
Finland
Austria
Canada Netherlands
Netherlands
Norway
Canada
Finland
Austria
Finland
SwedenNetherlands
Germany
Denmark
Germany
Japan
US Belgium
Italy
Italy
NorwayBelgium
Australia
Netherlands Australia
Australia Sweden Sweden
Sweden
Sweden
Denmark
Belgium
Japan
Belgium
US
Netherlands
Belgium
US
US Austria
Finland
Austria
France
Belgium
UK Canada
Finland
Norway Norway
Denmark
Italy
Netherlands
Italy
Norway
Norway
Belgium
Denmark
Netherlands Canada
Italy Finland
Belgium
Sweden
Netherlands
Sweden
Australia
Austria
Netherlands
US
Denmark
France
Austria
Canada France
Austria
Australia
N France
Australia
orway Norway Denmark
Switzerland
FranceNorway
Denmark Denmark Sweden
Japan
Netherlands
Austria
ItalyNorway
Canada Finland
Belgium
Austria Austria
Finland
Sweden
Norway
Netherlands
Austria
Austria
Sweden
USFrance Norway
Sweden
Netherlands
Norway
Finland
Norway
Japan
Sweden
US France
Sweden
Norway
Japan Australia
Norway
Norway
Canada
CanadaSweden
Austria
Germany
Japan Australia
Netherlands
Australia
Australia
Norway Denmark
DenmarkDenmark
Norway
DenmarkDenmark
Denmark
Denmark Denmark
Sweden
Denmark Sweden Sweden
US
Belgium
Austria
Denmark
Austria Austria
Finland
Finland
Japan
UK
US
Finland
US
Finland
Sweden
Sweden
Germany
Norway
Germany
Canada
Norway
US
Denmark
US
Austria
Austria France
Sweden
France
Australia Norway France
Denmark
Netherlands
Norway
France France
France
Germany Denmark Denmark DenmarkSweden Sweden Sweden Sweden
Netherlands
Austria
Germany
Germany
US DenmarkFinland
France
Germany
US
Austria
Canada
Australia
France
Finland Sweden
Finland
Denmark
France
France France
France
Norway
France
France
Australia Germany
Australia
Netherlands
Finland
Finland France
Germany
Australia
France
France Finland
France
Germany Sweden
Germany
Netherlands Norway
0

0 .005 .01 .015 .02 .025


Normalized residual squared
Solutions: Outliers
• Include or exclude obvious outlier cases and check their impact on
the regression coefficients.
• Logarithmize the dependent variable and possibly the explanatory
variables as well – this reduces the impact of larger values.

jacknife, bootstrap:
• Are both tests and solutions at the same time: they show whether
single observations have an impact on the results. If so, one can use
the jacknifed and bootstrapped coefficients and standard errors
which are more robust to outliers than normal OLS results.
• Jacknife: takes the original dataset, runs the same regression N-1
times, leaving one observation out at a time.
Example command in STATA: „jacknife _b _se, eclass: reg spend unem
growthpc depratio left cdem trade lowwage fdi skand “
• Bootstrapping is a re-sampling technique: for the specified number
of repetitions, the same regression is run for a different sample
randomly drawn from the original dataset.
Example command: „bootstrap _b _se, reps(1000): reg spend unem
growthpc depratio left cdem trade lowwage fdi skand “

You might also like