April 25, 2016

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

LINEAR REGRESSION

10
In Chapter 6 we discussed the concepts of covariance and correlation – two ways of measuring the extent to
which two random variables, X and Y were related to each other. In many cases we would like to take this
a step further and try to use information from one variable to make predictions about the outcome of the
other. For instance ..........

10.1 sample covariance and correlation

We have so far considered summarizing a set of observations where one measurement is made on each
individal or unit, but often in real-life random experiments we make multiple measurements on each
individual. For example, during a health check-up a doctor might record the height, weight, age, sex, pulse
rate, and blood pressure.
Just as we did for single measurements, we can represent the observed data by their empirical distribution,
which is now a function of multiple arguments. For example, if we measure two random variables (Xi , Yi )
for the ith individual (say weight and blood pressure), then the empirical distribution function is given by

1
f (t, s) = #{X = t, Y = s}.
n
We can now use this to estimate population features by the corresponding feature of the empirical distribution.
For example, the population covariance Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])] = E [XY ] − E [X ]E [Y ]
gives a measure of how X and Y relate to each other. The sample version of this is the sample covariance
n n
1 X 1X
SXY = (Xi − X )(Yi − Y ) = Xi Yi − X Y . (10.1.1)
n−1 n
i=1 i=1

The sample correlation coefficient is defined similarly to population correlation coefficient ρ[X, Y ] as

SXY
r [X, Y ] = , (10.1.2)
SX SY

where SX and SY are the sample standard deviations of X and Y respectively. As with ρ[X, Y ], r [X, Y ] is
bounded between −1 and 1, and is invariant to scale and location transformations, that is, for real numbers
a, b, c, d,
r [aX + b, cY + d] = r [X, Y ]

10.2 simple linear model

We will assume that the variable Y depends on X in a linear fashion, but that it is also affected by random
factors. Specifically we will assume there is a regression line y = α + βx and that for given x-values
X1 , X2 , . . . , Xn the corresponding y-values Y1 , Y2 , . . . , Yn are given by

Yj = α + βXj + j , (10.2.1)

for j = 1, 2, . . . , n and where each of the j are independent random variables with j ∼ Normal(0, σ 2 ).
Equation (10.2.1) is referred to as the simple linear model. In particular j are the (random) vertical
distance of the point (Xj , Yj ) from the regression line. For all results below we assume σ 2 > 0 is the
variance of the errors, assumed to be the same for every data point. We also assume that not all of the Xj
quantities are the same so that the variance of these quantities is non-zero. In particular this means n ≥ 2.

241
Version: – April 25, 2016
242 linear regression

10.3 the least squares line

The values of (X1 , Y1 ), . . . , (Xn , Yn ) are collected data. Though we assume that this data is produced via
the simple linear model, we typically do not know the actual values of the slope β or the y-intercept α.
The goal of this section is to illustrate a way to estimate these values from the data.
For a line y = a + bx the “residual” of a data point (Xj , Yj ) is defined to be the quantity Yj − (a + bXj ).
This is the difference between the actual y-value of the data point and the location where the line predicts
the y-value should be. In other words, it may be viewed as the error of the line when attempting to predict
the y-value corresponding to the Xj data point. Among all possible lines through the data, there is one
which minimizes the sum of these squared residual errors. This is called the “least squares line”.
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be points on the plane. Suppose we wish to find a line that
minimises the sum of squared residual errors. That is, let g : R2 → R be defined as
n
X
g (a, b) = [Yj − (a + bXj )]2 .
j =1

The objective is to minimize g. So using calculus,


n
∂g X
0 = = −2 [Yj − a − bXj ] (10.3.1)
∂a
j =1
and
n
∂g X
0 = = −2 Xj [Yj − a − bXj ]. (10.3.2)
∂b
j =1

From equation (10.3.1) we have


n
X n
X n
X n
X
0= [Yj − a − bXj ] = Yj − a−b Xj = nY − na − bnX = n(Y − (a + bX ))
j =1 j =1 j =1 j =1

Therefore1
Y = a + bX, (10.3.3)
which shows that the point (X, Y ) must lie on the least squares line. The point (X, Y ) is known as the
point of averages. Similarly from equation (10.3.2),
n
X n
X n
X n
X
0= Xj [Yj − a − bXj ] = (Xj Yj − aXj − bXj2 ) = Xj Yj − anX + b Xj2
j =1 j =1 j =1 j =1

so that
n
X n
X
Xj Yj = anX + b Xj2 . (10.3.4)
j =1 j =1

We now use the system of two equations (given by (10.3.3) and (10.3.4)) solve for a, b to get
n
P
( Xj Yj ) − nX Y
j =1
b = n (10.3.5)
2
Xj2 ) − nX
P
(
j =1
(10.3.6)
1 We shall use the notation X, Y , SX , SY , r [X, Y ](below), even though they are not necessarily random quantities.
This is to simplify notation and will allow us to use known properties, in the event they are random.

Version: – April 25, 2016


10.3 the least squares line 243

Recall that the sample variance of X1 , X2 , . . . , Xn is


n
2 1 X
SX = [ (Xj − X )2 ]
n−1
j =1
n
1 X 2
= [ Xj2 − 2Xj X + X ]
n−1
j =1
n n
1 X X 2
= [( Xj2 ) − 2X ( Xj ) + nX ]
n−1
j =1 j =1
n
1 X 2 2
= [( Xj2 ) − 2nX + nX ]
n−1
j =1
n
1 X 2
= [( Xj2 ) − nX ]
n−1
j =1
2
Therefore, the denominator of (10.3.5) is simply (n − 1)SX . The numerator may be written more simply
by using the notation of sample covairance and correlation defined in (10.1.1) and (10.1.2). So from (10.3.5)
we have

n
P
( Xj Yj ) − nX Y
j =1 (n − 1)SXY r [X, Y ]SY
b= n = 2
=
2 (n − 1)SX SX
Xj2 ) − nX
P
(
j =1

Using the above and (10.3.3), we also now can write a nice formula for a , which is
r [X, Y ]SY
a=Y − X (10.3.7)
SX
By the above calcuation we have show that the least squares line minimizing the sum of the squared
residual errors is the line passing through the point of averages (X, Y ) and having a slope equal to
r [X, Y ]SY
b= . We state this precisely in the Theorem below.
SX
Theorem 10.3.1. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be given data points. Then the least squares line
r [X, Y ]SY
passes through (X, Y ) and has slope given by .
SX
We illustrate the use of these formulas with two examples given below.
Example 10.3.2. Consider the following five data points:
X Y

3 6
4 5
5 6
6 4
7 2
These points are not colinear, but suppose we wish to find a line that most closely approximates their
trend in the least squares sense described above. Viewing these as samples, it is routine to calculate that
the formulas above yield a = 9.1 and b = −0.9. Of all of the lines in the plane, the one that minimizes the
sum of squared residual errors for the data set above is the line y = 9.1 − 0.9x.
The R software also has a feature to perform a regression directly. To obtain this result using R we
could first create vectors that represent the data:

Version: – April 25, 2016


244 linear regression

> x <- c(3,4,5,6,7)


> y <- c(6,5,6,4,2)
And then instruct R to perform the regression using the command “lm” indicating the linear model.
> lm(y ˜ x)
The order of the variables in this command is important with this y ∼ x indicating that the y variable is
being predicted using the x variable as input.
The resulting output from R is
(Inercept) x
9.1 -0.9
the values of the intercept and slope of the least squares line respectively.


Example 10.3.3. Suppose as part of a health study, a researcher collects data for weights and heights
of sixty adult men in a population. The average height of the men is 174 cm with a sample standard
deviation of 8.0 cm. The average weight of the men is 78 kg with a sample standard deviation of 10 kg.
The correlation between the variables in the sample was 0.55.
This information alone is enough to find the least squares line for predicting weight from height. The
reader may use the formulas above to verify that b = 0.6875 and a = −41.625. Therefore, among all lines,
y = −41.625 + 0.6875x is the one which minimizes the sum of squared residuals.
This does not necessarily mean this line would be appropriate for predicting new data points. To make
such a declaration, we would want to have some evidence that the two variables had a linear relationship to
begin with, but regardless of whether or not the data was produced from a simple linear model, the line
above minimizes error in the least squares sense. 

exercises

Ex. 10.3.1. Let (X1 , Y1 ), . . . (Xn , Yn ) be data produced via the simple linear model and suppose y = a + bx
is the least squares line for the data. Recall from above that the residual for any given data point is
Yj − (a + bXj ), the error the line makes in predicting the correct y-value from the given x-value. Show
that the sum of the residuals over all n data points must be zero.
Ex. 10.3.2. Suppose that instead of using the simple linear model, we assume the regression line is known
to pass through the origin. That is, the regression line has the from y = βx and for given x-values
X1 , X2 , . . . , Xn the corresponding y-values Y1 , Y2 , . . . , Yn are given by

Yj = βXj + j , (10.3.8)

for j = 1, 2, . . . , n. As with the simple linear model, we assume each of the j are independent random
variables with j ∼ Normal(0, σ 2 ). (We will refer to this as the “linear model though the origin” and will
have several exercises investigating how several formulas from this chapter would need to be modified for
such a model.)
Assuming data (X1 , Y1 ), . . . (Xn , Yn ) was produced from the linear model through the origin, find the
least squares line through the origin. That is, find a formula for b such that the line y = bx minimizes the
sum of squared residual errors.

10.4 a and b as random variables

In this section (and the remainder of this chapter) we will assume that (X1 , Y1 ), . . . , (Xn , Yn ) follow the
simple linear model (10.2.1). In other words, there is a regression line y = α + βx and that for given
x-values X1 , X2 , . . . , Xn the corresponding y-values Y1 , Y2 , . . . , Yn are given by (10.2.1). In the previous
section this data was used to produce a mean squared error-minimizing least squares line y = a + bx. In
this section we investigate how well the random quantities a and b approximate the (unknown) values α
and β.

Version: – April 25, 2016


10.4 a and b as random variables 245

Theorem 10.4.1. Under the assumptions of the simple linear model (10.2.1), the slope b of the least squares
line is a linear combination of the Yj variables. Further it has a normal distribution with mean β and
2
σ
variance (n−1 2 .
)SX

Proof - First recall that the X1 , X2 , X3 , . . . Xn are assumed to be deterministic, so will be treated as
known constants. The data points Y1 , Y2 , . . . , Yn are assumed to follow the simple linear model (10.2.1). So
for j = 1, . . . , n,

E [Yj ] = E [α + βXj + j ] = α + βXj + E [j ] = α + βXj


and
V ar [Yj ] = V ar [α + βXj + j ] = V ar [j ] = σ 2 .

Using the formula, (10.3.5), we derived for b and the above we have
 P n 
( Xj Yj ) − nX Y
 j =1 
E [b] = E 
 (n − 1)S 2 X

 
n
1 X
= 2
( Xj E [Yj ]) − nXE [Y ]
(n − 1)SX
j =1
 
n
1 X
= 2
( Xj (α + βXj )) − nX (α + βX )
(n − 1)SX
j =1
 
n
1 X 2 2
= 2
nαX + β ( Xj ) − nαX − βnX 
(n − 1)SX
j =1
 
n
β X 2 2
= 2
( Xj ) − nX  = β.
(n − 1)SX
j =1

Similarly,
 P n 
( Xj Yj ) − nX Y
 j =1 
V ar [b] = V ar 
 
(n − 1)S 2 X

 
n
1 X 2
= 2 ]2
( Xj2 V ar [Yj ]) − n2 X V ar [Y ]
[(n − 1)SX
j =1
 
n
1 X 2
= 2 ]2
( Xj2 σ 2 ) − n2 X (σ 2 /n)
[(n − 1)SX
j =1
 
2 n
σ X 2
= 2 ]2
( Xj2 ) − nX 
[(n − 1)SX
j =1
2
σ
= 2
.
(n − 1)SX

Version: – April 25, 2016


246 linear regression

The algebra below justifies that b is a linear combination of the Yj variables.


n
P
( Xj Yj ) − nX Y
 
n n n  
j =1 1 X X X Xj − X
b = 2
= 2
( Xj Yj ) − ( XYj ) = 2
Yj
(n − 1)SX ( n − 1 ) SX (n − 1)SX
j =1 j =1 j =1

Since b is a linear combination of independent, normal random variables Yj , b itself is also a normal random
variable (Theorem 6.3.13). 
As noted above, the least squares line can be defined as the line of slope b passing through the point of
averages. The following lemma is a useful fact about how these quantities relate to each other.

Lemma 10.4.2. Let b be the slope of the least squares line and let Y be the sample average of the Yj variables.
Then b and Y are independent.

Proof - By Theorem 6.3.13, Y has a normal distribution and so does b by Theorem 10.4.1. By Theorem
6.4.3, all we have ro show is that Y and b are uncorrelated. Note that the Yj variables are all independent
of each other and so Cov [Yj , Yk ] will be zero if j 6= k and will equal the variance σ 2 otherwise. So,
 
n n
X Xj − X 1 X
Cov [b, Y ] = Cov  Y ,
2 j
Yk 
( n − 1 ) SX n
j =1 k =1
n
n X  
X Xj − X 1
= Cov Y , Y
2 j n k
( n − 1 ) SX
j =1 k =1
n X
n
X Xj − X
= 2
Cov [Yj , Yk ]
n ( n − 1 ) SX
j =1 k =1
n
X Xj − X
= 2
σ2
n(n − 1)SX
j =1
n
σ2 X
= 2
Xj − X = 0.
n(n − 1)SX
j =1


We conclude this section with a result on the distribution of a.

Theorem 10.4.3. Under the assumptions of the simple linear model (10.2.1), The y-intercept a (given by
(10.3.7) of the least squares line is a linear combination of Yj variables. Further it has a normal distribution
2
with mean α and variance σ 2 ( n1 + (n−1
X
2 ).
)SX

Proof- See Exercise 10.4.1.

exercises

Ex. 10.4.1. Prove Theorem 10.4.3. (Hint: Make use of the fact that Y = a + bX and what has previously
been proven about Y and b).
Ex. 10.4.2. Show that, generally speaking, a and b are not independent. Find necessary and sufficient
conditions for when the two varaibles are independent.
Ex. 10.4.3. Show that a and Y are never independent.
Ex. 10.4.4. Continuing from Exercise 10.3.2, assuming the regression line y = βx passes through the origin
and b is the lest squares line of the form y = bx, do the following:
(a) Find the expected value of b.

Version: – April 25, 2016


10.5 predicting new data when σ 2 is known 247

(b) Find the variance of b.


(c) Determine whether or not b has a normal distribution.
(d) Determine if b and Y are independent.

10.5 predicting new data when σ 2 is known

In this section we return to question of using data for prediction. We continue to assume the simple linear
model (10.2.1). We further assume that α and β are estimated by a and b( as calculated from the data
(X1 , Y1 ), . . . , (Xn , Yn )) and parameter σ 2 describing the variability of data around the regression line is a
known quantity.
First suppose for a particular deterministic x-value X ∗ that we want to use the data to estimate the
corresponding y-value Y ∗ = α + βX ∗ on the regression line by Y = a + bX ∗ .
Theorem 10.5.1. The quantity Y = a + bX ∗ has a normal distribution with mean Y ∗ = α + βX ∗ and
(X ∗ −X )2
variance σ 2 ( n1 + (n−1)S 2 ).
X

Proof - Recall from Theorem 10.4.3 and Theorem 10.4.1 that a and b are both linear combination of
the random variables Yj normal distribution. So Y has normal distribution by Theorem 6.3.13. We need to
calculate only its mean and variance. The expected value is simple to calculate.
E [Y ] = E [a + bX ∗ ]
= E [a] + E [b]X ∗
= α + βX ∗ = Y ∗
If a and b were independent, then calculating the variance of Y would also be a simple task, but this is
typically this is not the case. However, from Lemma 10.4.2, we know that b and Y are independent. To
make use of this, using (10.3.3), we may rewrite the line in point-slope form around the point of averages:
Y = Y + b(X ∗ − X ). From this we have,
V ar [Y ] = V ar [Y + b(X ∗ − X )]
= V ar [Y ] + V ar [b](X ∗ − X )2
σ2 σ2
= + 2
(X ∗ − X )2
n (n − 1)SX
(X ∗ − X )2
 
1
= σ2 + 2
.
n (n − 1)SX

Note that for various values of X ∗ this variance is minimal when X ∗ is X, the average value of the
2
x-data. In this case V ar [Y ] = σn = V ar [Y ] as expected. The further X ∗ is from the average of the
x-values, the more variance there is in predicting the point on the regression line.
Next suppose that, instead of trying to estimate a point on the regression line, we are trying to predict
a new data point produced from the linear model. Let X ∗ now represent the x-value of some new data
point and let Y ∗ = α + βX ∗ + ∗ where ∗ ∼ Normal(0, σ 2 ) where the random variable ∗ is assumed to
be independent of all prior j which produced the original data set. The following theorem addresses the
distribution of the predictive error made when estimating Y ∗ by the quantity Y = a + bX ∗ .
Theorem 10.5.2. If (X ∗ , Y ∗ ) is a new data point, as described in the previous paragraph, then the predictive
error in estimating Y ∗ using the least square line is (a + bX ∗ ) − Y ∗ which is normally distributed with
(X ∗ −X )2
mean 0 and variance σ 2 (1 + 1
n + (n−1)SX 2 ).

Proof - The expected value of the predictive error is zero since


E [(a + bX ∗ ) − Y ∗ ] = E [a] + E [b]X ∗ − E [α + βX ∗ + ∗ ]
= α + βX ∗ − α − βX ∗ − E [∗ ] = 0.

Version: – April 25, 2016


248 linear regression

Both quantities a and b are linear combinations of the Yj variables and so

a + bX ∗ − Y ∗ = a + bX ∗ − α − βX ∗ − ∗
= (−α − βX ∗ ) + a linear combination of Y1 , Y2 , . . . , Yn , ∗ .

All (n + 1) of the variables, Y1 , Y2 , . . . , Yn , ∗ , are independent and have a normal distribution. As


(−α − βX ∗ ) is a constant, from the above (a + bX ∗ − Y ∗ ) has a normal distribution.
Finally, to calculate the varaince, we again rewrite a + bX ∗ in point-slope form and exploit independence.

V ar [(a + bX ∗ − Y ∗ )] = V ar [(Y + b(X ∗ − X ) − (α + βX ∗ + ∗ )]


= V ar [Y ] + V ar [b](X ∗ − X )2 + V ar [∗ ]
σ2 σ2
= + 2
(X ∗ − X )2 + σ 2
n (n − 1)SX
(X ∗ − X )2
 
1
= σ2 1+ + 2
.
n ( n − 1 ) SX

Example 10.5.3. A mathematics professor at a large university is studying the relationship between scores
on a preparation assessment quiz students take on the first day of class and their actual pecentage score at
the end of class. Assuming the simple linear model with σ = 6, he takes a random sample of 30 students
and discovers their average score on the quiz is X = 54 with a sample standard deviation of SX = 12,
while the averae percentage score in the class is Y = 68 with a sample standard deviation of SY = 10. The
sample correlation is r [X, Y ] = 0.6. So according to the results above, the least squares line for predicting
the course percentage from the preliminary quiz will be y = 0.5x + 41.
If we wish to use the line to predict the course percentage for someone who scores a 54 on the preliminary
quiz, we would find y = 0.5(54) + 41 = 68, as expected since somone who gets an average score on the quiz
is likely to get around the average percentage in the class.
Similiarly if we wish to use the line to predice the course percentage for someone who scores a 80 on the
preliminary quiz, we would find y = 0.5(80) + 41 = 81. Also not surprising. Due to the positive correlation,
a student scoring above average on the quiz is also likely to score higher in the course as well.
The previous theorem allows us to go further an calculate a standard deviation associated with these
estimates. For the student who scores a 54 on the preliminary quiz, let Y ∗ be the actual course percentage
and let a + bX ∗ = 68 be the least squares line estimate we made above. Then,

1
V ar [a + bX ∗ − Y ∗ ] = 36(1 + + 0) = 37.2
30

and so the standard deviation in the predictive error is SD [a + bX ∗ − Y ∗ ] ≈ 6.1. This means that students
who make an average score of 54 on the preliminary quiz will have a range of percentages in the course.
This range will have a normal distribution with mean 68 and standard deviation 6.1. We could then use
normal curve computations to make further predictions about how likely such a student may be to reach a
certain benchmark.
Next take the example of a student who scores 80 on the preliminary quiz. The least squares line
predicts the course percentage for such a student will be a + bX ∗ = 81, but now

1 (80 − 54)2
V ar [a + bX ∗ − Y ∗ ] = 36(1 + + ) ≈ 43.0
30 29 · 122

and so SD [a + bX ∗ − Y ∗ ] ≈ 6.6. Student who score an 80 on the preliminary exam will have a range of
course percentages with a normal distribution of mean 81 and standard deviation 6.6.
Thinking of the standard deviation as the likely error associated with prediction this example suggests
that predictions of data further from the mean will tend to have less accuracy than predictions near to the
mean. This is true in the simple linear model and will be explored in the exercises. 

Version: – April 25, 2016


10.6 hypothesis testing and regression 249

exercises

Ex. 10.5.1. Using the figures from Example 10.5.3 do the following. Two students are selected independently
at random. The first scored a 50 on the preliminary quiz while the second scored 60. Determine how likely
it is that the student who scored the lower grade on the quiz will score a higher percentage in the course.
Ex. 10.5.2. Explain why V ar [a + bX ∗ − Y ∗ ] is minimized when X ∗ = X.

10.6 hypothesis testing and regression

As a and b both have a normal distriubtion under the assumption of the simple linear model, it is possible
to perform tests of significance concerning the values of α and β. Of particular importance is a test with a
null hypothesis that β = 0 and an alternate hypothesis β 6= 0. This is commonly called a “test of utility”.
The reason for this name is that if β = 0, then the simple linear model produces output values Yj = α + j
which do not depend on the corresponding input Xj . Therefore knowing the value of Xj should not be at
all helpful in predicting the corresponding Yj result. However, if β 6= 0 then knowing Xj should be at least
somewhat useful in predicting Yj value.
Example 10.6.1. Suppose (X1 , Y1 ), . . . , (X16 , Y16 ) follows the simple linear model with σ = 5 and produces
a least squares line y = 0.3 + 1.1x. Suppose the sample average of the Xj data is 20 and the sample
2
variance is SX = 10. What is the conclusion of a test of utility at a significance level of α = 0.05? 
From the given least squares line, b = 1.1. As noted above, a test of utility compares a null hypothesis
that β = 0 to an alternate hypothesis β 6= 0, so this will be a two-tailed test. If the null were true, then
E [b] = 0 and we can use the normal distribution to determine whether the 1.1 value is so far from zero
that the null seems unreasonable. Using the same sample mimicing idea introduced in Chapter 9 we let
Z1 , . . . , Z16 be random variables produced from X1 , . . . X16 via the simple linear model. From Theorem
10.4.1, the slope of the least squares line for the (X1 , Z1 ), . . . , (X16 , Z16 ) data has a normal distribution
σ2
with mean β = 0 and variance (n−1 )S 2
= 16 . Therefore we can calculate
X

1.1
P (| slope of the least squares line | ≥ 1.1) = P (|Z| ≥ √ )
1/6
1.1
= 2P (Z < − √ ) ≈ 0.007
1/6

where Z ∼ N ormal(0, 1). As this P-value is less than the significance level, the test rejects the null
hypothesis. That is, the test concludes that the slope of 1.1 is far enough from 0 that it demonstrates a
true relationship between the Xj input values and the Yj output values.

exercises

Ex. 10.6.1. Continuing with Example 10.6.1, use Theorem 10.4.3 to devise a hypothesis test for determining
whether or not the regression line goes through the origin. That is, determine whether or not α = 0 is a
plausible assumption.

10.7 estimaing an unknown σ 2

In many cases the variance σ 2 of the points around the regression line will be an unknown quantity and
so, like α and β, it too will need to be approximted using the (X1 , Y1 ), . . . , (Xn , Yn ) data. The following
theorem provides an unbiased estimator for σ using the data.

Theorem 10.7.1. Let (X1 , Y1 ), . . . , (Xn , Yn ) be data following the simple linear model with n > 2. Let
n
S2 = 1
(Yj − (a + bXj ))2 . Then S 2 is an unbiased estimator for σ 2 . (That is, E [S 2 ] = σ 2 ).
P
n−2
j =1

Version: – April 25, 2016


250 linear regression

Proof - Before looking at E [S 2 ] in its entirety, we look at three quantities that will be helpful in
computing this expected value.

First note,

nYj + (Y1 + Y2 + · · · + Yn )
V ar [(Yj − Y )] = V ar [ ]
 n 
n
1  X
= V ar [( n − 1 ) Y j + Yi ] 
n2
i=1,i6=j
 
1  X
= [(n − 1)2 σ 2 + σ 2 ]
n2
i=1,i6=j
1
= [(n − 1)2 σ 2 + (n − 1)σ 2 ]
n2
n−1 2
= σ
n

and therefore,

n
X n
X
E [(Yj − Y )2 ] = V ar [Yj − Y ] + (E [Yj − Y ])2
j =1 j =1
n
X n−1 2
= σ + ((α + βXj ) − (α + βX ))2
n
j =1
n
X n−1
= σ 2 + β 2 (Xj − X )2
n
j =1
n
X
= (n − 1)σ 2 + β 2 (Xj − X )2
j =1
2 22
= (n − 1)σ + β (n − 1)SX . (10.7.1)

Next,

n
X n
X
E [b2 (Xj − X )2 ] = E [ b2 ] (Xj − X )2
j =1 j =1

= (V ar [b] + (E [b])2 )((n − 1)SX


2
)
σ2
= ( 2
+ β 2 )((n − 1)SX
2
)
(n − 1)SX
= σ 2 + β 2 (n − 1)SX
2
. (10.7.2)

Version: – April 25, 2016


10.7 estimaing an unknown σ 2 251

Also,

E [bYj ] = Cov [b, Yj ] + E [b]E [Yj ]


n
X Xi − X
= Cov [ 2
Yi , Yj ] + β (α + βXj )
( n − 1 ) SX
i=1
n
X Xi − X
= 2
Cov [Yi , Yj ] + β (α + βXj )
(n − 1)SX
i=1
Xi − X
= 2
V ar [Yj ] + β (α + βXj )
(n − 1)SX
Xi − X 2
= 2
σ + β (α + βXj )
(n − 1)SX

from which we may determine that


n
X n
X n
X
E [(Yj − Y )b(Xj − X )] = (Xj − X )E [Yj b] − (Xj − X )E [Y b]
j =1 j =1 j =1
n Xn n
X Xi − X 2 X
= (Xj − X )( 2
σ + β ( α + βX j )) − (Xj − X )E [Y ]E [b]
(n − 1)SX
j =1 j =1 j =1
n n n
X (Xi − X )2 X X
= 2
σ2 + (Xj − X )β (α + βXj ) − (Xj − X )(α + βX )β
(n − 1)SX
j =1 j =1 j =1
n
X
= σ2 + (Xj − X )β 2 (Xj − X )
j =1

= σ + β 2 (n − 1)SX
2 2
(10.7.3)

Finally, putting together the results from equations 10.7.1, 10.7.2, and 10.7.3 we find
n
X n
X
E[ (Yj − (a + bXj ))2 ] = E[ (Yj − (Y + b(Xj − X )))]
j =1 j =1
X n
= E[ ((Yj − Y ) − b(Xj − X ))2 ]
j =1
X n
= E[ (Yj − Y )2 − 2(Yj − Y )b(Xj − X ) + b2 (Xj − X )2 ]
j =1
n
X
= E [(Yj − Y )2 ] − 2E [(Yj − Y )b(Xj − X )] + E [b2 (Xj − X )2 ]
j =1

( n − 1 ) σ 2 + β 2 ( n − 1 ) SX
2
− 2 σ 2 + β 2 ( n − 1 ) SX
2
+ σ 2 + β 2 (n − 1)SX
2
  
=
= (n − 2)σ 2
n
2 1
(Yj − (a + bXj ))2 ] = σ 2 as desired.
P
Hence E [SX ] = E [ n−2
j =1

Version: – April 25, 2016


252 linear regression

Version: – April 25, 2016

You might also like