Chapter Simple Linear Regression 1

Chapter 11
Simple Linear Regression

Contents
Probabilistic Models
Fitting the Model: The Least Squares Approach
Model Assumptions
Assessing the Utility of the Model: Making Inferences
about the Slope 1
The Coefficients of Correlation and Determination
Using the Model for Estimation and Prediction
A Complete Example
Did You Know that Degree, Color and Race
Make a Difference in Home Refinancing?
A study found that broker fees for purchasers without a
college degree pay $1,472 more than those with a college
degree
No only did a degree matter, but race was also a factor.
African Americans on average paid $500 more than whites,
Hispanics $275 more than whites.
Regression analysis was used to determine whether various
borrower characteristics had a bearing on the amount of
broker fees and closing costs paid.
Did You Know that the Presence of an
NFL Team Boost Rental Costs?
Regression analysis has revealed that in cities with an
NFL team, rental costs for apartment in the central city
area were 8 percent higher than in cities without an NFL
team
Property tax receipts were also found to be higher in
cities with NFL teams
Regression Analysis
Regression analysis examines associative relationships between a
metric dependent variable and one or more independent
variables in the following ways:
• Determine whether the independent variables explain a
significant variation in the dependent variable: whether a
relationship exists.
• Determine how much of the variation in the dependent variable
can be explained by the independent variables: strength of the
relationship.
• Determine the structure or form of the relationship: the
mathematical equation relating the independent and dependent
variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
Regression Analysis
NOTE that
Regression analysis is concerned with the nature and
degree of association between variables and does not
imply or assume any causality.
A First-Order (Straight Line) Probabilistic Model
y = 0 + 1x +
where
y = Dependent or response variable
(variable to be modeled)
x = Independent or predictor variable
(variable used as a predictor of y)
E(y) = 0 + 1x = Deterministic component

 (epsilon) = Random error component
Independent Variable vs.
Dependent Variable
Independent variable
Explanatory or predictor variable
Often presumed to be a cause of the other
Dependent variable
Criterion Variable
Influenced by the independent variable
y = 0 + 1x +
0 (beta zero) = y-intercept of the line, that is, the

point at which the line intercepts or
cuts through the y-axis
1 (beta one) = slope of the line, that is, the change
(amount of increase or decrease) in the
deterministic component of y for every
1-unit increase in x
A positive slope implies that E(y) increases by the amount

1 for each unit increase in x. A negative slope implies that
E(y) decreases by the amount 1.
Five-Step Procedure
Step 1: Hypothesize the deterministic component of the

model that relates the mean, E(y), to the
independent variable x.
Step 2: Use the sample data to estimate unknown
parameters in the model.
Step 3: Specify the probability distribution of the random
error term and estimate the standard deviation of
this distribution.
Step 4: Statistically evaluate the usefulness of the model.
Step 5: When satisfied that the model is useful, use it for
prediction, estimation, and other purposes.
Explaining Attitude Toward the City of
Residence
Respondent No Attitude Toward Duration of Importance
the City Residence Attached to
Weather
1 6 10 3
2 9 12 11
3 8 12 4
4 3 4 1
5 10 12 11
6 4 6 1
7 5 8 7
8 2 2 4
9 11 18 8
10 9 9 10
11 10 17 8
12 2 2 5
Scatter Diagram
12
10
8
Attitude
0
0 5 10 15 20
Duration of Residence
Which Straight Line Is Best on this Scatter?
Line 1
Line 2
9 Line 3
Line 4
6
2.25 4.5 6.75 9 11.25 13.5 15.75 18

14
12
10
0
0 5 10 15 20
Fitting the Model:
The Least Squares Approach
Scatterplot
1. Plot of all (xi, yi) pairs

2. Suggests how well model will fit
y
60
40
20
0 x
0 20 40 60
Which Line Fits Best?
• How would you draw a line through the points?

• How do you determine which line ‘fits best’?
y
60
40
20
0 x
0 20 40 60
Least Squares Line
The least squares line ŷ = b̂0 + b̂1 x is one that has

the following two properties:
1. The sum of the errors equals 0,
i.e., mean error = 0.
2. The sum of squared errors (SSE) is smaller than for any
other straight-line model, i.e., the error variance is
minimum.
Least Squares Graphically
n
LS minimizes åi 1 2 3 4
e
ˆ 2
i =1
= e
ˆ 2
+ e
ˆ 2
+ e
ˆ 2
+ e
ˆ 2
y y2 = bˆ0 + bˆ1 x2 + eˆ2

^
^ 4
2
^ ^
yˆi = bˆ0 + bˆ1 xi
1 3
x
Formula for the Least Squares Estimates
SS xy
Slope : b̂1 =
SS xx
y - intercept : b̂0 = y - b̂1 x
( )( y - y )
where SS xy = å xi - x i
= å( x - x )
2
SS xx i
n = sample size
Interpreting the Estimates of 0 and 1 in Simple
Linear Regression
y-intercept: b̂0 represents the predicted value of y when x

= 0 (Caution: This value will not be meaningful if
the value x = 0 is nonsensical or outside the
range of the sample data.)
slope: b̂1 represents the increase (or decrease) in y for

every 1-unit increase in x (Caution: This
interpretation is valid only for x-values within the
range of the sample data.)
Example 1- Applying the Method of Least Squares
to the Advertising-Sales Data
Refer to the following advertising monthly-sales data.

Consider the straight-line model, E(y)=β0+β1x, where
y=sales revenue (thousands of dollars), and x=advertising
expenditure (hundreds of dollars).
a) Use the method of least squares to
estimate the values of β0 and β1.
b) Predict the sales
revenue when the Ad Exp., x, ($100s) Sales,y, (1000s)
advertising expenditure is $200. 1 1
2 1
c) Find SSE for the analysis. 3 2
4 2
5 4
Scattergram
Sales vs. Advertising
Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Parameter Estimation Solution
bˆ0 = y - bˆ1 x = 2 - (.70 )(3) = -.10
yˆ = -.1 + .7 x
Coefficient Interpretation Solution
^
1. Slope (1)
• Sales Volume (y) is expected to increase by $700
for each $100 increase in advertising (x), over the
sampled range of advertising expenditures from
$100 to $500
^
2. y-Intercept (0)
• Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation
Regression Line Fittedto the Data
Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Example: Income vs Consumption Expenditure
Consumption
Income (x)
Expenditure (y)
1 7
5 6
9 9
13 8
17 10
Questions
Construct scatterplot; determine if linear model is

appropriate. If so …
… find the least squares prediction line
Estimate consumption expenditure in a household with
an income of (i) $6,000 (ii) $25,000. Comfortable with
estimates?
Compute the residuals
Scatterplot
Consumption Expenditure
11
Expenditure ($1,000's)
10
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Solution
Inc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar)
(yi-ybar)
1 7 -8 64 -1 1 8
5 6 -4 16 -2 4 8
9 9 0 0 1 1 0
13 8 4 16 0 0 0
17 10 8 64 2 4 16
x=45 y=40 (xi-xbar) (xi-xbar) (yi-ybar) (yi-ybar)

2 2
32
=0 =160 =0 =10
least squares prediction line
So, prediction line is
For income $6000

6.2+0.2*6=7.4 ($ 740)
For income $25 000
6.2+0.2*25=11.2 ($ 1120)
Least Squares Prediction Line
11
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Consumption Expenditure Prediction When x=$6,000
11
10 y = 6.2 + 0.2x
9
8
7.4 7
6
5
0 5 6 10 15 20
Consumption Expenditure Prediction When x=$25,000
11.2 12
11
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20 25
25
C. Compute the Residuals
Inc. x ConE y y=6.2+.2x y - y (y-y)^2

1 7 6.4 .6 .36
5 6 7.2 -1.2 1.44
9 9 8 1 1
13 8 8.8 -.8 .64
17 10 9.6 .4 .16
residuals=0 (residuals)2
=3.6
Residuals
11
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Income Residual Plot
Income Residual Plot
2
Residuals
1
0
-1 0 5 10 15 20
-2
Income
residuals, (residuals)2
Note that
* Sresiduals = 0
* SSE=Sum of (residuals)2 = 3.6
Any other line drawn through the scatterplot

will have
* S(residuals)2 > 3.6
Assessing the Utility of the Model:
Making Inferences about the Slope 1
Usefulness of the Hypothesized Model
Testing the null hypothesis that the linear model

contributes no information for the prediction of y against
the alternative hypothesis that the linear model is useful
in predicting y, we test
H 0 : b1 = 0 against H a : b1 ¹ 0
If data support alternative hypothesis => x does
contribute information for the prediction of y using the
straight-line model.
A Test of Model Utility: Simple Linear Regression
H0: 1 = 0
b̂1 b̂1
Test Statistic: t = =
sb̂ s SS xx
1
Alternative Rejection
Hypothesis Region
Ha: 1 > 0 t > t
Ha: 1 < 0 t < t

Ha: 1 ≠ 0 t > t or t < t
where t and t are based on (n– 2) degrees of freedom
Example 3- Testing the Regression Slope, β1- Sales
Revenue Model
Refer to the simple linear regression analysis of the
advertising-sales data (Examples 1 and 2). Conduct a test
(at α = 0.05) to determine if sales revenue (y) is linearly
related to advertising expenditure (x).
Test of Slope Coefficient Solution
H0: 1 = 0
Ha: 1  0
 .05 Reject H0 Reject H0
df  5 – 2 = 3 .025 .025
Critical Value(s):
-3.182 0 3.182 t
Test Statistic Solution
s .6055
sb̂ = = = .1914
SS xx (15)
1 2
55 -
5
b̂1 .70
t= = = 3.657
Sb̂ .1914
1
Test of Slope Coefficient Solution
H0: 1 = 0 Test Statistic:

Ha: 1  0
 .05 t = 3.657
df  5 – 2 = 3
Critical Value(s): Decision:
Reject H0 Reject H0 Reject at  = .05
.025 .025 Conclusion:
There is evidence of a
-3.182 0 3.182 t relationship
Test of Slope Coefficient Computer Output
SUMMARY
OUTPUT
Regression Statistics
Multiple R 0.903696114
R Square 0.816666667
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
^
1
S^1 t = ^1 / S^1 P-Value
The Coefficients of Correlation and
Determination
The Coefficients of Correlation (r)
Correlation Models
Answers ‘How strong is the linear relationship

between two variables?’
Coefficient of correlation (r)
Sample correlation coefficient denoted r
Values range from –1 to +1
Measure of the strength of the linear
relationship
A low correlation does not necessarily imply
that x and y are unrelated- only that x and y
are not strongly linearly related.
Coefficient of Correlation
SUMMARY
OUTPUT
Multiple R 0.903696114 r
R Square 0.816666667
Adjusted R
Square 0.755555556
Observations 5
ANOVA
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
The Coefficients of Determination (R2)
Decomposition of the Total Variation
Y
Residual Variation SSE
Explained Variation SSreg

Y
X
X1 X2 X3 X4 X5
Decomposition of the Total Variation
Total Variation: SSyy = SSreg + SSE

Coefficient of Determination
To measure the contribution of x in predicting y.

To accomplish this, we calculate how much the errors
of prediction of y were reduced by using the
information provided by x.
r2 = (coefficient of correlation)2
-1  r  1 0  r2  1
Example - Obtaining the Value of r2 for the
Sales Revenue Model
Calculate the coefficient of determination for the

advertising-sales example. Interpret the result.
r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817
Interpretation: About 81.7% of the sample

variation in Sales (y) can be explained by using Ad
$ (x) to predict Sales (y) in the linear model.
SUMMARY
OUTPUT
R Square 0.816666667 r2
Adjusted R
Square 0.755555556
Observations 5
ANOVA
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
SUMMARY
OUTPUT
R Square 0.816666667 r2=(Ssyy-SSE)/SSyy
Adjusted R
Square 0.755555556
Observations 5
ANOVA
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
Testing Global Usefulness of the Model: The
analysis of Variance F-Test
SUMMARY
OUTPUT
R Square 0.816666667 F statistics=MSreg/MSresidual
Adjusted R
Square 0.755555556
Observations 5 P-value of F statistics
ANOVA
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
A Complete Example
Example
Suppose a fire insurance company wants to relate the

amount of fire damage in major residential fires to the
distance between the burning house and the nearest
fire station. The study is to be conducted in a large
suburb of a major city; a sample of 15 recent fires in
this suburb is selected. The amount of damage, y, and
the distance between the fire and the nearest fire
station, x, are recorded for each fire.
Example
DAMAGE
50
45
40
35
30
25
20
15
10
5
0
0 1 2 3 4 5 6 7
Example
Step 1:
First, we hypothesize a model to relate fire damage, y,

to the distance from the nearest fire station, x. We
hypothesize a straight-line probabilistic model:
y = 0 + 1x + 
Example
Step 2:
Use a statistical software package to estimate the
unknown parameters in the deterministic component of
the hypothesized model. The Excel printout for the
simple linear regression analysis is shown on the next
slide. The least squares estimates of the slope 1 and
intercept 0, highlighted on the printout, are
ˆ1  4.919331
ˆ0  10.277929
Example
Least Squares Equation: yˆ  10.278  4.919 x

Example
This prediction equation is graphed in the Minitab

scatterplot.
Interpretation
The least squares estimate of the slope, ˆ1  4.919

implies that the estimated mean damage increases by
$4,919 for each additional mile from the fire station. This
interpretation is valid over the range of x, or from .7 to
6.1 miles from the station. The estimated y-intercept,
has the interpretation that a fire 0 miles from the fire
station has an estimated mean damage of $10,278.
Example
Step 3: Specify the probability distribution of the

random error component .
The estimate of the standard deviation  of ,

highlighted on the Excel printout is
s = 2.31635= ?
This implies that most of the observed fire damage (y)
values will fall within approximately 2 = 4.64
thousand dollars of their respective predicted values
when using the least squares line.
Example
Step 4:
First, test the null hypothesis that the slope 1 is 0 –
that is, that there is no linear relationship between fire
damage and the distance from the nearest fire station,
against the alternative hypothesis that fire damage
increases as the distance increases. We test
H0: 1 = 0
Ha: 1 > 0
The two-tailed observed significance level for testing is
approximately 0.
Example
The 95% confidence interval yields (4.070, 5.768).
How?
So?
We estimate (with 95% confidence) that the interval
from $4,070 to $5,768 encloses the mean increase (1)
in fire damage per additional mile distance from the fire
station.
So?
The coefficient of determination, is r2 = .9235,
which implies ?
about 92% of the sample variation in fire damage (y) is
explained by the distance (x) between the fire and the
fire station.
Example
The coefficient of correlation, r, that measures the

strength of the linear relationship between y and x is
r   r  .9235  .96
2
The high correlation confirms our conclusion that 1 is

greater than 0; it appears that fire damage and
distance from the fire station are positively correlated.
All signs point to a strong linear relationship between y
and x.
F test

Chapter Simple Linear Regression 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter Simple Linear Regression 1

Uploaded by

Copyright:

Available Formats

Chapter 11

Simple Linear Regression

E(y) = 0 + 1x = Deterministic component

0 (beta zero) = y-intercept of the line, that is, the

A positive slope implies that E(y) increases by the amount

Step 1: Hypothesize the deterministic component of the

2.25 4.5 6.75 9 11.25 13.5 15.75 18

1. Plot of all (xi, yi) pairs

• How would you draw a line through the points?

The least squares line ŷ = b̂0 + b̂1 x is one that has

y y2 = bˆ0 + bˆ1 x2 + eˆ2

y - intercept : b̂0 = y - b̂1 x

y-intercept: b̂0 represents the predicted value of y when x

slope: b̂1 represents the increase (or decrease) in y for

Refer to the following advertising monthly-sales data.

bˆ0 = y - bˆ1 x = 2 - (.70 )(3) = -.10

Construct scatterplot; determine if linear model is

x=45 y=40 (xi-xbar) (xi-xbar) (yi-ybar) (yi-ybar)

So, prediction line is

For income $6000

Inc. x ConE y y=6.2+.2x y - y (y-y)^2

Income Residual Plot

Any other line drawn through the scatterplot

Testing the null hypothesis that the linear model

Ha: 1 > 0 t > t

Ha: 1 < 0 t < t

H0: 1 = 0 Test Statistic:

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Answers ‘How strong is the linear relationship

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Explained Variation SSreg

Total Variation: SSyy = SSreg + SSE

To measure the contribution of x in predicting y.

Calculate the coefficient of determination for the

Interpretation: About 81.7% of the sample

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Suppose a fire insurance company wants to relate the

First, we hypothesize a model to relate fire damage, y,

Least Squares Equation: yˆ  10.278  4.919 x

This prediction equation is graphed in the Minitab

The least squares estimate of the slope, ˆ1  4.919

Step 3: Specify the probability distribution of the

The estimate of the standard deviation  of ,

The coefficient of correlation, r, that measures the

The high correlation confirms our conclusion that 1 is

You might also like