Chapter Simple Linear Regression 1
Chapter Simple Linear Regression 1
Probabilistic Models
Fitting the Model: The Least Squares Approach
Model Assumptions
Assessing the Utility of the Model: Making Inferences
about the Slope 1
The Coefficients of Correlation and Determination
Using the Model for Estimation and Prediction
A Complete Example
Did You Know that Degree, Color and Race
Make a Difference in Home Refinancing?
A study found that broker fees for purchasers without a
college degree pay $1,472 more than those with a college
degree
No only did a degree matter, but race was also a factor.
African Americans on average paid $500 more than whites,
Hispanics $275 more than whites.
Regression analysis was used to determine whether various
borrower characteristics had a bearing on the amount of
broker fees and closing costs paid.
Did You Know that the Presence of an
NFL Team Boost Rental Costs?
Regression analysis has revealed that in cities with an
NFL team, rental costs for apartment in the central city
area were 8 percent higher than in cities without an NFL
team
Property tax receipts were also found to be higher in
cities with NFL teams
Regression Analysis
Regression analysis examines associative relationships between a
metric dependent variable and one or more independent
variables in the following ways:
• Determine whether the independent variables explain a
significant variation in the dependent variable: whether a
relationship exists.
• Determine how much of the variation in the dependent variable
can be explained by the independent variables: strength of the
relationship.
• Determine the structure or form of the relationship: the
mathematical equation relating the independent and dependent
variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
Regression Analysis
NOTE that
Regression analysis is concerned with the nature and
degree of association between variables and does not
imply or assume any causality.
A First-Order (Straight Line) Probabilistic Model
y = 0 + 1x +
where
y = Dependent or response variable
(variable to be modeled)
x = Independent or predictor variable
(variable used as a predictor of y)
y = 0 + 1x +
2 9 12 11
3 8 12 4
4 3 4 1
5 10 12 11
6 4 6 1
7 5 8 7
8 2 2 4
9 11 18 8
10 9 9 10
11 10 17 8
12 2 2 5
Scatter Diagram
12
10
8
Attitude
0
0 5 10 15 20
Duration of Residence
Which Straight Line Is Best on this Scatter?
Line 1
Line 2
9 Line 3
Line 4
6
12
10
0
0 5 10 15 20
Fitting the Model:
The Least Squares Approach
Scatterplot
y
60
40
20
0 x
0 20 40 60
Which Line Fits Best?
y
60
40
20
0 x
0 20 40 60
Least Squares Line
n
LS minimizes åi 1 2 3 4
e
ˆ 2
i =1
= e
ˆ 2
+ e
ˆ 2
+ e
ˆ 2
+ e
ˆ 2
x
Formula for the Least Squares Estimates
SS xy
Slope : b̂1 =
SS xx
( )( y - y )
where SS xy = å xi - x i
= å( x - x )
2
SS xx i
n = sample size
Interpreting the Estimates of 0 and 1 in Simple
Linear Regression
Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Parameter Estimation Solution
yˆ = -.1 + .7 x
Coefficient Interpretation Solution
^
1. Slope (1)
• Sales Volume (y) is expected to increase by $700
for each $100 increase in advertising (x), over the
sampled range of advertising expenditures from
$100 to $500
^
2. y-Intercept (0)
• Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation
Regression Line Fittedto the Data
Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Example: Income vs Consumption Expenditure
Consumption
Income (x)
Expenditure (y)
1 7
5 6
9 9
13 8
17 10
Questions
Consumption Expenditure
11
Expenditure ($1,000's)
10
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Solution
Inc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar)
(yi-ybar)
1 7 -8 64 -1 1 8
5 6 -4 16 -2 4 8
9 9 0 0 1 1 0
13 8 4 16 0 0 0
17 10 8 64 2 4 16
Consumption Expenditure
11
Expenditure ($1,000's)
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Consumption Expenditure Prediction When x=$6,000
Consumption Expenditure
11
Expenditure ($1,000's)
10 y = 6.2 + 0.2x
9
8
7.4 7
6
5
0 5 6 10 15 20
Household Income ($1,000's)
Consumption Expenditure Prediction When x=$25,000
Consumption Expenditure
11.2 12
Expenditure ($1,000's)
11
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20 25
25
Household Income ($1,000's)
C. Compute the Residuals
Consumption Expenditure
11
Expenditure ($1,000's)
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Income Residual Plot
2
Residuals
1
0
-1 0 5 10 15 20
-2
Income
residuals, (residuals)2
Note that
* Sresiduals = 0
* SSE=Sum of (residuals)2 = 3.6
H 0 : b1 = 0 against H a : b1 ¹ 0
If data support alternative hypothesis => x does
contribute information for the prediction of y using the
straight-line model.
A Test of Model Utility: Simple Linear Regression
H0: 1 = 0
b̂1 b̂1
Test Statistic: t = =
sb̂ s SS xx
1
Alternative Rejection
Hypothesis Region
H0: 1 = 0
Ha: 1 0
.05 Reject H0 Reject H0
df 5 – 2 = 3 .025 .025
Critical Value(s):
-3.182 0 3.182 t
Test Statistic Solution
s .6055
sb̂ = = = .1914
SS xx (15)
1 2
55 -
5
b̂1 .70
t= = = 3.657
Sb̂ .1914
1
Test of Slope Coefficient Solution
Regression Statistics
Multiple R 0.903696114
R Square 0.816666667
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6
^
1
S^1 t = ^1 / S^1 P-Value
The Coefficients of Correlation and
Determination
The Coefficients of Correlation (r)
Correlation Models
Regression Statistics
Multiple R 0.903696114 r
R Square 0.816666667
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6
Y
Residual Variation SSE
X
X1 X2 X3 X4 X5
Decomposition of the Total Variation
r2 = (coefficient of correlation)2
-1 r 1 0 r2 1
Example - Obtaining the Value of r2 for the
Sales Revenue Model
r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817
Regression Statistics
Multiple R 0.903696114
R Square 0.816666667 r2
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6
Regression Statistics
Multiple R 0.903696114
R Square 0.816666667 r2=(Ssyy-SSE)/SSyy
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6
Regression Statistics
Multiple R 0.903696114
R Square 0.816666667 F statistics=MSreg/MSresidual
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5 P-value of F statistics
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6
Step 1:
Step 2:
Use a statistical software package to estimate the
unknown parameters in the deterministic component of
the hypothesized model. The Excel printout for the
simple linear regression analysis is shown on the next
slide. The least squares estimates of the slope 1 and
intercept 0, highlighted on the printout, are
ˆ1 4.919331
ˆ0 10.277929
Example
Step 4:
First, test the null hypothesis that the slope 1 is 0 –
that is, that there is no linear relationship between fire
damage and the distance from the nearest fire station,
against the alternative hypothesis that fire damage
increases as the distance increases. We test
H0: 1 = 0
Ha: 1 > 0
The two-tailed observed significance level for testing is
approximately 0.
Example
The 95% confidence interval yields (4.070, 5.768).
How?
So?
We estimate (with 95% confidence) that the interval
from $4,070 to $5,768 encloses the mean increase (1)
in fire damage per additional mile distance from the fire
station.
So?
The coefficient of determination, is r2 = .9235,
which implies ?
about 92% of the sample variation in fire damage (y) is
explained by the distance (x) between the fire and the
fire station.
Example
r r .9235 .96
2