Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Multiple Linear Regression

Foundation for Predictive Modeling


Day 6
Session 1

1
Predictive Modeling
Process by which a statistical model is created to best predict the outcome or probability of an
outcome.

Predictive models are developed using historical data or from purposely collected data.

Predictive analytics is used in financial services, insurance, telecommunications, retail, travel,


healthcare, pharmaceuticals, sports and other fields.

2
Predictive Modeling
General Approach

Set
Set Apply Develop
Develop
Understand
Understand Apply
Business
Business Business TheStatistical
The Statistical
Data
Data Business
Goal
Goal Expertise Model
Model
Expertise

Validate Make
Make
Validate
TheModel
Model Decisions
Decisions
The

3
Multiple Linear Regression
Multiple linear regression is an extension of simple linear regression. It is used when we
want to predict the value of a variable based on the values of two or more other
variables.

The variable we want to predict is called as dependent variable (or sometimes response
variable).

The variables used to predict the value of dependent variable are called as independent
variables (or sometimes, the predictor, explanatory or regressor variables).

4
Statistical Model in Multiple Linear Regression

Y=b0 + b1x1 + b2x2 + - - - - + bpxp + e

Where,

Y : Dependent Variable
x1, x2 ,…, xp : Independent Variables
b0, b1 ,…, bp : Parameters of Model
e : Random Error Component
_______________________________________________
___________

Parameters of the model are estimated


by Least Square Method.
5
What is the least square method?

The least squares (LS) criterion states that the sum


of
the squares of errors (or residuals) is minimum.

Mathematically, following quantity is minimized to


estimate parameters using least square method.
^
Error ss= ∑ (Yi – Yi )2

6
Case Study
Predicting Job Performance
index

Test of
Technical
Language

General
Aptitude
Information
Performance
Index

7
Get Started With Scatter Plot Matrix

per_index<-read.csv(file.choose(),header=T)
pairs(~jpi+aptitude+tol+technical+general,data=per_index,
col="blue")

8
Performance Index: Mathematical
Model

Objective : To model performance index after probationary period.

Data: Scores on various tests conducted before recruitment.

Sample Size:33

Model:

jpi =b0 + b1(aptitude) + b2(tol) + b3(technical)+b4 (general) + e

Parameters of the Model are estimated by Least Square Method.


9
Snapshot of the Data
Dependent 4 independent variables
Variable

10
Least Square Estimates of Parameters

Coefficients
Intercept -54.2822 Model Equation
aptitude 0.3236
jpi= -54.2822 + 0.3236(aptitude)
tol 0.0334 + 0.0334(tol) + 1.0955(technical)
+0.5368 (general)
technical 1.0955
general 0.5368

11
Global Testing: Testing whether at least one
variable has significant impact

The aim of Global Testing is to test the null hypothesis that all the model
parameters are simultaneously equal to zero.

Hypotheses:

H0: b1 = b2 = … = bp = 0 v/s H1: At least one coefficient is not zero

In other words

H0: None of the Independent variable has significant impact on dependent


variable

12
Global Testing: Partitioning Total Variation

Total Variation
n
∑ (Yi – Y )2
i=1

Unexplained
Explained Variation Variation
n ^ n ^
∑ (Yi – Y )2 ∑ (Yi – Yi )2
i=1 i=1

13
Global Testing: ANOVA and Decision Criterion

Source DF SS MSS F Value Pr > F

<0.000
Model(Explained) p=4 2510.007 627.5017 49.8129
1

Error(Unexplained) n-p-1=28 352.7208 12.5972

Total n-1=32 2862.728

Reject the null hypothesis since P value < 0.05.


At least one variable has significant impact on performance index.
14
Individual Testing using t Test
Hypotheses

H 0: b i = 0 v/s H 1 : bi ≠ 0

; i=1,2,3,4,..,P
Test Statistic

t= (Estimate of bi)/(Standard Error of estimated bi)

Under H0, t follows t distribution with (n-p-1) d.f.

Reject the null hypothesis if P value < 0.05 and conclude that the
variable has significant effect on Y

15
Individual Testing: Case Study

Coefficients Standard Error t Stat P-value


Intercept -54.2822 7.3945 -7.3409 0.0000
aptitude 0.3236 0.0678 4.7737 0.0001
tol 0.0334 0.0712 0.4684 0.6431
technical 1.0955 0.1814 6.0395 0.0000
general 0.5368 0.1584 3.3890 0.0021

P values for aptitude, technical and general are less than 0.05.
P value for test of language(tol) is more than 0.05. Therefore, tol is
the only insignificant variable.

16
Interpretation of Partial Regression Coefficients

For every unit increase in the independent variable, the dependent


variable (Y) will change by the corresponding parameter estimate,
keeping all the other variables constant.

From the Parameter estimates table, we observe that the parameter


estimate for Aptitude Test is 0.3236

We can infer that for one unit increase in aptitude test score, the job performance
index will increase by 0.3236 units.

17
Measure of Goodness of Fit: R Squared

R2 is the proportion of variation in the dependent variable which is explained by


the independent variables. Note that R2 always increases if variable is added in the
model.
n ^
∑ (Yi – Y)2
Explained variation i=1
R2 = =
n
Total Variation ∑ (Yi – Y)2
i=1

The adjusted R-squared


2 n version
is a modified 1 of R-squared
2 that has been adjusted
R
for the number of  1 
a predictors in the model. (1  R )
n p 1

18
Multiple Linear Regression in R

per_index<-read.csv(file.choose(),header=T)
jpimodel<-lm(jpi~aptitude+tol+technical+general,data=per_index)
jpimodel
_______________________________________________________________
#default output displayed by typing model object name
Call:
lm(formula = jpi ~ aptitude + tol + technical + general, data = per_index)

Coefficients:
(Intercept) aptitude tol technical general
-54.28225 0.32356 0.03337 1.09547 0.53683
19
Multiple Linear Regression in
R(continued)

#detailed output displayed using 'summary' function (slides 20 and 21)

summary(jpimodel)

Call:
lm(formula = jpi ~ aptitude + tol + technical + general, data = per_index)

Residuals:
Min 1Q Median 3Q Max
-7.2891 -2.7692 0.4562 2.8508 5.6068

20
Multiple Linear Regression in
R(continued)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.28225 7.39453 -7.341 5.41e-08 ***
aptitude 0.32356 0.06778 4.774 5.15e-05 ***
tol 0.03337 0.07124 0.468 0.6431
technical 1.09547 0.18138 6.039 1.65e-06 ***
general 0.53683 0.15840 3.389 0.0021 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.549 on 28 degrees of freedom
Multiple R-squared: 0.8768, Adjusted R-squared: 0.8592
F-statistic: 49.81 on 4 and 28 DF, p-value: 2.467e-12

21
Summary of Findings

Performance in aptitude, technical and general tests during recruitment phase


has significant influence on job performance .

Test of language is the only insignificant variable.

R squared of the model is 0.88.

88% of the variation in job performance index is explained by the model and
12% is unexplained variation.

22
Regression Model in Matrix Form

Ynx1  X nx ( p 1)  ( p 1) x1  enx1


where
 y1  1 x11 x12 .........x1 p 
   
 y2   1 x 21 x 22 .......... . x 2 p 
 
Ynx1   ....  X   ................................ 
 ....   
   ................................ 
 yn  1 x x x 
   n1 n 2 .......... .... np

 0   e1 
  e 
  1   2
   ....  e   ... 
   
 ..   ... 
   en 
 p  

23
Least Square Estimator and its Variance

e Y  X
Z e e  ei2 (Y  X ) (Y  X )
Z
0

ˆ X X  1 X Y

V ( ˆ ) V X X  X Y
1

V (  ) X X  X  V (Y ) X ( X X )  1
ˆ 1

V (  ) X X   2
ˆ 1

24
THANK YOU!!

25

You might also like