STAT3301 - Term Exam 2 - CH11 Study Package

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

STAT3301 – Term Exam 2 Review Package

Chapter 11 – Regression with a Binary Dependent Variable


1. Identify and briefly describe the three approaches to regression with a binary dependent variable
discussed in chapter 11 of your textbook?
Linear Probability Model (LPM):
The Linear Probability Model (LPM) is a simple way to estimate binary outcomes (like yes/no, win/lose)
using a linear regression framework. In LPM, the dependent variable is the probability of an event
occurring, which is predicted to change linearly with the independent variables. However, it has
limitations, such as sometimes predicting probabilities outside the 0 to 1 range.
Probit Regression Model:
The Probit Regression Model is used to model binary outcome variables by estimating the probability
that an event occurs as a function of independent variables. Unlike LPM, Probit uses the standard normal
cumulative distribution function to ensure that all predicted probabilities lie between 0 and 1.
Logit Regression Model:
The Logit Regression Model, similar to Probit, is also used for binary outcome variables. It predicts the
log odds of the probability that an event occurs as a linear combination of independent variables. The
model utilizes the logistic cumulative distribution function to ensure predicted probabilities are between 0
and 1. Logit is widely used due to its ease of interpretation and the fact that its coefficients represent the
change in the log odds for a one-unit change in the predictor variable.
2. What are the advantages and disadvantages of the linear probability model (LPM)?
Advantages:

 Simple to estimate and easy to interpret.


 Inference is the same as for multiple regression (just remember to use heteroskedasticity-robust
standard errors).
Disadvantages:

 A LPM says that the change in the predicted probability for a given change in 𝑋 is the same for all
values of 𝑋 (i.e. is linear), but that doesn’t make sense.
o Using the HDMA example from your textbook, a change in P/I ratio from 0.3 to 0.4 might
have a large effect on the probability of denial.
o However, once the P/I ratio is so large that the loan is very likely to be denied, increasing the
P/I ratio further will have little effect.
 LPM predicted probabilities can be <0 or >1 which is nonsensical!

3. Which of the following regressions with a binary dependent variables utilizes the standard normal
cumulative distribution function to ensure that all predicted probabilities lie between 0 and 1?
a. Linear Probability Model
b. Logit Regression Model
c. Probit Regression Model
d. Two Stage Least Squares (TSLS)
Solution = c. Probit Regression Model
4. Describe the fraction correctly predicted measure used to assess the fit for the linear probability
model, probit, and/or logit regression models.
The fraction correctly predicted = fraction of 𝑌’𝑠 for which the predicted probability is > 50% when Yi =
1, or is < 50% when Yi = 0. An advantage of this measure of fit is that it is easy to understand. However,
a (major) disadvantage is that it does not reflect the quality of the predictors: If 𝑌_𝑖=1, the observation is
treated as correctly predicted whether the predicted probability is 51% or 90%.

5. In your textbook you learned that R2 and R̄2 measures of fit do not make sense with LMP, probit,
and/or logit models. What two other specialized measures are used in their place for these types of
models?

Fraction correctly predicted and pseudo- R2.

6. Draw a graph of the standard normal probability density function and its associated cumulative
distribution function Φ ( z). What regression model uses this distribution?

Probit regression uses the standard normal c.d.f.


7. What functions are utilized in the probit vs. logit regression model that forces the predicted values
between 0 and 1?
Probit regression uses the standard normal c.d.f.
Logit regression (aka logistic regression) uses the logistic c.d.f.
8. The Excel file Binary_Loan.xls contains data for the following variables
 appinc: applicant income, $1,000s
 married: =1 if married
 dep: number of dependents
 emp: years employed in line of work
 yjob: years at this job
 self: =1 if self employed
 pubrec: =1 if filed bankruptcy
 hrat: housing expenses % total income
 obrat: other obligations % total income
 school: =1 if > 12 years schooling
 black: =1 if Black
 hispan: =1 if Hispanic
 male: =1 if male
 approve: =1 if action ==1 or 2
 mortperf: no late mortgage payments
 mortlat1: one or two late payments
 mortlat2: >2 late payments
 loanprc: amount/price
 white: =1 if White

a. Estimate the effect of White on approve, using the Probit model.


b. Find the estimated probability of loan approval for both Whites and non-Whites.
c. Estimate the same relationship with the linear probability model and compare your results.
d. Re-estimate the model adding more determinants of loan-approval, such as: hrat, mortperf,
mortlat1, mortlat2, married, dep, school and emp. What happens to the estimation of the
White coefficient obtained in (a)? Is discrimination against non-Whites evident?
a. The results with the Probit method are as follows:

This yields the following estimated regression model:


^
Pr ¿ ¿ ¿

First, let’s determine if the coefficient associated with the white variable is statistically
significant (at the 5% level of significance). This is determined by the following hypothesis
test:
H 0 : β white=0
H 1 : β white ≠ 0
Looking at the regression output table, we see that the p-value associated with this hypothesis
test is 0.000 which is less than the level of significance of 0.05 or 5%. Therefore, we REJECT
the null hypothesis and conclude that the coefficient associated with the white variable is
statistically significant at a 5% level of significance.

Since the coefficient on the white variable is positive, we can also conclude that being white
seems to have a positive effect in getting your loan approved.

However, these obtained coefficients are not interpreted as marginal effects as in simple OLS.
So, the next step is needed in order to explain the results appropriately.

b. To determine the effect of white on approve we need to calculate the probability density
function of the standard normal distribution as follows:

Pr ¿ ¿
¿ Φ ( 1.3181 )

Looking up the probabilities for a z-value of 1.32 using Table 1 of the Appendix yields:
Pr ¿ ¿

To determine the effect of being non-white on approve we need to calculate the probability
density function of the standard normal distribution as follows:

Pr ¿ ¿
¿ Φ ( 0.5764 )

Looking up the probabilities for a z-value of 0.58 using Table 1 of the Appendix yields:
Pr ¿ ¿

c. The results of the Linear Probability Model are given below:


This yields the following estimated regression model:

Y^ i= β^ ¿ cons+ β^ white∗white=0.7178+0.1884∗white
We see that if you are non-white the probability to get a loan is equal with the constant
(0.7178), while if you are white is given by: 0.7178+0.1884*(1) = 0.9062. Thus, for this very
simple version the LPM and probit estimation methods yield very similar results.
d. Adding more variables in the Probit model yields the following:

We see that the variable white has been lowered in terms of magnitude, but is still highly
significant (as evidenced by the p-value of 0.000 for the null hypothesis test that this
coefficient is equal to 0 vs. the alternative hypothesis that it is not).

Therefore, the discrimination, even when we control for other characteristics, appears to still
be present.

9. Suppose the probability of getting a student loan is determined by a student’s grade point average
(GPA), age, sex, and level of study – undergraduate, masters, PhD student.
a. Identify the population logit model that could be used to represent this.
b. How would you estimate the probability that a 23-year-old, male, undergraduate student, with
a GPA of 3.2, will obtain a loan?

a. The population logit model would be as follows:


Pr (¿ loani =1)=F (β 0 + β 1 GPA i + β 2 agei + β 3 sex i + β 4 mastersi + β 5 PhD i + μi )¿
1
¿ −( β0+ β1 GPD i + β2 agei + β3 sexi + β 4 masters i + β5 PhD i + μi )
1+ e
Where,

 loan i=¿ 1 if loan is approved for student i; 0 if loan is denied.


 GPA i=¿ student i’s GPA
 age i=¿age of student i
 sex i=¿ 1 if student i is male; 0 if female.
Since the level of study is a categorical variable, we need to create a series of dummy variables
to capture this effect on the probability of getting a student loan. To avoid the dummy variable
trap we need to drop one of the categories. If we drop undergraduate, this leaves:

 mastersi =¿ 1 if student i is a masters student; 0 otherwise.


 PhD i=¿1 if student is a PhD student; 0 otherwise.
b. The estimated probability that a 23-year-old, male, undergraduate student, with a GPA of 3.2,
will obtain a loan is given as follows:
^
Pr (¿ loan=1)=F ( β^ 0 + β^ 1∗3.2+ β^ 2∗23+ ^β3∗1+ ^β 4∗0+ ^β 5∗0 ) ¿
¿ F ( ^β + ^β ∗3.2+ ^β ∗23+ ^β )
0 1 2 3

1
¿ −( ^β0+ ^β1∗3.2 + β^ 2∗23+ ^β3)
1+ e

You might also like