Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

PREDICTIVE MODELING

BUSINESS REPORT

DSBA

DIPTI PATIL
PGP – DSBA Online
Batch: March 2021

Date: 29th Aug, 2021

1
Table of Contents

Contents

Problem 1: Linear Regression…………………………………………………………………………………………………………………..3


Data Dictionary……………………………………………………………………………………………………………………………3
Q1.1…………………………………………………………………………………………………………………………………………….3
Sample of Dataset……………………………………………………………………………………………………………………….3
Exploratory Data Analysis……………………………………………………………………………………………………………4
Univariate Analysis………………………………………………………………………………………………………………………5
Multivariate Analysis…………………………………………………………………………………………………………………..7
Q1.2…………………………………………………………………………………………………………………………………………….9
Q1.3…………………………………………………………………………………………………………………………………………….8
Q1.4 Business Insights and Recommendations……………………………………………………….…………………12
Problem 2: Logistic Regression and Linear Discriminate Analysis.…………….…………………………………………….13
Data Dictionary………………………………………………………………………………………………………………………….13
Q2.1…………………………………………………………………………………………………………………………………………..13
Sample of Dataset……………………………………………………………………………………………………………………..13
Exploratory Data Analysis………………………………………………………………………………………………………….13
Univariate Analysis…………………………………………………………………………………………………………………….15
Bivariate Analysis……………………………………………………………………………………………………………………….17
Multivariate Analysis………………………………………………………………………………………………………………….19
Q2.2…………………………………………………………………………………………………………………………………………..20
Q2.3 Logistic Regression Model…………………………………………………………………………………….…………..20
Linear Discriminant Analysis Model…………………………………………………………………………………..21
Comparison of Logistic Regression & Linear Discriminant Analysis……………………………………22
Q2.4 Business Insights and Recommendations……………………………………………………….…………………22

Pictures:
Pic. 1, Pic. 2, Pic. 3 ………………………………………………………………………………………………………………………4
Pic. 4, Pic. 5………………………………………………………………………………………………………………………………….5
Pic. 6………..………………………………………………………………………………………………………………………………….7
Pic. 7………..………………………………………………………………………………………………………………………………….8
Pic. 8……….…………………………………………………………………………………………………………………………………10
Pic. 9, Pic. 10………………………………………………………………………………………………………………………………11
Pic. 11, Pic. 12, Pic. 13, Pic. 14……………………………………………………………………………………………………14
Pic. 15………..………………………………………………………………………………………………………………………………15
Pic. 16, Pic. 17……………………………………………………………………………………………………………………………17
Pic. 18, Pic. 19……………………………………………………………………………………………………………………………18
Pic. 20………..………………………………………………………………………………………………………………………………19
Pic. 21………..………………………………………………………………………………………………………………………………20
Pic. 22, Pic. 23……………………………………………………………………………………………………………………………21

Tables:
Table. 1.………………………………………………………………………………………………………………………………………3
Table. 2.………………………………………………………………………………………………………………………………….…10
Table. 3……………………………………………………………………………………………………………………………………..13
Table. 4……………………………………………………………………………………………………………………………………..22

2
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.
Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order
Cut
Fair, Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia.With D being the worst and J the best.
cubic zirconia Clarity refers to the absence of the Inclusions and
Clarity Blemishes. (In order from Best to Worst, IF = flawless, l1= level 1
inclusion) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
The Height of cubic zirconia, measured from the Culet to the table,
Depth
divided by its average Girdle Diameter.
The Width of the cubic zirconia’s Table expressed as a Percentage of its
Table
Average Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA). Perform Univariate and Bivariate Analysis.

Sample of Dataset:

Table. 1

3
Exploratory Data Analysis:

• The dataset has 26967 observation and 10 variables.


• All variables carat, depth, table, x, y, z have float datatype. Cut, color, clarity have object datatype &
price have int datatype. ( Pic.1)

Pic. 1 Pic. 2

• The variable ‘depth’ have 697 missing values(Pic. 2)


• The describe function gives the 5 point data summery for data set(Pic .3)

Pic. 3

• From Describe function, we can see that the mean values are equal to the median values for all
variables except for ‘price’ variable, which is represented by 50% in the output. This means that the
data has symmetric distribution and has less to zero skewness.
4
• Moreover, there is significant difference between 75th % and max values. This means that there are
outliers in all variables( Refer Pic. 3)

Pic. 4 Pic. 5
• There are outliers present in all the variables as seem in above boxplot(Pic. 4). After treating the
Outlier values (Pic. 5). We will further check them in the Univariate analysis.
• There are 34 number of duplicate values present but we don’t remove them as the values just
could be repeated for single variable.

Univariate Analysis:
Carat:

Depth:

5
Table:

X:

Y:

6
Z:

Price:

• From above charts we can see that there is little to no skewness in Depth, Table & X variables. This
means that the data is equally distributed.
• For Carat, Y, Z & Price variables, we see that the data is right skewed.

Multivariate Analysis:

Pic. 6 (Heat Map)


7
From the pair plot (Pic. 7) & heat map (Pic. 6) between variables, we can analyse the co-relation
between price and other variables:
• High corelation between price & carat, x, y & z variables.
• High corelation between carat & x, y & z variables.
• We can see 2 clusters in carat & x variable. The price can vary according to the cluster if
considered, but we will not consider it for linear regression (for any further analysis on price we
can consider it).
• High corelation between x, y & z variables.

Pic. 7 (Pair Plot)

8
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Do you think scaling is necessary in this
case?

Null Values before imputing with Median Null Values after imputing with Median

The variable ‘depth’ have 697 null values.


Also, the variable ‘x’ & ‘y’ have each 3 Zero values & ‘z’ have 9 Zero values which is pointless, as they
are length, width & height respectively. So first we will replace them with null values and impute them
with ‘median’ values respectively.

No, scaling is not necessary here.

With the describe function used previously, we can see that mean, minimum and maximum values are
not close to each other for all variables. For variables like y, z & price the difference is too much. This
means the data is not scaled and will require further scaling. Also, if we don’t get good accuracy in
model, we can consider scaling the data. If required, we will scale the data using the ‘z-score’ function.

9
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions on
Train and Test sets using Rsquare, RMSE

• The variables cut, color & clarity have object data type. For linear regression we need to encode
the data with get dummies.
• The unique values are converted to the encoded values accordingly, as shown below:

Table. 2

• We split the data for 70% for train data i.e to build model & other 30% for test data i.e. to test the
data.
• After applying the stats model we get the
summery as seen in Pic . 9
• R² & adjusted R² are 94%
• RMSE for test is 836.59, which is quite high.
• The P value for ‘depth’ is higher than 0.05
(Pic .9)
• Also, we can see that the ’ VIF’ values are
very high. So we drop ‘depth’ variable from
model and scale the data and rerun the
model.
• After scaling and removing ‘depth’ variable,
we get the p-value for all the variables below
0.05 (Refer Pic. 10), hence we stop dropping
the variables and conclude the model.
• The final model gives RMSE for Test data as
0.24 which is between 0 & 1. So this is a
good fit model.
• The R² for train & test data is 94.04% &
94.15% respectively.
• From Pic. 8 & Pic. 10, we can see the
coefficient of variables & VIF score which is
quite reduced as compared to before
removing ‘depth’ variable & scaling the data.
Pic. 8 (VIF Score)

10
Pic. 9 Pic. 10

11
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

• Based on the above predictions, we can see that the variables that are of utmost importance to
determine the price of diamond are carat, clarity, cut and width of the Cubic Zirconia.

Business Insights:
• The final Linear Regression equation is :

Price = (-0.85) * Intercept + (1.23) * carat + (-0.01) * table + (-0.35) * x + (0.27) * y + (-0.08) * z
+ (0.11) * cut_Good + (0.18) * cut_Ideal + (0.17) * cut_Premium + (0.14) * cut_Very_Good + (-
0.06) * color_E + (-0.07) * color_F + (-0.12) * color_G + (-0.25) * color_H + (-0.39) * color_I + (-
0.55) * color_J + (1.17) * clarity_IF + (0.75) * clarity_SI1 + (0.52) * clarity_SI2 + (0.99) *
clarity_VS1 + (0.91) * clarity_VS2 + (1.12) * clarity_VVS1 + (1.11) * clarity_VVS2 +

• The high co-efficient variables are:


• Carat (1.23)
• Clarity_IF (1.17)
• Clarity_VVS1 (1.12)
• Clarity_VVS2 (1.11)
• Clarity_VS1 (0.99)
• Clarity_VS2 (0.91)
• Clarity_SI1 (0.75)
• Clarity_SI2 (0.52)
• y(0.27)
• For example, when the value of carat increases by 1 unit, price increases by 1.23 units, keeping
all other predictors constant.
• Also, there are some negative co-efficient values. Which are:
• x (-0.35)
• Color H (-0.25)
• Color I (-0.39)
• Color J (-0.55 )
• For example, the value of x increases by 1 unit the value of price decreases by -0.35 units,
keeping all other predictors constant.

Recommendations:
• Higher the co-efficient value, most important is that predictor. The best features are:
• Carat
• Clarity IF
• Clarity VVS1 & VVS2
• Clarity VS1 & VS2
• Clarity Sl1 & Sl2
• Y(width of the stone)
• To earn profitability, the cubic Zirconia should have high weight with good clarity, cut & width
of stone.

12
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn’t. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.
Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Sample of Dataset:

Table. 3

Exploratory Data Analysis:

• The dataset has 872 observation and 7 variables.


• All variables have int datatype except ‘Holliday_Package’ and foreign which have object
datatype (Pic. 11)
• There are no missing values in the dataset (Pic. 12)
• There are no duplicate rows present in the dataset.
• The describe function gives the 5 point data summery for data set(Pic. 13)

13
Pic. 11 Pic. 12

Pic. 13
• From Describe function, we can see that the mean values are somewhat equal to the median values
for all variables except for ‘no_young_children’ variable, which is represented by 50% in the output.
This means that the data has symmetric distribution and has less to zero skewness.
• Moreover, there is significant difference between 75th % and max values. This means that there are
outliers in all variables (Refer Pic. 13)

Pic .14 (Before treating Outliers)

14
Pic 15 (After treating Outliers)

• There are outliers present in most of the variables as seen in above boxplot(Pic. 14). After
treating the Outlier values (Pic. 15). We will further check them in the Univariate analysis.
• Also, for good accuracy of model we treat Outlier values in variables.

Univariate Analysis:

Salary:

Age:

15
Education:

No younger Children:

No older Children:

16
Bi Variate Analysis:

Pic. 16
• From Pic. 16, we can see the employees with salary package of upto 50K opts for Holiday
Packages between the age 25 to 50. We see the same conclusion from individual variable
analysis.

Pic. 17

• From Pic. 17, we can see the employees with salary package between 25K to 75K and with years
of formal education between 5 to 15 opts for Holiday Packages.
• As the number of years of formal education and salary increases the employees do not opts for
Holiday Package.
• We see the same conclusion from individual variable analysis.

17
Pic. 18

• From Pic. 18, we can see as the number of children increases the opting of Holiday Package
decreases.
• Employees opts for Holiday Package whose salary is between 25K to 75K and number of
children upto 3.
• Few employees having salary above 100K opts for Holiday Packages.

Pic. 19

• From Pic. 19 we can see as the number of young children increases the opting of Holiday
Package decreases.
• Employees opts for Holiday Package whose salary is between 25K to 75K and number of young
children upto 1.
• Few employees having salary above 100K opts for Holiday Packages.

18
Multivariate Analysis:

Pic. 20 (Pair plot)

19
Pic. 21 (Heat Map)

• The Heat map in Pic. 21 shows that there is no collinearity between the variables.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

• The variable like Holiday Package and Foreign have Yes & No responses. So we convert the
variables through encoding.
• We split the data into train data: 70% and test data: 30%.
• We apply the Logistic Regression model & LDA model to train and test data.
• For Logistic Regression we use grid search method to get best param.
• The grid search gives solver = ‘liblinear’ & tolerance, tol = ‘1e-06’ as best param. ‘liblinear’ is
more suitable for small datasets.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.

Logistic Regression Analysis:

Train data & Test data Classification Report:

20
AUC score & ROC:

Pic . 22
Linear Discriminant Analysis:

Train data & Test data Classification Report:

AUC score & ROC:

Pic. 23

21
Comparison of Logistic Regression & Linear Discriminant Analysis:

Table. 4
• From the above comparison table we can see that, the train and test values for Accuracy, AUC,
Recall, Precision & F1 Score for both the model are inline. Which suggest that both the models
are best performance model and there is no underfitting or overfitting issue.
• The values are same for both model, but scaling is not done for LDA . If we perform it for LDA,
we may get better values as compared to Logistic Regression. LDA seems better model.

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

In this business problem, we need to help company to predict weather an employee will opt for
Holiday Package or not on the basis of the information given in the data set. From Table. 4, we can see
that both the logistic Regression & Linear Discriminant Analysis gives values which are very similar to
each other. In EDA we found out that:

• Employees who have salary of approx. 50K and between age 25 to 50 opts more for Holiday
Packages.
• Employees having salary more than 100K and age more than 50 years generally do not opt
Holiday Packages.
• Employees having older kids more than 7 years and salary approx. 50K opts for Holiday
packages.
The insights:
• The accuracy of model is approx. 64% for opting Holiday Packages, which means that the
model does not predict 36% accurately weather the Employee will opt for Holiday Package or
not.
The Recommendations:
• Customised packages should be provided for employees according to age, salary number of
kids.
• The employees having salary more than 100K and above 50 years should be provided with
destination which have spa and leisure retreats or ship cruises etc where they can relax and all
activities are under one roof.
• The employees having salary more than 50K and with having younger kids less than 7 years old
should be provide with destination which are young kids friendly like beaches or with water
bodies.
• The employees having salary more than 50K and with having younger kids more than 7 years
old should be provided with destination where kids can play on their own like, play area resorts
or amusement parks where every individual can do their own activity.

22

You might also like