Sunira - Predictive Modeling

LINEAR REGRESSION
PREPARED BY
SUNIRA
Content
Problem 1: Linear Regression………………………………….……………………………………...
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided
with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an
inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting the price for the stone
on the bases of the details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data
types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis…………………………………………
8
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of combining the sub levels
of a ordinal variables and take actions accordingly. Explain why you are combining these sub levels with
appropriate
reasoning…………………………………………………………………………………………………………………………………5
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30). Apply
Linear regression using scikit learn. Perform checks for significant variables using appropriate method from
stats model. Create multiple models and check the performance of Predictions on Train and Test sets using
R square, RMSE & Adj R square. Compare these models and select the best one with appropriate
reasoning………………………………………………………………………………………………………………………………………………….
12
1.4 Inference: Basis on these predictions, what are the business insights and recommendations……………….5
Problem 2: Logistic Regression and LDA………………………………...…………………………….

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details
of 872 employees of a company. Among these employees, some opted for the package and some didn't.
You have to help the company in predicting whether an employee will opt for the package or not on the
basis of the information given in the data set. Also, find out the important factors on the basis of which the
company will focus on particular employees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write
an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis………………………5
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis)………………………7
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is
best/optimized……………………………………………………………………….7
2.4 Inference: Basis on these predictions, what are the insights and recommendations……………………………..5
List of figures (Ques 1)
Fig 1.1,1.2 ………………………………………………………………………………………………………..4
Fig 1.3, 1.4, 1.5, 1.6 ……………………………………………………………………………………………...5
Fig 1.7, 1.8, 1.9, 1.10 …………………………………………………………………………………………….6
Fig 1.11, 1.12, 1.13, 1.14 ………………………………………………………………………………………...7
Fig 1.15. ………………………………………………………………………………………………………….8
Fig 1.16, 1.17 …………………………………………………………………………………………………….9
Fig 1.18, 1.19 ……………………………………………………………………………………………………10
Fig 1.20, 1.21 ……………………………………………………………………………………………………11
Fig 1.22, 1.23 ……………………………………………………………………………………………………12
Fig 1.24, 1.25, 1.26………………………………………………………………………………………………13
Fig 1.27, 1.28 ……………………………………………………………………………………………………14
Fig 1.29 …………………………………………………………………………………………………………15
Fig 1.30 ………………………………………………………………………………………………………….16
Fig 1.31 ………………………………………………………………………………………………………….19
Fig 1,32, 1.33 ……………………………………………………………………………………………………20
Fig 1.34, 1.35 ……………………………………………………………………………………………………21
Fig 1.36, 1.37 ……………………………………………………………………………………………………22
Fig 1.38, 1.39 ……………………………………………………………………………………………………23
Fig 1.40, 1.41 ……………………………………………………………………………………………………24
Fig 1.42, 1.43 ……………………………………………………………………………………………………25
Fig 1.44 ………………………………………………………………………………………………………….26
List of Tables (Ques 1)

Table 1.1, 1.2 …………………………………………………………………………………………………….2
Table 1.3 …………………………………………………………………………………………………………3
Table 1.4 …………………………………………………………………………………………………………8
Table 1.5 ………………………………………………………………………………………………………..18
Table 1.6 ………………………………………………………………………………………………………..26
List of Fig (Ques 2)

Fig 1.1 ……………………………………………………………………………………………………………4
Fig 1.2, 1.3 ……………………………………………………………………………………………………….5
Fig 1.4, 1.5 ……………………………………………………………………………………………………….6
Fig 1.6, 1.7 ……………………………………………………………………………………………………….7
Fig 1.8, 1.9 ……………………………………………………………………………………………………….8
Fig 1.10, 1.11 …………………………………………………………………………………………………….9
Fig 1.12, 1.13 ……………………………………………………………………………………………………10
Fig 1.14, 1.15 ……………………………………………………………………………………………………11
Fig 1.16 ………………………………………………………………………………………………………….12
Fig 1.17, 1.18 ……………………………………………………………………………………………………13
Fig 1.19, 1.20 ……………………………………………………………………………………………………14
Fig 1.20 ………………………………………………………………………………………………………….16
Fig 1.21, 1.22 ……………………………………………………………………………………………………17
Fig 1.23 ………………………………………………………………………………………………………….20
Fig 1.24 ………………………………………………………………………………………………………….21
Fig 1.25, 1.26 ……………………………………………………………………………………………………22
Fig 1.27, 1.28 ……………………………………………………………………………………………………23
Fig 1.29, 1.30 ……………………………………………………………………………………………………24
Fig 1.31, 1.32 ……………………………………………………………………………………………………25
List of Tables (Ques 2)
Table 1.1, 1.2 …………………………………………………………………………………………………..2
Table 1.3 ……………………………………………………………………………………………………….3
Table 1.4 ………………………………………………………………………………………………………14
Table 1.5 ………………………………………………………………………………………………………16
Table 1.6 ………………………………………………………………………………………………………25
1
LINEAR REGRESSION
You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and
other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable
stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.
Data Dictionary:
Variable Name Description

Carat Carat weight of the cubic zirconia.
Describe the cut quality of the
cubic zirconia. Quality is
Cut
increasing order Fair, Good, Very
Good, Premium, Ideal.
Colour of the cubic zirconia. With
Color
D being the best and J the worst.
Cubic zirconia Clarity refers to the
absence of the Inclusions and
Blemishes. (In order from Best to
Clarity
Worst, FL = flawless, I3= level 3
inclusions) FL, IF, VVS1, VVS2,
VS1, VS2, SI1, SI2, I1, I2, I3
The Height of a cubic zirconia,
measured from the Culet to the
Depth
table, divided by its average
Girdle Diameter.
The Width of the cubic zirconia's
Table Table expressed as a Percentage of
its Average Diameter.
Price The Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
2
Z Height of the cubic zirconia in mm.
1.1.Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA). Perform
Univariate and Bivariate Analysis.
Loading all the necessary library for the model building.
Now, reading the head and tail of the dataset to check whether data has
been properly fed.
HEAD OF THE DATA (Table1.1)
TAIL OF THE DATA (Table 1.2)
Checking the shape of the data (26967, 11)
Checking the info the data

3
We have float, int and object data types in the data.

DATA DESCRIPTION (Table 1.3)
We have both categorical and continuous data,

For categorical data we have cut, colour and clarity
For continuous data we have carat, depth, table, x. y, z and
price Price will be target variable.
Checking the duplicates in the data,
4
Unique values in the categorical data

CUT: 5
Fair 781
Good 2441
Very Good 6030
Premium 6899
Ideal 10816
We have 5 cuts and the ideal seems to be most preferred
cut COLOR: 7
J 1443
I 2771
D 3344
H 4102
F 4729
E 4917
G 5661
CLARITY: 8
I1 365
IF 894
VVS1 1839
VVS2 2531
VS1 4093
SI2 4575
VS2 6099
SI1 6571
Univariate / Bivariate analysis

Fig 1.1 Fig 1.2
5
The distribution of data in carat seems to positively skewed, as there are

multiple peaks points in the distribution there could multimode and the box
plot of carat seems to have large number of outliers. In the range of 0 to 1
where majority of data lies.
Fig 1.3 Fig 1.4
The distribution of depth seems to be normal

distribution, The depth ranges from 55 to 65
The box plot of the depth distribution holds many outliers.
Fig 1.5 Fig 1.6
The distribution of table also seems to be positively

skewed The box plot of table has outliers
The data distribution where there is maximum distribution is between 55 to 65.
6
Fig 1.7 Fig 1.8
The distribution of x (Length of the cubic zirconia in mm.) is positively

skewed The box plot of the data consists of many outliers
The distribution rages from 4 to 8
Fig 1.9 Fig 1.10
The distribution of Y (Width of the cubic zirconia in mm.) is positively

skewed The box plot also consists of outliers
The distribution too much positively skewed. The skewness may be due to the
diamonds are always made in specific shape. There might not be too much
sizes in the market
7
Fig 1.11 Fig 1.12
The distribution of z (Height of the cubic zirconia in mm.) is positively

skewed The box plot also consists of outliers
The distribution too much positively skewed. The skewness may be due to the
diamonds are always made in specific shape. There might not be too much
sizes in the market
Fig 1.13 Fig 1.14
The price has seems to be positively skewed. The skew is

positive The price has outliers in the data
The price distribution is from rs 100 to 8000.
8
PRICE – HIST
Fig 1.15
skew
Table 1.4
9
BIVARIATE ANALYSIS
CUT :
Quality is increasing order Fair, Good, Very Good, Premium, Ideal.
Fig 1.16
The most preferred cut seems to be ideal cut for diamonds.

Fig 1.17
The reason for the most preferred cut ideal is because those diamonds are
priced lower than other cuts.
COLOR:
D being the best and J the worst.

10
Fig 1.18
We have 7 colours in the data, The G seems to be the preferred colour,
Fig 1.19
We see the G is priced in the middle of the seven colours, whereas J being
the worst colour price seems too high.
CLARITY:
Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1,
VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
11
Fig 1.20
The clarity VS2 seems to be preferred by people
Fig 1.21
The data has No FL diamonds, from this we can clearly understand the
flawless diamonds are not bringing any profits to the store.
12
More relationship between categorical variables

Cut and colour
Fig 1.22
Cut and clarity
Fig 1.23
13
CORRLEATION
CARAT VS PRICE
Fig 1.24
DEPTH VS PRICE
Fig 1.25
X VS PRICE
Fig 1.26
14
Y VS PRICE
Fig 1.27
Z VS PRICE
Fig 1.28
15
DATA DISTRIBUTION
Fig 1.29
16
CORRELATIOM MATRIX
Fig 1.30
This matrix clearly shows the presence of multi collinearity in the dataset.
17
1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this case?
Yes we have Null values in depth, since depth being continuous variable
mean or median imputation can be done.
The percentage of Null values is less than 5%, we can also drop these if we
want.
After median imputation, we don’t have any null values in the dataset.
Checking if there is value that is “0”

18
Table 1.5
We have certain rows having values zero, the x, y, z are the dimensions of a
diamond so this can’t take into model. As there are very less rows.
We can drop these rows as don’t have any meaning in model building.
SCALING
Scaling can be useful to reduce or check the multi collinearity in the data,
so if scaling is not applied I find the VIF – variance inflation factor values
very high. Which indicates presence of multi collinearity
These values are calculated after building the model of linear
regression. To understand the multi collinearity in the model
The scaling had no impact in model score or coefficients of attributes nor
the intercept.
BEFORE SCALING – VIF VALUES
19
AFTER SCALING – VIF VALUES
CHECKING THE OUTLIERS IN THE DATA

BEFORE TREATING OUTLIERS
Fig 1.31
Fiffgljnsdfjns
20
Fig 1.32
Fig 1.33
21
Fig 1.34
Fig 1.35
22
Fig 1.36
Fig 1.37
23
AFTER TREATING OUTLIERS

Fig 1.38
Fig 1.39
24
Fig 1.40
Fig 1.41
25
Fig 1.42
Fig 1.43
26
Fig 1.44
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.
ENCODING THE STRING VALUES
GET DUMMIES
Table 1.6
27
Dummies have been encoded.

Linear regression model does not take categorical values so that we have
encoded categorical values to integer for better results.
DROPING UNWANTED COLUMNS
The coefficient for carat is 1.1009417847804501

The coefficient for depth is
0.005605143445570377 The coefficient for table
is -0.013319500386804035 The coefficient for x is
-0.30504349819633475
The coefficient for y is 0.30391448957926553
The coefficient for z is -0.13916571567987943
The coefficient for cut_Good is
0.09403402912977911 The coefficient for cut_Ideal
is 0.1523107462056746
The coefficient for cut_Premium is 0.14852774839849378
The coefficient for cut_Very Good is
0.12583881878452705 The coefficient for color_E is -
0.04705442233369822
The coefficient for color_F is -0.06268437439142825
The coefficient for color_G is -0.10072161838356786
28
The coefficient for color_H is -0.20767313311661612

The coefficient for color_I is -0.3239541927462737
The coefficient for color_J is -0.46858930275015803
The coefficient for clarity_IF is 0.9997691394634902
The coefficient for clarity_SI1 is
0.6389785818271332 The coefficient for clarity_SI2
is 0.42959662348315514 The coefficient for
clarity_VS1 is 0.8380875826737564 The coefficient
for clarity_VS2 is 0.7660244466083613
The coefficient for clarity_VVS1 is 0.9420769630114072
The coefficient for clarity_VVS2 is 0.9313670288415696
R square on training data
R square on testing data
RMSE on testing data
VIF –VALUES
We still find we have multi collinearity in the dataset, to drop these

values to Lower level we can drop columns after doing stats model.
29
From stats model we can understand the features that do not contribute to the Model
We can remove those features after that the Vif Values will be
reduced Ideal value of VIF is less tha 5%.
STATSMODEL
BEST PARAMS SUMMARY

OLS Regression Results
=======================================================================
=======
Dep. Variable: price R-squared:
0.942
Model: OLS Adj. R-squared:
0.942
Method: Least Squares F-statistic: 1.
330e+04
Date: Fri, 15 Jan 2021 Prob (F-statistic):
0.00
Time: 22:15:37 Log-Likelihood:
2954.6
No. Observations: 18870 AIC:
-5861.
Df Residuals: 18846 BIC:
-5673.
Df Model: 23
Covariance Type: nonrobust
=======================================================================
==========
coef std err t P>|t| [0.025
0.975]
-
-
Intercept -0.7568 0.016 -46.999 0.000 -0.788
-0.725
carat 1.1009 0.009 121.892 0.000 1.083
1.119
depth 0.0056 0.004 1.525 0.127 -0.002
0.013
table -0.0133 0.002 -6.356 0.000 -0.017
-0.009
x -0.3050 0.032 -9.531 0.000 -0.368
-0.242
y 0.3039 0.034 8.934 0.000 0.237
0.371
z -0.1392 0.024 -5.742 0.000 -0.187
-0.092
cut_Good 0.0940 0.011 8.755 0.000 0.073
0.115
cut_Ideal 0.1523 0.010 14.581 0.000 0.132
0.173
30
cut_Premium 0.1485 0.010 14.785 0.000 0.129

0.168
cut_Very_Good 0.1258 0.010 12.269 0.000 0.106
0.146
color_E -0.0471 0.006 -8.429 0.000 -0.058
-0.036
color_F -0.0627 0.006 -11.075 0.000 -0.074
-0.052
color_G -0.1007 0.006 -18.258 0.000 -0.112
-0.090
color_H -0.2077 0.006 -35.323 0.000 -0.219
-0.196
color_I -0.3240 0.007 -49.521 0.000 -0.337
-0.311
color_J -0.4686 0.008 -58.186 0.000 -0.484
-0.453
clarity_IF 0.9998 0.016 62.524 0.000 0.968
1.031
clarity_SI1 0.6390 0.014 46.643 0.000 0.612
0.666
clarity_SI2 0.4296 0.014 31.177 0.000 0.403
0.457
clarity_VS1 0.8381 0.014 59.986 0.000 0.811
0.865
clarity_VS2 0.7660 0.014 55.618 0.000 0.739
0.793
clarity_VVS1 0.9421 0.015 63.630 0.000 0.913
0.971
clarity_VVS2 0.9314 0.014 64.730 0.000 0.903
0.960
=======================================================================
=======
Omnibus: 4696.785 Durbin-Watson:
1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17
654.853
Skew: 1.208 Prob(JB):
0.00
Kurtosis: 7.076 Cond. No.
57.0
After dropping the depth

variable OLS Regression
Results
=========================================================
=====================
Dep. Variable: price R-squared: 0.942
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-
statistic: 1.390e+04 Date: Fri, 15 Jan 2021 Prob (F-
statistic): 0.00
Time: 22:16:56 Log-Likelihood: 2953.5
No. Observations: 18870 AIC: -5861.
Df Residuals: 18847 BIC: -5680.
Df Model: 22
Covariance Type: nonrobust
31
=========================================================
========================
coef std err t P>|t| [0.025 0.975]
Intercept -0.7567 0.016 -46.991 0.000 -0.788 -0.725

carat 1.1020 0.009 122.331 0.000 1.084 1.120
table -0.0139 0.002 -6.770 0.000 -0.018 -0.010
x -0.3156 0.031 -10.101 0.000 -0.377 -0.254
y 0.2834 0.031 9.069 0.000 0.222 0.345
z -0.1088 0.014 -7.883 0.000 -0.136 -0.082
cut_Good 0.0951 0.011 8.876 0.000 0.074 0.116
cut_Ideal 0.1512 0.010 14.508 0.000 0.131 0.172
cut_Premium 0.1474 0.010 14.711 0.000 0.128 0.167
cut_Very_Good 0.1255 0.010 12.239 0.000 0.105 0.146
color_E -0.0471 0.006 -8.439 0.000 -0.058 -0.036
color_F -0.0627 0.006 -11.082 0.000 -0.074 -0.052
color_G -0.1007 0.006 -18.246 0.000 -0.111 -0.090
color_H -0.2076 0.006 -35.306 0.000 -0.219 -0.196
color_I -0.3237 0.007 -49.497 0.000 -0.337 -0.311
color_J -0.4684 0.008 -58.169 0.000 -0.484 -0.453
clarity_IF 1.0000 0.016 62.544 0.000 0.969 1.031
clarity_SI1 0.6398 0.014 46.738 0.000 0.613 0.667
clarity_SI2 0.4302 0.014 31.232 0.000 0.403 0.457
32
clarity_VS1 0.8386 0.014 60.042 0.000 0.811 0.866

clarity_VS2 0.7667 0.014 55.691 0.000 0.740 0.794
clarity_VVS1 0.9424 0.015 63.655 0.000 0.913 0.971
clarity_VVS2 0.9319 0.014 64.784 0.000 0.904 0.960
=========================================================
=====================
Omnibus: 4699.504 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17704.272
Skew: 1.208 Prob(JB): 0.00
Kurtosis: 7.084 Cond. No. 56.5
To ideally bring down the values to lower levels we can drop one of the
variable that is highly correlated.
Dropping variables would bring down the multi collinearity level down.
33
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
We had a business problem to predict the price of the stone and provide insights for the
company on the profits on different prize slots. From the EDA analysis we could understand
the cut, ideal cut had number profits to the company. The colours H, I, J have bought profits
for the company. In clarity if we could see there were no flawless stones and they were no
profits coming from l1, l2, l3 stones. The ideal, premium and very good types of cut were
bringing profits where as fair and good are not bringing profits.
The predictions were able to capture 95% variations in the price and it is explained by the
predictors in the training set.
Using stats model if we could run the model again we can have P values and coefficients which
will give us better understanding of the relationship, so that values more 0.05 we can drop those
variables and re run the model again for better results.
For better accuracy dropping depth column in iteration for better results.
The equation, (-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x + (0.2
8) * y + (-0.11) * z + (0.1) * cut_Good + (0.15) * cut_Ideal + (0.15) * cut_Premiu
m + (0.13) * cut_Very_Good + (-0.05) * color_E + (-0.06) * color_F + (-0.1) * color
_G + (-0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.0) * clarity_IF + (
0.64) * clarity_SI1 + (0.43) * clarity_SI2 + (0.84) * clarity_VS1 + (0.77) * clarity_
VS2 + (0.94) * clarity_VVS1 + (0.93) * clarity_VVS2 +
Recommendations
1. The ideal, premium, very good cut types are the one which are bringing profits so
that we could use marketing for these to bring in more profits.
2. The clarity of the diamond is the next important attributes the more the clear is the
stone the profits are more
The five best attributes are
Carat,
Y the diameter of the
stone clarity_IF
clarity_SI1
clarity_SI2
clarity_VS1
clarity_VS2
clarity_VVS1
clarity_VVS2
THE END
LOGISTIC
REGRESSION AND
LDA
PREPARED BY
SUNIRA
1
LOGISTIC REGRESSION AND LDA

You are hired by a tour and travel agency which deals in selling holiday
packages. You are provided details of 872 employees of a company. Among
these employees, some opted for the package and some didn't. You have to
help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also,
find out the important factors on the basis of which the company will focus
on particular employees to sell their packages.
Data Dictionary:
Variable Name Description

Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young
no_young_children
children (younger than 7
years)
no_older_children Number of older children
foreign foreigner Yes/No
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.
Loading all the necessary library for the model building.

Now, reading the head and tail of the dataset to check whether data has
been properly fed
2
HEAD OF THE DATA
Table 1.1
TAIL OF THE DATA
Table 1.2
SHAPE OF THE DATA (872, 8)

INFO
 No null values in the dataset,

 We have integer and object data
3
DATA DESCRIBE
Table 1.3
We have integer and continuous data,

Holiday package is our target variable
Salary, age, educ and number young children, number older children of
employee have the went to foreign, these are the attributes we have to
cross examine and help the company predict weather the person will opt
for holiday package or not.
Check for duplicates in data
Unique values in the categorical

data HOLLIDAY_PACKAGE: 2
Yes 401
4
No471
Name: Holliday Package, dtype: int64
FOREIGN : 2
Yes 216
No 656
Name: foreign, dtype:
int64 Percentage of target
This split indicates that 45% of employees are interested in the holiday package.
CATEGORICAL UNIVARIATE
ANALYSIS FOREIGN
Fig 1.1
5
HOLIDAY PACKAGE
Fig 1.2
HOLIDAY PACKAGE VS SALARY
Fig 1.3
We can see employee below salary 150000 have always opted for
holiday package
6
HOLIDAY PACKAGE VS AGE
Fig 1.4
HOLIDAY PACKAGE VS EDUC
Fig 1.5
7
HOLIDAY PACKAGE VS YOUNG CHILDREN

Fig 1.6
HOLIDAY PACKAGE VS OLDER CHILDREN

Fig 1.7
8
AGE VS SALARY VS HOLIDAY PACKAGE

Fig 1.8
Fig 1.9
Employee age over 50 to 60 have seems to be not taking the holiday

package, whereas in the age 30 to 50 and salary less than 50000 people
have opted more for holiday package.
9
EDUC VS SALARY VS HOLIDAY PACKAGE

Fig 1.10
Fig 1.11
10
YOUNG CHILDREN VS AGE VS HOLIDAY PACKAGE

Fig 1.12
Fig 1.13
11
OLDER CHILDREN VS AGE VS HOLIDAY_PACKAGE

Fig 1.14
Fig 1.15
12
BIVARITE ANALYIS
DATA DISTRIBUTION
Fig 1.16
There is no correlation between the data, the data seems to be normal.

There is no huge difference in the data distribution among the holiday
package, I don’t see any clear two different distribution in the data.
13
Fig 1.17
No multi collinearity in the data
TREATING OUTLIERS
BEFORE OUTLIER TREATMENT
we have outliers in the dataset, as LDA works based on numerical
computation treating outliers will help perform the model better.
Fig 1.18
14
AFTER OUTLIER TREATMENT

Fig 1.19
No outliers in the data, all outliers have been treated.
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis).
ENCODING CATEGORICAL VARIABLE
Table 1.4
The encoding helps the logistic regression model predict better results
15
GRID SEARCH METHOD:

The grid search method is used for logistic regression to find the
optimal solving and the parameters for solving
The grid search method gives, liblinear solver which is suitable for small
datasets.
Tolerance and penalty has been found using grid search
method Predicting the training data,
16
Table 1.5
CONFUSION MATRIX TRAIN DATA
Fig 1.20
CONFUSION MATRIX FOR TEST DATA
17
Fig 1.21
ACCURACY
AUC, ROC CURVE FOR TRAIN DATA

Fig 1.22
AUC, ROC CURVE FOR TEST DATA

18
19
LDA
PREDICTING THE VARIBALE
MODEL SCORE
CLASSFICATION REPORT TRAIN DATA

20
MODEL SCORE
CLASSIFICATION REPORT TEST DATA
CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE

THAT GIVES BETTER ACCURACY AND F1 SCORE
Fig 1.23
21
22
23
24
25
Fig 1.31
AUC AND ROC CURVE
Fig 1.32
Table 1.6
26
Comparing both these models, we find both results are same, but LDA
works better when there is category target variable.
2.4 Inference: Basis on these predictions, what are the insights

and recommendations.
Please explain and summarise the various steps performed in this project.
There should be proper business interpretation and actionable insights
present.
We had a business problem where we need predict whether an employee

would opt for a holiday package or not, for this problem we had done
predictions both logistic regression and linear discriminant analysis. Since
both are results are same.
The EDA analysis clearly indicates certain criteria where we could find
people aged above 50 are not interested much in holiday packages.
So this is one of the we find aged people not opting for holiday packages.
People ranging from the age 30 to 50 generally opt for holiday packages.
Employee age over 50 to 60 have seems to be not taking the holiday
package, whereas in the age 30 to 50 and salary less than 50000 people
have opted more for holiday package.
The important factors deciding the predictions are salary, age and educ.
Recommendations
1. To improve holiday packages over the age above 50 we can
provide religious destination places.
2. For people earning more than 150000 we can provide vacation
holiday packages.
3. For employee having more than number of older children we can
provide packages in holiday vacation places.
THE END

Sunira - Predictive Modeling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sunira - Predictive Modeling

Uploaded by

Copyright:

Available Formats

LINEAR REGRESSION

Problem 2: Logistic Regression and LDA………………………………...…………………………….

List of Tables (Ques 1)

List of Fig (Ques 2)

Variable Name Description

Z Height of the cubic zirconia in mm.

TAIL OF THE DATA (Table 1.2)

Checking the shape of the data (26967, 11)

Checking the info the data

We have float, int and object data types in the data.

We have both categorical and continuous data,

Unique values in the categorical data

We have 5 cuts and the ideal seems to be most preferred

Univariate / Bivariate analysis

The distribution of data in carat seems to positively skewed, as there are

The distribution of depth seems to be normal

The distribution of table also seems to be positively

Fig 1.7 Fig 1.8

The distribution of x (Length of the cubic zirconia in mm.) is positively

The distribution of Y (Width of the cubic zirconia in mm.) is positively

Fig 1.11 Fig 1.12

The distribution of z (Height of the cubic zirconia in mm.) is positively

The price has seems to be positively skewed. The skew is

The most preferred cut seems to be ideal cut for diamonds.

D being the best and J the worst.

More relationship between categorical variables

Cut and clarity

Checking if there is value that is “0”

AFTER SCALING – VIF VALUES

CHECKING THE OUTLIERS IN THE DATA

AFTER TREATING OUTLIERS

Dummies have been encoded.

DROPING UNWANTED COLUMNS

The coefficient for carat is 1.1009417847804501

The coefficient for color_H is -0.20767313311661612

R square on testing data

RMSE on testing data

We still find we have multi collinearity in the dataset, to drop these

reduced Ideal value of VIF is less tha 5%.

BEST PARAMS SUMMARY

cut_Premium 0.1485 0.010 14.785 0.000 0.129

After dropping the depth

Intercept -0.7567 0.016 -46.991 0.000 -0.788 -0.725

clarity_VS1 0.8386 0.014 60.042 0.000 0.811 0.866

The five best attributes are

LOGISTIC REGRESSION AND LDA

Variable Name Description

Loading all the necessary library for the model building.

HEAD OF THE DATA

TAIL OF THE DATA

SHAPE OF THE DATA (872, 8)

 No null values in the dataset,

We have integer and continuous data,

Check for duplicates in data

Unique values in the categorical

int64 Percentage of target

HOLIDAY PACKAGE VS AGE

HOLIDAY PACKAGE VS YOUNG CHILDREN

HOLIDAY PACKAGE VS OLDER CHILDREN

AGE VS SALARY VS HOLIDAY PACKAGE

Employee age over 50 to 60 have seems to be not taking the holiday

EDUC VS SALARY VS HOLIDAY PACKAGE