Predictive Modeling Business Report
Predictive Modeling Business Report
Business Report
Submitted by:
Dev Kumar
TABLE OF CONTENTS:
LIST OF TABLES:
24.
Table 24 Correlation dataset summary for Problem 2
LIST OF FIGURES:
32.
Figure 32: AUC /ROC for training data in
Problem 2 by applying Logistic Regression.
33.
Figure 33: AUC /ROC and Confusion Matrix for
test data in Problem 2 by applying Logistic
Regression.
37. Figure 37: ROC curve for the two model on test
data in Problem 2
You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and
other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help
the company in predicting the price for the stone on the bases of the details
given in the dataset so it can distinguish between higher profitable stones
and lower profitable stones so as to have better profit share. Also, provide
them with the best 5 attributes that are most important.
Introduction to the dataset:
:"carat", "cut", "color", "clarity", "depth", "table", "x", "y", "z" are
Independent variable/feature.
Price is dependent variable/feature.
Data Description:
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 26967 non-null float64
1 cut 26967 non-null object
2 color 26967 non-null object
3 clarity 26967 non-null object
4 depth 26270 non-null float64
5 table 26967 non-null float64
6 x 26967 non-null float64
7 y 26967 non-null float64
8 z 26967 non-null float64
9 price 26967 non-null int64
dtypes: float64(6), int64(1), object(3)
memory usage: 2.1+ MB
Insight
From the above table we can see that there is null values present in the depth
column of the dataset. Their are total 26967 rows & 10 columns in this dataset
,indexed from 0 to 26966. Out of 10 variables 6 are float64 , 3 variables are
object and 1 variable is int64. Memory used by the dataset: 2.1+ MB.
From the above table we can infer the count, mean, std , 25% , 50% ,75%
and min & max values of the all numeric variables present in the dataset.
From the above table we can infer the count, unique ,top, freq of all the
categorical variables present in the dataset.
Next, we will plot the Histograms and Boxplot for each column:
Outlier is defined as data points that is positioned outside the whiskers of the
box plot. It represents a numerically distant value from the rest of the dataset.
Boxplot is the visual representation of distribution of numerical data through
their quartiles. Boxplot is also a tool used to detect outliers in the data.
Below is the boxplot for the dataset columns:
Figure 1 of Carat with distplot and Boxplot
Insight
*The average Depth: The Height of cubic zirconia, measured from the Culet to
the table, divided by its average Girdle Diameter is around 61.800.
*The standard deviation of the Depth: The Height of cubic zirconia, measured
from the Culet to the table, divided by its average Girdle Diameter is 1.412.
*25% , 50% (median) and 75 % of the Depth Insight: The Height of cubic
zirconia, measured from the Culet to the table, divided by its average Girdle
Diameter are 61.000 , 61.800 and 62.500. Skewness indicating that the
ditribution is normal distributed.
*Depth: The Height of cubic zirconia, measured from the Culet to the table,
divided by its average Girdle Diameter have outliers.
Insight
*Table: The Width of the cubic zirconia's Table expressed as a Percentage of its
Average Diameter ranges from a minimum of 49.000 to maximum of 79.000.
*The average Table: The Width of the cubic zirconia's Table expressed as a
Percentage of its Average Diameter is around 57.455.
*The standard deviation of the Table: The Width of the cubic zirconia's Table
expressed as a Percentage of its Average Diameter is 2.232.
*25% , 50% (median) and 75 % of the Table: The Width of the cubic zirconia's
Table expressed as a Percentage of its Average Diameter are 56.000 , 57.000 and
59.000.
*Table: The Width of the cubic zirconia's Table expressed as a Percentage of its
Average Diameter have outliers.
Insight
Insight
Insight
Insight
The average Price: The Price of the cubic zirconia is around 3937.526.
The standard deviation of the Price: The Price of the cubic zirconia is 4022.55.
25% , 50% (median) and 75 % of the Price: The Price of the cubic zirconia are
945.00 , 2375.00 and 5356.00.
Countplot:
A countplot is kind of like a histogram or a bar graph for categorical variables.
* Piechart:
A pie chart is a circle divided into sectors that each represent a proportion of the whole.
It is often used to show proportion, where the sum of the sectors equal 100%.
.
Figure 8 of “Cut” Feature with countplot and pieplot .
Insights:
There are 5 type of cut quality of the cubic zirconia present in the data set named
as 'Ideal' , 'Premium' , 'Very Good' , 'Good' & 'Fair'.
Cut Quality is increasing order Fair, Good, Very Good, Premium, Ideal.
40.11% cubic zirconia have Ideal cut quality which is the max among all 5 cut
quality present in the data.
25.6% cubic zirconia have Premium cut quality.
22.4% cubic zirconia have Very Good cut quality.
9.0% cubic zirconia have Good cut quality.
Only 2.9% cubic zirconia have Fair cut quality which is the min among all 5 cut
quality present in the data.
Figure 9 of “Color” Feature with countplot and pieplot .
Insights:
There are 7 type of colour of the cubic zirconia present in the data set named as
'D' , 'E' , 'F' , 'G' , 'H' , 'I' & 'J'.
With D being the worst and J the best.
21% cubic zirconia are of G color which is the max among all 7 color present in
the data.
18.3% cubic zirconia are of E color.
17.5% cubic zirconia are of F color.
15.2% cubic zirconia are of H color.
12.4% cubic zirconia are of D color.
10.3% cubic zirconia are of I color.
5.3% cubic zirconia are of J color which is the min among all 7 color present in
the data.
Figure 10 of “Clarity” Feature with countplot and pieplot .
Insights
There are 8 type of clarity of the cubic zirconia present in the data set named as
'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'.
cubic zirconia clarity refers to the absence of the Inclusions and Blemishes. (In
order from Worst to Best) IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1.
24.4% of cubic zirconia are of SI1 clarity which is the max among all 8 clarity
quality present in the data.
22.6% of cubic zirconia are of VS2 clarity.
16.9% of cubic zirconia are of SI2 clarity.
15.2% of cubic zirconia are of VS1 clarity.
9.4% of cubic zirconia are of VVS2 clarity.
6.8% of cubic zirconia are of VVS1 clarity.
3.3% of cubic zirconia are of IF clarity.
1.4% of cubic zirconia are of I1 clarity which is the min among all 8 clarity quality
present in the data.
Bivariant Analysis:
Scatter Plot
A scatter plot (a scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
Insights:
From the above plot we see that as the carat and the price is showing a strong
relationship,with increase in carat price is also increases.
Figure 12 Scatterplot of “price vs depth” .
Insights
From the above plot we see that as the depth and the price is showing a no relationship
as all the datapoints are scatter around its mean.
Insights
From the above plot we see that as the table and the price is showing weak
relatioship as all the datapoints are scatter around its mean.
Figure 14 Scatterplot of “price vs x(Length)” .
Insights
From the above plot we see that as the x(length) and the price is showing
strong positive relationship as all the datapoints are increasing when the
price is also increasing.
From the above plot we see that as the y(width) and the price is showing
strong positive relationship as all the datapoints are increasing when the
price is also increasing.
Figure 16 Scatterplot of “price vs z(height)” .
Insights
From the above plot we see that as the z(Height) and the price is showing
strong positive relationship as all the datapoints are increasing when the
price is also increasing.
From the above plot we see that the carat and the x(length) is showing a
positive relationship with increase in carat x(length) is also increases.
Figure 18 Scatterplot of “carat vs z(height)” .
Insights
From the above plot we see that the carat and the z(height) is showing a
positive relationship with increase in carat z(height) is also increases.
Insights:
Note - cubic zirconia clarity refers to the absence of the Inclusions and Blemishes. (In
order from Worst to Best) IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1.
Cubic zirconia clarity of SI1 have maximum very good cut quality.
Insights:
Heatmap:
Pairplot:
1.2 Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning or do we need to change them or drop them?
Check for the possibility of combining the sub levels of a ordinal variables and
take actions accordingly. Explain why you are combining these sub levels with
appropriate reasoning.
Checking for Null Values.
carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64
Insights
From the above function we infer that only depth variable have null values i.e. 697.As
we know that Theoretically, 25 to 30% is the maximum missing values are allowed,
beyond which we might want to drop the variable from analysis.Here we have 2.58%
(approx) null values in the the depth variable we are going to impute the null values by
median by using numpy .replace func( ).
Imputation is the process of replacing missing data with substituted values like
mean / median , if outliers are present then we impute with median ,if outliers
are not present then we impute with the mean.Because missing data can create
problems for analyzing data, imputation is seen as a way to avoid pitfalls
involved with listwise deletion of cases that have missing values.For imputation
we are going to use the numpy .replace func( ).
replace() function, each element in arr, return a copy of the string with all
occurrences of substring old replaced by new.
We can see that we have various missing values in depth column. There are
various ways of treating your missing values in the data set. And which technique
to use when is actually dependent on the type of data you are dealing with.
In this exercise, we will use .replace( ) function for the numerical columns and
replace the null values with the median value.
Check for the values which are equal to zero.
There is bad values found in the x(lenght) , y(width) , z(height) columns of the Dataset.
As x(lenght) , y(width) , z(height) are the length , width & height of the cubic zirconia in
mm and we have found mininum value of x(lenght) , y(width) , z(height) is zero which
doesnot make sense. As we know that length , width , height can't be zero. Thus, we
need to treat & clean them.
Conclusion:
*We successfully impute the bad values or zero present in the x(lenght) , y(width) ,
z(height) columns of the Dataset. Now we clearly infer that min value of x(lenght) ,
y(width) , z(height) is not zero anymore & we have appropirate min values for
x(lenght) , y(width) , z(height) columns of the Dataset.
Geting unique counts of all Objects:
cut
Ideal 10805
Premium 6886
Very Good 6027
Good 2435
Fair 780
Name: cut, dtype: int64
color
G 5653
E 4916
F 4723
H 4095
D 3341
I 2765
J 1440
Name: color, dtype: int64
clarity
SI1 6565
VS2 6093
SI2 4564
VS1 4087
VVS2 2530
VVS1 1839
IF 891
I1 364
Name: clarity, dtype: int64
Cut:
Cut - Quality is increasing order Fair, Good, Very Good, Premium, Ideal. We are going to
club Good and Very Good labels.
cut
Ideal 10805
Premium 6886
Very Good 6027
Good 2435
Fair 780
Name: cut, dtype: int64
Ideal 10805
Very Good 8462
Premium 6886
Fair 780
Name: cut, dtype: int64
Insights
After grouping we have 4 labels in the 'cut' labelled as - Quality is increasing order Fair,
Very Good, Premium, Ideal.
Color
Color - Color of the cubic zirconia. With D being the worst and J the best. Here we
labelled the color in order to understand better & will encode with label encoding for
model building.
Best 1440
Very Good 2765
Worst 3341
Good 4095
Bad 4723
Very Bad 4916
Fair 5653
Name: color, dtype: int64
Best 1440
Very Good 2765
Worst 3341
Bad 4723
Very Bad 4916
Good 9748
Name: color, dtype: int64
Conclusion:
Here we are going to club the Fair and Good to reduce the labels.
Insights:
Clarity
cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In order
from Worst to Best) IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1. Here we labelled the clarity in
order to understand better & will encode with label encoding for model building.
Best 364
Worst 891
Very Bad 1839
Bad 2530
Fair 4087
Very Good 4564
Good 6093
Better 6565
Name: clarity, dtype: int64
Here we are going to club Bad & Very Bad, Fair & Good and Better & Very Good to
reduce the labels.
Best 364
Worst 891
Very Bad 4369
Good 10180
Very Good 11129
Name: clarity, dtype:
int64
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and check
the performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with appropriate reasoning.
Outliers are unusual values in your dataset, and they can distort statistical analyses and
violate their assumptions. Outliers increase the variability in your data, which decreases
statistical power. Consequently, excluding outliers can cause your results to become
statistically significant. That's why are doing the outlier treatment.
we can build the model without performing outlier treatment and compare the results.
Label Encoding:
5 1440 Best
4 2765 Very Good
0 3341 Worst
2 4723 Bad
1 4916 Very Bad
3 9748 Good
Name: color, dtype: int64
Table 13: Encoding of Color Feature
4 364 Best
0 891 Worst
1 4369 Very Bad
Good
2 10180
3 11129 Very Good
Name: clarity, dtype: int64
OLS Report:
Results:
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.01e+04. This might indicate that
there are
strong multicollinearity or other numerical problems.
Table 16: OLS Model Report After Label Encoding of Problem 2(with
outlier treatment)
VIF SCORE:
Insights
VIF value must be in between 1 to 5 , but we saw here all the VIF are very high
means that their is a strong multicollinearity between the variables.
OlS Report:
==========================================================================
====
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 6.7e+03. This might indicate that there
are
strong multicollinearity or other numerical problems.
Table 17: Model Report After Label Encoding of Problem 2(without outlier
treatment)
VIF SCORE:
VIF value must be in between 1 to 5 , but we saw here all the VIF are very high
which means their is a strong multicollinearity between the variables.
OLS Report:
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
Table 19: OLS Model Report After Label Encoding of Problem
2(without outlier treatment using Z-score )
VIF:
carat ---> 81.5033418559768
cut ---> 6.860729682922422
color ---> 4.27389351773964
clarity ---> 9.369000024698023
depth ---> 558.577146731718
table ---> 556.1039170515155
x ---> 1133.1266837484013
y ---> 347.88572635041317
z ---> 382.0311260961392
Linear Regression Model 4 (Applied Feature Engineering):
OLS Report:
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.22e+04. This might indicate that
there are
strong multicollinearity or other numerical problems.
Conclusion:
Hypothesis Testing:
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
Insights:
Dataset has 8 fields including:
Data Description:
RangeIndex: 872 entries, 0 to 871
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Holliday_Package 872 non-null object
1 Salary 872 non-null int64
2 age 872 non-null int64
3 educ 872 non-null int64
4 no_young_children 872 non-null int64
5 no_older_children 872 non-null int64
6 foreign 872 non-null object
dtypes: int64(5), object(2)
memory usage: 47.8+ KB
Insights
From the above results we can see that there is no null values present in the
dataset. Their are total 872 rows & 7 columns in this dataset, indexed from 0 to
871.
Out of 7 variables 5 are int64 , 2 variables are object. Memory used by the
dataset: 47.8+ KB.
Insights
From the above table we can infer the count,mean, std , 25% , 50% ,
75% and min & max values of the all numeric variables present in the
dataset.
From the above table we can infer the count, unique, top, freq of all the
categorical variables present in the dataset.
Holliday_Package 0
Salary 0
age 0
educ 0
no_young_children 0
no_older_children 0
foreign 0
dtype: int64
Insights:
From the above output we infer that only their is no null values in the
dataset.
Insights
A histogram takes as input a numeric variable only. The variable is cut into
several bins, and the number of observation per bin is represented by the height
of the bar. It is possible to represent the distribution of several variable on the
same axis using this technique.
A boxplot gives a nice summary of one or several numeric variables. The line that
divides the box into 2 parts represents the median of the data. The end of the
box shows the upper and lower quartiles. The extreme lines show the highest and
lowest value excluding outliers.
Figure 23 :Distplot and Boxplot of Salary in Problem 2 .
Insights
Fi
gure 25:Distplot and Boxplot of educ in Problem 2 .
Insights
Countplot
A countplot is kind of like a histogram or a bar graph for Discrete & categorical
variables.
Figure 26:Countplot of no_young_children in Problem 2.
Insights
76.26% employee have 0 young children (younger than 7 years).
16.85% employee have 1 young children (younger than 7 years).
6.3% employee have 2 young children (younger than 7 years).
0.57% employee have 3 young children (younger than 7 years).
Insights
Insights
Bivariant Analysis:
Scatter Plot
A scatter plot ( scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
Around 61.2% employee who are not foreigner opted no for holiday package.
Around 38.71% employee who are not foreigner opted yes for holiday
package.
Around 68.05% employee who are foreigner opted yes for holiday package.
Around 31.94% employee who are foreigner opted no for holiday package.
Heatmap:
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic
Regression and LDA (linear discriminant analysis).
Result:
Label Encoding has been done for categorical columns and all columns are
now in number.
After performing EDA , various data preprocessing & data preparation steps.
Our dataset is now ready for supervised modelling algorithms like Logistic
Regression & LDA (Linear Discriminant Analysis).
Checking the dimensions of the training and test data.
x_train (610, 6)
x_test (262, 6)
train_labels (610,)
test_labels (262,)
Grid Search for finding out the optimal values for the hyper parameters
{'penalty': 'l2', 'solver': 'newton-cg', 'tol': 0.0001}
Out[238]:
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model Final Model: Compare Both the models and write inference which model is
best/optimized.
array([[102 43]
[ 50 67]] dtype=int64)
Figure 33: AUC /ROC and Confusion Matrix for test data in Problem 2 Logistic
Regression.
Classifiaction Report:-
Train Data:
AUC: 74.1%
Accuracy: 67.37%
Precision: 68%
Recall: 56%
f1-Score: 62%
Test Data:
AUC: 70.5%
Accuracy: 65%
Precision: 61%
Recall: 57%
f1-Score: 59%
Model Evaluation on LDA
array([[254, 72],
[126, 158]], dtype=int64)
Figure 34: AUC /ROC and Confusion Matrix for train data in Problem 2 by
applying LDA
array([[103 42]
[ 52 65]] dtype=int64)
Figure 35: AUC /ROC and Confusion Matrix for test data in Problem 2 by
applying LDA
LDA Conclusion:
Train Data:
AUC: 74%
Accuracy: 68%
Precision: 69%
Recall:56%
f1-Score: 61%
Test Data:
AUC: 70.5%
Accuracy: 64%
Precision: 61%
Recall:56%
f1-Score: 58%
Figure 36: ROC curve for the two model on training data Problem 2
We will do this exercise only on the training data and test data.
Table 34: Classification Report of default cut-off train data & the
custom cut-off train data
Table 35: Classification Report of default cut-off test data & the custom
cut-off test data
2.4 Inference: Basis on these predictions, what are the insights and recommendations.