Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF

PREDICTIVE
MODELLING - PROJECT
Gloria Susan Raju

4-11-2021
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with
the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The company is earning different profits
on different prize slots. You have to help the company in predicting the price for the stone on the bases of
the details given in the dataset so it can distinguish between higher profitable stones and lower profitable
stones so as to have better profit share. Also, provide them with the best 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
EXPLORATORY DATA ANALYSIS
 The dataset consists of 11 variables – ‘Unnamed: 0, carat, cut, color, clarity, depth, table, x, y, z, price’.
 The variable ‘Unnamed: 0’ is not needed for exploratory data analysis or any further predictions. Hence,
we can chose to drop the column. After dropping the column, the dataset look as below:
1
 The shape of the data is (26967, 10).
INFORMATION OF DATA
 The data contains float, int and object datatypes.

Carat, depth, table, x, y, z are float datatypes;
Cut, color, clarity are object datatypes;
Price is int datatype.
Checking for Null value:
 We can see that the variable ‘depth’ is having a total of 697 null values, i.e, 2.5% of the data is missing
for this column.
2
DESCRIPTIVE STATISTICS OF THE DATA
 There are three categorical variable ‘cut, color and clarity’. Cut is having a total of 5 unique values, color
is having a total of 7 unique value and clarity is having a unique value of 8.
 ‘carat, depth, table, x, y, z, price’ are continuous variables.
 Price will be the target variable considered while building the Linear Regression model.
CHECKING FOR DUPLICATES IN THE DATA
 After checking for duplicates found in the dataset, we can there are total of 34 rows present. We can
choose to remove the duplicates to get a better prediction or insights from the model.
UNIQUE VALUES IN CATEGORICAL VARIABLES
3
 Cut is having 5 unique values – ‘Fair, Good, Very Good, Premium, and Ideal’. Quality is increasing order
Fair, Good, Very Good, Premium, Ideal where Ideal being the highest quality and Fair being the least
quality.
 Color is having 7 unique values – ‘J, I, D, H, F, E, G’. D is the best quality and J the worst quality.
 Clarity is having 8 unique values –‘FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3’. In order from Best to
Worst, FL = flawless, I3= level 3 inclusions.
UNIVARIATE ANALYSIS
CARAT DISTRIBUTION
4
 The plot shows that the Carat weight distribution of the cubic zirconia and it is positively skewed. The
skewness value for carat is 1.114789.
 Boxplot shows that the data contains large number of outliers.

DEPTH DISTRIBUTION
 Depth shows the height of a cubic zirconia, measured from the Culet to the table, divided by its average
Girdle Diameter. From the plot we can see that the data is almost normal distribution with skewness
value of -0.026086.
5
 The boxplot shows large number of outliers for the distribution of depth.
TABLE DISTRIBUTION
 The Width of the cubic zirconia's Table is expressed as a Percentage of its Average Diameter. The plot
shows that the data is positively skewed with a skewness value of 0.765805.
6
 The box plot shows outliers present in the table data.
LENGTH DISTRIBUTION
 The plot shows the length of the cubic zirconia in mm. The distribution plot shows us that the data is
positively skewed with skewness value of 0.392290.
7
 The boxplot shows many outliers present in the data.
WIDTH DISTRIBUTION
 The plot shows the Width of the cubic zirconia in mm. The distribution plot shows that the data is
positively skewed with skewness value of 3.867764
8
 The boxplot shows that the length distribution consists of outliers in the data.
HEIGHT DISTRIBUTION
 The plot shows the distribution of Height of the cubic zirconia in mm. The distribution of the data is
positively skewed with skewness of 2.580665.
9
 The box plot consist of many outliers in the data.
CUT DISTRIBUTION
 Ideal consists of more number of data in the dataset whereas Fair has the least. Ideal is having the best
quality and Fair being the least.
10
COLOR DISTRIBUTION
 G has the highest amount of data and J being the least. D being the best followed by, E, F, G, H, I and J
being the worst.
CLARITY DISTRIBUTION
 Cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. The plot shows the total
count of data for each category of clarity. From the plot we can see that the SI1 is most the most
number of data followed by VS2.
 The order from best to worst FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3.
11
PRICE DISTRIBUTION
 The above plot shows the price of the cubic zirconia. From the plot we can see that the data is
positively skewed with skewness value of 1.619116
 The above boxplot shows that there are many outliers present in the data.
12
BIVARIATE ANALYSIS
PAIRPLOT DISTRIBUTION
13
CUT AND PRICE
 The above plot shows the distribution of data between cut type and price. We can see that the
Premium is having the highest price and Ideal is having the least.
CLARITY AND PRICE
 SI2 is having the highest price as compared to others.
14
COLOR AND PRICE
 From the plot, we can see that J is having the highest price among all the other color categories
followed by I, H, G, F, E and D.
DEPTH AND PRICE
15
X AND PRICE
Y AND PRICE
16
Z AND PRICE
CORRELATION BETWEEN VARIABLES
 The above plot shows the correlation between each variables.
 We can see that there are multicollinearity present in the data.
 The variables Carat with variables X, Y, Z and price are strongly correlated with each other.
 Similarly x is strongly correlated with y and z.
17
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Do you think scaling is
necessary in this case?
Checking for null value
 Depth is the only variable which has null value in them. Only 2.5% of the data contains null value.
 We can chose to impute the null values with median or mean imputations. Here I have chosen median
imputations. Once the imputation is completed, we don’t have any null values present in the dataset.
 There are few columns which has few values as zero. I chose not to remove them.
18
OUTLIERS
 From univariate analysis, we could find that the data is having outliers. So we can remove them by
treating the outliers.
 After treating the outliers, the data looks as below
19
20
SCALING
 Scaling can be done to normalize the range of independent variables or features of data. Since there is
presence of multicollinearity in data we can chose to scale the data.
 I chose to do StandardScaler() method to scale the data. After scaling the data, the data ranges from 0
to 1.
21
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train
and test (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE.
ENCODING STRING VALUES
 We use get_dummies () function to encode the string values for modelling, i.e., converting the
categorical variables to dummy or indicator variables.
 After converting the variables, the data looks as below:
 The columns are
Test and Train Split
 We split the train and test data as 70% and 30%. We copy all the predictor variable i.e Price in to X data
frame and copy the target into y data frame.
 X Dataframe looks like below
 The shape of X is (26933,23)
 Y data frame looks as below
22
 Shape of Y is (26933, 1).
LINEAR REGRESSION MODEL:
 We run the LinearRegression() to find the best model for training data.
 The coefficients for the independent variable is as below:
 The intercept for our model is -0.7689302284856566.
 R square on training data is 0.9402044588687953
 R square on testing data is 0.9419074345242372

 RMSE on Training data is 0.21068012556157584
 RMSE on testing data is 0. 0.20815894795593584

23
 Variance Inflation Factor values for the data is as below:
 We can still see high correlation present in the data from VIF values. The best values for VIF is 5% and
less.
24
INFERENTIAL STATISTICS
 1ST Iteration
25
 For 2nd Iteration I dropped the value depth, to reduce the high collinearity.
 The final equation is :

(-0.77) * Intercept + (1.08) * carat + (-0.01) * table + (-0.33) * x + (0.26) * y + (-0.05) * z + (0.12) *
cut_Good + (0.18) * cut_Ideal + (0.17) * cut_Premium + (0.15) * cut_Very_Good + (-0.05) * color_E +
(-0.06) * color_F + (-0.11) * color_G + (-0.21) * color_H + (-0.33) * color_I + (-0.48) * color_J + (1.0) *
clarity_IF + (0.63) * clarity_SI1 + (0.42) * clarity_SI2 + (0.83) * clarity_VS1 + (0.76) * clarity_VS2 +
(0.94) * clarity_VVS1 + (0.93) * clarity_VVS2 +
26
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
2 The exploratory analysis clearly showed us that diamonds with cuts in ideal, premium and very good cuts
brought in more profits to the company. Hence we can recommend to bring in more marketing strategies
to promote these cuts. For eg, advertising or inviting any social media influencers.
3 Similarly, for the color H, I, J are bringing in more profits, so we need to maintain the same and use these
colors to bring in more profits to the company. While looking at the other colors that is not bringing any
profits, we can either decrease their price or promote them, so they sell out.
4 Since diamonds are most sold when their clarity is much higher, the jeweler should make sure that they are
of the finest quality hence bringing in more customers.
__________________________________________________________________________________________
__________________________________________________________________________________________
27
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details of
872 employees of a company. Among these employees, some opted for the package and some didn't. You
have to help the company in predicting whether an employee will opt for the package or not on the basis of
the information given in the data set. Also, find out the important factors on the basis of which the company
will focus on particular employees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis.
EXPLORATORY DATA ANALYSIS
 The dataset consist of 8 variables: ‘Unnamed: 0, Holliday_Package, Salary, age, educ,

no_young_children, no_older_children, foreign’
 Since we do not need the variable Unnamed for prediction or model building, we can drop the column.
After dropping the column, the data look as below:
28
INFORMATION OF THE DATA
 The shape of the data (872, 7).

Null Value checking
 There are no null values present in the dataset.
29
DESCRIPTIVE STATISTICS
 There are two categorical variables – Holliday_Package and foreign.
 The minimum value for age is 20 and maximum is 62.
DUPLICATES
 There are no duplicates values in the data.

UNIQUE VALUES OF CATEGORICAL VARIABLES
 Holliday_Package has two values: no and yes. No has a total of 471 values whereas yes has 401 values.
 Foreign has two values: no and yes. No has 656 values and yes has 216 values.
30
UNIVARIATE ANALYSIS
SALARY DISTRIBUTION
 The above diagram shows the salary of employees. From the plot we can see that the data is positively
skewed with skewness value of 3.103216.
 From the boxplot, we can see that there are many outliers present in the data.
31
AGE DISTRIBUTION
 The above plot shows the distribution of age in years. The plot is positively skewed with a value of
0.146412.
 From the boxplot, we can find that there are no outliers present in the variable age.
32
EDUC DISTRIBUTION
 The above plot shows the distribution of years of formal education. The plot looked negatively skewed
with skewness -0.045501
 The boxplot shows that the variable is having outliers.
33
No_young_children Distribution
 The plot shows the number of younger children lesser than 7 years. The plot is positively skewed with
a skewness value of 1.946515.
 The boxplot shows outliers present in the data.
34
No_older_children Distribution
 The plot shows the distribution of Number of older children. The distribution is positively skewed with
skewness of 0.953951.
 The boxplot shows there are few outliers present in the data.
35
Foreign Distribution
 The above plot shows the distribution of Foreign. It has two categories. Category ‘no’ is having more
values than ‘yes’.
Holliday_Package Distribution
 The above shows the distribution of Holliday_Package. It has two categories: Yes and No. No is having
more value than yes.
36
BIVARIATE ANALYSIS
PAIR PLOT DATA DISTRIBUTION
 From the pairplot, we can find that there are not many correlation between the data and data
distribution looks normal with no huge variation.
37
CORRELATION HEATMAP
 There are no strong correlation between the variables.

HOLLIDAY_PACKAGE AND SALARY
 From the plot, we can see that the employees having salary more than 100000 opted for the
holliday_package
38
Holliday_Package and Age
 Employees having age less than 50 are choosing the Holliday_Package and people above age 50 are
less to take the Holliday_Package.
Holliday_Package and educ
 People with more formal education in years are choosing the holliday_package.
39
No_young_children and Holliday_package
 Very less number of Younger children are choosing the Holliday_Package.
No_older_children and Holliday_Package
 Older children are opting for the Holliday_Package.
40
AGE AND SALARY
 People with age between 30 to 50 and salary less than 100000 are opting for the Holliday_Package.
EDUC AND SALARY
 People with formal education and salary above 50000 are opting for the Holliday_Package.
41
REMOVING OUTLIERS
 From univariate analysis we could find that, there are many outliers present in the data. For Logistic
Regression and LDA, it is better to treat the outliers in order to get the best results.
 After treating the outliers the data looks as below. There are no outliers present in the data after
treating it.
42
43
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
ENCODING THE CATEGORICAL VARIABLES FOR MODELLING
 We use get_dummies () function to encode the string values for modelling, i.e., converting the
categorical variables to dummy or indicator variables.
 After converting the variables, the data looks as below:
 The new columns are
TRAIN AND TEST SPLIT
 Copying the predictor variable into an X data frame and target variable into Y data frame.
 Then we split the Train and test data as 70% and 30%.
 Y_train value counts:
 Y_test value counts:
44
LOGISTIC REGRESSION MODEL
 Fitting the train and test data into logistic regression model:
 Predicting Probablities on the test data:
LINEAR DISCRIMINANT ANALYSIS MODEL
 For LDA, Model, we convert the categorical target variables to integer (1 and 0)
 Then we copy the target and predictor variable into X and Y data frame and split the data into Test and
train in 70% and 30%.
 X_train
45
 Y_train
5 X_Test
6 Y_Test
7 We fit the data into train and test using lineardiscriminantanalysis()

8 We fit the model into that and predict the test and Train Probabilities.
46
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.
LOGISTIC REGRESSION MODEL
Confusion Matrix and Classification Report of the Train data
47
Confusion Matrix and classification report for the test data:
AUC and ROC Curve for the Train data
48
 ROC and AUC Score : 0.661
 Accuracy of the train data: 0.6327868

AUC and ROC Curve for the Test Data:
 ROC and AUC Score: 0.675
 Accuracy of the train data: 0.66030

LINEAR DISCRIMINANT ANALYSIS MODEL
CLASSIFICATION REPORT AND CONFUSION MATRIX FOR THE TRAIN DATA
49
Confusion Matrix and classification Report for the Test Data
50
AUC AND ROC For both Train and Test Data
 AUC for the Training Data: 0.661
 AUC for the Test Data: 0.675

EVALUATION OF MODEL BY CHAGING THE CUT OF VALUE
51
52
53
COMPARING LR AND LDA MODEL
 Both the models have almost the same metrics for the classification report.
 Linear Regression model have better Precision and Recall rate. Therefore Linear Regression model to be
best optimized.
54
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
 From the EDA Analysis, we could find that people aged above 50 or between 50 to 60 are not opting for
Holiday Packages. This might either be due to their concerns of their safety during the travel or the price
of the package. By focusing on this aged group, we can add promotional strategies, explaining them
about the safety precautions taken during the travel, and we even can offer Senior citizens discount
options to them.
 Secondly, we can bring in more marketing strategies like social media campaigns like giveaways, lucky
draw to attract more customers.
 People with salary more than 150000 are more in number while choosing the holiday packages, therefore
we can roll out many offers for such category of people to bring in more customers and also it should be
a profit for the company.
__________________________________________________________________________________________
__________________________________________________________________________________________
55

Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF

Uploaded by

Copyright:

Available Formats

PREDICTIVE

Gloria Susan Raju

 The data contains float, int and object datatypes.

 ‘carat, depth, table, x, y, z, price’ are continuous variables.

 Boxplot shows that the data contains large number of outliers.

 SI2 is having the highest price as compared to others.

CORRELATION BETWEEN VARIABLES

 The above plot shows the correlation between each variables.

 We can see that there are multicollinearity present in the data.

 Similarly x is strongly correlated with y and z.

 After treating the outliers, the data looks as below

 After converting the variables, the data looks as below:

 The columns are

Test and Train Split

 X Dataframe looks like below

 The shape of X is (26933,23)

 Y data frame looks as below

 The coefficients for the independent variable is as below:

 The intercept for our model is -0.7689302284856566.

 R square on training data is 0.9402044588687953

 R square on testing data is 0.9419074345242372

 RMSE on testing data is 0. 0.20815894795593584

 The final equation is :

 The dataset consist of 8 variables: ‘Unnamed: 0, Holliday_Package, Salary, age, educ,

 The shape of the data (872, 7).

 There are no null values present in the dataset.

 There are two categorical variables – Holliday_Package and foreign.

 The minimum value for age is 20 and maximum is 62.

 There are no duplicates values in the data.

 The boxplot shows that the variable is having outliers.

 The boxplot shows outliers present in the data.

 There are no strong correlation between the variables.

 Very less number of Younger children are choosing the Holliday_Package.

No_older_children and Holliday_Package

 Older children are opting for the Holliday_Package.

 After converting the variables, the data looks as below:

 The new columns are

TRAIN AND TEST SPLIT

 Y_train value counts:

 Y_test value counts:

 Predicting Probablities on the test data:

LINEAR DISCRIMINANT ANALYSIS MODEL

7 We fit the data into train and test using lineardiscriminantanalysis()

AUC and ROC Curve for the Train data

 Accuracy of the train data: 0.6327868

 ROC and AUC Score: 0.675

 Accuracy of the train data: 0.66030

 AUC for the Training Data: 0.661

 AUC for the Test Data: 0.675

You might also like