Assignment - Predictive Modeling

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Table of Contents

Q1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis......................................... 1
Q1.2. Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning, or do we need to change them or drop them? Do you think scaling is necessary in this
case? ..................................................................................................................................................... 13
Q1.3. Encode the data (having string values) for Modelling. Data Split: Split the data into test and
train (70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions
on Train and Test sets using Rsquare, RMSE. ....................................................................................... 18
Q1.4 Inference: Basis on these predictions, what are the business insights and recommendations. . 26
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.... 30
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis). .... 43
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized. ................................ 51
2.4 Inference: Basis on these predictions, what are the insights and recommendations. ................... 60

Project – Predictive Modelling


Problem Statement:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities
as a diamond). The company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower profitable stones so
as to have better profit share. Also, provide them with the best 5 attributes that are most
important. The data set is cubic_zirconia.csv

Data Dictionary

Variable Name Description

Carat Carat weight of the cubic zirconia.

Cut Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very Good,
Premium, Ideal.

Color Colour of the cubic zirconia.With D being the best and J the worst.
Clarity cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In order from Best to
Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3

Depth The Height of a cubic zirconia, measured from the Culet to the table, divided by its average Girdle
Diameter.

Table The Width of the cubic zirconia's Table expressed as a Percentage of its Average Diameter.

Price the Price of the cubic zirconia.

X Length of the cubic zirconia in mm.

Y Width of the cubic zirconia in mm.

Z Height of the cubic zirconia in mm.

Q1.1. Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA). Perform
Univariate and Bivariate Analysis.

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their
main characteristics often plotting them visually. This step is very important especially when
we arrive at modelling the data. Plotting in EDA consists of Histograms, Box plot, pairplot
and many more. It often takes much time to explore the data. Through the process of EDA,
we can define the problem statement or definition on our data set which is very important.

Imported the required libraries.


To build the Linear Regression Model on our dataset we need to import the following
packages:

➢ Importing the dataset - “cubic_zirconia.csv”


➢ Data Summary and Exploratory Data Analysis:
✓ Checking if the data is being imported properly
✓ Head: The top 5 rows of the dataset are viewed using head () function

✓ Dimension of the Dataset: The Dimension or shape of the dataset can be


shown using shape function. It shows that the dataset given to us has 26967
rows and 11 columns or variables

Structure of the Dataset : Structure of the dataset can be computed using


.info() function

➢ Data Type is – Integer/Float/Object


➢ Checking for Duplicates: - There are 34 duplicate rows in the dataset as
computed using. Duplicated () function. We will drop the duplicates

➢ We will drop the first column ‘Unnamed: 0’ column as this is not important for our
study.

Descriptive Statistics for the dataset :

Carat:- This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around 0.8 and
75% of the stones are of 1.05 carat value. Standard deviation is around 0.477 which shows that the
data is skewed and has a right tailed curve. Which means that majority of the stones are of lower
carat. There are very few stones above 1.05 carat.

Depth :- The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60. Average
height of the stones is 61.80 25% of the stones are 61 and 75% of the stones are 62.5. Standard
deviation of the height of the stones is 1.4. Standard deviation is indicating a normal distribution

Table:- The percentage width of cubic Zirconia is in the range of 49 to 79. Average is around 57. 25%
of stones are below 56 and 75% of the stones have a width of less than 59. Standard deviation is 2.24.
Thus the data does not show normal distribution and is similar to carat with most of the stones having
less width also this shows outliers are present in the variable.

Price:- Price is the Predicted variable. Prices are in the range of 3938 to 18818. Median price of stones
is 2375, while 25% of the stones are priced below 945. 75% of the stones are in the price range of
5356. Standard deviation of the price is 4022. Indicating prices of majority of the stones are in lower
range as the distribution is right skewed.
Variables x, y, and z seems to follow a normal distribution with a few outliers.

Checking Correlation in the data using Heatmap

Observations:
➢ High correlation between the different features like carat, x, y, z and price.
➢ Less correlation between table with the other features.
➢ Depth is negatively correlated with most the other features except for carat

Univariate & Bivariate Analysis


Getting unique counts of Categorical Variables
Looking at the above unique values for variable “Cut “ we see the ranking given for each
unique value like “ Fair, Good, Ideal, Premium, Very Good “

Price Distribution of Cut Variable

➢ For the cut variable we see the most sold is Ideal cut type gems and least sold is Fair
cut gems
➢ All cut type gems have outliers with respect to price
➢ Slightly less priced seems to be Ideal type and premium cut type to be slightly more
expensive

Price Distribution of Color Variable


➢ For the color variable we see the most sold is G colored gems and least is J colored
gems
➢ All color type gems have outliers with respect to price
➢ However, the least priced seems to be E type; J and I colored gems seems to be
more expensive

Price Distribution of Clarity Variable


➢ For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity
gems
➢ All clarity type gems have outliers with respect to price
➢ Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be
more expensive
Getting unique counts of Numeric Variables
Histograms and Boxplot for each variable to check the data distribution
Observations:
Independent Variables
➢ Depth is the only variable which can be considered as normal distribution
➢ Carat, Table, x, y, z these variables have multiple modes with the spread of data
➢ Outliers: Large number of outliers are present in all the variables (Carat, Depth,
Table, x, y, z)
Price will be the target variable or dependent variable
➢ It is right skewed with large range of outliers
There are outliers present in all the variables as per the above plot

From above data it is seen that except for carat and price variable, all other variables
have mean and median values very close to each other, seems like there is no
skewness in these variables. Whereas for carat and price we see some difference in
value of mean and median, which slightly indicates existence of some skewness in
the data
Treatment of outliers by IQR method
Box Plots after outliers’ treatment
➢ Checked for data Correlation via heatmap: Heatmap showing correlation between
variables
➢ We see strong correlation between Carat, x,y, and z that are demonstrating strong
correlation or multicollinearity

Bivariate Analysis :
Pair Plot :

Observations:
➢ Pair plot allows us to see both distribution of single variable and relationships
between two variables.
Conclusion of EDA:
• Price – This variable gives the continuous output with the price of the cubic
zirconia stones. This will be our Target Variable.
• Carat, depth, table, x, y, z variables are numerical or continuous variables.
• Cut, Clarity and colour are categorical variables.
• We will drop the first column ‘Unnamed: 0’ column as this is not important for
our study which leaves the shape of the dataset with 26967 rows & 10
Columns
• Only in ‘depth 697missing values are present which we will impute by its
median values.
• There are total of 34 duplicate rows as computed using. Duplicated ()
function. We will drop the duplicates
• Upon dropping the duplicates – The shape of the data set is – 26933 rows &
10 columns

Q1.2. Impute null values if present, also check for the values which
are equal to zero. Do they have any meaning, or do we need to
change them or drop them? Do you think scaling is necessary in this
case?

Answer:

It is important to check if the data has any missing value or gibberish data in it. We did
check about the same for both object and numerical data types and can confirm the
following:
✓ There is no gibberish or missing data in the object type data columns – Cut, color
and clarity
CUT : 5
Fair 780
Good 2435
Very Good 6027
Premium 6886
Ideal 10805
Name: cut, dtype: int64

COLOR : 7
J 1440
I 2765
D 3341
H 4095
F 4723
E 4916
G 5653
Name: color, dtype: int64

CLARITY : 8
I1 364
IF 891
VVS1 1839
VVS2 2530
VS1 4087
SI2 4564
VS2 6093
SI1 6565
Name: clarity, dtype: int64

✓ There are missing values in the column “depth” – 697 cells or 2.6% of the total data
set. We can choose to impute these values using a mean or median. We checked for
both the values and the result for both is almost similar.
✓ For this case study, I have used median to impute the missing values.
Table 1.2.1 – Checking the missing values

Table 1.2.2. – checking for missing values after imputing the values
Table: 1.2.3: Describe Function showing presence of 0 values in x, y, and z columns

✓ While there are no missing values in the numerical columns, there are a few 0 in
columns – x (3 in count), y(3 in count) and z (9 in count) – in the database. A single
row was dealt with during checking for duplicates and the other eight rows were
taken care here. Since the total number of rows that had 0 value in them was 8 only,
it accounts for a negligible number and for this case study we could have avoided
them or dropped. Also, when I checked the correlation values, it seems there is a
strong multicollinearity between all three columns. There is a most likely case that I
won’t even use them in creating my Linear Regression model. I have chosen to drop
those rows as it represented an insignificant number when compared to the overall
dataset and it won’t add much value to the analysis here.
Gem_df = Gem_df[(Gem_df["x"] != 0) & (Gem_df["y"] != 0) & (Gem_df["z"] != 0)]

Table: 1.2.4: Describe Function confirming there are no 0 values in x, y, and z


columns
We now have a dataset with 26,925 rows as supposed to 26,933 after treating
duplicate entries.

Scaling – In regression it is often a good practice to centre the variables so that


predictors have a mean of 0. This makes it easier to intercept the intercept term as
the expected value of Yi when the predictor values are set to their means. Otherwise,
the intercept is interpreted as the expected value of Yi when the predictors are set to
0, which may not be a realistic or interpretable situation. Another valid reason for
scaling in regression is when one predictor variable has a very large scale. In that
case, the regression coefficients may be on a very small order of magnitude which
can be unclear to interpret. The convention that we standardize predictions primarily
exists so that the units of the regression coefficients are the same.
More often, the dataset contains feature highly varying in magnitudes, units and
range. However, most of the machine learning algorithms use Euclidean distance
between two data points in their computations, and this can be a potential problem.
Also, scaling helps to standardize the independent features present in the data in a
fixed range. If feature scaling is not done, then a machine learning algorithm tends to
weigh greater values, higher and consider smaller values as the lower values,
regardless of the unit of the values. The features with high magnitudes will weigh in a
lot more in the distance calculations than features with low magnitudes. To suppress
this effect, we need to bring all features to the same level of magnitudes.

For this case study, however, let’s look at the data more closely to identify if there is
a need for us to scale the data.
The describe function output that was shared above indicates that mean and std dev
numbers aren’t varying significantly for original numeric variables with a low std
deviation and hence, even if we don’t scale the numbers, our model performance will
not vary much, or the impact will be insignificant. Though some of the immediate
effects that we can see if we scale the data and run our linear model is:
Faster execution, the conversion is faster.
The intercept is minimized significantly, bringing it almost to negligible value
The coefficients can now be interpreted in std Dev units instead of a pure unit
increment in normal linear models. Though, there is no difference in the interpretation
of the model scores and the representation of linear model in a graphical mode using
scatterplot representation.

Table 1.2.11 – Correlation between variables of the dataset


We can identify that there is a strong correlation between independent variables –
like Carat, x,y, and z. All these variables are strongly correlated with the target
variable – price. This indicates a strong case of our dataset struggling with
multicollinearity . Depth does not show any strong relation with the price variable. For
this case study, I would drop x,y, and z variables before creating the linear regression
model. Similarly, Depth does not seems to be influencing my variable price and
hence, at some point, I will be dropping this variable from my model building process
as well.

Keeping the above points in mind, for this dataset, I don’t think scaling the data will
make much sense. However, for the sake of checking how does it impact the overall
model coefficients, I have still carried it for this study

Please note, centering/scaling does not affect our statistical inference in regression
models - the estimates are adjusted appropriately, and the p-values will be the same.
Sharing the brief results from running the model on a scaled data.

The coefficients are mentioned below:

The intercept value is at --2.268175876465854e-16 or almost 0


Regression model score on train and test data is same as with the normal Linear
Regression model at 91.4% and 91.8%, respectively. This indicates the model is a
right fit model and has avoided being an underfit or an overfit model. The rmse value
is 0.2885. This means we have about 29% variance of residual error or unexplained
error in our model. It allows better interpretability for us to study the model.
While reading the coefficients value of the scaled variables, I can interpret it as for
one std dev increment in carat value, the carat variable impacts the dependent
variable- price - by 1.06 times carat +. +. .+. so on for other variables.
Visual representation of the scaled model:

Q1.3. Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on Train
and Test sets using Rsquare, RMSE.
Answer:
We have three object columns, which have string values - cut, color, and clarity
Let us read the data in brief before deciding on what kind of encoding technique needs to be
used.
I would quickly check the statistical inference of our target variable – price, with the following
output:

Let us now check the first object column – cut for quick read. I am using the groupby
function to check the relationship of cut with price and have used some aggregator
functions likes mean, median, and standard deviation to have a quick read.
We can establish that there is an order in ranking like mean price is increasing from
ideal then good to very good, premium and fair. Fair segment has the highest median
value as well. Since, I can see the ordered ranking here, I have classified them using
scale – Label encoding - and won’t use one-hot encoding here. However, it is
absolutely fine for us to go ahead and treat this variable using one-hot encoding as
well. We can certainly try that approach if we would like to see the impact of varied
kind of cut variables with price target variable. Overall, with the grid above, I don’t
think we will find cut as a strong predictor and hence I have stayed with using label
encoding for the rest of the case study.
Gem_df['cut_s'] = Gem_df['cut'].map({'Ideal':1,'Good':2,'Very
Good':3,'Premium':4,'Fair':5})

Let’s look at the relationship between color and price, using bivariate analysis:
The color variable seems to be making an impact on the price variable and it should
be used further in our linear regression model.

Lets quickly check relationship between clarity and price, bivariate analysis:
For me, its hard to pick an order and the variable seems to be having a direct impact
on the price variable. We should read this further while creating our Linear
Relationship model.
In one-hot encoding, the integer encoded variable is removed and a new binary
variable is added for each unique integer value. A one hot encoding allows the
representation of categorical data to be more expressive. Many machine learning
algorithms cannot work with categorical data directly and hence, the categories must
be converted into numbers. This is required for both input and output variables that
are categorical. The only disadvantage being, that for high cardinality, the feature
space can really blow up quickly and we will then need to struggle with
dimensionality.

For this case study, I have used one-hot encoding for color and clarity independent
variables using the function “drop_function = True” or can be referred as dummy
encoding, which takes into consideration the (Kn-1) encoding. Dummy encoding
converts it into n-1 variables. One-hot encoding ends up with kn variables, while
dummy encoding usually ends up with kn-1 variables. This will also help us to deal
with the issue of multidimensionality, if needed.

Table 1.3.1: Encoded data for Categorical Variables

We have now set the precedence that there are a lot of variables (Carat, x,y, and z)
that are demonstrating strong correlation or multicollinearity. So, before proceeding
with the Linear Regression Model creation, we need to get rid of them from the model
creation exercise. I have decided to drop x,y, and z from my Linear Regression
model creation step. I have decided to keep the carat column of the lot as it has the
strongest relation with the target variable – Price out of the four columns. Depth
column also did not have much impact on the price column and hence I have chosen
to not use it as well.

Table 1.3.2: Heatmap showcasing multicollinearity between different variables


Table 1.3.3: Heatmap after dropping x, y,z, and depth columns before creating Linear
Regression model
After finalizing the columns that I intend to use for my model, I have chosen to
save them in a new column:
sel_df = Gem_df[['carat','table','cut_s','color_E', 'color_F', 'color_G', 'color_H',
'color_I', 'color_J',
'clarity_IF', 'clarity_SI1', 'clarity_SI2', 'clarity_VS1',
'clarity_VS2', 'clarity_VVS1', 'clarity_VVS2','price']]

Without scaling the data:


✓ Copy all the predictor variables into X dataframe
x = sel_df.drop('price', axis=1)
✓ Copy target into the y dataframe.
y = sel_df[['price']]
✓ Split x and y into training and test set in 70:30 ratio
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30 ,
random_state=1)

Intercept of the model is: -4109.09


Model score on the training data set is 91.4%
Model score on the test data is 91.7%
Our model is in the right fit zone. We have a good model with us. Our model is
neither an underfit nor an overfit model and its working out fine for us.
The linear regression equation can be represented as :
-4109.09 + 9199.07*Carat -36.80*table-55.23*cut_s……. + 5119.51*clarity_IF+
3197.93*clarity_SI1+…… +4809.18*clarity_VVS1+4688.70*clarity_VVS2

R-squared value is 0.914 or 91.4% in percentage terms

R-squared is a statistical measure of how close the data are to the fitted regression
line. It is also known as the coefficient of determination, or the coefficient of multiple
determination for multiple regression. 100% indicates that the model explains all the
variability of the response data around its mean. The value of R-squared vary from 0
to 1. Any value inching closer to 1 can be considered a good fitted regression line
and our R-squared score of 0.869 signifies good performance.

Root mean square error (RMSE) is 1167.72.

RMSE is the standard deviation of the residuals or prediction errors. Residuals are a
measure of how far from the regression line data points are; RMSE is a measure of
how spread out these residuals are. In other words, it tells us how concentrated the
data is around the line of best fit.

For better analysis, we should be studying MAPE (mean average percentage error),
a method used to calculate the average variance. MAPE is the Sum of all Errors
divided by the sum of Actual (or forecast). When comparing the accuracy of various
forecasting methods, the one with the lowest MAPE may have the best predictive
power.
The mape score of my model on both training and test data is almost similar at
0.48318 and 0.48688, respectively. The almost similar results indicate that the model
is performing almost similar on both training and test data and we can assume it to
be in the right fit zone and have been able to avoid the issue of under fit or over fit
model.
Hypothesis testing for Linear Regression – The null hypothesis states that there is no
relation between the dependent variable – Price and other independent variables.
Looking at the summary table above, all the P values are less than 0.05 or at 95%
confidence level we can say that the variables have a direct impact on the price
variable. Carat and clarity variables seem to impact the price rise positively,
Surprisingly, color of the stones is offsetting the price increase. We can however,
study our model further to see if we can address multicollinearity, if available.
Table 1.3.5: Scatterplot on test data between dependent variable – price - and
Independent variables

As stated earlier, with this specific dataset, I don’t think we need to scale the data,
however, to see its impact, lets quickly view the results post scaling the data.
I have used Z score to scale the data. Z-Scores become comparable by measuring
the observations in multiples of the standard deviation of that sample. The mean of a
z-transformed sample is always zero.
from scipy.stats import zscore
x_train_scaled = x_train.apply(zscore)
x_test_scaled = x_test.apply(zscore)
y_train_scaled = y_train.apply(zscore)
y_test_scaled = y_test.apply(zscore)
The values are scaled now.
The new intercept is -2.268175876465854e-16, which is almost equal to 0, which is a
by-product of the scaling exercise.
The scaled linear regression equation will be, where the interpretability will be with an
unit increase in standard deviation and not the unit increase as in normal linear
model:
~0 + 1.05*carat – 0.019*table-0.018*cut_s+ ………+0.3004*clarity_VVS1+
0.3409*clarity_VVS2
Model score on the train data (91.4%) and test data (91.7%) stays exactly the same
as before scaling the data. So, scaling does not impact or hinder with the model
scoring.
RMSE is now at 0.2886. This means we have almost 29% variance of residual error
or unexplained error in our model. It allows better interpretability for us to study the
model.
Table 1.3.6: Post scaling, Scatterplot on test data between price - and Independent
variables

The scatterplot also stays the same despite scaling the data.

Q1.4 Inference: Basis on these predictions, what are the business


insights and recommendations.

We have a database which have strong correlation between independent variables and
hence we need to tackle with the issue of multicollinearity which can hinder the results of the
model performance. Multicollinearity makes it difficult to understand how one variable
influence the target variable. However, it does not affect the accuracy of the model. As a
result while creating the model, I had dropped a lot of independent variables displaying
multicollinearity or the ones with no direct relation with the target variable.
Table: Showing Correlation between variables before model building

While we looked at the data during univariate analysis, we were able to establish that
Carat is strongly related with the price variable, and also with a lot of other
independent variables - x, y, and z, and low correlation with variables such as table
and cut as well. It can be established that Carat will be a strong predictor in our
model creation.

The same trend was displayed even after the object columns were encoded. The
carat variable continues to display strong to low correlation with most of the
variables, making its claim to be the most important predictor firm.
Figure – Heatmap with encoded variables, before the model creation stage

After the Linear Regression model was created, we can see the assumption coming
true as the carat variable emerged as the single biggest factor impacting the target
variable , followed with a few others within clarity variables. Carat variable has the
highest coefficient value as compared to the other studies variables for this test case.
The sample linear equation after running the model before scaling:
-4109.09 + 9199.07*Carat -36.80*table-55.23*cut_s……. + 5119.51*clarity_IF+
3197.93*clarity_SI1+…… +4809.18*clarity_VVS1+4688.70*clarity_VVS2

Even after scaling, our claim about carat being an important driver in our linear equation is
reaffirmed.

We can then look at the Variance Inflation Factor (VIF) score to check the
multicollinearity scores. As per the industry standards or at least for this case study,
any variable with a VIF score of greater than 10 has been accepted to indicate
severe collinearity. The table below indicates VIF for all the variables.

from the above scores, we can establish that table and Clarity_SI1, Clarity_SI2,
Clarity_VS1, Clarity_Vs2 are displaying severe collinearity and need to be
addressed.

VIF measures the intercorrelation among independent variables in a multiple


regression model. In mathematical terms, the variance inflation factor for a
regression model variable would be the ratio of the overall model variance to the
variance of the model with a single independent variable. As an example, the VIF
value for Carat in the table above is the intercorrelation with other independent
variables in the dataset and so on for other variables

If we are looking to finetune the model we can simply drop these columns from our
Linear Regression model to see how the results pan out and can check our model
performance.

As an alternate approach, we can choose to use the “cut” variable as one hot
encoding or dummy encoding and run the model again to check on the overall model
score or tackle the issue of multicollinearity. This can allow us to read the impact of
cut variables on the target variable – Price, if company intends to study that as well.
however, for this case study and for the reasons mentioned above, I have not used
one-hot encoding on the cut variable.

For the business based on the model that we have created for the test case, some of
the key variables that are likely to positively drive price change are (top 5 in
descending order):
1. Carat
2. Clarity_IF
3. Clarity VVS_1
4. Clarity VVS_2
5. Clarity_vs1

Recommendations:
As expected Carat is a strong predictor of the overall price of the stone.

Clarity refers to the absence of the Inclusions and Blemishes and has emerged as a
strong predictor of price as well. Clarity of stone types IF, VVS_1, VVS_2 and vs1 are
helping the firm put an expensive price cap on the stones.
Color of the stones such H, I and J won’t be helping the firm put an expensive price
cap on such stones.
The company should instead focus on stones of color D, E and F to command
relative higher price points and support sales.
This also can indicate that company should be looking to come up with new color
stones like clear stones or a different color/unique color that helps impact the price
positively.
The company should focus on the stone’s carat and clarity so as to increase their
prices. Ideal customers will also contribute to more profits. The marketing efforts can
make use of educating customers about the importance of a better carat score and
importance of clarity index. Post this, the company can make segments, and target
the customer based on their income/paying capacity etc, which can be further
studied.
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages.
You are provided details of 872 employees of a company. Among these employees,
some opted for the package and some didn't. You have to help the company in
predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.
Dataset for Problem 2: Holiday_Package.csv

Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than
7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their
main characteristics often plotting them visually. This step is very important especially when
we arrive at modelling the data. Plotting in EDA consists of Histograms, Box plot, pairplot
and many more. It often takes much time to explore the data. Through the process of EDA,
we can define the problem statement or definition on our data set which is very important.

Imported the required libraries.


To build the Logistic Regression Model and LDA on our dataset we need to import the
following packages:
➢ Importing the dataset - “Holiday_Package.csv”
➢ Data Summary and Exploratory Data Analysis:
✓ Checking if the data is being imported properly
✓ Head: The top 5 rows of the dataset are viewed using head () function

✓ Dimention of the Dataset: The Dimension or shape of the dataset can be


shown using shape function. It shows that the dataset given to us has 872
rows and 8 columns or variables

Structure of the Dataset : Structure of the dataset can be computed using


.info() function

➢ Data Type is – Integer/Object


➢ Checking for Duplicates: - There are No duplicate rows in the dataset as
computed using. Duplicated () function.
➢ We have dropped the first column ‘Unnamed: 0’ column as this is not important
for our study. The shape would be – 872 rows and 7 columns

Univariate Analysis

Descriptive Statistics for the dataset :


Summary of the Dataset

• Holiday Package – This variable is a categorical Variable. output with the This will be
our Target Variable.
• Salary, age, educ, no_young_children, no_older_children, variables are numerical or
continuous variables.
• Salary ranges from 1322 to 236961. Average salary of employees is around 47729
with a standard deviation of 23418. Standard deviation indicates that the data is not
normally distributed. skew of 0.71 indicates that the data is right skewed and there
are few employees earning more than an average of 47729. 75% of the employees
are earning below 53469 while 255 of the employees are earning 35324.
• Age of the employee ranges from 20 to 62. Median is around 39. 25% of the
employees are below 32 and 25% of the employees are above 48. Standard
deviation is around 10. Standard deviation indicates almost normal distribution.
• Years of formal education ranges from 1 to 21 years. 25% of the population has
formal education for 8 years, while the median is around 9 years. 75% of the
employees have formal education of 12 years. Standard deviation of the education is
around 3. This variable is also indicating skewness in the data
• Foreign is a categorical variable
• We have dropped the first column ‘Unnamed: 0’ column as this is not important for
our study. Unnamed is a variable which has serial numbers so may not be required
and thus it can be dropped for further analysisThe shape would be – 872 rows and 7
columns
• There are no null values
• There are no duplicates
Getting unique counts of Numerical Variables

Checking the unique values for the target Variables ‘Holiday Package”:

We can observe that 54% of the employees are not opting for the holiday package and
46% are interested in the package. This implies we have a dataset which is fairly
balanced
Checking the unique values of the Foreign Variables as it is categorical:

We can observe that 75% of the employees are not Foreigners and 25% are foreigners

➢ Checked for data distribution by plotting histogram


Let us check for outliers by plotting the box plot
We can observer that there are significant outliers present in variable “ Salary”, however
there are minimal outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of
older children’. There are no outliers in variable ‘age’. For Interpretation purpose we
would need to study the variables such as no. of young children and no. of older children
before outlier treatment. For this case study we have done outlier treatment for only
salary & educ.

Treatment of outliers by IQR method (placed it here for comparison sake only)
Box Plots after outliers’ treatment –
Bi-Variate Analysis with Target variable

Holiday Package & Salary

While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.

Holiday Package & age

There are no outliers present in age


The distribution of data for age variable with holiday package is also similar in nature
We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees
Holiday Package & educ

This variable is also showing a similar pattern. This means education is likely not to be a
variable for influencing holiday packages for employees.
We observe that employees with less years of formal education(1 to 7 years) and higher
education are not opting for the Holiday package as compared to employees with formal
education of 8 year to 12 years.

There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package
We can clearly see that people with younger children are not opting for holiday packages

The distribution for opting or not opting for holiday packages looks same for employees with
older children. At this point, this might not be a good predictor while creating our logistics
model.
Almost same distribution for both the scenarios when dealing with employees with older
children.

Checking pairwise distribution of the continuous variables: [Salary, age,educ, no. of young
children, 'no of older children]
➢ Checked for data Correlation
➢ We will see correlation between independent variables to see which factors might
influence choice of holiday package.
➢ Heatmap showing correlation between variables

We can relate there isn’t any strong correlation between any variables. Salary and
education display moderate corelation and no_older_children is somewhat correlated
with salary variable. However, there are no strong correlation in the data set.

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).

Answer:
While Linear Regression helps us predicting continuous target variable, Logistic Regression
helps us for predicting a discrete a target variable. Logistic Regression is one of the “white-
box” algorithms which helps us in determining the probability values and the corresponding
cut-offs. Logistic regression is used to solve such problem which gives us the corresponding
probability outputs and then we can decide the appropriate cut-off points to get the target
class outputs.

Precisely Logistic Regression is defined as a statistical approach, for calculating the


probability outputs for the target labels. In its basic form, it is used to classify binary data.
Logistic regression is very much similar to linear regression where the explanatory
variables(X) are combined with weights to predict a target variable of binary class(y).
Evaluation of Logistic regression model- Performance measurement of classification
algorithms are judge by confusion matrix which comprise the classification count values of
actual and predicted labels.
Pros and cons of Logistics Regression:
Pros- Logistic regression classification model is simple and easily scalable for multiple
classes.
Cons- Classifier constructs linear boundaries and the interpretation of coefficients value is
difficult.

In the given dataset, the target variable – Holliday Package and an independent variable –
Foreign are object variables. Let us study them one at a time.
Holliday_Package: The distribution seems to be fine, with 54% for no and 46% for yes.

Foreign: The data is imbalanced with more skewed towards no and relatively a smaller
shared for yes.
Both the variables can be encoded into numerical values for model creation analytical
purposes.
Table 2.2.1: Encoding the string values

Table 2.2.2: Checking if the values are converted into numeric using the head()
Before we proceed with the model creation, let us read the other part of the data to see
how the numerical data also impacts the model. I will first look at the data correlation to
quickly identify the variable importance using the heatmap

We can relate there isn’t any strong correlation between any variables. Salary and
education display moderate corelation and no_older_children is somewhat correlated
with salary variable. However, there are no strong correlation in the data set.
Looking at the no_young_children variable:

The variable which is actually a numeric one seems to show varied distribution between
number of children being 1 and 2 when done a bivariate analysis with the dependent
variable. It is therefore advised to treat this variable as categorical and do the encoding
on it.
HP = pd.get_dummies(HP,columns = ['no_young_children'], drop_first=True)
post which, the variable will appear as:
Let’s read no_older_children as well:

Looking at the table above, there does not seems to be much variation between the
distribution of data for children more than 0. It seems they are close enough for the
Holiday_Package classes with an almost like distribution. For this test case, I don’t think
this variable will be an important factor while creating the model and for analysis
purposes and hence, I will drop this from my model building process.
Let’s now initiate the model:

• We need to copy all the predictor variables into X dataframe


X = HP.drop(columns = ['Holliday_Package','no_older_children'], axis=1)

• # Copy target into the y dataframe.


y = HP['Holliday_Package']

• Split X and y into training and test set in 70:30 ratio. This implies 70% of the total
data will be used for training purposes and remaining 30% will be used for test
purposes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 ,
random_state=1,stratify=y)
Checking the data split for the dependent variable – Y in both train and test data. The
percentage split between no and yes seems to be almost same at 54% and 46%,
respectively for both train and test data sets.
Table 2.2.3: Data split for the target variable – Holliday_Package in training and test sets
The data proportion seems to be reasonable and we can continue with our model
building as next steps.

As next steps, we will initiate the LogisticRegression function and will then fit the Logistic
Regression model. There after we will predict on the training and test data set.
model = LogisticRegression(solver='newton-
cg',max_iter=10000,penalty='none',verbose=True,n_jobs=2)
model.fit(X_train, y_train)
ytrain_predict = model.predict(X_train)
ytest_predict = model.predict(X_test)
in the equation y = mx+c, when we fit the model what we're really doing is choosing the
values for m and b – the slope and the intercept. The point of fitting the model is to find
this equation – to find the values of m and b such that y=mx+b describes a line
that fits our observed data well.

Solver:- I have used Newton method for solver as the data is not very large. ALso Newton methods use
an exact Hessian matrix. It also calculated second order derivatives, Thus, results are likely to be more
reliable. We have used best grid search method to identify best solver method and newton sg is coming
out to be best solver method
Model Accuracy scores:
Training data:
model.score(X_train, y_train) at 66.1%
Test data:
model.score(X_test, y_test) at 65.6%
The accuracy scores aren’t too different and can be considered as right fit models avoiding
the scenarios of underfit and overfit models.
We can apply for GridSearchCV here to finetune the model further to see if helps improve
the results. GridSearchCV is a brute force iterative method of obtaining the best model
based on a scoring metric provided by us and the parameters provided.
We can pass on the parameters to see what’s the best set as prescribed by the model. A
glimpse of the steps taken are shared below:
Post running the GridSeachCV, we can check on the accuracy scores again to identify if
the model performance has improved or not.
The accuracy scores for:
Train data is at 66%
Test data is at 66% , these scores are almost same as they were before putting the
model through GridSearchCV process, with marginal better results for training data and
a marginal reduction in accuracy score for the test data.
Post creating this model, we will then move on to creating the Linear Discriminant
Analysis (LDA).
LDA takes the help of prior probabilities to predict the corresponding target probabilities.
Prior probabilities is the probability of y (say equal to 1) without taking into account any
other data or variables. The corresponding updated probabilities when the covariates
(Xs) are available is called the posterior probabilities. We want to find P(Y=1|X). Thus, a
Linear Discriminant Analysis (LDA) discriminates between the two classes by looking at
the features (Xs)

Like Logistics regression, here also we will make a model and fit it. Post which, we can
evaluate the accuracy scores.
Figure 2.2.1: Confusion Matrix for both Training and Test data

Accuracy scores for the:


Training data is 66%
Test data is 66%
We can look to refine the cut off values to see if the model returns better values. We
started the model building with training data and test data classification with a default cut
off value of 0.5. We will now try to refine the model with different/custom cut-off values of
our choice. For this case study, I have used the cut-off range to increase by 10% or 0.1
and closing at a max of 1.
After this, we can compare the best result as shared by the model, to compare it with the
default model.

As shared above, the initial accuracy scores on the test data was 66%, which is also the
same after the custom cut-off model but the F1 score improved from an initial value of
57% to 67% with the custom model.
For all problems, accuracy might not be best possible metric based on which we can
take a decision, so we need to be aware of other metrics such as precision, recall, f1-
score, which becomes more relevant to choose the best LDA model.

2.3 Performance Metrics: Check the performance of Predictions on


Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.

Confusion matrix cells are populated by the terms:


True Positive(TP)- The values which are predicted as True and are actually True.
True Negative(TN)- The values which are predicted as False and are actually False.
False Positive(FP)- The values which are predicted as True but are actually False.
False Negative(FN)- The values which are predicted as False but are actually True.
ROC Curve- Receiver Operating Characteristic(ROC) measures the performance of
models by evaluating the trade-offs between sensitivity (true positive rate) and false (1-
specificity) or false positive rate.
AUC - The area under curve (AUC) is another measure for classification models is based
on ROC. It is the measure of accuracy judged by the area under the curve for ROC.

Performance Matrix of Logistics Regression model:


➢ Train data
AUC score is 0.741 or 74.1%

Confusion Matrix for Train data:


Classification report for Train data:

Test Data:
AUC score is 0.741 or 74.1%
Confusion Matrix for Test Data
Classification report for Test Data

While looking the metrices for both training and the test data, it seems the accuracy
scores are same on both models at 66%. Our model is close enough to be treated as a
right fit model. The current model is not struggling with being a over fit model or an under
fit model.
The AUC scores for both the training and test data is also same at 74.1%.
The model performance is good on F1 score as well with training data performing better
at 62% while the test data gave a F1 score of 57%.
Let us see if these numbers change if we try to further refine the model using the
GridSearchCV.
Applying GridSearchCV on Logistic Regression:
Confusion Matrix on the Train data:
Classification report on Train data:

Confusion Matrix on the Test data:

Classification report on Test data:

Even after applying the best parameters, my model performance has been similar with
almost same accuracy scores when compared with before putting the model through
GridSearchCV step.
Accuracy score on Train data: 66%
Accuracy score on test data: 66%
Also, the F1 scores have been similar when compared with the default model with 62%
and 57% for Train and Test data, respectively.
To summarize, our model scores are not very good but seems to be a right fit model and
seems to be avoiding the underfit and overfit scenarios. The training data seems to be
performing a bit better when compared with test data though the difference is not much.
The point to note is that F1 score seems to perform better on Training data and dips
slightly in the test data. We can study the data further for data engineering to improve the
scores, as needed for the business case.

LDA Model:
Confusion Matrix on Training data:

Classification report for both training and testing data:


The accuracy score of the training data and test data is same at 66%. This is almost
similar to the Logistic Regression model result so far. The AUC scores are marginally
lower for the test data, else they are also almost similar to the Logistic Regression
model. F1 scores are 61% and 57% for train and test data, respectively, which again is
almost close to the logistic regression model.

AUC for the Training Data: 0.740 or 74.0%


AUC for the Test Data: 0.725 or 72.5%

Overall, the model seems to be a right fit model and is staying away from being referred
as under fit or over fit model. Let us see if we can refine the results further and improve
on the F1 score of the test data specifically.

Custom cut off for the LDA model:

Comparison of the Classification report:


Looking at the custom classification report for the test data, we can see that we have
managed to retain our accuracy scores and have been able to improve our F1 score
from 57% to now a 67%.
Confusion Matrix of the custom cut off for test data:

array([[84, 58],
[31, 89]], dtype=int64)

Let’s compare both the models now:


Classification report and confusion matrix of the Logistics model on the test data
Classification Report of the default cut-off test data of LDA model:

precision recall f1-score support

0 0.65 0.79 0.71 142


1 0.67 0.50 0.57 120

accuracy 0.66 262


macro avg 0.66 0.64 0.64 262
weighted avg 0.66 0.66 0.65 262

Classification Report of the custom cut-off test data of the LDA model:

precision recall f1-score support

0 0.73 0.59 0.65 142


1 0.61 0.74 0.67 120

accuracy 0.66 262


macro avg 0.67 0.67 0.66 262
weighted avg 0.67 0.66 0.66 262

As stated above, both the models – Logistics and LDA offers almost similar results.
While LDA offers flexibility to control or change the important metrices such as precision,
recall and F1 score by changing the custom cut-off. Like in this case study, the moment
we changed the cut off to 40%, we were able to improve our precision, recall and F1
scores considerably. Further, this is up to the business if they would allow the play with
the custom cut off values or no.
Though for this case study, I have chosen to proceed with Logistics Regression as its is
easier to implement, interpret, and very efficient to train. Also, our dependent variable is
following a binary classification of classes, and hence it is ideal for us to rely on the
logistic regression model to study the test case at hand.

Logistic regression is a classification algorithm used to find the probability of event


success and event failure. It is used when the dependent variable is binary(0/1,
True/False, Yes/No) in nature. It learns a linear relationship from the given dataset and
then introduces a non-linearity in the form of the Sigmoid function.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

We started this test case with looking at the data correlation to identify early trends and
patterns. At one stage, Salary and education seems to be important parameters which
might have played out as an important predictor.
Holiday Package & Salary

While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.

Holiday Package & age

There are no outliers present in age


The distribution of data for age variable with holiday package is also similar in nature. The
range of age for people not opting for holliday package is more spread out when compared
with people opting for yes.
We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees
However, almost similar distribution here for salary and age is indicating that they might
not come out as strong predictors after the model is created. Lets carry on with more
data exploration and check.

There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package
We can clearly see that people with younger children are not opting for holiday packages
We identify that employees with number of younger children has a varied distribution and
might end up playing an important role in our model building process.
Employees with older children has almost similar distribution for opting and not opting for
holiday packages across the number of children levels and hence I don’t think it will be
an important predictor at all for my model and I did not include this specific variable for
my model building process.
For this test case, I have chosen Logistic Regression to be a better model for interpretation
and analytical purposes. Keeping that in mind, I will quickly like to refer the coefficients
values:
1. Coefficients values for Salary is: -2.10948637e-05 or almost 0
2. Coefficients values for age is: -6.36093014e-02 or almost 0
3. Coefficients values for education is: 5.97505341e-02 or almost 0
4. Coefficients values for foreign is: 1.21216610e+00
5. Coefficients values for no_young_children:1: -1.84510523e+00
6. Coefficients values for no_young_children:2 is : -2.54622113e+00
7. Coefficients values for no_young_children:3 is: -1.81252011e+00

A coefficient for a predictor variable shows the effect of a one unit change in the
predictor variable. In the logistic regression models the values are represented using the
log function/ log of likelihood.
E.g. Log (p/1-p) = b0 + b1x1+ b2x2+ …. bnxn
Interestingly and as expected by my, Salary and age didn’t turn out to be an important
predictor for my model. Also, number of young children has emerged as a strong
predictor (likelihood ) in not opting for holiday packages.
As interpretation,
1) There is no plausible effect of salary, age, and education on the prediction for
Holliday_packages. These variables don’t seem to impact the decision to opt for
holiday packages as we couldn’t establish a strong relation of these variables with
the target variable
2) Foreign has emerged as a strong predictor with a positive coefficient value. The log
likelihood or likelihood of a foreigner opting for a holiday package is high.
3) no_young_children variable is negating the probability for opting for holiday
packages, especially for couple with number of young children at 2.

The company can try to bin salary ranges to see if they can derive some more meaningful
interpretations out of that variable. May be club the salary or age in different buckets and see
if there is some plausible impact on the predictor variable. OR else, the business can use
some different model techniques to do a deep dive.

Recommendation:
1) The company should really focus on foreigners to drive the sales of their holiday
packages as that’s where the majority of conversions are going to come in.
2) The company can try to direct their marketing efforts or offers toward foreigners for a
better conversion opting for holiday packages
3) The company should also stay away from targeting parents with younger children.
The chances of selling to parents with 2 younger children is probably the lowest. This
also gels with the fact that parents try and avoid visiting with younger children.
4) If the firm wants to target parents with older children, that still might end up giving
favorable return for their marketing efforts then spent on couples with younger
children.

You might also like