Professional Documents
Culture Documents
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
MODELLING - PROJECT
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with
the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The company is earning different profits
on different prize slots. You have to help the company in predicting the price for the stone on the bases of
the details given in the dataset so it can distinguish between higher profitable stones and lower profitable
stones so as to have better profit share. Also, provide them with the best 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
EXPLORATORY DATA ANALYSIS
The dataset consists of 11 variables – ‘Unnamed: 0, carat, cut, color, clarity, depth, table, x, y, z, price’.
The variable ‘Unnamed: 0’ is not needed for exploratory data analysis or any further predictions. Hence,
we can chose to drop the column. After dropping the column, the dataset look as below:
1
The shape of the data is (26967, 10).
INFORMATION OF DATA
We can see that the variable ‘depth’ is having a total of 697 null values, i.e, 2.5% of the data is missing
for this column.
2
DESCRIPTIVE STATISTICS OF THE DATA
There are three categorical variable ‘cut, color and clarity’. Cut is having a total of 5 unique values, color
is having a total of 7 unique value and clarity is having a unique value of 8.
Price will be the target variable considered while building the Linear Regression model.
CHECKING FOR DUPLICATES IN THE DATA
After checking for duplicates found in the dataset, we can there are total of 34 rows present. We can
choose to remove the duplicates to get a better prediction or insights from the model.
UNIQUE VALUES IN CATEGORICAL VARIABLES
3
Cut is having 5 unique values – ‘Fair, Good, Very Good, Premium, and Ideal’. Quality is increasing order
Fair, Good, Very Good, Premium, Ideal where Ideal being the highest quality and Fair being the least
quality.
Color is having 7 unique values – ‘J, I, D, H, F, E, G’. D is the best quality and J the worst quality.
Clarity is having 8 unique values –‘FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3’. In order from Best to
Worst, FL = flawless, I3= level 3 inclusions.
UNIVARIATE ANALYSIS
CARAT DISTRIBUTION
4
The plot shows that the Carat weight distribution of the cubic zirconia and it is positively skewed. The
skewness value for carat is 1.114789.
Depth shows the height of a cubic zirconia, measured from the Culet to the table, divided by its average
Girdle Diameter. From the plot we can see that the data is almost normal distribution with skewness
value of -0.026086.
5
The boxplot shows large number of outliers for the distribution of depth.
TABLE DISTRIBUTION
The Width of the cubic zirconia's Table is expressed as a Percentage of its Average Diameter. The plot
shows that the data is positively skewed with a skewness value of 0.765805.
6
The box plot shows outliers present in the table data.
LENGTH DISTRIBUTION
The plot shows the length of the cubic zirconia in mm. The distribution plot shows us that the data is
positively skewed with skewness value of 0.392290.
7
The boxplot shows many outliers present in the data.
WIDTH DISTRIBUTION
The plot shows the Width of the cubic zirconia in mm. The distribution plot shows that the data is
positively skewed with skewness value of 3.867764
8
The boxplot shows that the length distribution consists of outliers in the data.
HEIGHT DISTRIBUTION
The plot shows the distribution of Height of the cubic zirconia in mm. The distribution of the data is
positively skewed with skewness of 2.580665.
9
The box plot consist of many outliers in the data.
CUT DISTRIBUTION
Ideal consists of more number of data in the dataset whereas Fair has the least. Ideal is having the best
quality and Fair being the least.
10
COLOR DISTRIBUTION
G has the highest amount of data and J being the least. D being the best followed by, E, F, G, H, I and J
being the worst.
CLARITY DISTRIBUTION
Cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. The plot shows the total
count of data for each category of clarity. From the plot we can see that the SI1 is most the most
number of data followed by VS2.
The order from best to worst FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3.
11
PRICE DISTRIBUTION
The above plot shows the price of the cubic zirconia. From the plot we can see that the data is
positively skewed with skewness value of 1.619116
The above boxplot shows that there are many outliers present in the data.
12
BIVARIATE ANALYSIS
PAIRPLOT DISTRIBUTION
13
CUT AND PRICE
The above plot shows the distribution of data between cut type and price. We can see that the
Premium is having the highest price and Ideal is having the least.
CLARITY AND PRICE
14
COLOR AND PRICE
From the plot, we can see that J is having the highest price among all the other color categories
followed by I, H, G, F, E and D.
DEPTH AND PRICE
15
X AND PRICE
Y AND PRICE
16
Z AND PRICE
The variables Carat with variables X, Y, Z and price are strongly correlated with each other.
17
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Do you think scaling is
necessary in this case?
Checking for null value
Depth is the only variable which has null value in them. Only 2.5% of the data contains null value.
We can chose to impute the null values with median or mean imputations. Here I have chosen median
imputations. Once the imputation is completed, we don’t have any null values present in the dataset.
There are few columns which has few values as zero. I chose not to remove them.
18
OUTLIERS
From univariate analysis, we could find that the data is having outliers. So we can remove them by
treating the outliers.
19
20
SCALING
Scaling can be done to normalize the range of independent variables or features of data. Since there is
presence of multicollinearity in data we can chose to scale the data.
I chose to do StandardScaler() method to scale the data. After scaling the data, the data ranges from 0
to 1.
21
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train
and test (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE.
ENCODING STRING VALUES
We use get_dummies () function to encode the string values for modelling, i.e., converting the
categorical variables to dummy or indicator variables.
We split the train and test data as 70% and 30%. We copy all the predictor variable i.e Price in to X data
frame and copy the target into y data frame.
22
Shape of Y is (26933, 1).
LINEAR REGRESSION MODEL:
We run the LinearRegression() to find the best model for training data.
We can still see high correlation present in the data from VIF values. The best values for VIF is 5% and
less.
24
INFERENTIAL STATISTICS
1ST Iteration
25
For 2nd Iteration I dropped the value depth, to reduce the high collinearity.
26
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
2 The exploratory analysis clearly showed us that diamonds with cuts in ideal, premium and very good cuts
brought in more profits to the company. Hence we can recommend to bring in more marketing strategies
to promote these cuts. For eg, advertising or inviting any social media influencers.
3 Similarly, for the color H, I, J are bringing in more profits, so we need to maintain the same and use these
colors to bring in more profits to the company. While looking at the other colors that is not bringing any
profits, we can either decrease their price or promote them, so they sell out.
4 Since diamonds are most sold when their clarity is much higher, the jeweler should make sure that they are
of the finest quality hence bringing in more customers.
__________________________________________________________________________________________
__________________________________________________________________________________________
27
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details of
872 employees of a company. Among these employees, some opted for the package and some didn't. You
have to help the company in predicting whether an employee will opt for the package or not on the basis of
the information given in the data set. Also, find out the important factors on the basis of which the company
will focus on particular employees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis.
EXPLORATORY DATA ANALYSIS
Since we do not need the variable Unnamed for prediction or model building, we can drop the column.
After dropping the column, the data look as below:
28
INFORMATION OF THE DATA
29
DESCRIPTIVE STATISTICS
DUPLICATES
Holliday_Package has two values: no and yes. No has a total of 471 values whereas yes has 401 values.
Foreign has two values: no and yes. No has 656 values and yes has 216 values.
30
UNIVARIATE ANALYSIS
SALARY DISTRIBUTION
The above diagram shows the salary of employees. From the plot we can see that the data is positively
skewed with skewness value of 3.103216.
From the boxplot, we can see that there are many outliers present in the data.
31
AGE DISTRIBUTION
The above plot shows the distribution of age in years. The plot is positively skewed with a value of
0.146412.
From the boxplot, we can find that there are no outliers present in the variable age.
32
EDUC DISTRIBUTION
The above plot shows the distribution of years of formal education. The plot looked negatively skewed
with skewness -0.045501
33
No_young_children Distribution
The plot shows the number of younger children lesser than 7 years. The plot is positively skewed with
a skewness value of 1.946515.
34
No_older_children Distribution
The plot shows the distribution of Number of older children. The distribution is positively skewed with
skewness of 0.953951.
The boxplot shows there are few outliers present in the data.
35
Foreign Distribution
The above plot shows the distribution of Foreign. It has two categories. Category ‘no’ is having more
values than ‘yes’.
Holliday_Package Distribution
The above shows the distribution of Holliday_Package. It has two categories: Yes and No. No is having
more value than yes.
36
BIVARIATE ANALYSIS
PAIR PLOT DATA DISTRIBUTION
From the pairplot, we can find that there are not many correlation between the data and data
distribution looks normal with no huge variation.
37
CORRELATION HEATMAP
From the plot, we can see that the employees having salary more than 100000 opted for the
holliday_package
38
Holliday_Package and Age
Employees having age less than 50 are choosing the Holliday_Package and people above age 50 are
less to take the Holliday_Package.
Holliday_Package and educ
People with more formal education in years are choosing the holliday_package.
39
No_young_children and Holliday_package
40
AGE AND SALARY
People with age between 30 to 50 and salary less than 100000 are opting for the Holliday_Package.
EDUC AND SALARY
People with formal education and salary above 50000 are opting for the Holliday_Package.
41
REMOVING OUTLIERS
From univariate analysis we could find that, there are many outliers present in the data. For Logistic
Regression and LDA, it is better to treat the outliers in order to get the best results.
After treating the outliers the data looks as below. There are no outliers present in the data after
treating it.
42
43
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
ENCODING THE CATEGORICAL VARIABLES FOR MODELLING
We use get_dummies () function to encode the string values for modelling, i.e., converting the
categorical variables to dummy or indicator variables.
Copying the predictor variable into an X data frame and target variable into Y data frame.
Then we split the Train and test data as 70% and 30%.
44
LOGISTIC REGRESSION MODEL
Fitting the train and test data into logistic regression model:
For LDA, Model, we convert the categorical target variables to integer (1 and 0)
Then we copy the target and predictor variable into X and Y data frame and split the data into Test and
train in 70% and 30%.
X_train
45
Y_train
5 X_Test
6 Y_Test
46
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.
LOGISTIC REGRESSION MODEL
Confusion Matrix and Classification Report of the Train data
47
Confusion Matrix and classification report for the test data:
48
ROC and AUC Score : 0.661
49
Confusion Matrix and classification Report for the Test Data
50
AUC AND ROC For both Train and Test Data
51
52
53
COMPARING LR AND LDA MODEL
Both the models have almost the same metrics for the classification report.
Linear Regression model have better Precision and Recall rate. Therefore Linear Regression model to be
best optimized.
54
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
From the EDA Analysis, we could find that people aged above 50 or between 50 to 60 are not opting for
Holiday Packages. This might either be due to their concerns of their safety during the travel or the price
of the package. By focusing on this aged group, we can add promotional strategies, explaining them
about the safety precautions taken during the travel, and we even can offer Senior citizens discount
options to them.
Secondly, we can bring in more marketing strategies like social media campaigns like giveaways, lucky
draw to attract more customers.
People with salary more than 150000 are more in number while choosing the holiday packages, therefore
we can roll out many offers for such category of people to bring in more customers and also it should be
a profit for the company.
__________________________________________________________________________________________
__________________________________________________________________________________________
55