Professional Documents
Culture Documents
Predictive Modelling Project 1 PDF
Predictive Modelling Project 1 PDF
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the
dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The company is earning different profits on
different prize slots. You have to help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so
as to have better profit share. Also, provide them with the best 5 attributes that are most important.
Data Dictionary:
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Cut
Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia.With D being the best and J the worst.
cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In
Clarity order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2,
VS1, VS2, SI1, SI2, I1, I2, I3
The Height of a cubic zirconia, measured from the Culet to the table, divided by
Depth
its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Table
Diameter.
Loading all the necessary libraries and checking the data load and basic information of the data.
The target variable is price.
Among the other variable cut,color and clarity are categorical variable whereas carat, depth, table, x,y,z are
continuous variable.
Checking for Null Values:
There are about 697 values in depth which is null. This is less than 3% of total values.
• In Cut, there are five unique values Fair, good, very good, premium and ideal. Ideal cut seems to be
more preferred cut.
• There are about 7 different colors in the data set
• There are 8 different values for clarity.
Univariate / Bivariate analysis
• The data for Carat shows that the data is positively skewed and also there could be possibilities of
multimode as there are multiple peaks seen in the data
• The data for depth is normally distributed with a single peak and distributed between 55 and 70
• The data for Price shows that the data is positively skewed
• The data for table shows that the data is positively skewed and also possibilities of multimode as
there are multiple peaks seen in data
• The data for X,Y,Z are positively skewed with X having possibilities of multimode.
• The data for carat, depth, table, price, x,y, z shows there are outliers present in the data
Count Plots:
This is clear showing that count increases as the quality of cut increases and Ideal cut seems to be most
preferred.
The plot between cut and price shows that ideal cut seems to cheaper hence it is most preferred cut too.
There are about 697 values in depth which is null. This is less than 3% of total values.
After performing median imputation in depth coloumn, there are no null values,
There are certain values x,y,z has 0 as the value. As x,y,z denotes the dimensions of the diamond it does not make
sense, hence these data can be dropped
SCALING
From the correlation matrix presented above, we clearly understand that there is presence of multi colinearity in
the data. Scaling will help to remove the multicolinearity and also it will not have impact on the coefficient or
intercept of the model.
The Variance inflation factor VIF after scaling shows that the multicolinearity has been taken care.
Treating outliers:
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30).
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets
using Rsquare, RMSE.
Linear regression model does not take categorical data, hence encoding with dummies.
Dropping the unnamed as it does not have any meaning. Separating the target variable and other variable.
From the initial data analysis we understood that ideal cut has has better price points than other cuts hence
providing better profit to the company. The colours H, I, J have better price points and G has the median price
points. In clarity there are no values for flawless hence flawless has no relation with profit.
Stats model shows which are the variables has less effect , only depth seems to be less effective and dropped the
depth variable and formed the equation.
The equation, (-0.83) * Intercept + (1.24) * carat + (-0.02) * table + (-0.37) * x + (0.32) * y +
(-0.12) * z + (0.11) * cut_Good + (0.18) * cut_Ideal + (0.17) * cut_Premium + (0.15) *
cut_Very_Good + (-0.05) * color_E + (-0.07) * color_F + (-0.12) * color_G + (-0.24) * color_H + (-
0.38) * color_I + (-0.54) * color_J + (1.16) * clarity_IF + (0.74) * clarity_SI1 + (0.5) *
clarity_SI2 + (0.97) * clarity_VS1 + (0.89) * clarity_VS2 + (1.09) * clarity_VVS1 + (1.08) *
clarity_VVS2 +
Recommendations
• The various cut types ideal, premium and very good are bringing more profits hence more marketing can
be done to bring in more profits
• Clarity has more importance hence more clear the diamond and more the profit is.
• The diameter is one of the next important attribute. And median of diameter is 5.71 hence diameter can
be cut around these lines to make more profits.
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details of 872
employees of a company. Among these employees, some opted for the package and some didn't. You have to
help the company in predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of which the company will
focus on particular employees to sell their packages.
Data Dictionary:
Checking the data types where two categorical variables Holiday_package, foreign. And there are about 8
columns and 872 rows.
Checking for Null Values
Data Describe:
Salary, age, educ and number young children, number older children , employee have the went to foreign, these
are the attributes to be checked and help the company to predict whether the person will opt for holiday
package or not.
• Salary is positively skewed, age is normally distributes, educ has multi peaks, no young children and no
older children are positively skewed with more than one peaks.
• The salary data has lot of outliers where as other variables number of outliers are less
Data Distribution
There is no clear two different data distribution as there is no huge difference in the data distribution among the
holiday packages.
There is no multicolinearity present in the data
There is clear indication that people with salary more than 1,50,000 have always opted for holiday packages
Holiday package vs Educ
Performed logistic regression. Applying grid search method which will help to find optimal solving methods and
parameters to be used in logistic regression,
Liblinear solver is suggested and penalty, tolerance level has been found
LDA
Logistic Regression
Confusion matrix,
Changing the cut off value to check optimal F1 score and accuracy,
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
From the given data set we have to predict whether a particular person would opt for a holiday package or not. .
To understand this we built both logistic regression and LDA models, LDA seems to slightly better than the logistic
regression.
The exploratory data analysis shows that salary, age, educ are important parameters and gives insights like,
• People with salary more than 1,50,000 are opting for package
• People above 50 years are not opting much package.
People ranging from the age 30 to 50 generally opt for holiday packages based on the salary.
Recommendations
• As the employees earning more than 1,50,000 are opting for package, there should be more options and
also lucrative packages so it will allow the company to earn more as they will be ready to spend if
packages are good
• As aged people are not taking any packages, there could be options which will attract them like pilgrim
packages.