Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Linear_Regression_Assignment

Problem Statement:
You are hired by a company Gem Stones co ltd, which is a cubic
zirconia manufacturer. You are provided with the dataset containing
the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same
qualities as a diamond). The company is earning different profits on
different prize slots. You have to help the company in predicting the
price for the stone on the bases of the details given in the dataset
so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share. Also, provide
them with the best 5 attributes that are most important.
Reading the data and do exploratory data analysis. Describing the
data briefly. (Check the null values, Data types, shape, EDA).
Performing Univariate and Bivariate Analysis.

Loading all the necessary library for the model building.


Now, reading the head and tail of the dataset to check whether data has been
properly fed.

Data Head:

Data Tail:

Checking shape and Info of the Data:


We have float, int and object data types in the data.

Description of the Data:

We have both categorical and continuous data,


For categorical data we have cut, colour and clarity
For continuous data we have carat, depth, table, x. y, z and price
Price will be target variable.

Checking for duplicates in the Data:


Unique values in the Categorical Data:

Univariate / Bivariate Analysis:

The distribution of data in carat seems to positively skewed, as there are multiple
peaks points in the distribution there could multimode and the box plot of carat
seems to have large number of outliers. In the range of 0 to 1 where majority of data
lies.

The distribution of depth seems to be normal distribution,


The depth ranges from 55 to 65
The box plot of the depth distribution holds many outliers.

The distribution of table also seems to be positively skewed


The box plot of table has outliers
The data distribution where there is maximum distribution is between 55 to 65.

The distribution of x (Length of the cubic zirconia in mm.) is positively skewed


The box plot of the data consists of many outliers
The distribution rages from 4 to 8

The distribution of Y (Width of the cubic zirconia in mm.) is positively skewed The
box plot also consists of outliers
The distribution too much positively skewed.
The skewness may be due to the diamonds are always made in specific shape.
There might not be too much sizes in the market

The distribution of z (Height of the cubic zirconia in mm.) is positively skewed


The box plot also consists of outliers
The distribution too much positively skewed.
The skewness may be due to the diamonds are always made in specific shape.
There might not be too much sizes in the market
The price has seemed to be positively skewed.
The skew is positive
The price has outliers in the data
The price distribution is from rs 100 to 8000.

Price – Histogram:

Skew of Data:

Bivariate Analysis:

Cut:
Quality is increasing order Fair, Good, Very Good, Premium, Ideal.

The most preferred cut seems to be ideal cut for diamonds.

The reason for the most preferred cut ideal is because those diamonds are priced
lower than other cuts.

Color:
D is the best and J is the worst
We have 7 colours in the data, The G seems to be the preferred colour,

We see the G is priced in the middle of the seven colours, whereas J being the worst
colour price seems too high.

Clarity:
Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2,
SI1, SI2, I1, I2, I3
The clarity VS2 seems to be preferred by people.

The data has No FL diamonds, from this we can clearly understand the flawless
diamonds are not bringing any profits to the store.

More Relations between Categorical Variables:

Cut and Color:

Cut and Clarity:


Correlation:

Carat vs Price:

Depth vs Price:
X vs Price:
Y vs Price:
Z vs Price:

Data Distribution:
Correlation Matrix:
This matrix clearly shows the presence of multi collinearity in the dataset.

Checking if there is a value that is “0”:

Scaling:
Scaling can be useful to reduce or check the multi collinearity in the data, so if
scaling is not applied I find the VIF – variance inflation factor values very high. Which
indicates presence of multi collinearity
These values are calculated after building the model of linear regression.
To understand the multi collinearity in the model The scaling had no impact in model
score or coefficients of attributes nor the intercept.

After Scaling – VIF Values:


Checking the Outliers in the Data:

Before treating Outliers:

After treating Outliers:


Encode the data (having string values) for Modelling. Data
Split: Split the data into test and train (70:30). Apply
Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using
Rsquare, RMSE.:

ENCODING THE STRING VALUES:


GET DUMMIES:

Dummies have been encoded:


Linear regression model does not take categorical values so that we have encoded
categorical values to integer for better results.

DROPING UNWANTED COLUMNS:

Linear Model:
VIF – Values:
We still find we have multi collinearity in the dataset, to drop these values to Lower
level we can drop columns after doing stats model.
From stats model we can understand the features that do not contribute to the Model
We can remove those features after that the Vif Values will be reduced
Ideal value of VIF is less tha 5%.

Stats Model:

BEST PARAMS SUMMARY:


After dropping the depth variable

OLS Regression Results:


To ideally bring down the values to lower levels we can drop one of the variables that
is highly correlated.
Dropping variables would bring down the multi collinearity level down.

Insights:
We had a business problem to predict the price of the stone and provide insights for
the company on the profits on different prize slots. From the EDA analysis we could
understand the cut, ideal cut had number profits to the company. The colours H, I, J
have bought profits for the company. In clarity if we could see there were no flawless
stones and they were no profits coming from l1, l2, l3 stones. The ideal, premium
and very good types of cuts were bringing profits where as fair and good are not
bringing profits.
The predictions were able to capture 95% variations in the price and it is explained
by the predictors in the training set.
Using stats model if we could run the model again, we can have P values and
coefficients which will give us better understanding of the relationship, so that values
more 0.05 we can drop those variables and re run the model again for better results.
For better accuracy dropping depth column in iteration for better results.

The Equation:(-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x +


(0.28) * y + (-0.11) * z + (0.1) * cut_Good + (0.15) * cut_Ideal + (0.15) * cut_Premium
+ (0.13) * cut_Very_Good + (-0.05) * color_E + (-0.06) * color_F + (-0.1) * color_G +
(-0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.0) * clarity_IF + (0.64) *
clarity_SI1 + (0.43) * clarity_SI2 + (0.84) * clarity_VS1 + (0.77) * clarity_VS2 + (0.94)
* clarity_VVS1 + (0.93) * clarity_VVS2 +

Recommendations:
1.The ideal, premium, very good cut types are the one which are bringing profits so
that we could use marketing for these to bring in more profits.
2.The clarity of the diamond is the next important attributes the more the clear is the
stone the profits are more

The five best attributes are:


Carat,
Y the diameter of the stone
clarity_IF
clarity_SI1
clarity_SI2
clarity_VS1
clarity_VS2
clarity_VVS1
clarity_VVS2

--------------------------------------The END----------------------------------------

You might also like