Capstone Project Submission

1
Introduction………………………………………………………………………………………………………………………………………3
EDA and Business Application…………………………………………………………………………………………………………...5
Data Cleaning and Pre-Processing……………………………………………………………………………………………………23
Model Building………………………………………………………………………………………………………………………………..27
Model Validation……………………………………………………………………………………………………………………………..30
Final Interpretation/Recommendation…………………………………………………………………………………………...30
2
List of Figures
Figure1: Price Boxplot…………………………………………………………………………………………………………………….….5

Figure2: room_bed Boxplot………………………………………………………………………………………………………….......5
Figure3: room_bath Boxplot……………………………………………………………………………………………………………...6
Figure4: living_measure Boxplot………………………………………………………………………………………………………..6
Figure5: lot_measure Boxplot…………………………………………………………………………………………………………...6
Figure6: ceil Boxplot………………………………………………………………………………………………………………………….7
Figure7: quality………………………………………………………………………………………………………………………………….7
Figure8: ceil_measure……………………………………………………………………………………………………………………….7
Figure9: price………………………………………………………………………………………………………………………………….…9
Figure10: room_bed………………………………………………………………………………………………………………………..10
Figure11: room_bath……………………………………………………………………………………………………………………….10
Figure12: room_bath distribution……………………………………………………………………………………………………11
Figure13: living_measure distribution……………………………………………………………………………………………..11
Figure14: living_measure boxplot……………………………………………………………………………………………………12
Figure15: lot_measure boxplot……………………………………………………………………………………………………..…15
Figure16: ceil……………………………………………………………………………………………………………………………………14
Figure17: quality rating……………………………………………………………………………………………………………………15
Figur18: ceil_measure distribution…………………………………………………………………………………………………..15
Figure19: ceil vs ceil_measure……………………………………………………………………………………………………….…16
Figure20: basement distribution………………………………………………………………………………………………………16
Figure21: year_built………………………………………………………………………………………………………………………...17
Figure22: yr_renovated……………………………………………………………………………………………………………………17
Figure23: furnished………………………………………………………………………………………………………………………….18
Figure24: heatmap………………………………………………………………………………………………………………………..…20
Figure25: month_year……………………………………………………………………………………………………………………..21
Figure26: room_bead………………………………………………………………………………………………………………………22
Figure27: room_bath……………………………………………………………………………………………………………………….23
Figure 28: ceil_measure distribution………………………………………………………………………………………………..24
Figure29: basement distribution……………………………………………………………………………………………………..25
Figure30: basement boxplot…………………………………………………………………………………………………………….25
Figure31: living_measure boxplot……………………………………………………………………………………………………26
Figure32: living_measure distribution…………………………………………………………………………………………..…27
List of Tables:
Table1: Table2: Correlation Check……………………………………………………………………………………………………19
Table2: Houses having high quality rating………………………………………………………………………………………..15
Table3: Outliers ceil_measure………………………………………………………………………………………………………….23
Table4: Outliers basement……………………………………………………………………………………………………………….24
Table5: Outliers living_measure………………………………………………………………………………………………………26
Table 6: Model Score for Validation…………………………………………………………………………………………………30
3
1. Introduction
Brief introduction about the problem statement and the need of solving it.
Problem Statement
As a house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For example, if
we want to sell a house and we don't know the price which we can take, as it can't be too low or too
high. To find house price we usually try to find similar properties in our neighbourhood and based on
collected data we trying to assess our house price.
Problem Definition
When any person/business wants to sell or buy a house, they always face this kind of issue as they
don't know the price which they should offer. Due to this they might be offering too low or high for
the property. Therefore, we can analyze the available data of the properties in the area and can
predict the price. We need to find how these attributes influence the house prices Right pricing is
very imporatnt aspect to sell house. It is very important to understand what are the factors and how
they influence the house price. Objective is to predict the right price of the house based on the
attributes
Need of the study/project
Build model which will predict the house price when required features passed to the model. So we
will
Find out the significant features from the given features dataset which affects the house price the
most. Build best feasible model to predict the house price with 95% confidence level
c) Understanding business/social opportunity
As people don't know the features/aspects which commulate property price, we can provide them
HouseBuyingSelling guiding services in the area so they can buy or sell their property with most
suitable price tag and they didn't lose their hard earned money by offering low price or keep waiting
for buyers by putting high prices.
4
2. EDA and Business Implication
Uni-variate / Bi-variate / Multi-variate analysis to understand relationship b/w variables. How
your analysis is impacting the business?
Exploratory Data Analysis Let's do some visual data analysis of the features
Univariate Analysis - By BoxPlot
Figure1: Price Boxplot
Figure2: room_bed Boxplot
5
Figure3: room_bath Boxplot
Figure4: living_measure Boxplot
Figure5: lot_measure Boxplot
6
Figure6: ceil Boxplot
Figure7: quality
Figure8: ceil_measure
7
We can see, there are lot of features which have outliers. So we might need to treat those before
building model
Analysing Feature: cid
 We have 176 properties that were sold more than once in the given data
Analyzing Feature: dayhours
0 April-2015
1 March-2015
2 August-2014
3 October-2014
4 February-2015
Name: month_year, dtype: object
 We successfully converted dayhours feature to month_year for better analysis.
April-2015 2231
July-2014 2211
June-2014 2180
August-2014 1940
October-2014 1878
March-2015 1875
September-2014 1774
May-2014 1768
December-2014 1471
November-2014 1411
February-2015 1250
January-2015 978
May-2015 646
Name: month_year, dtype: int64
 We can see, most houses sold in April, July month
month_year
April-2015 561933.463021
August-2014 536527.039691
December-2014 524602.893270
February-2015 507919.603200
January-2015 525963.251534
July-2014 544892.161013
June-2014 558123.736239
March-2015 544057.683200
May-2014 548166.600113
May-2015 558193.095975
November-2014 522058.861800
October-2014 539127.477636
September-2014 529315.868095
Name: price, dtype: float64
8
So the time line of the sale data of the properties is from May-2014 to May-2015 and April month
have the highest mean price.
Analyzing Feature: Price (our Target)
count 2.161300e+04
mean 5.401822e+05
std 3.673622e+05
min 7.500000e+04
25% 3.219500e+05
50% 4.500000e+05
75% 6.450000e+05
max 7.700000e+06
Name: price, dtype: float64
Figure9: price
 The Price is ranging from 75,000 to 77,00,000 and distribution is right-skewed.
Analyzing Feature: room_bed
3.0 9875
4.0 6854
2.0 2747
5.0 1595
6.0 270
1.0 197
7.0 38
8.0 13
0.0 13
9.0 6
10.0 3
33.0 1
9
11.0 1
Name: room_bed, dtype: int64
 The value of 33 and 11 seems to be outlier we need to check the data point before
imputing the same
 Will delete this data point after bivariate analysis as it looks to be an outlier as it has low
price for 33 bed room property
Figure10: room_bed
 Most of the houses/properties have 3 or 4 bedrooms
Analyzing Feature: room_bath
0.00 10
0.50 4
0.75 72
1.00 3829
1.25 9
1.50 1439
1.75 3031
2.00 1917
2.25 2147
2.50 5358
2.75 1178
3.00 750
3.25 588
3.50 726
3.75 155
4.00 135
4.25 78
4.50 100
4.75 23
10
5.00 21
5.25 13
5.50 10
5.75 4
6.00 6
6.25 2
6.50 2
6.75 2
7.50 1
7.75 1
8.00 2
Name: room_bath, dtype: int64
Figure11: room_bath
 Skewness is : 0.5102509663719975
Figure12: room_bath distribution
Analyzing Feature: Living measure
11
 Skewness is : 1.460564983728366
count 21613.000000
mean 2078.226588
std 919.980534
min 2.250000
25% 1420.000000
50% 1910.000000
75% 2550.000000
max 13540.000000
Name: living_measure, dtype: float64
Figure13: living_measure distribution
 Data distribution tells us, living_measure is right-skewed.
Figure14: living_measure boxplot
 There are many outliers in living measure. Need to review further to treat the same.
12
 We have only 9 properties/house which have more than 8k living_measure. So will treat
these outliers.
Analyzing Feature: lot_measure
Skewness is : 13.084880210575367
count 2.161300e+04
mean 1.509003e+04
std 4.138466e+04
min 5.200000e+02
25% 5.043000e+03
50% 7.618000e+03
75% 1.066000e+04
max 1.651359e+06
Name: lot_measure, dtype: float64
Figure15: lot_measure boxplot
 We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this.
Analyzing Feature: ceil
1.0 10647
2.0 8210
1.5 1977
3.0 610
2.5 161
3.5 8
Name: ceil, dtype: int64
13
Figure16: ceil
 We can see, most houses have 1 floor
 Above grapth confirming the same, that most properties have 1 and 2 floors
Analyzing Feature: coast

0.0 21452
1.0 161
Name: coast, dtype: int64
Analyzing Feature: sight
0.0 19494
2.0 959
3.0 510
1.0 332
4.0 318
Name: sight, dtype: int64
Analyzing Feature: condition
3.0 14063
4.0 5655
5.0 1694
2.0 171
1.0 30
Name: condition, dtype: int64
Analyzing Feature: quality
14
Figure17: quality rating
Table1: Houses having high quality rating
 There are only 13 propeties which have the highest quality rating
Analyzing Feature: ceil_measure
 Skewness is : 1.446780194446363
15
Figur18: ceil_measure distribution
Figure19: ceil vs ceil_measure
 There is no pattern in Ceil Vs Ceil_measure
 The vertival lines at each point represent the inter quartile range of values at that point
Analyzing Feature: basement
16
Figure20: basement distribution
 We can see 2 gaussians, which tells us there are propeties which don't have basements
and some have the basements
 We have almost 60% of the properties without basement
Analyzing Feature: yr_built
Figure21: year_built
17
 The built year of the properties range from 1900 to 2014 and we can see upward trend
with time
Analyzing Feature: yr_renovated
 Only 914 houses were renovated out of 21613 houses
Figure22: yr_renovated
 Now will create age column from columns : yr_built & yr_renovated
Analyzing Feature: furnished
Figure23: furnished
18
 Most properties are not furnished. Furnish column need to be converted into categorical
column
BIVARIATE ANALYSIS
Correlation Check:
Table2: Correlation Check
 We have linear relationships in below featues as we got to know from above matrix
 price: room_bath, living_measure, quality, living_measure15, furnished
 living_measure: price, room_bath. So we can consider dropping 'room_bath' variable.
 quality: price, room_bath, living_measure
 ceil_measure: price, room_bath, living_measure, quality
 living_measure15: price, living_measure, quality. So we can consider dropping

living_measure15 as well. As it's giving same info as living_measure.
 lot_measure15: lot_measure. Therefore, we can consider dropping lot_measure15, as it's

giving same info.
 furnished: quality
 total_area: lot_measure, lot_measure15. Therefore, we can consider dropping total_area

feature as well. As it's giving same info as lot_measure.
19
We can plot heatmap and can easily confirm our above findings
Figure24: heatmap
Analyzing Bivariate for Feature: month_year
20
Figure25: month_year
 The mean price of the houses tend to be high during March,April, May as compared to
that of September, October, November,December period.
Analyzing Bivariate for Feature: room_bed
21
Figure26: room_bead
 There is clear increasing trend in price with room_bed
22
Figure27: room_bath
 There is upward trend in price with increase in room_bath
Analyzing Bivariate for Feature: living_measure
3. Data Cleaning and Pre-processing
DATA PROCESSING
Treating Outlilers
 We have seen outliers for columns room_bath(33 bed), living_measure, lot_measure,

ceil_measure and Basement
 Treating outliers for column - ceil_measure
Table3: Outliers ceil_measure
23
 We got 611 records which are outliers
Figure 28: ceil_measure distribution
 After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data
points but data is nicely distributed
Treating outliers for column - basement
Table4: Outliers basement
 We got 408 records as outliers, let's drop these outliers
24
Figure29: basement distribution
 After treating outliers of basement, we can see that 400(~2%) data points got imputed.
Total about 5% data has been imputed after treating ceil_measure and basement.
Figure30: basement boxplot
Treating outliers for column - living_measure
25
Table5: Outliers living_measure
 We got 178 records as outliers. Let's treat this by dropping
Figure31: living_measure boxplot
26
Figure32: living_measure distribution
 By treating outliers of living_measure, we lost 178 data points more and data distribution
looks normal
 Outliers are dropped because there seems to be an error in data entry and also variation
from the other records significantly affecting the efficiency of predictive model.
4.Model building
 Clear on why was a particular model(s) chosen.

Linear Regression:
Linear-regression models are relatively simple and provide an easy-to-interpret
mathematical formula that can generate predictions. Linear regression can be
applied to various areas in business and academic study.
You’ll find that linear regression is used in everything from biological, behavioral,
environmental and social sciences to business. Linear-regression models have
become a proven way to scientifically and reliably predict the future. Because linear
regression is a long-established statistical procedure, the properties of linear-
regression models are well understood and can be trained very quickly.
Lasso Regression:
Lasso regression is also called Penalized regression method. This method is usually
used in machine learning for the selection of the subset of variables. It provides
greater prediction accuracy as compared to other regression models. Lasso
Regularization helps to increase model interpretation.
Ridge Regression:
Ridge regression is the same as simple linear regression, it assumes a linear
relationship between the target variables and the independent variables. Ridge
27
regression is used where there is a high correlation between the independent
variables in the data set. If the correlation is high, then there is the bias introduced in
the model. Hence, we introduce a bias matrix in the equation of the Ridge
Regression. It is a very powerful regression algorithm where the model is less prone
to overfitting. It’s a type of regularised linear regression which uses L2 regularisation.
KNN Regression:
K nearest neighbors (KNN) is a supervised machine learning algorithm. A supervised

machine learning algorithm’s goal is to learn a function such that f(X) = Y where X is
the input, and Y is the output. KNN can be used both for classification as well as
regression.
Decision Tree Regression:
Linear Regression is used to predict continuous outputs where there is a linear

relationship between the features of the dataset and the output variable. It is used for
regression problems where you are trying to predict something with infinite possible
answers such as the price of a house.
Decision trees can be used for either classification or regression problems and are
useful for complex datasets. They work by splitting the dataset, in a tree-like
structure, into smaller and smaller subsets and then make predictions based on what
subset a new example would fall into.
Decision trees are useful when there are complex relationships between the features
and the output variables. They also work well compared to other algorithms when
there are missing features, when there is a mix of categorical and numerical features
and when there is a big difference in the scale of features.
Model Tuning
 Effort to improve model performance.
Gradient Boosting, Bagging Regression Model & Random Forest
Gradient Boosting Method:
 Often provides predictive accuracy that cannot be trumped.

 Lots of flexibility - can optimize on different loss functions and provides
several hyper parameter tuning options that make the function fit very flexible.
 No data pre-processing required - often works great with categorical and
numerical values as is.
 Handles missing data - imputation not required.
 Gradient Boosting Models will continue improving to minimize all errors. This
can overemphasize outliers and cause overfitting.
 Computationally expensive - often require many trees (>1000) which can be
time and memory exhaustive.
28
 The high flexibility results in many parameters that interact and influence
heavily the behavior of the approach (number of iterations, tree depth,
regularization parameters, etc.). This requires a large grid search during
tuning.
 Less interpretative in nature, although this is easily addressed with various
tools.
Bagging Method:
 Bagging offers the advantage of allowing many weak learners to combine

efforts to outdo a single strong learner. It also helps in the reduction of
variance, hence eliminating the overfitting of models in the procedure.
 One disadvantage of bagging is that it introduces a loss of interpretability of a
model. The resultant model can experience lots of bias when the proper
procedure is ignored. Despite bagging being highly accurate, it can be
computationally expensive, which may discourage its use in certain instances.
Random Forest Method:
 Random Forest is based on the bagging algorithm and uses Ensemble

Learning technique. It creates as many trees on the subset of the data and
combines the output of all the trees. In this way it reduces overfitting problem
in decision trees and also reduces the variance and therefore improves the
accuracy.
 Random Forest can be used to solve both classification as well as regression

problems.
 Random Forest works well with both categorical and continuous variables.
 Random Forest can automatically handle missing values.
 No feature scaling required: No feature scaling (standardization and

normalization) required in case of Random Forest as it uses rule based
approach instead of distance calculation.
 Handles non-linear parameters efficiently: Non linear parameters don't affect

the performance of a Random Forest unlike curve based algorithms. So, if
there is high non-linearity between the independent variables, Random Forest
may outperform as compared to other curve based algorithms.
 Random Forest can automatically handle missing values.
 Random Forest is usually robust to outliers and can handle them

automatically.
 Random Forest algorithm is very stable. Even if a new data point is introduced
in the dataset, the overall algorithm is not affected much since the new data
may impact one tree, but it is very hard for it to impact all the trees.
29
 Random Forest is comparatively less impacted by noise.
Disadvantages of Random Forest
1. Complexity: Random Forest creates a lot of trees (unlike only one tree in
case of decision tree) and combines their outputs. By default, it creates 100
trees in Python sklearn library. To do so, this algorithm requires much more
computational power and resources. On the other hand decision tree is simple
and does not require so much computational resources.
2. Longer Training Period: Random Forest require much more time to train as
compared to decision trees as it generates a lot of trees (instead of one tree in
case of decision tree) and makes decision on the majority of votes.
5. Model validation - How was the model validated ? Just accuracy, or anything else
too ?
There are a number of different model validation techniques, choosing the right one
will depend upon your data and what you’re trying to achieve with your machine
learning model. These are the most common model validation techniques.
Train and Test Split or Holdout
The most basic type of validation technique is a train and test split. The point of a
validation technique is to see how your machine learning model reacts to data it’s
never seen before. All validation methods are based on the train and test split, but
will have slight variations. With this basic validation method, you split your data into
two groups: training data and testing data. You hold back your testing data and do
not expose your machine learning model to it, until it’s time to test the model. Most
people use a 70/30 split for their data, with 70% of the data used to train the model.
Table 6: Model Score for Validation
6. Final interpretation / recommendation
 We have built different models. The performance (score and 95% confidence interval
scores) of the model build on dataset-1 is better as the 95% confidence interval.
30
 The top key features to consider for pricing a property are:'furnished_1', 'yr_built',
'living_measure','quality_8', 'lot_measure15', 'quality_9', 'ceil_measure', 'total_area'.
 So, one needs to thoroughly introspect its property on parameters suggested and list its
price accordingly, similarly if one wants buy house - needs to check the features
suggested above in house and calculate the predicted price. The same can than be
compared to listed price.
 For further improvization, the datasets can be made by treating outliers in different ways
and hypertuning the ensemble models. Making polynomial features and improvising the
model performance can also be explored further.
31

Capstone Project Submission

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Project Submission

Uploaded by

Copyright:

Available Formats

1

EDA and Business Application…………………………………………………………………………………………………………...5

Data Cleaning and Pre-Processing……………………………………………………………………………………………………23

Figure1: Price Boxplot…………………………………………………………………………………………………………………….….5

Need of the study/project

c) Understanding business/social opportunity

Univariate Analysis - By BoxPlot

Figure1: Price Boxplot

Figure2: room_bed Boxplot

Figure4: living_measure Boxplot

Figure5: lot_measure Boxplot

Analysing Feature: cid

Analyzing Feature: dayhours

 We successfully converted dayhours feature to month_year for better analysis.

 We can see, most houses sold in April, July month

Analyzing Feature: Price (our Target)

 The Price is ranging from 75,000 to 77,00,000 and distribution is right-skewed.

Analyzing Feature: room_bed

 Most of the houses/properties have 3 or 4 bedrooms

Analyzing Feature: room_bath

Figure12: room_bath distribution

Analyzing Feature: Living measure

Figure13: living_measure distribution

 Data distribution tells us, living_measure is right-skewed.

Figure14: living_measure boxplot

Figure15: lot_measure boxplot

Analyzing Feature: ceil

 We can see, most houses have 1 floor

Analyzing Feature: coast

Analyzing Feature: sight

Analyzing Feature: condition

Analyzing Feature: quality

Table1: Houses having high quality rating

Analyzing Feature: ceil_measure

Figure19: ceil vs ceil_measure

 There is no pattern in Ceil Vs Ceil_measure

Analyzing Feature: basement

Analyzing Feature: yr_built

Analyzing Feature: yr_renovated

 Only 914 houses were renovated out of 21613 houses

Analyzing Feature: furnished

Table2: Correlation Check

 price: room_bath, living_measure, quality, living_measure15, furnished

 living_measure: price, room_bath. So we can consider dropping 'room_bath' variable.

 quality: price, room_bath, living_measure

 ceil_measure: price, room_bath, living_measure, quality

 living_measure15: price, living_measure, quality. So we can consider dropping

 lot_measure15: lot_measure. Therefore, we can consider dropping lot_measure15, as it's

 total_area: lot_measure, lot_measure15. Therefore, we can consider dropping total_area

Analyzing Bivariate for Feature: room_bed

 There is clear increasing trend in price with room_bed

 There is upward trend in price with increase in room_bath

Analyzing Bivariate for Feature: living_measure

3. Data Cleaning and Pre-processing

 We have seen outliers for columns room_bath(33 bed), living_measure, lot_measure,

 Treating outliers for column - ceil_measure

Table3: Outliers ceil_measure

Figure 28: ceil_measure distribution

Treating outliers for column - basement

Table4: Outliers basement

 We got 408 records as outliers, let's drop these outliers

Figure30: basement boxplot

Treating outliers for column - living_measure