Professional Documents
Culture Documents
Capstone Project Submission
Capstone Project Submission
Introduction………………………………………………………………………………………………………………………………………3
Model Building………………………………………………………………………………………………………………………………..27
Model Validation……………………………………………………………………………………………………………………………..30
Final Interpretation/Recommendation…………………………………………………………………………………………...30
2
List of Figures
List of Tables:
Table1: Table2: Correlation Check……………………………………………………………………………………………………19
Table2: Houses having high quality rating………………………………………………………………………………………..15
Table3: Outliers ceil_measure………………………………………………………………………………………………………….23
Table4: Outliers basement……………………………………………………………………………………………………………….24
Table5: Outliers living_measure………………………………………………………………………………………………………26
Table 6: Model Score for Validation…………………………………………………………………………………………………30
3
1. Introduction
Brief introduction about the problem statement and the need of solving it.
Problem Statement
As a house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For example, if
we want to sell a house and we don't know the price which we can take, as it can't be too low or too
high. To find house price we usually try to find similar properties in our neighbourhood and based on
collected data we trying to assess our house price.
Problem Definition
When any person/business wants to sell or buy a house, they always face this kind of issue as they
don't know the price which they should offer. Due to this they might be offering too low or high for
the property. Therefore, we can analyze the available data of the properties in the area and can
predict the price. We need to find how these attributes influence the house prices Right pricing is
very imporatnt aspect to sell house. It is very important to understand what are the factors and how
they influence the house price. Objective is to predict the right price of the house based on the
attributes
Build model which will predict the house price when required features passed to the model. So we
will
Find out the significant features from the given features dataset which affects the house price the
most. Build best feasible model to predict the house price with 95% confidence level
As people don't know the features/aspects which commulate property price, we can provide them
HouseBuyingSelling guiding services in the area so they can buy or sell their property with most
suitable price tag and they didn't lose their hard earned money by offering low price or keep waiting
for buyers by putting high prices.
4
2. EDA and Business Implication
Uni-variate / Bi-variate / Multi-variate analysis to understand relationship b/w variables. How
your analysis is impacting the business?
Exploratory Data Analysis Let's do some visual data analysis of the features
5
Figure3: room_bath Boxplot
6
Figure6: ceil Boxplot
Figure7: quality
Figure8: ceil_measure
7
We can see, there are lot of features which have outliers. So we might need to treat those before
building model
We have 176 properties that were sold more than once in the given data
0 April-2015
1 March-2015
2 August-2014
3 October-2014
4 February-2015
Name: month_year, dtype: object
April-2015 2231
July-2014 2211
June-2014 2180
August-2014 1940
October-2014 1878
March-2015 1875
September-2014 1774
May-2014 1768
December-2014 1471
November-2014 1411
February-2015 1250
January-2015 978
May-2015 646
Name: month_year, dtype: int64
month_year
April-2015 561933.463021
August-2014 536527.039691
December-2014 524602.893270
February-2015 507919.603200
January-2015 525963.251534
July-2014 544892.161013
June-2014 558123.736239
March-2015 544057.683200
May-2014 548166.600113
May-2015 558193.095975
November-2014 522058.861800
October-2014 539127.477636
September-2014 529315.868095
Name: price, dtype: float64
8
So the time line of the sale data of the properties is from May-2014 to May-2015 and April month
have the highest mean price.
count 2.161300e+04
mean 5.401822e+05
std 3.673622e+05
min 7.500000e+04
25% 3.219500e+05
50% 4.500000e+05
75% 6.450000e+05
max 7.700000e+06
Name: price, dtype: float64
Figure9: price
3.0 9875
4.0 6854
2.0 2747
5.0 1595
6.0 270
1.0 197
7.0 38
8.0 13
0.0 13
9.0 6
10.0 3
33.0 1
9
11.0 1
Name: room_bed, dtype: int64
The value of 33 and 11 seems to be outlier we need to check the data point before
imputing the same
Will delete this data point after bivariate analysis as it looks to be an outlier as it has low
price for 33 bed room property
Figure10: room_bed
0.00 10
0.50 4
0.75 72
1.00 3829
1.25 9
1.50 1439
1.75 3031
2.00 1917
2.25 2147
2.50 5358
2.75 1178
3.00 750
3.25 588
3.50 726
3.75 155
4.00 135
4.25 78
4.50 100
4.75 23
10
5.00 21
5.25 13
5.50 10
5.75 4
6.00 6
6.25 2
6.50 2
6.75 2
7.50 1
7.75 1
8.00 2
Name: room_bath, dtype: int64
Figure11: room_bath
Skewness is : 0.5102509663719975
11
Skewness is : 1.460564983728366
count 21613.000000
mean 2078.226588
std 919.980534
min 2.250000
25% 1420.000000
50% 1910.000000
75% 2550.000000
max 13540.000000
Name: living_measure, dtype: float64
There are many outliers in living measure. Need to review further to treat the same.
12
We have only 9 properties/house which have more than 8k living_measure. So will treat
these outliers.
Analyzing Feature: lot_measure
Skewness is : 13.084880210575367
count 2.161300e+04
mean 1.509003e+04
std 4.138466e+04
min 5.200000e+02
25% 5.043000e+03
50% 7.618000e+03
75% 1.066000e+04
max 1.651359e+06
Name: lot_measure, dtype: float64
We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this.
1.0 10647
2.0 8210
1.5 1977
3.0 610
2.5 161
3.5 8
Name: ceil, dtype: int64
13
Figure16: ceil
Above grapth confirming the same, that most properties have 1 and 2 floors
0.0 19494
2.0 959
3.0 510
1.0 332
4.0 318
Name: sight, dtype: int64
3.0 14063
4.0 5655
5.0 1694
2.0 171
1.0 30
Name: condition, dtype: int64
14
Figure17: quality rating
There are only 13 propeties which have the highest quality rating
Skewness is : 1.446780194446363
15
Figur18: ceil_measure distribution
The vertival lines at each point represent the inter quartile range of values at that point
16
Figure20: basement distribution
We can see 2 gaussians, which tells us there are propeties which don't have basements
and some have the basements
We have almost 60% of the properties without basement
Figure21: year_built
17
The built year of the properties range from 1900 to 2014 and we can see upward trend
with time
Figure22: yr_renovated
Now will create age column from columns : yr_built & yr_renovated
Figure23: furnished
18
Most properties are not furnished. Furnish column need to be converted into categorical
column
BIVARIATE ANALYSIS
Correlation Check:
We have linear relationships in below featues as we got to know from above matrix
furnished: quality
19
We can plot heatmap and can easily confirm our above findings
Figure24: heatmap
Analyzing Bivariate for Feature: month_year
20
Figure25: month_year
The mean price of the houses tend to be high during March,April, May as compared to
that of September, October, November,December period.
21
Figure26: room_bead
22
Figure27: room_bath
DATA PROCESSING
Treating Outlilers
23
We got 611 records which are outliers
After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data
points but data is nicely distributed
24
Figure29: basement distribution
After treating outliers of basement, we can see that 400(~2%) data points got imputed.
Total about 5% data has been imputed after treating ceil_measure and basement.
25
Table5: Outliers living_measure
26
Figure32: living_measure distribution
By treating outliers of living_measure, we lost 178 data points more and data distribution
looks normal
Outliers are dropped because there seems to be an error in data entry and also variation
from the other records significantly affecting the efficiency of predictive model.
4.Model building
Lasso Regression:
Lasso regression is also called Penalized regression method. This method is usually
used in machine learning for the selection of the subset of variables. It provides
greater prediction accuracy as compared to other regression models. Lasso
Regularization helps to increase model interpretation.
Ridge Regression:
Ridge regression is the same as simple linear regression, it assumes a linear
relationship between the target variables and the independent variables. Ridge
27
regression is used where there is a high correlation between the independent
variables in the data set. If the correlation is high, then there is the bias introduced in
the model. Hence, we introduce a bias matrix in the equation of the Ridge
Regression. It is a very powerful regression algorithm where the model is less prone
to overfitting. It’s a type of regularised linear regression which uses L2 regularisation.
KNN Regression:
Decision trees are useful when there are complex relationships between the features
and the output variables. They also work well compared to other algorithms when
there are missing features, when there is a mix of categorical and numerical features
and when there is a big difference in the scale of features.
Model Tuning
Effort to improve model performance.
Gradient Boosting Models will continue improving to minimize all errors. This
can overemphasize outliers and cause overfitting.
Computationally expensive - often require many trees (>1000) which can be
time and memory exhaustive.
28
The high flexibility results in many parameters that interact and influence
heavily the behavior of the approach (number of iterations, tree depth,
regularization parameters, etc.). This requires a large grid search during
tuning.
Less interpretative in nature, although this is easily addressed with various
tools.
Bagging Method:
Random Forest works well with both categorical and continuous variables.
Random Forest algorithm is very stable. Even if a new data point is introduced
in the dataset, the overall algorithm is not affected much since the new data
may impact one tree, but it is very hard for it to impact all the trees.
29
Random Forest is comparatively less impacted by noise.
1. Complexity: Random Forest creates a lot of trees (unlike only one tree in
case of decision tree) and combines their outputs. By default, it creates 100
trees in Python sklearn library. To do so, this algorithm requires much more
computational power and resources. On the other hand decision tree is simple
and does not require so much computational resources.
2. Longer Training Period: Random Forest require much more time to train as
compared to decision trees as it generates a lot of trees (instead of one tree in
case of decision tree) and makes decision on the majority of votes.
5. Model validation - How was the model validated ? Just accuracy, or anything else
too ?
There are a number of different model validation techniques, choosing the right one
will depend upon your data and what you’re trying to achieve with your machine
learning model. These are the most common model validation techniques.
The most basic type of validation technique is a train and test split. The point of a
validation technique is to see how your machine learning model reacts to data it’s
never seen before. All validation methods are based on the train and test split, but
will have slight variations. With this basic validation method, you split your data into
two groups: training data and testing data. You hold back your testing data and do
not expose your machine learning model to it, until it’s time to test the model. Most
people use a 70/30 split for their data, with 70% of the data used to train the model.
We have built different models. The performance (score and 95% confidence interval
scores) of the model build on dataset-1 is better as the 95% confidence interval.
30
The top key features to consider for pricing a property are:'furnished_1', 'yr_built',
'living_measure','quality_8', 'lot_measure15', 'quality_9', 'ceil_measure', 'total_area'.
So, one needs to thoroughly introspect its property on parameters suggested and list its
price accordingly, similarly if one wants buy house - needs to check the features
suggested above in house and calculate the predicted price. The same can than be
compared to listed price.
For further improvization, the datasets can be made by treating outliers in different ways
and hypertuning the ensemble models. Making polynomial features and improvising the
model performance can also be explored further.
31