Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

BUSINESS

REPORT
Capstone Project

HOUSE PRICE PREDICTION

Project Note-1

SONAL SINGH
01/05/2022

1
CONTENT
1) Introduction of the Problem
a) Defining problem statement
b) Need of the study/project
c) Understanding business/social opportunity

2) Data Report
a) Understanding how data was collected in terms of time, frequency
and methodology
b) Visual inspection of data (rows, columns, descriptive details)
c) Understanding of attributes (variable info, renaming if required)

3) Exploratory data analysis


a) Duplicates
b) Univariate analysis (distribution and spread for every continuous
attribute, distribution of data in categories for categorical ones)
c) Bivariate analysis
d) Removal of unwanted variables
e) Missing value treatment
f) Outlier treatment
g) Variable transformation
h) Scaling of data
i) Log Transformation
j) Encoding

4) Business Insights from EDA


a) Is the data unbalanced? If so. What can be done? Please explain in
the context of the business
b) Any business insights using clustering
c) Any other business insights

2
1) INTRODUCTION
This section aims at introducing the project and providing the basic understanding of the
project and the objectives of this analysis. The analysis deals with the prediction of house prices
based on the factors given in the data set to define the attributes of a house. In other words,
it targets to understand the real estate market of the geographical location given. Prediction
of house prices is not only depend upon square foot of space that it occupies but, different
other factors like, number of bedrooms, bathrooms, floors, basement area, condition of house,
quality of house, year of build, waterfront/ beachfront, age of the house, age of renovation of
the house, etc., are few of the important points that play a major role in determining its cost.
So through this project we try to derive different patterns and we will be exploring multiple
other questions and try to derive answers to those by applying our learning and models from
the past 11 months of study.

Defining Problem Statement


The goal of this analysis is to understand the relationship between the features of the house
and how those features can predict the house price.
A house value is simply more than location and square footage. Like the features that make up
a person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don’t know the price which you may expect — it
can’t be too low or too high. To find house price you usually try to find similar properties in
your neighborhood and based on gathered data you will try to assess your house price.

Assumptions
This section aims at understanding the attributes in the data set which are not explained well
in the problem.

Ceil – 1 indicates the level/floor of house which is lowest in the attributes and 3.5 indicates
the maximum levels/floor of house.

Coast – 0 indicates closer to waterfront and 1 indicates farther to waterfront

Condition – 1 indicates Poor Condition and 4 indicates Best Condition

Quality – 1 indicate Poor Quality and 13 indicates Best Quality

Furnished – 0 indicates not furnished and 1 indicates furnished

Scope of Project
This section aims at understanding as Data Scientists, what is the scope of this project in real
world? Real estate is an always active market. This is also one of the markets that gets hit
hardest in times of distress of the economy. As per research, real estate generates almost 35
percent of the total revenue of the country’s economy. When it comes to young population,
then real estate is the most viable option to invest in. During the time of Corona Pandemic too,
this market kept on working, despite it saw some crashes and booms with parallel to the stock
movements.
Seller can't estimate the price of the house. Features of the house can help evaluate the house

3
price. Different houses have different features. Features of more than two houses can help
evaluate relevant prices. Hence, analyzing the bulk of data can help predict the house price.
To get the profitable pricing for the houses and buildings, so that neither the seller nor the
buyer are at a loss? That is where the factors affecting the price of the house comes into
picture. If a fair evaluation of all the factors, how they contribute, why they contribute, how
they contribute is made, then a profitable figure can be derived which leads to a win-win
situation for both the parties. Understanding and knowing the contribution of real estate to
the economy and to the standard of living of an individual, it’s very essential for us to contribute
our data skills so as to make it to a fair and profitable future

Understanding business/social opportunity


This section aims at understanding that how will such kind of a project or a study generate
business profitability or social benefits.
Real estate is a booming sector that contributes hugely to the country’s economy. It is also one
of the sectors that contribute substantially to generating the employment. When we talk about
employment, it’s not only for the brokers of the houses or the builders, rather it also accounts
those laborers who help with construction of the building. Now, if a sector is contributing such
heavily into the economy and employment, then it’s fair to have an honest and viable pricing
of the product that the sector generates, in our case, houses. Any unfair pricing will be injustice
not only to buyer and the seller but also to the workers who are contributing building the real
estate. Not only this, big companies who are into building, buying and selling of the properties
which means that the major turnover of these companies are from the pricing of the houses.
These houses maybe newly built or selling of an already existing house. Also this is the
investment option chosen by majority of the public. Hence, as data scientist, it’s our duty to
provide a fair pricing and a just understanding of the factors that contribute to the pricing of
the properties. Therefore, this project becomes an imperative to the lives of people, as well as
to the profits of the companies of the nation and abroad.

2) Data Report

Understanding how data was collected in terms of time, frequency and methodology
This section aims at giving how the data is collected.
This is the Capstone Project driven by the Great Learning, hence the data of “House Price
Prediction” is provided to us from the learning platform.
This data is collected already from year 2014 to year 2015.

Visual inspection of data (rows, columns, descriptive details)

The various attributes provided are

1) cid: a notation for a house


2) dayhours: Date house was sold
3) price: Price is prediction target
4) room_bed: Number of Bedrooms/House
5) room_bath: Number of bathrooms/bedrooms
6) living_measure: square footage of the home
7) lot_measure: quare footage of the lot

4
8) ceil: Total floors (levels) in house
9) coast: House which has a view to a waterfront
10) sight: Has been viewed
11) condition: How good the condition is (Overall)
12) quality: grade given to the housing unit, based on grading system
13) ceil_measure: square footage of house apart from basement
14) basement_measure: square footage of the basement
15) yr_built: Built Year
16) yr_renovated: Year when house was renovated
17) zipcode: zip
18) lat: Latitude coordinate
19) long: Longitude coordinate
20) living_measure15: Living room area in 2015(implies-- some renovations) This might or
might not have affected the lotsize area
21) lot_measure15: lotSize area in 2015(implies-- some renovations)
22) furnished: Based on the quality of room
23) total_area: Measure of both living and lot

Fig 1, we can see the initial look of the data. This tells us that the data has 23 columns.

5
These columns are the different factors that impact the price of the house. Factors like number
of bedrooms, number of bathrooms, number of floors, quality of house, condition of house,
etc.. Each column has a different name and a different meaning.

Number of Rows & Columns

We see that there are 21613 rows and 23 columns in the dataset.

From this, we see that there are 21613 rows and 23 columns. This information tallies from the
above Fig. too where we got 23 columns in the data. Also, there are total 21613 rows which
means there are 21613 entries of different instances. These rows can be consisting of missing
data or duplicates. They can also have unwanted inputs like an object variable in the
float/integer column.

Data Info

6
In the dataset, we have more than 21k records and 23 columns, out of which
• 12 features are of float type
• 4 features are of integer type
• 7 feature is of object type

We see that number of bedrooms and number of bathrooms, living_measure, lot_measure,


ceil, coast, sight, condition, quality, ceil_measure, basement, yr_built, living_measure15,
lot_measure15, furnished & total_area have null values. We get to know this because the
above Fig says that bedrooms and bathrooms have only 21505 non null values. This means that
the rest of the entries i.e. (21613 – 21505) number of entries are actually null or NaN. We see
a similar happening in the living_measure, lot_measure, ceil, coast, sight, condition, quality,
ceil_measure,basement, yr_built, living_measure15, lot_measure15, furnished and
total_area. All these columns have null values. It’s important to know and then treat the null
values.
Another observation is the type of data in each column. We see that the data type is either
object, or float64 or int64. Object datatype happens when alphabets or signs creep in the
dataset. Float64 happens when there are decimals and int64 meaning integer64 happens when
there are integer values. An astonishing thing to note is that dayhours is object. Its due to
presence of “T” in between & remaining features like total area, long, year build, condition,
coast, ceil are numerical features but it is shown as a object data because of bad data that
needs to be treated. In conclusion, we see that 12 columns are in float64 nature, 4 columns
are of int64 nature and 7 columns are of object nature. In case of bad data of missing data it
needs to be treated for accurate results.

Data Description

7
Besides graphs, statistics that summarize the distribution of the data, are used to transform
data into information. The five-number summary, which forms the basis for a boxplot, is a good
example of summarizing data. The above table is summary statistics of the dataset

• CID: House ID/Property ID. Not used for analysis


• price: Our target column value is in 75k - 7700k range. As Mean > Median, it's Right-
Skewed.
• room_bed: Number of bedrooms range from 0 - 33. As Mean slightly > Median,
it's slightly Right-Skewed.
• room_bath: Number of bathrooms range from 0 - 8. As Mean slightly < Median,
it's slightly Left-Skewed.
• living_measure: square footage of house ranges from 290 - 13,540. As Mean > Median,
it's Right-Skewed.
• lot_measure: Square footage of lot range from 520 - 16,51,359. As Mean almost double
of Median, it's Highly Right-Skewed.
• ceil: Number of floors range from 1 - 3.5 As Mean ~ Median, it's almost Normal
Distributed.
• coast: As this value represent whether house has waterfront view or not.
It's categorical column. From above analysis we got know, very few houses has
waterfront view.
• sight: Value ranges from 0 - 4. As Mean > Median, it's Right-Skewed
• condition: Represents rating of house which ranges from 1 - 5. As Mean > Median,
it's Right-Skewed
• quality: Representing grade given to house which range from 1 - 13. As Mean > Median,
it's Right-Skewed.
• ceil_measure: square footage of house apart from basement ranges in 290 - 9,410. As
Mean > Median, it's Right-Skewed.
• basement: Square footage house basement ranges in 0 - 4,820. As Mean highly >
Median, it's Highly Right-Skewed.
• yr_built: House built year ranges from 1900 - 2015. As Mean < Median, it's Left-
Skewed.
• yr_renovated: House renovation year only 2015. So, this column can be used
as Categorical Variable for knowing whether house is renovated or not.
• zipcode: House Zip Code ranges from 98001 - 98199. As Mean > Median, it's Right-
Skewed.
• lat: Latitude ranges from 47.1559 - 47.7776 As Mean < Median, it's Left-Skewed.
• long: Longitude ranges from -122.5190 to -121.315 As Mean > Median, it's Right-
Skewed.
• living_measure15: Value ranges from 399 to 6,210. As Mean > Median, it's Right-
Skewed.
• lot_measure15: Value ranges from 651 to 8,71,200. As Mean highly > Median,
it's Highly Right-Skewed.
• furnished: Representing whether house is furnished or not. It's a Categorical Variable
• total_area Total area of house ranges from 1,423 to 16,52,659. As Mean is almost
double of Median, it's Highly Right-Skewed

8
From above analysis we got to know,

Most columns distribution is Right-Skewed and only few features are Left-Skewed (like
room_bath, yr_built, lat).

3) Exploratory Data Analysis


This section aims at a deeper level of data cleaning for the dataset. It targets to give the
univariate analysis, bivariate analysis, remove the unwanted variables, remove the
missing values (already done in previous section) outlier treatment, variable
transformation and addition of any new variables. It is essential because we cannot
work on an unclean data, hence the Exploratory Data Analysis aims at cleaning the data
to make it ready for processing. Unclean data, filled with missing values, outliers,
unwanted variables can make the analysis erroneous and outcome to be misguiding.

Removal of unwanted variable


There could also be some miscellaneous columns like ID, that we can drop from the analysis
as it’s a mere identifier and doesn’t contribute much to our analysis.

Missing value treatment

This tells us that in the total entries of 21613, there are max missing null values of 166 count
in the living_measure15. Next, we observe that the columns that have high number of missing
data are the number of bedrooms and bathrooms. Rest of the columns have substantially less
numbers of missing data, like lot_measure15, furnished, total_area have only 29 null values.
Also, sight and condition too have very lesser i.e., just 57 of the missing values. An interesting
analysis here is that 166 is the highest number of null value spaces and it is very less than 30
percentage of the total data of 21613. 166 is approximately 7 to 8 percentage of the total data.

9
This implies that maximum only 7 to 8 percentage of the data is missing or null in nature which
needs to be treated to get the more accurate results.

Bad data and missing data is treated. Replaced bad data with NaN value and treated the null
values with simple imputer & mode method.

Duplicates

As per the code, we see there is no duplicates

Univariate analysis (distribution and spread for every continuous attribute, distribution of
data in categories for categorical ones)

This is simplest form of data analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.

10
11
Very few houses are renovated, only 914 houses are renovated out of total 21613 records
house with no sight or 0 record is more after that we have house few more houses with 2 sights
hose with 1 or 4 site is very minimal
most of the houses in the dataset has bedroom within the range of 0 to 5
more no of houses are built from year 2000 onwards. from the year 1900 to 1950 we can see
less no of house got constructed
more no of unfurnished house are there in data set .17500 house are unfurnished and near
about only 4000 houses are furnished Most of the houses are non-coast in the dataset and
very few houses negligible amount of houses are near the coast.

Bivariate analysis (relationship between different variables, correlations)


To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot()
function. This shows the relationship for (n, 2) combination of variable in a Dataframe as a
matrix of plots and the diagonal plots are the univariate plots

From above pair plot, we observed/deduced below

12
• room_bed: our target variable (price) and room_bed plot is not linear. Its distribution
has lot of gaussians
• room_bath: It's plot with price has somewhat linear relationship. Distribution has
number of gaussians.
• living_measure: Plot against price has strong linear relationship. It also have linear
relationship with room_bath variable. So might remove one of these 2. Distribution is
Right-Skewed.
• lot_measure: No clear relationship with price.
• ceil: No clear relationship with price. We can see, it's have 6 unique values only.
Therefore, we can convert this column into categorical column for values.
• coast: No clear relationship with price. Clearly it's categorical variable with 2 unique
values.
• sight: No clear relationship with price. This has 5 unique values. Can be converted to
Categorical variable.
• condition: No clear relationship with price. This has 5 unique values. Can be converted
to Categorical variable.
• quality: Somewhat linear relationship with price. Has discrete values from 1 - 13. Can
be converted to Categorical variable.
• ceil_measure: Strong linear relationship with price. Also with room_bath and
living_measure features. Distribution is Right-Skewed.
• basement: No clear relationship with price.
• yr_built: No clear relationship with price.
• yr_renovated: No clear relationship with price. Have 2 unique values. Can be converted
to Categorical Variable which tells whether house is renovated or not.
• zipcode, lat, long: No clear relationship with price or any other feature.
• living_measure15: Somewhat linear relationship with target feature. It's same as
living_measure. Therefore we can drop this variable.
• lot_measure15: No clear relationship with price or any other feature.
• furnished: No clear relationship with price or any other feature. 2 unique values so can
be converted to Categorical Variable
• total_area: No clear relationship with price. But it has Very Strong linear relationship
with lot_measure. So one of it can be dropped.
• There is Linear relation exist between lot_ measure and total area And also there is
some linear relation between ceil_measure and living_measure

13
Analysing Bivariate

for Feature: room_bed

There is clear increasing trend in price with room_bed, price increases with the increase in
no. of bedrooms.

for Feature: room_bath

There is upward trend in price with increase in room_bath, price increases with the increase in
number of bath rooms.

14
for Feature: living_measure

There is clear increment in price of the property with increment in the living measure but there
seems to be one outlier to this trend. Need to evaluate the same

Feature: lot_measure

15
There doesn’t seem to be no relation between lot_measure and price trend

For lot_measure <25000

Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between
lot_measure and price

For lot_measure >100000

Price increases with increase in living measure

16
Feature: ceil

There is some slight upward trend in price with the ceil and then falls latter

Feature: coast

The house properties with waterfront tend to have higher price compared to that of non-
waterfront properties

17
Feature: sight

Properties with higher price have more no. of sights compared to that of houses with lower
price

Sight - Viewed in relation with price and living_measure

The above graph also justifies that: Properties with higher price have more no. of sights
compared to that of houses with lower price

18
Feature: condition

The price of the house increases with condition rating of the house

Condition - Viewed in relation with price and living_measure

Most houses are rated as 3 or more.

19
So we found out that smaller houses are in better condition and better condition houses are
having higher prices.

Feature: quality

with grade increase price and living_measure increase (mean and median)
There is clear increase in price of the house with higher rating on quality

20
quality - Viewed in relation with price and living_measure.

Most houses are graded as 6 or more.


We can see some outliers as well
There is clear increase in price of the house with higher rating on quality

Feature: ceil_measure

There is upward trend in price with ceil_measure

21
Feature: basement

We will create the categorical variable for basement 'has_basement' for houses with basement
and no basement. This categorical variable will be used for further analysis.

basement - after binning we data shows with basement houses are costlier and have higher
living measure (mean & median)

The houses with basement has better price compared to that of houses without basement

22
basement - have higher price & living measure

Feature: yr_built

As per the graph, most of the houses are built in around 2000’s and least houses are built in
around 1900’s. there is a kind of slightly increase in trend, as year increases the number of
house built increases.

23
Feature: yr_renovated

So most houses are renovated after 1980's. We will create new categorical variable
'has_renovated' to categorize the property as renovated and non-renovated. For further
analysis we will use this categorical variable.

Renovated properties have higher price than others with same living measure space.

24
Feature: furnished

Furnished houses have higher price than that of the Non-furnished houses

Analyzing Feature: Zipcode, Lat, Long

25
With above figure aims at understanding that how the combination of latitude, longitude & Zip
code affects the price. This precisely is a geographical study of differing price. We see that
highest price is concentrated in the center of the map and lowest is happening at the coast line
houses. Probably this is the reason that most of our houses don’t have coast lines.
We can find that the highest price is around the location of longitude -122.2 and -122.3 and
latitude 47.5 and 47.6 This is location has high number of houses.
The lowest price is around the region longitude -121.906 and latitude 47.26.

26
Correlation

We have linear relationships in below features as we got to know from above matrix

1. price: room_bath, living_measure, quality, living_measure15, furnished


2. living_measure: price, room_bath. So we can consider dropping 'room_bath' variable.
3. quality: price, room_bath, living_measure
4. ceil_measure: price, room_bath, living_measure, quality
5. living_measure15: price, living_measure, quality.
6. lot_measure15: lot_measure.
7. furnished: quality
8. total_area: lot_measure, lot_measure15.

We can plot heatmap and can easily confirm our above findings

27
Heat Map

Outlier

28
We see that all the variables has outliers. The black dots lying in distant marks the outliers.
Now we need to treat them and remove them. Using the remove outlier code in Python, we
removed all the outliers.

Outlier Treatment

We have the boxplot of the cleaner data with no outliers. As now we can see that there are no
randomly lying black dots in the boxplots. This confirms that there are no outliers in the data
now. We have successfully removed all the outliers.

29
Multicollinearity

The dataset we have has only 22 Features and 1 Target column. Since the number of Features
are only 22, not high we should be good to use all the Features given in the Dataset. Out of the
22 Features, we can clearly see that cid and dayhours are just audit columns and do not add
any value to the model in terms of prediction. Based on above correlation heat map/VIF and
summary listed, we identify the below features with a correlation less than 0.25 with target
variable 'price' as potential columns to be excluded from the data. However, we will be
experimenting the different Features to be used/eliminated in the models and finalize the
Feature list.

cid, dayhours, zipcoad, lot_measure, lot_measure15, long, condition, yr_built, yr_renovated,


total_area.

30
Scaling
Scaling is necessary for unscaled data. Scaling needs to be done as the values of the variables
are in different scales. Spending, advance price are in different values and this may get more
weightage. Scaling will have all the values in the relative same range. Below is the snapshot of
scaled data.

Unscaled data

Scaled data using z score method for scaling

Before Scaling After Scaling

31
We can see from above figure there is not major difference in graph from before and after
scaled data.

Transformation
Log transformation of gives actual information by enhancing the image. If we apply this
method in an image having higher pixel values then it will enhance the image more and actual
information of the image will be lost.
In situations where data is highly skewed and the algorithm we plan to use for prediction has
a prerequisite that the data has to be normally distributed, below transformation can be
applied.

Price

Graph after log transformation


32
Living measure

Graph after log transformation

Lot measures

33
Graph after log transformation

Quality

Graph after log transformation

34
Ceil measure

Graph after log transformation

Year built

35
Graph after log transformation

Living measure15

Graph after log transformation


36
Lot measure15

Graph after log transformation

Total area

37
Graph after log transformation

As per the above graphs of various numeric field, Distribution is more spread across, even
boxplot is more spread, because of scaled property outlier is more near to boxplot. With
transformation distribution seems to have changed even achieved better skeness & kurtosis.

Label Encoding

Encoding deals with categorical features

Encoded with numerical values

38
4) Business Insights from EDA

This section aims at giving certain business insights from the analysis done above. This
takes into account all the pointers that have been coded/ discussed/ elaborated above.

a) I) Not applicable as we are going to develop regression model not classification


model.
II) value counts of independent variables data does seems to be unbalanced due to
the outliers. IQR method is used for treating the outliers. Outliers can degrade the
efficiency of the data. - It results in overestimation or underestimation

b) we have used KMeans Clustering to obtain the WSS plot and its clear noticeable
that the elbow is at clusters = 3. Hence the optimal number of clusters will be 3.

Post running the silhouette score analysis, we got to the K-Means cluster and Fig.
tells us about the clustering using the K-Means and the head of the data.

39
Above fig gives us the train and test dataset split up

c) I) The missing data in the data set is already imputed based on the data type.
II) Living measure is the most significant variable in our analysis and since living
measure, lot measure and ceil measure are proportional we need not spend a lot
of time in analyzing these variables, analysis based on living measure would provide
much greater insights.
III) It is evident from EDA that an ideal house would be the one with 2-3 bedrooms
and 3 bathrooms, even though houses with 8 and >8 bedrooms and bathrooms
have sold for a higher price a lot of people doesn’t seem to be buying them, higher
number of records are sold with three-bedroom houses hence an equal or even
more revenue could be obtained by selling more houses with three bedrooms and
bathrooms.
IV) Although majority of houses are not furnished, it is seen in bivariate analysis that
furnished houses produce more revenue compared to unfurnished ones.
V) From the above analysis, we can conclude that, high quality house has the
highest house price.
VI) These features combined, can help estimate the house price.

40

You might also like