1 Problem statement 3

And agenda

2 Summary of the data 4

3 Info of the data 5

4 Top 5 rows 6

5 Describing the data 7-8

6 Univariate and Bivariate 9-14


7 Checking and treating 15-16


8 Pair plot and heatmap 17-19

9 Histogram 20-22

10 Encoding and Scaling 23-25

11 Business insights 26


• A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For example, you want
to sell a house and you don’t know the price which you may expect — it can’t be too low or too high. To find
house price you usually try to find similar properties in your neighborhood and based on gathered data you will
try to assess your house price.

• Take advantage of all of the feature variables available, use it to analyze and predict house prices.

• Your job is to use your magical data science skills to provide them with suitable insights about their
data and help them in predicting the price of the house.


• The prices of house increase every year, so there is a need for a system to predict house prices in the

• Also house price prediction can help the developer to determine the selling price of a house and can
help the customer to arrange the right time to purchase a house.


• Will be able to go through the thing in detail with deep knowledge

• Will get the idea about the things or important factors need to include for better results.


• Shape of dataset – Number of rows – 21613; Number of columns – 23

• Null Values-There are multiple NA values with respect to each column we will drop them. In this case
there are few missing values you can drop those values.

• After dropping-we can see all the 23 columns don’t have null values in it.

dayhours 0 dayhours 0
price 0 price 0
room_bed 108 room_bed 0
room_bath 108 room_bath 0
living_measure 17 living_measure 0
lot_measure 42 lot_measure 0
ceil 42 ceil 0
coast 1 coast 0
lat 0 lat 0
sight 57 sight 0
condition 57 condition 0
quality 1 quality 0
ceil_measure 1 ceil_measure 0
basement 1 basement 0
yr_built 1 yr_built 0
yr_renovated 0 yr_renovated 0
zipcode 0 zipcode 0
long 0 long 0
living_measure1 166 living_measure1 0
5 5
lot_measure15 29 lot_measure15 0
furnished 29 furnished 0
total_area 29 total_area 0

• Duplicate values-Also there are no duplicates in the data.

Number of duplicate rows = 0


• Missing values -We can see there are missing values as per the count in the column entries.

• Data types-We can see there are 12 columns of float data type, 4 columns of integer data type and 7
columns of object data type.

# Column Non-Null Count Dtype

0 cid 21613 non-null int64

1 dayhours 21613 non-null object
2 price 21613 non-null int64
3 room_bed 21505 non-null float64
4 room_bath 21505 non-null float64
5 living_measure 21596 non-null float64
6 lot_measure 21571 non-null float64
7 ceil 21613 non-null object
8 coast 21613 non-null object
9 sight 21556 non-null float64
10 condition 21613 non-null object
11 quality 21612 non-null float64
12 ceil_measure 21612 non-null float64
13 basement 21612 non-null float64
14 yr_built 21613 non-null object
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null object
19 living_measure15 21447 non-null float64
20 lot_measure15 21584 non-null float64
21 furnished 21584 non-null float64
22 total_area 21613 non-null object
dtypes: float64(12), int16(3), int64(4), int8(4)

TOP 5 ROWS -. The top 5 observations of the dataset are displayed below to get some idea about different
features and their values.

cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ...

0 3876100940 20150427T000000 600000 4 1.75 3050 9440 1 1 0

1 3145600250 20150317T000000 190000 2 1 670 3101 1 1 0

2 7129303070 20140820T000000 735000 4 2.75 3040 2415 3 1 0

3 7338220280 20141010T000000 257000 3 2.5 1740 3721 3 1 0

4 7950300670 20150218T000000 450000 2 1 1120 4590 1 1 0

basement yr_renovated zipcode lat long living_measure15 lot_measure15 furnished total_area


1250 1966 0 98034 47.7228 2020 8660 0 12490
0 1948 0 98118 47.5546 1660 4100 0 3771
0 1966 0 98118 47.5188 2620 2433 0 5455
0 2009 0 98002 47.3363 2030 3794 0 5461
0 1924 0 98118 47.5663 1120 5100 0 571


count mean std min 25% 50% 75% max

cid 21387.0 4.577761e+09 2.877364e+09 1.000102e+06 2.122054e+09 3.904920e+09 7.307250e+09 9.900000e+09

price 21387.0 5.403520e+05 3.681089e+05 7.500000e+04 3.210000e+05 4.500000e+05 6.450000e+05 7.700000e+06

room_bed 21387.0 3.370880e+00 9.304884e-01 0.000000e+00 3.000000e+00 3.000000e+00 4.000000e+00 3.300000e+01

room_bath 21387.0 2.114941e+00 7.698064e-01 0.000000e+00 1.750000e+00 2.250000e+00 2.500000e+00 8.000000e+00

living_measure 21387.0 2.080473e+03 9.189430e+02 2.900000e+02 1.430000e+03 1.910000e+03 2.550000e+03 1.354000e+04

lot_measure 21387.0 1.511142e+04 4.144908e+04 5.200000e+02 5.040000e+03 7.620000e+03 1.068750e+04 1.651359e+06

sight 21387.0 2.348623e-01 7.672480e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00

quality 21387.0 7.657923e+00 1.176458e+00 1.000000e+00 7.000000e+00 7.000000e+00 8.000000e+00 1.300000e+01

ceil_measure 21387.0 1.789026e+03 8.285817e+02 2.900000e+02 1.190000e+03 1.560000e+03 2.210000e+03 9.410000e+03

basement 21387.0 2.914476e+02 4.426845e+02 0.000000e+00 0.000000e+00 0.000000e+00 5.600000e+02 4.820000e+03

yr_renovated 21387.0 8.389204e+01 4.005111e+02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.015000e+03

zipcode 21387.0 9.807789e+04 5.349811e+01 9.800100e+04 9.803300e+04 9.806500e+04 9.811700e+04 9.819900e+04

lat 21387.0 4.756000e+01 1.385830e-01 4.715590e+01 4.747065e+01 4.757170e+01 4.767800e+01 4.777760e+01

living_measure15 21387.0 1.987044e+03 6.857636e+02 3.990000e+02 1.490000e+03 1.840000e+03 2.360000e+03 6.210000e+03

lot_measure15 21387.0 1.276213e+04 2.724116e+04 6.510000e+02 5.100000e+03 7.620000e+03 1.008500e+04 8.712000e+05

furnished 21387.0 1.969421e-01 3.976975e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00

• The describe method will help to see how data has been spread for the numerical values as well as for
categorical values. We can clearly see the minimum value, mean values, different percentile values
and maximum values.
• From the above table we can see that range of price varies from Rs7.5 to Rs.7.7and also the standard
deviation is 3.6.
• We can also find some values 0 as minimum value for furnished, basement, room_bed, room_bath
which is meaningless and to be taken care while data pre-processing.
• Also, the mean and median for the attributes are same which shows that the data is normally
• Also, by seeing the difference in the values we can say there are outliers present in the data.


count unique top frequency

Dayhours 21387 372 20140623T000000 141
Ceil 21387 7 1 10547
Coast 21387 3 0 21197
Condition 21387 5 3 13881
Yr_built 21387 116 2014 554
Long 21387 753 -122.29 115
Total_area 21387 11094 $ 39


Distribution of room_bed

Distribution of price Distribution of room_bath

Distribution of living_measure Distribution of lot_measure

Distribution of living_measure15
Distribution of lat

Distribution of ceil_measure Distribution of basement

Distribution of zipcode Distribution of furnished

• There are outliers present in the living_measure15 variable.
• Two peaks are found in the distribution of lot measure15.

• There are no outliers present in the zipcode variable.

• Multiple peaks are found in the distribution of lat.

• There are outliers present in the basement variable.

• Single peak is found in the distribution of yr_renovated.

There are no outliers present in the cid variable.

Multiple peaks are found in the distribution of cid.

• There are outliers present in the price variable.

• Multiple peaks are found in the distribution of dayhours.

• There are no outliers present in the yr_built variable.

• Multiple peaks are found in the distribution of sight.

• There are outliers present in the living_measure15 variable.
• Multiple peaks are found in the distribution of long.
• Also seems to be left skewed.

• There are outliers present in the living_measure15 variable.

• Only few peaks are found in the distribution of lot_measure15.

We can see there are outliers in many columns except for the few columns-






As most of the columns contain outliers we need to treat them.

• For removing the outliers, we have used Inter Quartile Range method on the given dataset.
• Now we can see our data does not consist of any outliers.

In the above plot scatter diagrams are plotted for all the numerical columns in the dataset. A scatter
plot is a visual representation of the degree of correlation between any two columns. The pair plot
function in seaborn makes it very easy to generate joint scatter plots for all the columns in the data.

From the above pair-plot and correlation matrix we find that following have high correlation

• Living_measure & ceil_measure

• Living_measure & living_measure15
• Living_measure& quality
• Living_measure& room_bath
• Price & living_measure
• Lot_measure& lot_measure 15
• Price & lot_measure
• Room_bed & living_measure


Distribution of ceil

Total = 21387
Number of Groups = 7

• the Total number of floors (levels) in most of the house are 1.

• the houses which have only one floor are approx. 10547 in number.

Distribution of coast

Total = 21387
Number of Groups =3

• House which has a view to a waterfront is very less only 30-160

• Most of the houses do not have waterfront view approx. 21197.

Distribution of condition Total = 21387

Number of Groups =5

• most of the houses are rated 3 approx. 13881 no. of houses for their condition.

The top 5 records of dataset after converting the categories is displayed below.

cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ...

0 3.88E+09 351 600000 4 1.75 3050 9440 1 1 0

1 3.15E+09 310 190000 2 1 670 3101 1 1 0
2 7.13E+09 110 735000 4 2.75 3040 2415 3 1 0
3 7.34E+09 161 257000 3 2.5 1740 3721 3 1 0
4 7.95E+09 283 450000 2 1 1120 4590 1 1 0

basement yr_renovated zipcode lat long living_measure15 lot_measure15 furnished total_area


1250 67 0 98034 47.723 455 2020 8660 0 1823

0 49 0 98118 47.555 546 1660 4100 0 6392
0 67 0 98118 47.519 528 2620 2433 0 7734
0 110 0 98002 47.336 485 2030 3794 0 7739
0 25 0 98118 47.566 557 1120 5100 0 7914

Scaling is necessary in this case as the variables/attributes in the dataset are on different scales and using
distance measure, so it is really important that all attributes/variables are on same scale before moving
further. Because of the above-mentioned reasons scaling is necessary in this case. To add further, for the given
dataset it is required because all the variables are expressed in different form. The values expressed in higher
units will outweigh the values expressed in lower units and therefore creating a disbalance in the overall
analysis. Scaling is a method to normalize the range of independent variables/attributes in the given dataset. It
is part of the data preparation step in the overall exploratory data analysis (EDA).

Scaled Data and box plot after scaling the data using standard scalar,

cid dayhours price room_bed room_bathliving_measure

lot_measureceil coast sight basement
0 -0.24386 1.5863 0.3525 0.74607 -0.48313 1.181325 0.146 -0.91159 0 0 2.315272
1 -0.49775 1.2119 -1.2846 -1.5956 -1.52214 -1.65428 -1.11 -0.91159 0 0 -0.68073
2 0.886784 -0.6139 0.8916 0.74607 0.90221 1.16941 -1.25 0.936214 0 0 -0.68073
3 0.959393 -0.1483 -1.0171 -0.4248 0.555875 -0.37945 -0.99 0.936214 0 0 -0.68073
4 1.172121 0.9654 -0.2464 -1.5956 -1.52214 -1.11814 -0.82 -0.91159 0 0 -0.68073

yr_built yr_renovated
zipcode lat long living_measure15
furnished total_area
-0.17138 0 -0.82046 1.174785 -0.22647 0.068362 0.082345 0 1823
-0.78418 0 0.749731 -0.03896 0.434679 -0.48601 -0.96223 0 6392
-0.17138 0 0.749731 -0.2973 0.303903 0.992317 -1.34409 0 7734
1.29253 0 -1.41862 -1.61423 -0.00851 0.083761 -1.03232 0 7739
-1.60124 0 0.749731 0.045467 0.514598 -1.31757 -0.73315 0 7914

Business insights –

• Yes, the data is unbalanced for that we can use StandardScaler which will help in scaling the data
and will return the z-scores of every attribute.

• Also scaling and encoding helps in more accurate predictions.

• In this method, we convert variables with different scales of measurements into a single scale.

• StandardScaler normalizes the data using the formula (x-mean)/standard deviation.

• By doing model training and then testing on it along with predictions can help in getting the accurate output.

• The dataset has significant number of outliers which were removed during the data preprocessing step done
on the given data set.

• Among the variables/attributes in the given dataset, Price is one of the important features as seen from the
feature importance parameters in the models.

• The best attributes which can help the business in predictions can be-







• Linear Regression can help in achieving the further required objection as linear regression model
helps in predicting the real estate values based on the given data

• Also, it can help the business or investors to know the trend of housing prices in a certain
required location.

• Also, the most important factor is location because it helps in determining the prevailing land

