You are on page 1of 10

House Prices Prediction in King County

Jiajun Zhang

December 4, 2020

1 Introduction
The goal of this project is to build predictive machine learning models in order to
make prediction on house sale prices in King County, USA. Data is found on Kaggle
and it can be downloaded through the following link. It includes the information of
houses sold between May 2014 and May 2015. The full description of the dataset
is available here, and it has the meanings of all the variables. For example, the
variable of view implies an index from 0 to 4 of how good the view of the property
is. Since this is a classical regression problem, there are certainly many approaches
we can use to achieve our goal. In this project, we will see some of the methods and
their performances respectively. The source code for this project has been posted on
RPubs and Github.

2 Data Preparation
The original data contains 21613 observations and 21 features. By looking at the
structure of dataset as shown in Figure 1, we can easily remove some of the variables
from our analysis since they only provide auxiliary information, such as id, date,
etc. Also, we notice that some variables are supposed to be categorical, but stored
in numerical types. For example, the variable of waterfront indicates whether the
house has water view, where 1 suggests yes and 0 otherwise. So, we will need to
convert this into a categorical variable instead. Similarly, we can perform the same
transformation on other variables as well. Then, it is always a good idea to ensure
whether our data have any missing values or outliers.

1
Figure 1: Structure of Housing Dataset

Figure 2 shows the Pearson’s correlation between variables. This is worth men-
tioning because, from the result, we can clearly see that some of the variables are
highly correlated with each other such as sqft_above and sqft_living. In many cases,
we do not want our features in a model to have duplicate effects and it can easily
cause overfitting. By definition, the variable of sqft_above indicates the square feet
above ground, and thus it has similar meaning to sqft_living. Also, I have manually
created a variable age, indicating the number of years after houses built or reno-
vated. Finally, I have decided to use explanatory variables of bedrooms, bathrooms,
sqft_living, sqft_lot, floors, waterfront, view, condition, age, lat, and long only.

Figure 2: Pearson’s Correlation

2
Data Visualization

After cleaning our data, we can generate some graphs to visualize some patterns
between the response and predictors. Ideally, we expect an increasing relationship
between price and sqft_living because a bigger house always implies it is more ex-
pensive. Similarly, we can say the relationship between the number of bathrooms
and price of the house is also positively correlated. From Figure 3 and Figure 4,
we can sort of see the patterns which satisfies our assumptions. There are certainly
many other findings we can demonstrate just by exploring the relationships graphi-
cally. To keep this report simple, I will not list them all here.

Figure 3: Relationship Between Price and Square Feet Living

Figure 4: Relationship between Price and Number of Bathrooms

3
Train-Test Split

Before building our statistical models, we first need to split our data into a training
set and a validation set. In this project, I used 80% as training data and we will still
have about 4300 observations to test. This step is important because we want to
evaluate our models on a local test set before actually making predictions.

3 Data Analysis

Data Preprocessing

First, we have scaled and centered all the numerical explanatory variables around 0.
This is important because we can prevent a variable from outweighing another due
to the difference in units. For example, the variable of sqft_living has a larger range
weight than bathrooms before scaling.

Linear Regression

After separating our data, we can first attempt the method of linear regression. How-
ever, before building our model, we need to make sure whether the assumptions of
linear regression are correct such as normality and homoscedasticity. Figure 5 sug-
gests that our data is not normally distributed and the residuals do not seem to have
a constant variance. Since the assumptions are violated, we cannot directly apply
the method and our response needs some transformation.

Figure 5: Model Diagnostics

4
In this particular example, I used the method of Box-Cox and it is basically a
power transformation that selecting the optimal λ by maximizing the log-likelihood
[1],
n RSSλ X
L(λ) = − log( ) + (λ − 1) log(yi ).
2 n
Now, our linear model becomes y λ = y −0.03 = β0 + βi xi where i = 1, ..., 11 since
the optimal λ selected is -0.03. Then, by checking the model diagnostics again, we
see that the conditions of normality and homoscedasticity have improved a lot as
shown in Figure 6. We first try to fit the model with all the explanatory variables
and perform a hypothesis test to each of the variable. This suggests that if one of
the predictors had a p-value greater than 0.05 from the summary output, then we
do not reject the null hypothesis βj = 0 and conclude that the variable does not
have a significant linear relationship with the response. From the summary output
I obtained, only the predictor of condition2 has a p-value larger than 0.05 and 2
implies the second level because the variable is categorical. This can be removed if
we were to perform one-hot encoding on the categorical variable; however, in this
case, I do not think whether it is a big deal by simply removing one level from the
model.

Figure 6: Model Diagnostics After Box-Cox Transformation

Also, the summary output suggests our full model has a R2 value of 0.7. Then,
we can perform a 10-fold cross validation on the training to verify whether our model
performance has changed on some new test data. Since the result did not change
much, we then use the model to make predictions on the validation set. There are
many ways to justify the model accuracy and, in this project, I will use the Root

5
Mean Square Error(RMSE). So, the linear regression with all features provides a
RMSE of 238338.4. In particular, we will use this number to compare with other
models later. Moreover, we should always ensure that our regression model does
not have the issue of multicollinearity. This can be verified by looking at the VIF
values for all predictors. In this example, we do not have such an issue.

Regularization

Since we are able to use a standard linear model, we can then perform some shrink-
age or regularization methods to see whether there are improvements to our model.
And specifically, regularization prevents our model from overfitting. There are two
well-known methods: LASSO and ridge; sometimes, these are referred to L1 and
L2 regularization. Particularly, we can use LASSO for feature selection to discard
all the unimportant features.
The idea of two methods are similar since they are both trying to find the opti-
mal λ that adjusts the penalty term better in the loss function. LASSO has the loss
function written as RSS + λ pj=1 |βj |, where j indicates the columns in our data.
P

On the other hand, ridge regression has the form of RSS + λ pj=1 βj2 [3]. The se-
P

lection of λ is important here since it affects the values in slopes(βj ). If λ is zero,


it is essentially a OLS problem. When λ is large, it makes coefficients zero and
hence it will cause the problem of underfitting. As suggested in Figure 7, when we
increase the value in λ, we see more coefficients are converging to zero. The lines
in different color and the number on the left indicate different coefficients. In this
particular example, we find that coefficients 3 and 16, sqft_living and lat, in green
and blue slowest converge to zero which indicates that they are the most significant
features in the linear model. So this demonstrates and proves that we are able to
perform feature selection using LASSO by tuning the values in λ.

6
Figure 7: LASSO Convergence

Ridge regression has a similar meaning in terms of the cost function and selection
of λ. However, in this case, we are not able to force the coefficients to be zero. When
λ gets larger, the effect of the shrinkage penalty grows, and the coefficients can only
get closer to zero but not exactly zero[2]. Apparently, the convergences of ridge
regression coefficients are slower as shown in Figure 8 and they do not shrink all
the way to zero. So, ridge regression penalizes the variables with minor contribution
to the model by having their coefficients close to zero.

Figure 8: Ridge Convergence

7
By tuning different values in λ, we will certainly get different results in coeffi-
cients and model performances. As shown in Figure 9, the idea is to find the optimal
λ to minimize the test error using cross validation on the training. In this case, if
we were to increase the values, the MSE would get larger. So, the optimal value is
around e−11 .

Figure 9: Parameter Tuning

Since the obtained R2 values are all about 0.70, we may think that a linear re-
lationship does not explain the data well enough. Personally, I am not a fan of
polynomial regression, even though it might work well in this particular example.
The method itself is a bit difficult to interpret and we need to pay careful attention
to the choice of degree, since it can easily cause overfitting. On the other hand, we
can attempt some tree-based models. Specifically, I tried to use regression tree and
XGBoost tree which is also a type of ensemble methods.

XGBoost Tree

The method of regression tree provides a best result of 224030.2 in RMSE and 0.627
in R2 . I have gave up the method of random forest given that it is really time con-
suming on my computer. Then, since XGBoosting requires numerical data input
only, we first need to perform one-hot encoding on the categorical variables. Due
to the page limits of this report, we will not see and explain the whole procedure of
XGBoosting. I would basically say it is a more enhanced approach than bagging.
Also, unlike linear regression, it is also hard to interpret; however, the trees built in
XGBoosting algorithm are not as complicated as random forest.

8
Tuning the hyperparameters in XGBoosting is a big problem since there are
quite many. In this case, I have used the default hyperparameters and a 10-fold
CV training. The result shown in Figure 10 has significantly improved than the
results from linear regressions with regularization. This amazed me a lot since I
have not previously worked with XGBoosting method. The result could potentially
be better if we were to spend some time on tuning the hyperparameters. Since I have
no experience with it yet, I think by following the procedures in this tutorial, we can
improve our model better.

Figure 10: Result of XGBoosting on Training

At the end, we make our predictions using XGBoosting and obtain a R2 value
of 0.894 and a RMSE of 119420. This result may need further investigation and it
can potentially be overfitting, given that both values have significant improvements
than the results from training in Figure 10.

4 Model Comparisons
So far, we have seen several different approaches and their performances are shown
in Table 1.

5 Conclusion
Although there are many other methods we could have used in this project, I think
this housing dataset is especially good for building different algorithms in regres-
sion. Previously, I have focused too much on finding a linear relationship and reg-

9
Model Performances
Methods/Models Rsquared RMSE
Linear Regression 0.702 238338.4
Linear Regression with L1 Regularized 0.705 236769.2
(LASSO Regression)
Linear Regression with L2 Regularized 0.702 221165
(Ridge Regression)
Regression Tree 0.627 224030.2
XGBoosting 0.894 119420

Table 1: Model Performances

ularization. In this particular project, we have seen that linear regression with L2
regularization slightly improved the linear model in terms of RMSE but not as strong
as XGBoosting. In nowadays machine learning project, ensemble methods are often
being used and one of the most famous boosting methods is XGBoosting. I believe
that starting from this project, I will need to begin applying the method better and
experiment with it more in the future.

References
[1] David Dalpiaz. Applied Statistics with R Chapter 14 Transforma-
tions, 30 Oct. 2020. https://daviddalpiaz.github.io/appliedstats/
transformations.html.

[2] Kassambara. Penalized Regression Essentials: Ridge, Lasso


Elastic Net, 03 Nov. 2018. http://www.sthda.com/
english/articles/37-model-selection-essentials-in-r/
153-penalized-regression-essentials-ridge-lasso-elastic-net/.

[3] Anuja Nagpal. L1 and L2 Regularization Methods,


14 Oct. 2017. https://towardsdatascience.com/
l1-and-l2-regularization-methods-ce25e7fc831c.

10

You might also like