Machine Learning Report Official
Machine Learning Report Official
Machine Learning Report Official
Abstract
This report describes how we develop a prediction model that use given data
of old cars sold. With this model, we use several classic methods such that Linear
Regression (LR), Random Forest (RF), Gradient Boosting (GBM), XGBoost (XGB),
LightGBM. Also, we will provide experimental results demonstrating the perfor-
mance of the models and evaluation metrics.
1. I NTRODUCTION
In recent years, Artificial Intelligence have shown to be effective in many fields like
machine learning, computer vision, natural language processing, image processing,
speech processing. In the automotive industry, many companies are developing and
deploying Machine Learning Model to predict vehicle’s prices, then implement strate-
gies to increase their revenue . In this project, we are working on Old Vehicle’s Price
Prediction which is a very useful application. The task of predicting prices is attract
customers and optimize sales. In the scope of this subject, we give some methods to
train and predict model like Linear Regression, Random Forest Regressor and Gradient
Boosting Regressor,...
2. B ACKGROUND
This section provides a brief explanation of the theoretical background necessary to
understand our project report.
1
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
• transmission: way that vehicle moves the power from the engine to the wheels.
(manual or automatic)
2
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
At each mentioned algorithm, we will analyze the importance of the above features
that refers to the relative importance of each feature in the training data.
We split data into two sets, train set with size 0.8 and test set with size 0.2
Furthermore, we also scale features by using Standard Scaler to train model easier
3
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
year, These two mount of money are proportional and correlation shown as first degree
equation (Eg: y=a+bx).
In this report, we will implement and interpret ’Linear Regression’ model based on
the theory above.
• Each observation of x is a n-dimension vector xi = ( xi1 , xi2 , xi3 , ...) where xi1 , xi2 , ...
are features of an instance
f ( x ) = w0 + w1 x1 + w2 x2 + ... + wn xn
1
• Observation x = ( x1 , x2 , ..., xn ) T
The goal is to make prediction in the future is the best by minimizing loss function.
The empirical error of the prediction is:
1 M
M i∑
L = (yi − f ( xi ))2
=1
1
M i∑
= (yi − w0 − w1 xi1 − w2 xi2 − ... − wn xin )2
=1
f ∗ = arg min L( f )
f
M
⇐⇒ w = arg min ∑ (yi − w0 − w1 xi1 − ... − wn xin )2
∗
w
i =1
∂L
= X T ( Xw − y) = 0
∂w
→ w ∗ = ( X T X ) −1 X T y
with X is data matrix of size Mx(n+1), whose the ith row is Xi = (1, xi1 , xi2 , ..., xin );
y = (y1 , y2 , ..., yn ) T .
4
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
One of the most popular ways to minimize loss function is use the Gradient Descent
Algorithm.
We will try some values w into loss function until loss is minimize. Gradient De-
scent optimize this process based on gradient, the value wt+1 at step t + 1 will base on
gradient of loss function at previous step t:
∂L
wt+1 = wt − lr ∗ ( wt )
∂w
w t +1 = wt − lr ∗ X T ( Xwt − y)
5
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
6
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
To identify outliers, we use the box plots technique which will be explained below.
Box plot is a useful technique to describe data in the middle and at the end of the
distributions. It uses the median and the lower and upper quartiles (often 25th and
75th percentiles). The difference between lower and upper is called interquartile range
(IQ).
We follow some steps:
• Step 1: Calculating the median and the lower and upper quartiles.
• Step 2: Calculating the interquartile range which is the difference between the
lower and upper quartile (denotes as IQ)
• Step 3: Denote Q1,Q3 are the lower and upper quartiles repectively. Calculating
some points:
• Step 5: Removing outliers. Points beyond the outer fence are called extreme out-
liers, and points beyond the inner fence are called mild outliers. In this section,
we consider points beyond the outer fence.
Outlier detection helps us to improve the accuracy of the linear regression model,
from 64% to 69.5% for the train set and from 63.5% to 67.4% for the test set. Although
it’s insignificant, this explains why outliers impact on the model.
7
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
2.4. Boosting
2.4.1. What is Ensemble Learning
Ensemble Learning is one of the Machine Learning methods that combines some base
models to give a optimal model. Instead of making a model and deploying this model
with expecting to bring the predictor accurately, Ensemble methods take a lot of mod-
els into account and compute average those models in order to produce the final re-
sults.3
There are three main classes in ensemble learning: bagging, boosting and stacking. In
this report, we will focus on boosting method such that Gradient Boosting, XGBoost.
8
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
9
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
n
F0 ( x ) = arg min ∑ L(yi , γ)
γ i =1
with L is the loss function: L = (yi − γ)2 , this is squared error in regression case.
We find γ value in order to minimize the loss function ∑ L(yi , γ) by arg min func-
tion. Taking derivative ∑ L with respect to γ and let it equals to zero, we get:
n
1
γ= n ∑ yi = ȳ → F0 ( x ) = ȳ
i =1
– Step 2.1: Compute the residual by taking derivative loss function with re-
spect to previous prediction Fm−1 ( x ) and multiplying by -1. We have:
∂L(yi , F ( xi )
ri m = −[ ] F(x)= Fm−1 (x) , i = 1, ..., n
∂F ( xi )
∂(yi − Fm−1 )2
= −
∂Fm−1
= 2(yi − Fm−1 )
γ jm = arg min
γ
∑ L(yi , Fm−1 ( xi ) + γ)
xi ∈ R jm
= arg min
γ
∑ (yi − Fm−1 ( xi ) − γ)2
xi ∈ R jm
for j = 1, 2, ..., Jm .
10
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
Similar to step 1, taking derivative with respect to γ and let it equal to zero,
we have:
∂
∂γ ∑ (yi − Fm−1 ( xi ) − γ)2 = 0
xi ∈ R jm
−2 ∑ (yi − Fm−1 ( xi ) − γ) = 0
xi ∈ R jm
nj γ = ∑ (yi − Fm−1 ( xi ))
xi ∈ R jm
1
γ=
nj ∑ rim
xi ∈ R jm
After deploying Gradient Boosting Regressor by two ways (one is using Scikit-learn
Library and the other is above Algorithm) on the vehicle’s dataset, we get quite good
11
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
result with the accuracy of 94.88% for the train set and 87.48% for the test set (same
results for 2 ways).
There are some important parameters impact on model which we want to explain:
• n_estimators: the number of sequential trees to be modeled
• learning_rate: controls the degree of contribution of the additional tree prediction
to the combined prediction. Small learning rates are popular because they make
the model fit to the specific characteristics of tree, but they require higher number
of trees.
• max_depth: the maximum depth of a tree. They help avoid overfitting since
higher depth allow model to learn relations very specific to a particular sample.
In our code, we choose values for above parameters by intuitively. After using
GridSearchCV function from Scikit-learn Library to tune hyperparameters, the result
becomes better than the original which is 97.42% for train set and 89.13% for test set.
12
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
2.4.5. XGBoost
Extreme Gradient Boosting Machine (XGBM) is the latest version of gradient boosting
machines that works very similar to GBM. In XGBM, trees are added sequentially (one
at a time) that learn from the errors of previous trees and improve them. Although
algorithm of XGBM and GBM are quite similar but still there are a few differences
between them:
• Instead of doing sequential process like GBM, XGBM follow parallel of each
node. This makes speed becomes quicker than GBM.
• XGBM handling missing data automatically, so we can pop this step in data pre-
processing step.
The initial accuracy on the train set of XGBoost is 98.89% and on the test set is
86.78% (with default parameters).
We try to tune some hyperparameters that we understand basically:
• n_estimators: the number of trees to be modeled, also the number of boosting
rounds.
• learning_rate: step size at each iteration, is added to optimize the combined pre-
diction, optimizes chance for getting optimum value. This value is between 0 and
1.
• max_depth: the maximum depth of tree. A high value for max_depth might
increase the performance but also the chances of overfitting.
13
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
• Step 2: In each feature, we choose the best split point (computed by a metric like
Gini impurity) of this feature at node d
• Step 3: Continue splitting the node into several of nodes using best-split method
which by default is using Gini impurity values
• Step 4: Repeat the above process to build completely individual Decision Tree
14
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
R EFERENCES
[1] Weisberg, Sanford. Applied linear regression. Vol. 528. John Wiley Sons, 2005.
[2] Ben-Gal, Irad. "Outlier detection." Data mining and knowledge discovery hand-
book. Springer, Boston, MA, 2005. 131-146.
15
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY
[4] Schapire, Robert E. "A brief introduction to boosting." Ijcai. Vol. 99. 1999.
[5] Section 10.13.1 “Relative Importance of Predictor Variables” of the book The El-
ements of Statistical Learning: Data Mining, Inference, and Prediction, page 367.
16