Machine Learning Report Official

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

MINI PROJECT: OLD VEHICLES PRICES PREDICTION

Instructors: Nguyễn Nhật Quang, PhD.

Students: Trần Thanh Trường - 20214938


Hoàng Đình Dũng - 20214882
Phan Công Anh - 20210078

Hanoi - December, 2022


H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

Abstract
This report describes how we develop a prediction model that use given data
of old cars sold. With this model, we use several classic methods such that Linear
Regression (LR), Random Forest (RF), Gradient Boosting (GBM), XGBoost (XGB),
LightGBM. Also, we will provide experimental results demonstrating the perfor-
mance of the models and evaluation metrics.

1. I NTRODUCTION
In recent years, Artificial Intelligence have shown to be effective in many fields like
machine learning, computer vision, natural language processing, image processing,
speech processing. In the automotive industry, many companies are developing and
deploying Machine Learning Model to predict vehicle’s prices, then implement strate-
gies to increase their revenue . In this project, we are working on Old Vehicle’s Price
Prediction which is a very useful application. The task of predicting prices is attract
customers and optimize sales. In the scope of this subject, we give some methods to
train and predict model like Linear Regression, Random Forest Regressor and Gradient
Boosting Regressor,...

2. B ACKGROUND
This section provides a brief explanation of the theoretical background necessary to
understand our project report.

2.1. Data Preprocessing and Cleaning


We use some basic ways to clean dataset:
• Remove missing data
• Remove duplicate rows based on all columns
• Remove unnecessary columns
• Reset the index after removing missing data and duplicate rows
• Remove some unnecessary characters of several specific columns
• Encode the categorical data
After preprocessing and cleaning data, we get some information about dataset:

1
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

To train model conveniently, we replaced "name" column by "company" column in


which company is the first word in "name" column.

2.2. Splitting Data


Let’s understand about features:
• year: year of manufacture

• km_driven: traveled distance of vehicle by kilometers

• fuel: type of fuel which vehicle is using

• seller_type: type of sellers

• transmission: way that vehicle moves the power from the engine to the wheels.
(manual or automatic)

• owner: type of vehicle’s owner

• mileage: traveled distance of vehicle by mileage

• engine: type of engine which vehicle is using

• max_power: the maximum of power of the vehicle

2
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

• torque: measurement of vehicle’s ability to do work

• seats: number of seats of the vehicle

• company: name of the company of the vehicle

• selling_price: price of the vehicle and we will predict them (target)

At each mentioned algorithm, we will analyze the importance of the above features
that refers to the relative importance of each feature in the training data.
We split data into two sets, train set with size 0.8 and test set with size 0.2

Furthermore, we also scale features by using Standard Scaler to train model easier

2.3. Linear Regression


2.3.1. What is Linear Regression
Linear regression is one of basic techniques predicting the value of target relied on
other related and known data values. Model is trained based on labeled dataset, then
use this to predict unknown values. Mathematically, the method creates a correlation
between dependent variable and independent ones as linear equation. For example,
suppose that there is a data showing your shopping expense and your salary. Last

3
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

year, These two mount of money are proportional and correlation shown as first degree
equation (Eg: y=a+bx).
In this report, we will implement and interpret ’Linear Regression’ model based on
the theory above.

2.3.2. How does Linear Regression model work?


The core problem of Linear Regression is that find a function f(x) from a training data
D = {( x1 , y1 ), ( x2 , y2 ), ..., } in order that this function satisfies yi ∼
= f ( xi ) for every i.

• Each observation of x is a n-dimension vector xi = ( xi1 , xi2 , xi3 , ...) where xi1 , xi2 , ...
are features of an instance

The learning function is performed as a linear form:

f ( x ) = w0 + w1 x1 + w2 x2 + ... + wn xn
1

• w0 , w1 , w2 , ... are the regression coefficients, w0 sometimes is called "bias"

• Observation x = ( x1 , x2 , ..., xn ) T

• Notes: learning a linear function is equivalent to learning the coefficient vector


w = (w0 , w1 , ..., wn ) T

The goal is to make prediction in the future is the best by minimizing loss function.
The empirical error of the prediction is:

1 M
M i∑
L = (yi − f ( xi ))2
=1
1
M i∑
= (yi − w0 − w1 xi1 − w2 xi2 − ... − wn xin )2
=1

We find f ∗ so that L is minimized:

f ∗ = arg min L( f )
f
M
⇐⇒ w = arg min ∑ (yi − w0 − w1 xi1 − ... − wn xin )2

w
i =1

By taking the derivative of L and let it equals to zero, we have:

∂L
= X T ( Xw − y) = 0
∂w
→ w ∗ = ( X T X ) −1 X T y

with X is data matrix of size Mx(n+1), whose the ith row is Xi = (1, xi1 , xi2 , ..., xin );
y = (y1 , y2 , ..., yn ) T .

4
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

Then, we produce new prediction:

y x = w0∗ + w1∗ x1 + ... + wn∗ xn

One of the most popular ways to minimize loss function is use the Gradient Descent
Algorithm.
We will try some values w into loss function until loss is minimize. Gradient De-
scent optimize this process based on gradient, the value wt+1 at step t + 1 will base on
gradient of loss function at previous step t:

∂L
wt+1 = wt − lr ∗ ( wt )
∂w
w t +1 = wt − lr ∗ X T ( Xwt − y)

with lr is learning rate.


After using both Scikit-learn Library and Gradient Descend method on dataset, we
get the same result that: 64% of accuracy on training set and 63.5% on test set.

Let’s check the decrease of loss function:

We also plot the correlation of true values and predict values:

5
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

2.3.3. Improve Linear Regression Model


To optimize the above results, we researched outlier detection in regression and apply
on our project.
Outlier is known that values locate far from the expected distribution. They distort
the model significantly, cause features distribution is less well-behaved, and they make
the linear regression models produce worse results and more biased. There are several
ways to solve this problem such that removing outliers from the observations, using
algorithms, or treating them. By removing outliers from training data before modeling,
we can get a better fit of the data and good performance.2

6
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

To identify outliers, we use the box plots technique which will be explained below.
Box plot is a useful technique to describe data in the middle and at the end of the
distributions. It uses the median and the lower and upper quartiles (often 25th and
75th percentiles). The difference between lower and upper is called interquartile range
(IQ).
We follow some steps:

• Step 1: Calculating the median and the lower and upper quartiles.

• Step 2: Calculating the interquartile range which is the difference between the
lower and upper quartile (denotes as IQ)

• Step 3: Denote Q1,Q3 are the lower and upper quartiles repectively. Calculating
some points:

– lower inner fence: Q1 − 1.5IQ


– upper inner fence: Q3 + 1.5IQ
– lower outer fence: Q1 − 3IQ
– upper outer fence: Q3 + 3IQ

• Step 4: Plot a box that has the following form

• Step 5: Removing outliers. Points beyond the outer fence are called extreme out-
liers, and points beyond the inner fence are called mild outliers. In this section,
we consider points beyond the outer fence.

Outlier detection helps us to improve the accuracy of the linear regression model,
from 64% to 69.5% for the train set and from 63.5% to 67.4% for the test set. Although
it’s insignificant, this explains why outliers impact on the model.

7
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

2.4. Boosting
2.4.1. What is Ensemble Learning
Ensemble Learning is one of the Machine Learning methods that combines some base
models to give a optimal model. Instead of making a model and deploying this model
with expecting to bring the predictor accurately, Ensemble methods take a lot of mod-
els into account and compute average those models in order to produce the final re-
sults.3

There are three main classes in ensemble learning: bagging, boosting and stacking. In
this report, we will focus on boosting method such that Gradient Boosting, XGBoost.

2.4.2. What is Boosting


Boosting is one of the three main methods in Ensemble Learning. Different from bag-
ging that training many decision trees parallelly on the different samples of the same
dataset and produce prediction by averaging, boosting method combines a set of weak
learners into strong learners to minimize training error. In boosting, a random sample
of data is selected, fitted with a model and then train sequentially - that is, each model
tries to compensate for the weakness of its predecessor.4

2.4.3. Gradient Boosting


One of the most popular methods in Ensemble Learning is Gradient Boosting Machine.
It is a powerful technique for building predictive models for regression and classifica-
tion tasks. In our project, we use Gradient Boosting Regressor to train model and give
predictions.
In this project, we built gradient boosting regression trees step by step using the
vehicle’s dataset and produce results well. To understand how it works intuitively,
you could follow the image and some steps that we explain below.

8
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

We will explain Gradient Boosting Algorithm step by step:

9
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

• Step 1: Initialize model with a constant value

n
F0 ( x ) = arg min ∑ L(yi , γ)
γ i =1

with L is the loss function: L = (yi − γ)2 , this is squared error in regression case.
We find γ value in order to minimize the loss function ∑ L(yi , γ) by arg min func-
tion. Taking derivative ∑ L with respect to γ and let it equals to zero, we get:

n
1
γ= n ∑ yi = ȳ → F0 ( x ) = ȳ
i =1

• Step 2: We will iterate this step M times

– Step 2.1: Compute the residual by taking derivative loss function with re-
spect to previous prediction Fm−1 ( x ) and multiplying by -1. We have:

∂L(yi , F ( xi )
ri m = −[ ] F(x)= Fm−1 (x) , i = 1, ..., n
∂F ( xi )
∂(yi − Fm−1 )2
= −
∂Fm−1
= 2(yi − Fm−1 )

We can pop 2 out of it since it is a constant, so the residuals is ri m = yi − Fm−1


– Step 2.2: With features x, we train regression tree, then create terminal node
R jm for j = 1, 2, ..., Jm including:
* m: tree index
* j: terminal node
* J: total number of leaves
– Step 2.3: In this step, we go to find γ jm so that loss function is minimize on
each terminal node:

γ jm = arg min
γ
∑ L(yi , Fm−1 ( xi ) + γ)
xi ∈ R jm

= arg min
γ
∑ (yi − Fm−1 ( xi ) − γ)2
xi ∈ R jm

for j = 1, 2, ..., Jm .

10
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

Similar to step 1, taking derivative with respect to γ and let it equal to zero,
we have:

∂γ ∑ (yi − Fm−1 ( xi ) − γ)2 = 0
xi ∈ R jm

−2 ∑ (yi − Fm−1 ( xi ) − γ) = 0
xi ∈ R jm

nj γ = ∑ (yi − Fm−1 ( xi ))
xi ∈ R jm
1
γ=
nj ∑ rim
xi ∈ R jm

Notes: n j denotes number of samples in the terminal node j.


In conclude, we will calculate the average of the target value (residuals) in
each terminal node and produce the regular prediction values of regression
tree γ jm .
Jm
– Step 2.4: We update the model: Fm ( x ) = Fm−1 ( x ) + v ∑ γ jm with x ∈ R jm .
j =1
Using v ∈ (0, 1) is learning rate which controls the degree of contribution
of the additional tree prediction γ to the combined prediction F. Also model
overfitting to be decrease if we get the small learning rate.

Remember that we need to iterate step 2 M times.


The brief algorithm is shown bellow:

After deploying Gradient Boosting Regressor by two ways (one is using Scikit-learn
Library and the other is above Algorithm) on the vehicle’s dataset, we get quite good

11
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

result with the accuracy of 94.88% for the train set and 87.48% for the test set (same
results for 2 ways).

There are some important parameters impact on model which we want to explain:
• n_estimators: the number of sequential trees to be modeled
• learning_rate: controls the degree of contribution of the additional tree prediction
to the combined prediction. Small learning rates are popular because they make
the model fit to the specific characteristics of tree, but they require higher number
of trees.
• max_depth: the maximum depth of a tree. They help avoid overfitting since
higher depth allow model to learn relations very specific to a particular sample.
In our code, we choose values for above parameters by intuitively. After using
GridSearchCV function from Scikit-learn Library to tune hyperparameters, the result
becomes better than the original which is 97.42% for train set and 89.13% for test set.

2.4.4. Features Importance


Gradient Boosting model retrieve importance scores of each attribute after constructing
the trees. It shows how useful each feature was in decision trees, the more an attribute
is used to make key decisions with decision trees, the higher its relative importance.
Importance is computed for a decision tree by above principle, then averaged over
all of the decision trees within the model. Features importance are calculated for each
attribute in the dataset, allow to be compared each other.5
The image below shows the features importance of our GBM model in which max_power
is the most important.

12
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

2.4.5. XGBoost
Extreme Gradient Boosting Machine (XGBM) is the latest version of gradient boosting
machines that works very similar to GBM. In XGBM, trees are added sequentially (one
at a time) that learn from the errors of previous trees and improve them. Although
algorithm of XGBM and GBM are quite similar but still there are a few differences
between them:

• XGBM reduces overfitting or underfitting, make model performance better by


using regularization techniques.

• Instead of doing sequential process like GBM, XGBM follow parallel of each
node. This makes speed becomes quicker than GBM.

• XGBM handling missing data automatically, so we can pop this step in data pre-
processing step.

The initial accuracy on the train set of XGBoost is 98.89% and on the test set is
86.78% (with default parameters).
We try to tune some hyperparameters that we understand basically:
• n_estimators: the number of trees to be modeled, also the number of boosting
rounds.

• learning_rate: step size at each iteration, is added to optimize the combined pre-
diction, optimizes chance for getting optimum value. This value is between 0 and
1.

• max_depth: the maximum depth of tree. A high value for max_depth might
increase the performance but also the chances of overfitting.

• colsample_bytree: the value is between 0 and 1. It represents the fraction of


columns to be randomly sampled for each tree.

13
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

After using GridSearchCV to try to tune hyperparameters, we get the accuracy of


the test set is 90.69%

Because of lack of experience in hyperparameters tuning, we only try to tune them


basically. Hope that you will give more feedback and experience about this topic.

2.5. Random Forest


2.5.1. Understanding basic Decision Tree
Decision Tree is one of the most powerful method in machine learning. It has a tree,
hierarchical structure that includes root node, branches, internal nodes and leaf nodes.
The outcomes branch from root node and go into terminal node, by evaluating tech-
nique, all possible outcomes are represented in leaf nodes. To seek the tree which is
the best in split the set of data, the data will be typically trained by Cart algorithm
applying some evaluation metric such as Gini impurity.
Decision Tree using greedy algorithm to find the optimal result within a tree. One
of the ways that decision tree get stable accuracy is forming an ensemble via random
forest algorithm.

2.5.2. What is Random Forest


As mentioned above, Ensemble Learning includes Bagging method. And one of the
most popular algorithm in machine learning is Random Forest which is a bagging
technique, based on many Decision Trees that built on different samples and generalize
all votes from those to find the majority for classification or average values in case of
regression.

2.5.3. Random Forest Algorithm


This algorithm includes several steps:

• Step 1: Randomly choose m features


√ from n given feature of the dataset input
(m<n), the default value will be n if there is no reference

• Step 2: In each feature, we choose the best split point (computed by a metric like
Gini impurity) of this feature at node d

• Step 3: Continue splitting the node into several of nodes using best-split method
which by default is using Gini impurity values

• Step 4: Repeat the above process to build completely individual Decision Tree

• Step 5: From n built Decision Tree, we form a Random Forest model

14
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

2.5.4. Features Importance


The importance feature of Random Forest method is how much this feature is used in
each tree of the forest. It equals to the total of the impurity decrease from all decision
trees in forest. Let’s look the features importance of our Random Forest model:

max_power is still the most importance feature, it impacts strongly on RF model. We


can see features that have low cardinality are not as important as higher cardinality
features, for example year feature is more crucial than transmission feature.

2.6. Evaluation and Prediction


3. C ONCLUSION

R EFERENCES
[1] Weisberg, Sanford. Applied linear regression. Vol. 528. John Wiley Sons, 2005.

[2] Ben-Gal, Irad. "Outlier detection." Data mining and knowledge discovery hand-
book. Springer, Boston, MA, 2005. 131-146.

15
H ANOI U NIVERSITY OF S CIENCE AND T ECHNOLOGY
S CHOOL OF I NFORMATION AND C OMMUNICATION T ECHNOLOGY

[3] Polikar, Robi. "Ensemble learning." Ensemble machine learning. Springer,


Boston, MA, 2012. 1-34.

[4] Schapire, Robert E. "A brief introduction to boosting." Ijcai. Vol. 99. 1999.

[5] Section 10.13.1 “Relative Importance of Predictor Variables” of the book The El-
ements of Statistical Learning: Data Mining, Inference, and Prediction, page 367.

16

You might also like