Decision Trees
Decision Trees
Decision Trees
1
Motivation:
Classification techniques for analysts have majorly been Logistic Regression, Decision
Trees and SVM. Due to some of the following limitations of Logistic Regression,
Decision Trees prove to be very useful:
1. Doesn’t perform well when the feature space is too large
2. Doesn’t handle large number of variables/features well
3. Linear decision boundary layers
We will try to work out the implementation of Decision Trees using the Banknote
dataset. You can find the data on
(https://archive.ics.uci.edu/ml/datasets/banknote+authentication)
The data was extracted from specimen of forged and genuine banknotes. The data
columns contain details about various continuous properties of the image scanned. The
aim of the problem set is to develop a classifier which, given a set of properties of the
data classify whether the note is genuine or forged .
2
Tree
What are decision trees?
They are popular because the final model is so easy to understand by practitioners and
domain experts alike. The final decision tree can explain exactly why a specific
prediction was made, making it very attractive for operational use.
Representation
Given below is an example of a decision tree classifying whether a day is suitable for
playing tennis .
A decision tree consist of three main parts - root nodes , leaf nodes , branches .It
starts with a single node called “root node ” which then branches into “child nodes” .
At each node , a specific attribute is tested and the training data is classified
accordingly.
In the above example , the root node tests the attribute “Outlook” and splits into 3
branches according to the 3 values taken by the attribute - “Sunny” , “Overcast ”and
“Rain” .
3
Regression vs Classification
Classification trees are used when the target variable is categorical in nature whereas
Regression trees are used when the target variable is continuous or numerical in
nature.
In case of linear regression , a single formula is used for prediction in the entire training
data set. But if the number of features are more , they can interact in non-linear ways
and modelling a single formula becomes difficult .Hence regression trees are used.
Classification trees works in the same manner, only the predictor variables are
categorical in nature.
Building a Model:
Choosing a split criteria to to start forming child nodes is the first step in a building our
decision trees. There are many index or measures available to test the homogeneity of
a sample after a split, some of the popular ones are mentioned below
4
Popular cost functions :
1. Information gain :
Information gain measures the reduction in entropy caused by classifying the training
data on the basis of an attribute. Higher the information gain ,better is the attribute in
classifying the data.
Mathematically ,
Entropy (S) = − p1 log p1− p2 log p2 ...− pn log pn
Entropy(S | A) = sum (P(c) x E(c))
Gain (S,A) =Entropy(S) −Entropy(S | A)
Where S is the Training set and A is the attribute Entropy(S | A) is the weighted
entropy of the child nodes after splitting using attribute A.
2. Gini index :
For classification the Gini index function is used which provides an indication of how
“pure” the leaf nodes are (how mixed the training data assigned to each node is).
G = sum(pk * (1 – pk))
Where G is the Gini index over all classes, pk are the proportion of training instances
with class k in the rectangle of interest.
For a binary classification problem, this can be re-written as:
G = 2 x p1 x p2
or
G = 1 – (p12 + p22 )
3. Chi - square test : Chi-square is another test used to determine the statistical
significance between the parent node and the child-nodes after the split.
Higher the value of Chi-Square higher the statistical significance of differences between
parent node and the child-nodes and thus better is the attribute for classification.
5
Evaluating the performance of Decision trees :
Confusion Matrix : The performance of the decision tree over the test data set can be
summarised in a confusion matrix .
The each training example will fall into one of the four categories -
1. True Positives (TP) : All positive instances correctly classified
Accuracy can be calculated from the table as the ratio of correct predictions to total
predictions .
However , calculating accuracy alone is not very reliable if the number of datasets belong
to each class are very different .
Decision trees are prone to overfitting . If they are grown too deep , they lose
generalisation capability .
6
Tree pruning :
Random forests :
As discussed earlier , when working with large sets of data decision trees often run into
the problem of overfitting. Hence, an ensemble learning technique called Random
Forests is used.
To classify a new object based on attributes, each tree gives a classification which is
then aggregated into one result .The forest chooses the classification by the number of
votes from all the trees and in case of regression, it takes the average of outputs by
different trees.
7
Random forests are considered more popular than single decision tree due to the fact
that they can minimise overfitting without bringing in much error of bias .
Despite being more accurate , the random forest algorithms are computationally
expensive and hard to implement .
Implementation
We’ll break down our implementation on the banknote case study into the following for
steps:
● Data Exploring and Pre-processing
● Data Splitting
● Building Model
● Making Predictions
Data was extracted from images that were taken from genuine and forged banknote-like
specimens. For digitization, an industrial camera usually used for print inspection was
used. The final images have 400 x 400 pixels. Due to the object lens and distance to the
investigated object grayscale pictures with a resolution of about 660 dpi were gained.
Wavelet Transform tool were used to extract features from images.
8
The variables and their respective denotion goes as follows:
● Var: variance of Wavelet Transformed image (continuous)
● Skewness: skewness of Wavelet Transformed image (continuous)
● Kurtosis: kurtosis of Wavelet Transformed image (continuous)
● Entropy: Entropy of image(continuous)
● Class: DIscrete variable accounting for the class in which the banknote lies
For data exploration we start with multivariate analysis to get an idea of how the
decision boundary might look like. The r
9
A little idea about the decision boundary can be gathered from the above visualization,
where the Red dots represent class 0 and blue represent class 1.
We’ll be using two splitting techniques and then compare the better model. In each case
we would need to split the data into train and test set, done as below: (As of now we
have limited the maximum tree depth, we would discuss the reasoning behind later)
10
We can visualize our decision trees using:
You would require graphviz for the above operation, it can be easily installed using the
command prompt. This is how the png output of our decision tree would look like:
11
Performance of the Model
Sklearn has a module for accuracy metrics, we would fit both the models and check their
accuracy
12
Random Forests
As you can see the accuracy of both the models have improved significantly.
A new dimension to the study of Decision Trees comes after learning about bagging and
boosting. These topics have been covered in advanced modules
13