Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

CS 422

September 3, 2020
Machine Learning

❑ Classification
Machine Learning Definition

❑ "Field of study that gives computers the ability to learn without


being explicitly programmed“ (Wikipedia)
❑ Basic case – learn to differentiate between two classes in the
data
Machine Learning Definition

❑ "Field of study that gives computers the ability to learn without


being explicitly programmed“ (Wikipedia)
❑ Is this a picture of a cat? Or not?
Machine Learning Definition

❑ "Field of study that gives computers the ability to learn without


being explicitly programmed“ (Wikipedia)
❑ Is this a picture of a cat? Or not?
Machine Learning Definition

❑ "Field of study that gives computers the ability to learn without


being explicitly programmed“ (Wikipedia)
❑ Is this a spam email? Or not?
Big Picture of Machine Learning Process

❑ Machine learning algorithms differ in how they create the model


of the data
Machine Learning

Supervised Unsupervised

❑ There are manually labeled ❑ There are no manually labeled


examples of the “Yes”, examples
“No” classes, or more generally, ❑ Easier to use because no
“1”, “-1” labeled data required
❑ The model is built using those ❑ Usually, less precise results
labeled examples
❑ Manual labels are expensive to
produce
❑ In general, better performance
Supervised
Machine Learning
Supervised Classification Algorithms

❑ Decision Trees, Random Forest


❑ Acknowledgment
❑ Used the material from the website
http://www.popsci.com/how-machine-learning-works-interactive
Some intuition about features

❑ First step – define features


Elevation for homes in SF vs NY
Decision tree classifier

❑ A decision tree uses if-then statements to define patterns in data.


❑ For example, if a home's elevation is above some number, then the
home is probably in San Francisco.
Splitting the data by features

❑ These statements split


the data into two
branches based on
some value.
❑ That value between the
branches is called a
split point. Homes to
the left of that point get
categorized in one way,
while those to the right
are categorized in
another. A split point is
the decision tree's
version of a boundary.
Trade-offs

❑ Picking a split point has tradeoffs.


Our initial split (~240 ft) incorrectly
classifies some San Francisco
homes as New York ones.
❑ Look at that large slice of green in
the left pie chart, those are all the
San Francisco homes that are
misclassified. These are called false
negatives.
❑ However, a split point meant to
capture every San Francisco home
will include many New York homes
as well. These are called
false positives.
Best Split

❑ At the best split, the


results of each branch
should be as
homogeneous (or pure)
as possible. There are
several mathematical
methods you can choose
between to calculate the
best split.
❑ As we see here, even the
best split on a single
feature does not fully
separate the San
Francisco homes from the
New York ones.
Recursion!

❑ To add another split point, the


algorithm repeats the process
above on the subsets of data.
This repetition is called
recursion, and it is a concept
that appears frequently in
training models.
Growing a Decision Tree
Machine Learning

❑ Classification – more formal definition


Classification

❑ Classification is the task of assigning objects to one


of the several predefined categories

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Illustrating Classification Task

Tid Refund Marital Taxable


Status Income Cheat
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 1 Yes
Yes Single
L 125K No Algorithm
2 No Married 100K No
3 No Single 70K No
No
Yes
Induction
4 5 Yes
No Large
Married 95K
120K No
6 No Medium 60K No
5 No Divorced 95K Yes
7 Yes Large 220K No Learn
6 8 NoNo Married
Small 60K
85K No
Yes Model
7 9 Yes
No Medium
Divorced 75K
220K No
No
10 No Small 90K Yes
8 No Single 85K Yes Model
9 No Married 75K No
10 No Single 90K Yes
10

0
1
Illustrating Classification Task

Tid Refund Marital Taxable


Tid Attrib1 Status
Attrib2 Income
Attrib3 Cheat
Class Learning
Learning
1 Yes Large 125K No Algorithm
1 Yes Single 125K No algorithm
2 No Medium 100K No
2 No Married 100K No
3 No Small 70K No
3 No Single 70K No
4 Yes Medium 120K No
45 Yes Married 120K No
Induction
No Large 95K Yes
56 No
No Divorced
Medium 95K
60K Yes
No

67 No
Yes Married
Large 60K
220K No
No Learn
78 No
Yes Small
Divorced 85K
220K Yes
No Model
No
89 No
No Medium
Single 75K
85K Yes
10 No Small 90K Yes
9 No Married 75K No
Model
10

10 No Training
Single Set
90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Classification Function

❑ Definition 4.1: Classification is the task of learning a


target function f that maps each attribute set x to one
of the predifined class labels y.

❑ Set of data points X = {x1, x2, ….., xn}


❑ Set of functions H = {f1, f2, …}
❑ Set of labels Y = {y1, y2, ….}

f: X -> Y
Classification Function

❑ Definition 4.1: Classification is the task of learning a target


function f that maps each attribute set x to on of the predifined
class labels y.

❑ Set of data points X = {x1, x2, ….., xn}


❑ Set of functions H = {f1, f2, …}
❑ Set of labels Y = {y1, y2, ….}

f: X -> Y

f(x) = a*x+b

If x in RN, then a in RN and f(x) = ∑xi*ai + b

y = 1 if f(x) > 0
y = -1 if f(x) <=0
Training and Test Sets of Data

❑ We need data to build the model


❑ Set of data points X = {x1, x2, ….., xn}
❑ Use some of the data to train the model
❑ Use the rest to evaluate the quality of the model
Training and Test Sets of Data

❑ Training Data Set


❑The data we use to train the model
❑ Validation Data Set
❑The data we use during training to tune hyperparameters
❑ Test Data Set
❑The data use to evaluate the model
❑Test data CANNOT be used during training!
Training and Test Sets of Data Creation

❑ Train/Test Split
❑ Designate % as training data
❑ Designate % as validation if needed
❑ Rest is test data

❑ Cross Validation
❑ N-fold cross validation
❑ (N-1)/N for training
❑ 1/Nth for testing
Training and Test Sets of Data Creation Cont.

❑ Holdout
❑ Reserve 2/3 for training and 1/3 for testing
❑ Random subsampling
❑ Repeated holdout
❑ Cross validation
❑ Partition data into k disjoint subsets
❑ k-fold: train on k-1 partitions, test on the remaining one
❑ Leave-one-out: k=n
❑ Bootstrap
❑ Sampling with replacement
Classification Error

❑ Classification error on the training set

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
Model 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Apply Model to Training Data

Refund Marital Taxable


Status Income Cheat
Refund
No Married 80K ?No
Yes No 10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Training Data

Refund Marital Taxable


Refund Status Income Cheat
Yes No
No Married 80K ?No
10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Training Data

Refund Marital Taxable


Refund Status Income Cheat

Yes No No Married 80K ?No


10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Training Data

Refund Marital Taxable


Status Income Cheat
Refund
Yes No No Married 80K ?No
10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Training Data

Refund Marital Taxable


Refund Status Income Cheat

Yes No No Married 80K ?No


10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Training Data

Refund Marital Taxable


Status Income Cheat
Refund
Yes No No Married 80K ?No
10

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K f(x) = y
NO YES
Classification Error

❑ Classification error on the training set

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No f(x1) = “No”


2 No Married 100K No
f(x2) = “No”
3 No Single 70K No
Model 4 Yes Married 120K No f(x3) = “No”
5
6
No
No
Divorced 95K
Married 60K
Yes
No
f(x4) = “No”
7 Yes Divorced 220K No f(x5) = “Yes”
8
9
No
No
Single
Married
85K
75K
Yes
No
….
10 No Single 90K Yes
10

Classification error on the training set is zero.


Classification Function

❑ How to compute the best function?


❑ Finding the optimal function may be computationaly
infeasible
❑ E.g. exponential number of possible decision trees
Classification Function

❑ How to compute the best function?


❑ Finding the optimal function may be computationaly infeasible
❑ E.g. exponential number of possible decision trees
❑ Size of the hypothesis space H
❑ E.g. How many distinct decision trees with n Boolean attributes?
= number of Boolean truth tables = 22n
Truth table row → path to leaf, 2n paths
For each row we can choose T or F, 22n trees
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616
trees

n=2: 2^2 = 4 rows. For each row we can choose T or F: 2^4 functions.
❑ Decision Tree Classifiers
❑ Acknowledgment: based on the notes by Dr. Faisal
ShafaitGerman Research Center for Artificial Intelligence
(DFKI)
Decision Tree Classifiers

❑ How do we automatically learn optimal questions to


ask at each node?
❑ How do we minimise the expected number of
queries?
❑ How do we make learning/estimation efficient?
❑ How do we handle continuous features/distributional
outputs?
Decision Tree Induction

❑ Avoids solving the NP-hard problem

❑ Finds a local minimum

❑ Uses greedy, recursive, top-down approach


Decision Tree Induction
❑ Algorithm 4.1 Decision tree induction algorithm
❑ E is the set of the training records
❑ F is the set of labels
❑ TreeGrowth(E,F)
❑ If stopping_cond(E,F) = true then
– Leaf = createNode();
– Leaf.label = classifyNode(E)
– Return leaf
❑ Else
– Root = createNode()
– Root.test_cond = find_best_split(E,F)
– V = { v | v is a possible output of root.test_cond}
– For each v in V
» Ev = { e | root.test_cond(e) = v and e in E}
» Child = TreeGrowth(Ev,F)
» Add child as descendant of root and lable edge
(root->child) as v
– end for
❑ End if
Decision Tree Induction

❑ Algorithm 4.1 Decision tree induction algorithm


❑ E is the set of the training records
CreateNode() exteds the tree with a new node.
❑ F is the set of labels The node has either the node.test_cond or a
❑ TreeGrowth(E,F) node.leaf.
❑ If stopping_cond(E,F) = true then
– Leaf = createNode(); Find_best_split() determines which attribute
to select as the test condition (Gini index)
– Leaf.label = classifyNode(E)
– Retrun leaf Classify() assigns the class label
❑ Else
Stopping_cond is used to terminate
– Root = createNode() the node creation
– Root.test_cond = find_best_split(E,F)
– V = { v | v is a possible output of root.test_cond}
– For each v in V
» Ev = { e | root.test_cond(e) = v and e in E}
» Child = TreeGrowth(Ev,F)
» Add child as descendant of root and lable edge
(root->child) as v
– end for
❑ End if
Hunt’s Algorithm

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Start Hunt’s Algorithm
Hunt’s Algorithm
–If Dt contains records that belong
Refund to the same class yt, then t is a leaf
Don’t Yes No node labeled as yt
Cheat Don’t
Don’t Cheat –If Dt is an empty set,
Cheat Cheat
then t is a leaf node labeled by
the default class, yd
Refund –If Dt contains records that belong
Refund
Yes No to more than one class, use an
Yes No attribute test to split the data into
Don’t Marital smaller subsets.
Don’t Marital Cheat Status
Cheat Status Single, -Recursively apply the procedure
Single, Married
Divorced
Married Divorced to each subset.
Taxable Don’t
Cheat Don’t Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Drawbacks of Hunt’s Algorithm

❑ Works if every combination of attribute values is


present in the training data and each combination
has a unique class label
❑ Additional conditions are needed:
❑ Some child nodes are empty. Can happen if none of the
training records have the combination of the values
associated with these nodes. The node is declared the leaf
node with the labels as the majority of the parent
❑ All records at the nodes have the same attributes but the
class labels are different. Assign the class as the majority of
the class labels.
Characteristics of Decision Tree Induction
❑ NP-hard problem to find the optimal decision tree
❑ Existence of computation inexpensive tree induction procedures
❑ Decision trees, especially smaller ones are easy to interpret
❑ Expressive representation of discrete valued functions. Do not
generalize well to certain types of Boolean problems
❑ Not expressive enough for modeling continuous variables.
Particularly when test condition involves only a single attribute
at-a-time
❑ Robust to noise, especially combined with techniques for
avoiding overfitting
❑ Reduntant attributes do not adversely affect the accuracy of
decision trees. Though the tree can become unnecessarily
complex
❑ Most decision tree induction algorithms use the recursive top-
down approach, as the tree grows, the number of records goes
down. The number of records at a node can become too small
(data fragmentation)
❑ Duplicated subtrees
Characteristics of Decision Tree Induction

❑ Test condition involves one attribute at a time -


decision boundary.

0.9

0.8
x < 0.43?

0.7
Yes No
0.6
y

0.5 y < 0.47? y < 0.33?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
Oblique Decision Trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive
Model Evaluation

❑ How do we know if the greedy approach is good?


❑ How do we evaluate a classification model, e.g. a decision
tree?
Metrics for Performance Evaluation

❑ Focus on the predictive capability of a model


❑ Rather than how fast it takes to classify or build models,
scalability, etc.
❑ Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
ACTUAL Class=Yes a b
c: FP (false positive)
CLASS
d: TN (true negative)
Class=No c d
Metrics for Performance Evaluation

PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)

❑ Most widely-used metric:


a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Limitation of Accuracy

❑ Consider a 2-class problem


❑ Number of Class 0 examples = 9990
❑ Number of Class 1 examples = 10

❑ If model predicts everything to be class 0, accuracy is


9990/10000 = 99.9 %
❑ Accuracy is misleading because model does not detect any
class 1 example
Cost Matrix

PREDICTED CLASS

C(i|j) Class=Yes Class=No

ACTUAL Class=Yes C(Yes|Yes) C(No|Yes)


CLASS

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i


Computing Cost of Classification

Cost PREDICTED CLASS


Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS

+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
Cost vs Accuracy

Count PREDICTED CLASS


Class=Yes Class=No
Accuracy is proportional to cost if
Class=Yes 1. C(Yes|No)=C(No|Yes) = q
ACTUAL
a b
2. C(Yes|Yes)=C(No|No) = p
CLASS Class=No c d
N=a+b+c+d

Accuracy = (a + d)/N
Cost PREDICTED CLASS
Class=Yes Class=No
Cost = p (a + d) + q (b + c)
Class=Yes
ACTUAL p q = p (a + d) + q (N – a – d)
CLASS
Class=No q p = q N – (q – p)(a + d)
= N [q – (q-p)  Accuracy]
Model Complexity

C1: 0 C1: 3 C1: 5 C1: 1 C1: 3 C1: 3


C2: 2 C2: 0 C2: 2 C2: 4 C2: 1 C2: 6

e(T) = 7/24=0.29
C1: 3 C1: 2 C1: 0 C1: 1 C1: 3 C1: 0
C2: 1 C2: 1 C2: 2 C2: 2 C2: 1 C2: 5

e(T) = 4/24=0.167
Model Complexity

❑ Pessimistic Error Estimate


❑ Add a penalty for each node Ω(t)
❑ n(t) is the number of training records at node t
❑ e(t) classifcation error of node t
❑ k is the number of leaf nodes

error’(T) = (e(T) + Ω(T)) / N(T)


= (∑t=1:k |e(t)+ Ω(t)|) / ∑t=1:k n(t)
Model Complexity
error’(T) = (e(T) + Ω(T)) / N(T)
= (∑t=1:k |e(t)+ Ω(t)|) / ∑t=1:k n(t)

C1: 0 C1: 3 C1: 5 C1: 1 C1: 3 C1: 3


C2: 2 C2: 0 C2: 2 C2: 4 C2: 1 C2: 6

e(T) = 7/24=0.29
C1: 3 C1: 2 C1: 0 C1: 1 C1: 3 C1: 0
C2: 1 C2: 1 C2: 2 C2: 2 C2: 1 C2: 5 Let Ω(t) = 0.5
e(T) = 4/24=0.167 error’(T) = 0.33

Let Ω(t) = 0.5 Let Ω(t) = 1 Let Ω(t) = 1


error’(T) = 0.31 error’(T) = 0.458 error’(T) = 0.417
Cost-Sensitive Measures

a
Precision (p) = a: TP (true positive)
a+c
b: FN (false negative)
a
Recall (r) = c: FP (false positive)
a+b
d: TN (true negative)
2rp 2a
F - measure (F) = =
r + p 2a + b + c

Precision is biased towards C(Yes|Yes) & C(Yes|No)


Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is harmonic mean, biased towards all except C(No|No)
Cost-Sensitive Measures, Multiclass

https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1
Cost-Sensitive Measures, Multiclass

https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1
Cost-Sensitive Measures, Multiclass Cont.
Macro vs Micro Measures

❑ Data set with 90%-10% class distribution


❑ Classifier can achieve 90% accuracy by assigning
the majority class label to everything
❑ 90% Micro-averaged accuracy
❑ Macro-averaged accuracy is 50%
❑ Macro-averaged measures are insensitive to the
imbalance of the classes and treats them all as
equal
Methods for Performance Evaluation

❑ How to obtain a reliable estimate of performance?

❑ Performance of a model may depend on other factors


besides the learning algorithm:
❑ Class distribution
❑ Cost of misclassification
❑ Size of training and test sets
Learning Curve

Learning curve shows


how accuracy changes
with varying sample
size
Requires a sampling
schedule for creating
learning curve:
Arithmetic
sampling
(Langley, et al)
Geometric
sampling
(Provost et al)
Effect of small sample
size:
- Bias in the
estimate
- Variance of
estimate
Generalization Error

❑ Is the training error the best measure of the


goodness of the model?
Generalization Error

❑ Is the training error the best measure of the


goodness of the model?

Points not
seen during
training
Generalization Error

❑ Error on the actual whole data according to its natural


distribution

❑ Training set is a subset of the whole data

❑ Expected value of the error on the whole data vs the


actual error on the training set
Estimating Generalization Errors

❑ Re-substitution errors: error on training ( e(t) )


❑ Generalization errors: error on testing ( e’(t))
❑ Methods for estimating generalization errors:
❑ Optimistic approach: e’(t) = e(t)
❑ Pessimistic approach:
❑ For each leaf node: e’(t) = (e(t)+0.5)
❑ Total errors: e’(T) = e(T) + N  0.5 (N: number of leaf nodes)
❑ For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%
❑ Reduced error pruning (REP):
❑ uses validation data set to estimate generalization
error
❑ Need new ways for estimating errors
Practical Issues of Classification

❑ Underfitting and Overfitting

❑ Missing Values

❑ Costs of Classification
Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient Examples

❑ Lack of data points in the lower


half of the diagram makes it
difficult to predict correctly the
Points not
seen during
training class labels of that region

❑Insufficient number of training


records in the region causes the
decision tree to predict the test
examples using other training
records that are irrelevant to the
classification task
HW Assignment Report Tips

❑ Report content
❑ What do you see
❑ Why is it the case
❑ Is it important
❑ Does it help to understand the problem you are working on
❑ Does it help to understand the results that you get with
your approach
❑ What did you learn
❑ What do you want others to learn
❑ Analysis and Conclusions are the most important
parts of your report
HW Assignment Report Tips

❑ Try to make your report short and informative


❑ Long != Informative
❑ Don’t repeat definitions, give definitions once at the beginning of
the report
❑ Don’t repeat the same sentences with different numbers
❑ The performance of the Decision stump is X, the performance of.. Is
Y
❑ Results like that are best represented in a table
❑ Don’t write a manual for using a tool, describe only your steps
that matter for the analysis and conclusion
❑ Always write a conclusion
❑ What did you learn
❑ What were the most interesting results
❑ What do you want the others to learn after reading your report
Results Table
❑ Grading policy
❑ Late submission policy
❑ No regrading
❑ Write everything in YOUR OWN WORDS
❑Explain each step of how you got to the answer
– The answers like “Yes”, “No”, “42” without any
explanation will result in 0 points
❑Write simple explanations
– Very precise and short explanations
– Make is easier for the grader to understand your
explanation
❑Provide details for all your steps
❑In other words show that you understand the
problem ☺
Iris Dataset
Classifiers

❑ Decision Stump:
❑ A model consisting of a one-level decision tree.
❑ Only one internal node is immediately connected to the
terminal nodes.
❑ Predicts based on a single input feature.
❑ For continuous features a threshold feature value is selected
to split the attribute.

❑ SimpleCart:
❑ Could produce multi-level decision tree.
❑ Only binary splits on attributes.
Iris Dataset

❑ 4 attributes, 1 class attribute, 3 classes.


❑ PetalLength and PetalWidth separate the classes
well.
DecisionStump (Iris)
DecisionStump (Iris)

❑ Size of tree = 4
❑ No. of leaf nodes = 3
❑ Accuracy = 66.66%
❑ 10-fold cross-validation used.
❑ In each cross-validation iteration:
❑ Size of training set = 135 records Test set = 15 records
❑ Model uses only PetalLength for classification.
❑ Petal length split on threshold value 2.45
❑ No record classified as Iris-virginica.
❑ Relatively poor performance in terms of accuracy.
❑ Random forest classifier
Ensemble Methods

❑ Construct a set of classifiers from the training data

❑ Predict class label of previously unseen records by


aggregating predictions made by multiple classifiers
General Idea

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers
Why does it work?

❑ Suppose there are 25 base classifiers


❑ Each classifier has error rate,  = 0.35
❑ Assume classifiers are independent
❑ Probability that the ensemble classifier makes a wrong
prediction:
25
 25 i
  
 i 
i =13 
 (1 −  ) 25−i
= 0.06

❑ Random forest classifier
Random Forest

❑ Train a collection of trees


❑ Ensemble method
❑ Averages over (diverse) classification trees (a forest)
❑ For each tree draw L samples of the original data
❑ At each node randomly sample P queries and choose
the best among them
Random Forest

❑ Train a collection of trees


Random Forest

❑ Aggregate across trees (majority vote or average ⇒


mixture model)
❑ Avoids over-fitting and computationally efficient
Random Forest
Random Forests

❑ Random forests are a very popular tool for


classification, e.g. in computer vision
❑ Based on decision trees: classifiers constructed
greedily using the conditional entropy
❑ The extension hinges on two ideas:
❑ building an ensemble of trees by training on subsets of data
❑ considering a reduced number of possible queries
(attributes) at each node
Random Forests
Random Forests
Random Forests
Random Forests
Random Forests

You might also like