Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

UNIT-IV: CLASSIFICATION

VENKAT NAIK
Introduction to Classification

• Def1:
– Classification is the task of assigning objects to
one of the several predefined categories.
• Def2:
– Classification is the task of mapping an input
attribute set x into it’s class label y

VENKAT NAIK
Contd..
• Classification Model is useful for two purposes
– Descriptive Modeling: It can serve as an
explanatory tool to distinguishing between objects
of different classes.
– Predicting Modeling: It can be used to predict the
class label of unknown records.
NOTE:
Type of attribute: Must be discrete. If continuous, it
should be converted to discrete OR else
Regression techniques used.

VENKAT NAIK
Contd..

• Application of Classification & Prediction


– Credit approval
– Target marketing
– Medical diagnosis
– Treatment effectiveness analysis
– Fraud Detection
– Performance Prediction
– Molecular biology
– Software cost/Defect estimation
– Software Reliability prediction

VENKAT NAIK
General Approach to Solving Classification Problem

VENKAT NAIK
Contd..
• Classification Technique or Classifier is a systematic approach to building “
Classification Models” from an input dataset.
• Ex:
– Decision Tree classifier
– Rule based classifier
– Neural networks
– Support Vector machines
– Naïve bayes classifiers etc…..
• Each technique employs a “Learning algorithm”
to identify a model that best fits the relationship between the attribute set
and class label of the input data.

• The model generated by a learning algorithm should both fit the input data
well and correctly predict the class labels of the records It has never seen
before.

• Therefore, a key objective of the learning algorithm is to build models with


good generalization capability; i.e. models
VENKAT NAIK that accurately predict the class
Contd..
• Training set:
– Consist of records whose class labels are known
– Training set is used to build a classification model.
• Test set:
– Consist of records with unknown class labels

VENKAT NAIK
Example of a Decision Tree

l l us
ir ca ir ca uo
go go n
te te nti
l a ss
ca ca co c
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

VENKAT NAIK
Another Example of Decision Tree

l l us
rica rica o
ego ego ti nu ss
at at on l a MarSt Single,
c c c c
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No NOTE: There could be more than one tree
10 No Single 90K Yes that fits the same data!
10

VENKAT NAIK
Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

Assign Cheat to “No”


NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

VENKAT NAIK
Evaluation of Classifiers
• Evaluation of the performance of the classifier is based
on the counts of test records correctly and incorrectly
predicted by the model.
• These counts are tabulated in a table known as “
Confusion Matrix”.
• The confusion matrix is useful tool for analyzing how
well your classifier can recognize tuples of different
classes.
• TP &TN tell us when the classifier is getting things right.
• FP & FN tell us when the classifier is getting things
wrong.

VENKAT NAIK
Training Dataset Contd..
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
VENKAT NAIK
Contd..
Confusion matrix :
Predicted class

Yes No

Actual Yes (True-Positive) False-Negative)


Class
No (False-Positive) (True-Negative)

Here,
NOTE: Positive tuples are records labeled as yes
Negative Tuples are records labeled as No
TP indicates number of records from Positive Tuples that were correctly predicted as Positive Tuples.
TN indicates no. of records from Negative tuples correctly predicted as Negative Tuples.
FP these are the negative tuples that were incorrectly labeled as positive.
Ex: Tuples of class buys_compter=no for which the classifier predicted buys_computer=yes
FN these are the positive tuples that were incorrectly labeled as negative
Ex: Tuples of class buys_compter=yes for which the classifier predicted buys_computer=no

VENKAT NAIK
Contd..

 Performance of a model can be evaluated in


ways:
1. Accuracy= Number of correct predictions/Total predictions

(or)
Accuracy= TP+TN
TP+TN+FP+FN
2. Error rate=Number of wrong predictions/Total predictions

(or)
Error Rate= FP+FN
TP+TN+FP+FN
VENKAT NAIK
Contd..

•P is the no.of positive tuples


•N is the no.of Negative tuples

VENKAT NAIK
Contd..

VENKAT NAIK
Classification Techniques
1. Decision Tree
I. Decision Tree construction
II. Methods for expressing attribute test conditions
III. Measures for selecting the best split
IV. Algorithm for decision tree induction
2. Naïve Bayes Classifier
3. Bayesian Belief Networks Classifier
4. K-Nearest Neighbor Classification
I. KNN Algorithm & it’s Characteristics

VENKAT NAIK
Decision Tree Classification
• Decision tree :
– A flow-chart-like tree structure
– A root node has no incoming edges and zero or more
outgoing edges
– Internal node denotes a test on an attribute and it has
only one incoming edge and two or more outgoing
edges.
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– EX: ID3, C4.5 and CART are decision tree based
classification algorithms.
– NOTE: These are supervised learning methods
VENKAT NAIK
Training Dataset Contd..
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
VENKAT NAIK
Output: Construction of Decision Tree
for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

VENKAT NAIK
Algorithm for Decision Tree Induction
• Tree Construction:
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Samples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning:
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

•Use of decision tree: Classifying an unknown sample


– Test the attribute values of the sample against the decision tree
Decision Tree Induction Algorithm Pseudo-code
Algorithm GenDecTree(Sample S, Attlist A)
1. create a node N
2. If all samples are of the same class C then label N with C;
terminate;
3. If A is empty then label N with the most common class C in S
(majority voting); terminate;
4. Select aA, with the highest information gain; Label N with a;
5. For each value v of a:
a. Grow a branch from N with condition a=v;
b. Let Sv be the subset of samples in S with a=v;
c. If Sv is empty then attach a leaf labeled with the most
common class in S;
d. Else attach the node generated by GenDecTree(Sv, A-a)

22
Characteristics of Decision Tree Induction
• It does not require any prior assumptions
• Finding optimal decision tree is NP-hard problem
• Computationally inexpensive
• Small trees are easy to interpret
• Not works well with binary data
• Robust and avoids overfitting
• Redundant attributes will not affect accuracy
• Top-down and recursive partitioning approach
• Subtree may be replicated many times(difficult for
interpretation)
VENKAT NAIK
Methods for Expressing Attribute Test Conditions
• Decision tree induction algorithms must
provide a method for expressing an attribute
test condition and its corresponding outcomes
for different attribute types.
1. Binary Attributes
2. Nominal Attributes
3. Ordinal Attributes
4. Continuous Attributes

VENKAT NAIK
Contd..
1. Binary Attribute:
The test condition for a binary attribute
generates two potential outcomes as shown
in below figure.
Body
Temperature Marital
Marital
Status Marital
Status
Status

Married {Single, {Divorced}


Warm Cold {Single,
Divorced} Single {Married,
Blooded Blooded Married}
Divorced}

VENKAT NAIK
Contd..

• Nominal Attributes:
– A nominal attribute can have many values.
– Number of outcomes depends on the number of
distinct values for the corresponding attribute.
– Ex: Marital Status is Nominal attribute.
Marital status

married
single divorced

(a) Multi-way split

06/20/2024 Data Mining - D. Sandhya Rani 26


Contd..
3. Ordinal Attributes:
-Ordinal attributes can produce binary or multiway splits.
- Ordinal attribute values can be grouped as long as the grouping does not violate the order
property of the attribute values.
NOTE: Fig 4.10(c) violates this property because it combines the attribute values Small and
Large into the same partition while Medium and Extra Large are combined into another
partition.

VENKAT NAIK
Contd..
• 4. Continuous Attributes:
– For continuous attributes, the test condition can
be expressed as a comparison test (A< v) or (A≥ v)
with binary outcomes (or) a range query with
outcomes of the form V ≤ A < V 1for i=1,…k.
i i

– The difference between these approaches is


shown in Fig 4.1

VENKAT NAIK
Contd..
• For the binary case, the decision tree algorithm
must consider all possible split positions v, and it
selects the one that produces the best partition.
• For the multiway split, the algorithm must
consider all possible ranges of continuous values.
• After, Discretization, a new ordinal value will be
assigned to each discretized interval. Adjacent
intervals can also be aggregated into wider
ranges as long as the order property is
preserved.

VENKAT NAIK
Measures for Attribute Selecting Best Split
• Measures are used to determine the best way to split the
records.
• These measures are splits the data partition [DP] into
distinct classes.
• These measures enable us to separate the attributes into
the best possible way , such that every partition is “Pure”
i.e all attributes of that partition belong to the same class.
They are also known as “splitting rules”
• Following are the widely used measures for attribute
selection.
1. Information Gain
2. Gain Ratio
3. Gini Index
VENKAT NAIK
Contd..
1. Information Gain:
 If node ‘N’ represents the tuples of data
partition DP, then the attribute with the
maximum information gain is selected as the
splitting attribute for node ‘N’.
 Advantages:
 It requires minimal information to classify tuples.
 It produces the least impurity in the partitions.
 It minimizes the number of tests required to classify a
tuple.
 It results in a simple tree.

VENKAT NAIK
Contd..
 Though we always try to provide a pure partition,
it is always likely to produce an impure partition.
The amount of information required to change
this impure partition to a pure partition is given
by
m

Information (D)= -  dp j 1
j log 2 ( dp j )

Where dp j  probability of a tuple in DP that belongs to class CJ

Information (D) is also called “entropy of D” . It is the amount of information needed


to identify the class of tuples.

VENKAT NAIK
Contd..
 The information gain is calculated based on the difference
between the original information and the information after
partition.

Information Gain(G) = Information (DP) – Information ( DP )

WHERE
) =m | DPj | DPj (
Information
DP(
 J 1
| DP |
X information )

VENKAT NAIK
Example:

VENKAT NAIK
Contd..
1. Information needed to classify a tuple in D

2. Next, Compute expected information for each


attribute.

=
0.694

VENKAT NAIK
Contd..

3. The Gain in information fro such as partitioning


would be

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..
2. Gain Ratio:
 The information gain measure is biased toward tests with
many outcomes. That is, it prefers to select attributes having a large
number of values. For example, consider an attribute that acts as a
unique identifier such as product ID. A split on product ID would result
in a large number of partitions (as many as there are values), each
one containing just one tuple. Because each partition is pure, the
information required to classify data set D based on this partitioning
would be Therefore, the information gained by
partitioning on this attribute is maximal. Clearly, such a partitioning is
useless for classification.
• C4.5, a successor of ID3, uses an extension to information gain known
as gain ratio, which attempts to overcome this bias. It applies a kind
of normalization to information gain using a “split information” value
defined analogously with Info(D) as
VENKAT NAIK
Contd..
• This value represents the potential information generated
by splitting the training data set, D, into v partitions,
corresponding to the v outcomes of a test on attribute A.
 Note that, for each outcome, it considers the number of
tuples having that outcome with respect to the total
number of tuples in D. It differs from information gain,
which measures the information with respect to
classification that is acquired based on the same
partitioning.
 The gain ratio is defined as

 The attribute with the maximum gain ratio is selected as the


splitting attribute
VENKAT NAIK
Contd..

VENKAT NAIK
Contd..
3. Gini Index:
• The Gini index is used in CART.
• The Gini index measures the impurity of D, a
data partition or set of training tuples, as

VENKAT NAIK
Contd..
• The Gini index considers a binary split for each attribute. Let’s
first consider the case where A is a discrete-valued attribute
having v distinct values, fa1, a2, : : : , avg, occurring in D.
• To determine the best binary split on A, we examine all the possibleS A
subsets that can be formed using known values of A. Each subset, S A
can be considered as a binary test for attribute A of the form “A €
SA
Given va tuple, this test is satisfied if the value of A for the tuple is
among2 the values listed in . If A has v possible values, then there
are possible subsets.

For example,
if income has three possible values, namely
{low, medium, high}, then the possible subsets are
{low, medium, high}, {low, medium}, {low, high}, {medium,
high}, {low}, {medium}, {high}.
VENKAT NAIK
Contd..
• The reduction in impurity that would be incurred
by the binary split on a discrete as


• Gini(A) = Gini(D) - GiniA ( D)

• The attribute that maximizes the reduction in


impurity is selected as the splitting attribute. This
attribute and either its splitting subset (for a
discrete-valued splitting attribute) or split-point
(for a continuous-valued splitting attribute)
together form the splitting criterion.

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..
• The Gini Index value computed based on this
partitioning is

– Gini Index of subset {low, medium} = 0.458


– Gini Index of subset {

VENKAT NAIK
Naïve Bayes Classifier
• It is a classification technique on Bayes Theorem with an
assumption of independence between predictors. In simple
terms, it assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature.
• Consider each attribute and class label as random
variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data?

VENKAT NAIK
Naïve Bayes Classifier
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem

P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

• Assume independence among attributes Ai when class is


given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

VENKAT NAIK
Example: Contd..
How to Estimate Probabilities
from Data? • Class: P(C) = Nc/N
a l a l s – e.g., P(No) = 7/10,
r ic r ic
u ou P(Yes) = 3/10
g o g o in s
te te nt las
ca ca co c
Tid Refund Marital
Status
Taxable
Income Evade • For discrete attributes:
1 Yes Single 125K No
2 No Married 100K No P(Ai | Ck) = |Aik|/ Nc
3 No Single 70K No
4 Yes Married 120K No
– where |Aik| is number of
5 No Divorced 95K Yes instances having attribute Ai and
6 No Married 60K No belongs to class Ck
7 Yes Divorced 220K No
– Examples:
8 No Single 85K Yes
9 No Married 75K No P(Status=Married|No) = 4/7
10 No Single 90K Yes P(Refund=Yes|Yes)=0
10

VENKAT NAIK
How to Estimate Probabilities from Data? Contd..

• For continuous attributes:


– Discretize the range into bins
• one ordinal attribute per bin
• violates independence assumption
– Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
– Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to
estimate the conditional probability P(A i|c)
125  100  70  ....  75
Samplemean 
7 = 110
(125  110) 2  (100  110) 2  ....  (75  110) 2
Sample _ Varience _  2 
7(6)
=2975 2975
VENKAT NAIK =54.54
How to Estimate Probabilities from Data? Contd..

l l s
ric
a
ric
a
u o u • Normal distribution:
go go t in s
at
e
at
e
on c las
1 
( Ai   ij ) 2

P( A | c ) 
c c c
Tid Refund Marital Taxable e 2  ij2

2
Status Income Evade i j 2

ij
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No – One for each (Ai,ci) pair
4 Yes Married 120K No
5 No Divorced 95K Yes
• For (Income, Class=No):
6 No Married 60K No
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes • sample mean = 110
9 No Married 75K No • sample variance = 2975
10 No Single 90K Yes
10

1 
( 120 110 ) 2

P ( Income  120 | No)  e 2 ( 2975 )


 0.0072
2 (54.54)
VENKAT NAIK
Contd..

VENKAT NAIK
Ex: Contd..

VENKAT NAIK
Ex: 1-Weather Forecasting
Weather Play
• STEP1:Convert the dataset to
Sunny No
frequency table.
Overcast Yes
Rainy Yes • Step2: Create likelihood table by
Sunny Yes
finding the probabilities.
Sunny Yes • STEP3: Use Naïve Bayesian
Overcast Yes equation to calculate the
Rainy No posterior probability for each
Rainy No class.
Sunny Yes
• Conclusion: The class with the
Rainy Yes highest posterior probability is
Sunny No
the outcome of prediction.
Overcast Yes
Overcast YES
rainy No

VENKAT NAIK
Contd..
Step1: Frequency Table
P(overcast) = 4/14 0.29
Weather NO YES
P(Rainy) = 5/14 0.36
Overcast 4 P(sunny) = 5/14 0.36
P(No) = 5/140.36
Rainy 3 2 P(Yes) = 9/140.64
Sunny 2 3 PROBLEM: Can players play if weather is
sunny?
Grand Total 5 9
P(Yes | Sunny) = P(Sunny | Yes) * P(Yes)
Step2: Likelihood Table P(Sunny)
P(Sunny|Yes)= 3/90.33
Weather NO YES P(Yes) = 0.64
Overcast 4 P(Sunny) = 0.36
Rainy 3 2
= 0.33 * 0.64
Sunny 2 3 0.36
Grand Total 5 9
P(Yes | Sunny) = 0.60
(Yes as it has higher probability , so players
VENKAT can
NAIK play game.
EX:2 NBC: Training Data
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Class:
>40 medium no fair yes
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
VENKAT NAIK
Contd..
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

• Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

VENKAT NAIK
EX3: NBC
Outlook Temperature Humidity Windy Class
overcast hot high false P
rain mild high false P
rain cool normal false P
overcast cool normal true P
sunny cool normal false P
rain mild normal false P
sunny mild normal true P 9
overcast mild high true P
overcast hot normal false P

Outlook Temperature Humidity Windy Class


sunny hot high false N
sunny hot high true N
rain cool normal true N
sunny mild high false N
rain mild high true N 5

VENKAT NAIK
Contd..
• Given the training set, we compute the probabilities:

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
• We also have the probabilities
– P = 9/14
– N = 5/14

VENKAT NAIK
Contd..

• To classify a new sample X= < sunny,cool, high,false >

• Prob(P|X) = Prob(P)*Prob(sunny|P)*Prob(cool|P)* Prob(high|


P)*Prob(false|P) = 9/14*2/9*3/9*3/9*6/9 = 0.01
• Prob(N|X) = Prob(N)*Prob(sunny|N)*Prob(cool|N)*
Prob(high|N)*Prob(false|N) = 5/14*3/5*1/5*4/5*2/5 = 0.013
• Therefore X takes class label N

VENKAT NAIK
Contd..

• Second example X = <rain, hot, high, false>

• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582

• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class N (don’t play)

VENKAT NAIK
Contd..

• Advantages :
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages:
– Assumption: class conditional independence, therefore loss of accuracy
– Practically, dependencies exist among variables
• E.g., Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks

VENKAT NAIK
Ex:4 Apply NBC Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
Bayesian Belief Networks [BBN] Classifier
• The Naive Bayesian classifier makes the assumption of
class conditional independence, that is, given the class
label of a tuple, the values of the attributes are assumed
to be conditionally independent of one another. This
simplifies computation.
• When the assumption holds true, then the naïve
Bayesian classifier is the most accurate in comparison
with all other classifiers.
• In practice, however, dependencies can exist between
variables.
• To answer this, we use BBN.
• BBN uses “joint conditional probability distributions “.
VENKAT NAIK
Contd..
• They allow class conditional independencies to
be defined between subsets of variables.
• They provide a graphical model of causal
relationships, on which learning can be
performed.
• Trained Bayesian belief networks can be used
for classification.
• BBN is also known as belief networks, Bayesian
networks, and Probabilistic networks.

VENKAT NAIK
Contd..
• BBN defined by two components:
1. Directed Acyclic Graph (DAG)
2. Conditional Probability Table (CPT)

3. DAG:
(i) Each node in the DAG represents a random variable.
They can be discrete or continuous-valued.
(ii) Each arc represents a relationship or probabilistic
dependence. If an arc is drawn from a node Y to
a node Z, then Y is a parent or immediate
predecessor of Z, and Z is a descendant of Y. Each
variable is conditionally independent of its
nondescendants in the graph,
VENKAT NAIK
given its parents.
Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
Contd..

VENKAT NAIK
BBN Characteristics
1. BBN provides an approach for capturing the prior knowledge
of a particular domain using a graphical model. The network
can also be used to encode casual dependencies among
variables.
2. Constructing the networks can be time consuming and
requires a large amount of effort. However, once the
structure of the network has been determined, adding a new
variable is quite straightforward.
3. BN are well suited to dealing with incomplete data. Instances
with missing attributes can be handled by summing or
integrating the probabilities over all possible values of the
attribute.
4. Because the data is combined probabilistically with prior
knowledge , the method is quite robust to model overfitting.
VENKAT NAIK
K-Nearest Neighbor Classification Algorithm
• KNN Uses k “closest” points (nearest neighbors) for
performing classification.
• Basic Idea:
– If it walks like a duck, quacks like a duck, then it’s probably
a duck.

Compute
Distance
Test Record

Choose k of the
“nearest” records
Training
Records
VENKAT NAIK
Contd..

Unknown record  Requires three things


– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

VENKAT NAIK
Contd..

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

NOTE: K-nearest neighbors of a record x are data points that


have the k smallest distance to x

VENKAT NAIK
Steps to classify the record by KNN
• Compute distance between two points:
– Euclidean distance

d ( p, q )   ( pi
i
q )
i
2

• Determine the class from nearest neighbor list


– take the majority vote of class labels among the k-
nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
VENKAT NAIK
Example:
Tid Refund Marital Taxable
Status Income Cheat Distance between nominal attribute values:
1 Yes Single 125K No d(Single,Married)
2 No Married 100K No = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
No
3 No Single 70K d(Single,Divorced)
4 Yes Married 120K No
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
5 No Divorced 95K Yes
d(Married,Divorced)
6 No Married 60K No
7 Yes Divorced 220K No
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
8 No Single 85K Yes d(Refund=Yes,Refund=No)
9 No Married 75K No = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
10 No Single 90K Yes
10

Marital Status Refund


n1i n2i
Class
Single Married Divorced
Class
Yes No d (V1 ,V2 )   
i n1 n2
Yes 2 0 1 Yes 0 3

No 2 4 1 No 3 4

VENKAT NAIK
Contd..

Tid Refund Marital Taxable


Status Income Cheat

X Yes Single 125K No


Y No Married 100K No
10

Distance between record X and record Y:


d
 ( X , Y )  wX wY  d ( X i , Yi ) 2

i 1

where: Number of times X is used for prediction


wX 
Number of times X predicts correctly

wX  1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions


VENKAT NAIK
K-NN Characteristics
• K-NN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive.

VENKAT NAIK

You might also like