Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

UNIT IV L EARNING

Probability basics - Bayes Rule and its Applications - Bayesian Networks – Exact and Approximate
Inference in Bayesian Networks - Hidden Markov Models - Forms of Learning - Supervised Learning
- Learning Decision Trees – Regression and Classification with Linear Models - Artificial Neural
Networks – Nonparametric Models - Support Vector Machines - Statistical Learning - Learning with
Complete Data - Learning with Hidden Variables- The EM Algorithm – Reinforcement Learning

BAYESIAN THEORY

Bayes’ theorem (Bayes’ law or Bayes' rule) describes the probability of an event, based on prior
knowledge of conditions that might be related to the event.
For example, if diabetic is related to age, then, using Bayes’ theorem, a person’s age can be used
to more accurately assess the probability that they have diabetic, compared to the assessment of the
probability of diabetic made without knowledge of the person's age. It is the basis of uncertain reasoning
where the results are unpredictable.

Bayes Rule
𝑃(𝐷|ℎ)𝑃(ℎ)
𝑃(ℎ|𝐷) =
𝑃(𝐷)
P(h)- prior probability of hypothesis h
P(D)prior probability of data D, the evident
P(h|D)-posterior probability (prob. Of h based on given evident)
P(D|h)- likelihood of D given h (Prob. of evident based on h)

Axioms of probability
1. All probabilities are between 0 and 1 ie0≤P(A) ≤1
2. P(True)=1 and P(false)=0
3. P(AB)=P(A)+P(B)-P(AB)

BAYESIAN NETWORK
• A Bayesian network is a probabilistic graphical model that represents a set of variables and their
probabilistic independencies. Otherwise known as Bayes net, Bayesian belief Network or simply
Belief Networks. A Bayesian network specifies a joint distribution in a structured form. It represents
dependencies and independence via a directed graph. Networks of concepts linked with conditional
probabilities.
• Bayesian network consists of
– Nodes = random variables
– Edges = direct dependence
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Requires that graph is acyclic (no directed cycles)
• 2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)

For eg, evidence says that lab produces 98% accurate results. It means that a person X has 98%
malaria or 2% of not having malaria. This factor is called uncertainty factor. This is the reason that we
go for Bayesian theory. Bayesian theory is also known as probability learning.

The probabilities are numeric values between 0 and 1 that represent uncertainties.
i) Simple Bayesian network

p(A,B,C) = p(C|A,B)p(A)p(B)
ii) 3-way Bayesian network (Marginal Independence)

p(A,B,C) = p(A) p(B) p(C)


iii) 3-way Bayesian network (Conditionally independent effects)

p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent Given A
iv) 3-way Bayesian network (Markov dependence)

p(A,B,C) = p(C|B) p(B|A)p(A)

Problem 1
You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor
earth quakes. Two neighbors (John, Mary) promise to call you at work when they hear the alarm. John
always calls when hears alarm, but confuses with phone ringing. Mary likes loud music and
sometimes misses alarm. Find the probability of the event that the alarm has sounded but neither a
burglary nor an earth quake has occurred and both Mary and John call.
Consider 5 binary variables
B=Burglary occurs at your house
E=Earth quake occurs at your home
A=Alarm goes off
J=John calls to report alarm
M=Mary calls to report the alarm
Probability of the event that the alarm has sounded but neither a burglary nor an earth quake has
occurred and both Mary and John call
P(J,M,A, E, B)=P(J|A).P(M|A).P(A|E, B).P(E).P(B)
=0.90*0.70*0.001*0.99*0.998
=0.00062
Problem 2
Rain influences sprinkler usage. Rain and sprinkler influences whether grass is wet or not. What is the
probability that rain gives grass wet?

Solution
Let S= Sprinkler
R=Rain
G=Grass wet
P(G,S,R)=P(G|S,R).P(S|R).P(R)
=0.99*0.01*0.2
=0.00198
Problem 3
Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
age income student credit_ratingbuys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Solution
• P(Ci):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

Problem 4
Did the patient have malignant tumour or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive
result in only 98% of the cases in which a malignant tumour actually present, and a correct negative
result in only 97% of the cases in which it is not present. Furthermore, o.oo8 of the entire population
have this tumour.
Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97
𝑃(+|𝑡𝑢𝑚𝑜𝑢𝑟)𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) =
𝑃(+)
0.98 ∗ 0.008
=
𝑃(+)
𝑃(+|𝑡𝑢𝑚𝑜𝑢𝑟)𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) =
𝑃(+)
0.3∗0.992
=
𝑃(+)

0.98 ∗ 0.008 0.3 ∗ 0.992


+ =1
𝑃(+) 𝑃(+)
𝑃(+) = 0.98 ∗ 0.008 + 0.3 ∗ 0.992 = 0.305
0.98 ∗ 0.008
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) = = 0.025
0.305
0.3 ∗ 0.992
𝑃(𝑡𝑢𝑚𝑜𝑢𝑟)|+) = = 0.975
0.305
The probability of not having tumour is high. So the person is not having malignant tumour.

Case 2:
Hypothesis: Did the patient have malignant tumour if the result reports negative.

Solution:
P(tumour)=0.008 P(tumour)=0.992
P(+|tumour)=0.98 P(-|tumour)=0.02
P(+|tumour)=0.03 P(-|tumour)=0.97

P(tumour|-) = p(-|tumour) p(tumour) / p(-)

= (0.02)(0.008)/p(-)

P(┐tumour|-) = p(-|┐tumour) p(┐tumour) / p(-)

= (0.97)(0.992)/p(-)
(0.02)(0.008)/p(-) + (0.97)(0.992)/p(-) = 1
(0.002)(0.008) + (0.97)(0.992) =p(-)
0.000016+0.96=p(-)
Hence p(-)=0.96

Substitute the value of p(-)


P(tumour|-) = (0.02)(0.008)/p(-) = (0.02)(0.008)/0.96= 0.00015

P(┐tumour|-) = (0.97)(0.992)/0.96 = 0.99985

The probability of not having tumour is high. So the person is not having malignant tumour.

HIDDEN MARKOV MODEL


Markov process is a simple stochastic process in which the distribution of future states depends only
on the present state and not on how it arrived in the present state. A random sequence has the Markov
property if its distribution is determined solely by its current state. Any random process having tis
property is called Markov random process. For observable state sequences, this leads to a Markov chain
model. For non-observable state, this leads to a Hidden Markov chain model.

MARKOV MODEL
Markov model is a discrete finite system with N distinct states. It begins (at time t=1) in some initial
states. At each time step (t=1,2,..) the system moves from current to next state according to transition
probabilities associated with current state. This kind of system is called a finite or discrete Markov
model.
Markov property (Memory less property): The state of the system at time t+1 depends only on the
state of the system at time t. Future is independent of past given present. Three basic information to
define a Markov model
 Parameter space
 State space
 State transition probability

Stationary Assumption: in general, a process is called stationary if transition probabilities are


independent of t, namely for all t, P[Xt+1=xj|Xt=xi]=pij

Set of states {S1, S2,…,SN}


• Process moves from one state to another generating a sequence of states: Si1, Si2,…,Sik,..
• Markov chain property: probability of each subsequent state depends only on what was the previous
state: P (Sik|Si1, Si2,…,Sik-1)=P(Sik| Sik-1)
• To define Markov model, the following probabilities have to be specified: transition probabilities
aij=P(Si|Sj) and initial probabilities π i=P(Si).
Example 1: Weather prediction
Tomorrow’s weather depends on today’s weather.
Tomorrow
Today Rainy Cloudy Sunny
Rainy 0.4 0.3 0.3
Cloudy 0.2 0.6 0.2
Sunny 0.1 0.1 0.8

What is the probability that the weather for the next 7 days will be “sun-sun-rain-rain-sun-cloudy-sun”
when today is sunny?
S1: rain, S2: cloudy, S3: sunny
P(O|model)=P(S3, S3, S3, S1, S1, S3, S2,S3|model)
=P(S3)*P(S3|S3)* P(S3|S3)* P(S1|S3)* P(S1|S1)* P(S3|S1)* P(S2|S3)* P(S3|S2)
= π 3*a33*a33*a31*a11*a11*a13*a32*a23
=1*0.8*0.8*0.1*0.4*0.3*0.1*0.2
=1.536x10-4
Initial sate probability matrix
0.5
π =( π i)=[0.2]
0.3
Sate transition probability matrix
0.6 0.2 0.2
A={aij}=[0.5 0.3 0.2]
0.4 0.1 0.5
What is the probability of 5 consecutive up days?
P(1,1,1,1,1)= π 1*a11*a11*a11*a11=0.5*(0.6)4= 0.0648

HIDDEN MARKOV MODEL


A hidden Markov model is an extension of a Markov model in which the input symbols are not the
same as the states. This means that we don’t know which state we are in. Often we face scenarios where
states cannot be directly observed. So there is a need of Hidden Markov Model.

aij are state transition probabilities, bik are observation (output) probabilities.

Set of states {S1, S2,…,SN}


• Process moves from one state to another generating a sequence of states: Si1, Si2,…,Sik,..
• Markov chain property: probability of each subsequent state depends only on what was the previous
state: P (Sik|Si1, Si2,…,Sik-1)=P(Sik| Sik-1)
• States are not visible, but each state randomly generates one of M observations (or visible states) {v1,
v2,…, vM}
• To define hidden Markov model, the following probabilities have to be specified: matrix of transition
probabilities A=(aij), aij= P(si|sj) , matrix of observation probabilities B=(bi(vm)), bi(vm)= P(vm|si) and a
vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).

Example 1:

Number of states: N=3


Number of observation M=3
V={R,G,B}
Initial state distribution π=[1, 0, 0]
Sate transition probability distribution
0.6 0.2 0.2
A={aij}=[0.5 0.3 0.6]
0.3 0.1 0.6
Observation symbol probability distribution
3/6 2/6 1/6
B={bi(vk)}= [1/6 3/6 2/6]
1/6 1/6 4/6
Consider n urns containing color balls with m distinct colors. Each urn contains different number of
color balls.
Sequence generating algorithm
1. Pick initial urn according to some random process
2. Randomly pick a ball from the urn and then replace it.
3. Select another urn according to a random selection process.
4. Repeat steps 2 & 3.
Here, what is hidden? We can just see the chosen balls. We can’t see which urn is selected at a time.
So, urn selection (state transition) information is hidden.

Main issues of Hidden Markov Model


1. Evaluation Problem
Given the HMM M= (A, B, π) and the observation sequence O=o1 o2 ... oK , calculate the
probability that model M has generated sequence O.
Solution: Use Forward-Backward HMM algorithms for efficient calculations.
Forward Recursion for HMM: Define the forward variable αk(i) as the joint probability of
the partial observation sequence o1,o2 ... ok and that the hidden state at time k is si : αk(i)= P(o1,o2,
... ok, qk= si).
Backward Recursion for HMM: Define the forward variable βk(i) as the joint probability of
the partial observation sequence ok+1,ok+2 ... ok given that the hidden state at time k is si : βk(i)=
P(ok+1,ok+2, ... ok|qk= si).

2. Decoding Problem
Given the HMM M= (A, B, π) and the observation sequence O=o1 o2 ... oK, calculate the most
likely sequence of hidden states Si that produced this observation sequence O.
Solution: Use efficient Viterbi algorithm
Define variable δk(i) as the maximum probability of producing observation sequence o1, o2 ...
ok when moving along any hidden state sequence q1… qk-1 and getting into qk= si .
δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok) where max is taken over all possible paths q1… qk-1.

3. Learning Problem
Given some training observation sequences O=o1 o2 ... oK and general structure of HMM
(number of hidden and visible states), determine HMM parameters M= (A, B, π) that best fit
training data.
Solution: Use iterative expectation-maximization algorithm to find local maximum of P(O|M)
- Baum-Welch algorithm
Expected number of transitions from state sj to state si
aij= Expected number of transitions out of state sj

ExpectExpected number of times observation vm occurs in state si


bi(vM)= Expected number of times in state si

Advantages of HMM on Sequential Data


 Natural model structure: doubly stochastic process
 Efficient and good modelling tool for sequences with temporal constraints, spatial variability
along the sequence and real world complex processes.
 Mathematically strong and computationally efficient

Application areas of HMM


 Online handwriting recognition
 Speech recognition
 Gesture recognition
 Language modelling]
 Motion video analysis and tracking
 Optical character recognition
 Stock price prediction
 Flood prediction

FORMS OF LEARNING
Any component of an agent can be improved by learning from data.
Components to be learned
The components of these agents include:
1. A direct mapping from conditions on the current state to actions.
2. A means to infer relevant properties of the world from the percept sequence.
3. Information about the way the world evolves and about the results of possible actions the agent can
take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of actions.
6. Goals that describe classes of states whose achievement maximizes the agent’s utility.
In supervised learning the agent observes some example input–output pairs and learns a function that
maps from input to output.
In unsupervised learning the agent learns patterns in the input even though no explicit feedback is
supplied. The most common unsupervised learning task is clustering: detecting potentially useful
clusters of input examples
In reinforcement learning the agent learns from a series of reinforcements, rewards or punishments.
In semi-supervised learning we are given a few labeled examples and must make what we can of a
large collection of unlabeled examples.

SUPERVISED LEARNING
The task of supervised learning is this: Given a training set of N example input–output pairs (x1, y1),(x2,
y2),...(xN , yN) , where each yj was generated by an unknown function y = f(x), discover a function h that
approximates the true function f.
Here x and y can be any value. The function h is a hypothesis. Learning is a search through the
space of possible hypotheses for one that will perform well, even on new examples beyond the training
set. To measure the accuracy of a hypothesis we give it a test set of examples that are distinct from the
training set. We say a hypothesis generalizes well if it correctly predicts the value of y for novel
examples.
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem
is called classification.
When y is a number the R learning problem is called regression
Supervised learning can be done by choosing the hypothesis h∗ that is most probable given the data:

By Bayes’ rule this is equivalent to

The computer is presented with example inputs and their desired outputs, given by a "teacher",
and the goal is to learn a general rule that maps inputs to outputs.
o Prediction
o Classification (discrete labels),
o Regression (real values)

Prediction
Example: Price of a used car
x : car attributes
y : price
y = g (x | θ )
θ parameters

Classification
Suppose you have a basket and it is filled with different kinds of fruits. Your task is to arrange them
as groups. For understanding let me clear the names of the fruits in our basket.
You already learn from your previous work about the physical characters of fruits. So arranging the
same type of fruits at one place is easy now. Your previous work is called as training data in data mining.
You already learn the things from your train data; this is because of response variable. Response variable
means just a decision variable.
No. SIZE COLOR SHAPE FRUIT NAME
1 Big Red Rounded shape with a depression at the top Apple
2 Small Red Heart-shaped to nearly globular Cherry
3 Big Green Long curving cylinder Banana
4 Small Green Round to oval, Bunch shape Cylindrical Grape
Suppose you have taken a new fruit from the basket then you will see the size, color and shape of
that particular fruit. If size is Big, color is Red, shape is rounded shape with a depression at the top, you
will conform the fruit name as apple and you will put in apple group.
If you learn the thing before from training data and then applying that knowledge to the test data
(for new fruit), this type of learning is called as Supervised Learning.
Regression
Given example pairs of heights and weights of a set of people, find a model to predict the weight
of a person from her height

LEARNING DECISION TREE


A decision tree represents a function that takes as input a vector of attribute values and returns a
“decision”—a single output value. The input and output values can be discrete or continuous. A decision
tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds
to a test of the value of one of the input attributes and the branches from the node are labeled with the
possible values of the attribute. Each leaf node in the tree specifies a value to be returned by the function.
Choosing attribute tests
The scheme used in decision tree learning for selecting attributes is designed to minimize the depth of
the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact
classification of the example. A perfect attribute divides the examples in to sets that are all positive or
all negative.

Assessing the performance of learning algorithm


1. Collect a large set of examples.
2. Divide it in to two disjoint sets, the training set and a testing set.
3. Apply the learning algorithm to the training set, generating a hypothesis h.
4. Measure the percentage of examples in the test set that are correctly classified by h.
5. Repeat steps 1 to 4 for different sizes of training seta and different randomly selected training
sets of each size.
Entropy is a measure of the uncertainty of a random variable

The information gain from the attribute test on A is the expected reduction in entropy:

Over-fitting
It is a situation when the model works well for the training data but fails for new testing data.

How to prevent over-fitting


 Lets have a fully grown tree T
 Choose a test node having only leaf node as decedents.
 If the test appear s to be irrelevant, remove the test and replace it with a leaf node with the
majority class.
 Repeat until all test seem to be relevant.
A decision tree learning system for real world applications must be able to handle the following issues.
 Missing data
 Multi-valued attribute
 Continuous and integer valued input attribute
 Continuous-valued output attribute.

Problem 1: Construct a decision tree for the following data

Deadline Is there party Lazy Activity


Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
Near Yes No Party
None No Yes Pub
None Yes No Party
Near No No Study
Near No Yes TV
Near Yes Yes Party
Urgent No No Study

Entropy(S)=-Ppartylog2Pparty-Pstudylog2Pstudy--Ppublog2Ppub—PTVlog2PTV

Party Study Pub TV


5 3 1 1

5 5 3 3 1 1 1 1
Entopy(S) =− 10 𝑙𝑜𝑔2 10 − 10
𝑙𝑜𝑔2 10 − 10 𝑙𝑜𝑔2 10 − 10 𝑙𝑜𝑔2 10

=−0.5𝑙𝑜𝑔2 0.5 − 0.3𝑙𝑜𝑔2 0.3 − 0.1𝑙𝑜𝑔2 0.1 −0.1𝑙𝑜𝑔2 0.1


=0.5+0.5211+0.3322+0.3322
=1.6855
E(S, deadline)=P(urgent) * E(1,2)+P(Near)*E(2,1,1)+P(None)*E(2,1)

Party Study Pub TV Total


Urgent 1 2 0 0 3
Near 2 1 0 1 4
None 2 0 1 0 3

3 1 2 4 2 1 1 3 2 1
=10 ∗ 𝐸 (3 , 3) + 10 ∗ 𝐸 (4 , 4 , 4) + 10 ∗ 𝐸 (3 , 3)

1 1 2 2 2 2 1 1 1 1 2 2 1 1
=0.3*[− 3 𝑙𝑜𝑔2 3 − 3 𝑙𝑜𝑔2 3]+0.4∗ [− 4 𝑙𝑜𝑔2 4 − 4 𝑙𝑜𝑔2 4 − 4 𝑙𝑜𝑔2 4]+0.3*[− 3 𝑙𝑜𝑔2 3 − 3 𝑙𝑜𝑔2 3]

=0.3*[-0.3333𝑙𝑜𝑔2 0.3333-0.6667𝑙𝑜𝑔2 0.6667]+0.4*[-0.5𝑙𝑜𝑔2 0.5-0.25𝑙𝑜𝑔2 0.25-0.25𝑙𝑜𝑔2 0.25]+0.3*[-


0.6667𝑙𝑜𝑔2 0.6667-0.3333𝑙𝑜𝑔2 0.3333]
=0.3*[0.5283+0.3899]+0.4*[0.5+0.5+0.5]+0.3*[0.3899+0.5283]
=0.2755+0.6+0.2755
=1.151
Information Gain(S, deadline)=1.6855-1.515=0.5345
E(S, Party)=P(Yes)*E(5,0,0,0)+P(No)*E(3,1,1)

Party Study Pub TV Total


Yes 5 0 0 0 5
No 0 3 1 1 5
5 5 0 5 3 1 1
=10 ∗ 𝐸 (5 , 5) + 10 ∗ 𝐸 (5 , 5 , 5)

=0.5*E(1)+0.5*E(0.6, 0.2, 0.2)


=0.5*[-1𝑙𝑜𝑔2 1]+0.5[-0.6𝑙𝑜𝑔2 0.6-0.2𝑙𝑜𝑔2 0.2-0.2𝑙𝑜𝑔2 0.2]
=0+0.5*[0.4422+0.4644+0.4644]
=0.5*1.371
=0.6855
IG(S,Party)=1.6855-0.6855=1
E(S,lazy)=P(yes)*E(3,1,1,1)+P(No)*E(2,2)

Party Study Pub TV Total


Yes 3 1 1 1 6
No 2 2 0 0 4
6 3 1 1 1 4 2 2
=10 ∗ 𝐸 (6 , 6 , 6 , 6) + 10 ∗ 𝐸 (4 , 4)

=0.6*E(0.5, 0.1667,0.1667,0.1667)+0.4*E(0.5,0.5)
=0.6*[-0.5𝑙𝑜𝑔2 0.5-0.1667𝑙𝑜𝑔2 0.1667-0.1667𝑙𝑜𝑔2 0.1667-0.1667𝑙𝑜𝑔2 0.1667]+0.4*[-0.5𝑙𝑜𝑔2 0.5-
0.5𝑙𝑜𝑔2 0.5]
=0.6*[0.5+0.4309+0.4309+0.4309]+0.4*[0.5+0.5]
=1.0756+0.4
=1.4756
IG(S,Lazy)=1.6855-1.4756=0.2099
The maximum information Gain is for IG(Activity, Party). So the root node will be party, which has
two feature values

Party
Yes
No

Go to party ?

Deadline Is there party Lazy Activity


Urgent No Yes Study
None No Yes Pub
Near No No Study
Near No Yes TV
Urgent No No Study
Study Pub TV
3 1 1

3 3 1 1 1 1
E(S)=− 5 𝑙𝑜𝑔2 5 − − 5 𝑙𝑜𝑔2 5 − − 5 𝑙𝑜𝑔2 5

=0.6𝑙𝑜𝑔2 0.6-0.2𝑙𝑜𝑔2 0.2-0.2𝑙𝑜𝑔2 0.2


=0.4422+0.4644+0.4644
=1.371
E(S,deadline)=P(urgent)*E(2)+P(None)*E(1)+P(Near)*E(1,1)

Study Pub TV Total


Urgent 2 0 0 2
Near 0 1 0 1
None 1 0 1 2

2 2 1 1 2 1 1
= ∗ 𝐸 ( ) + ∗ 𝐸 ( )+ ∗ 𝐸 ( , )
5 2 5 1 5 2 2

=0.4*E(1)+0.2*E(1)+0.4*E(0.5,0.5)
=0+0+0.48*[-0.5𝑙𝑜𝑔2 0.5-0.5𝑙𝑜𝑔2 0.5]
=0.4
IG(S, deadline)=1.371-0.4=0.971
E(S,Lazy)=P(Yes)*E(1,1,1)+P(No)*E(2)

Study Pub TV Total


Yes 1 1 1 3
No 2 0 0 2

3 1 1 1 2 2
=5 ∗ 𝐸 (3 , 3 , 3) + 5 ∗ 𝐸 (2)

=0.6*E(0.3333,0.3333,0.3333)+o.4*E(1)
=0.6*[-0.3333𝑙𝑜𝑔2 0.3333-0.3333𝑙𝑜𝑔2 0.3333-0.3333𝑙𝑜𝑔2 0.3333]+0
=0.6*[0.5283+0.5283+0.5283]
=0.9509
IG(S,lazy)=1.371-0.9509=0.4201

Party
Yes
No

Go to party Deadline

?
Study Pub
Deadline Is there party Lazy Activity
Near No No Study
Near No Yes TV

Party
Yes
No

Go to party Deadline

Lazy
Study Pub

Study
TV

REGRESSION AND CLASSIFICATION WITH LINEAR MODELS


Univariate linear regression
A univariate linear function (a straight line) with input x and output y has the form y = w1x+ w0, where
w0 and w1 are real-valued coefficients to be learned. The value of y is changed by changing the relative
weight of one term or another. Define w to be the vector [w0, w1]

Figure shows an example of a training set of n points in the x, y plane, each point representing the size
in square feet and the price of a house offered for sale.

The task of finding the hw that best fits these data is called linear regression. To fit a line to the data,
find the values of the weights [w0, w1] that minimize the empirical loss. Find w*
Gradient Descent
For one training example:

For N training examples

These updates constitute the batch gradient descent learning rule for univariate linear regression.
Convergence to the unique global minimum is guaranteed but may be very slow.
Multivariate Linear Regression

In Multivariate Linear Regression, hypothesis space is the set of functions of the form

Augmented vectors: add a feature to each x by tacking on a 1: xj,0 =1

The update equation for each weight wi is

It is also possible to solve analytically for the w that minimizes loss. Let y be the vector of outputs for
the training examples, and X be the data matrix. Then the solution minimizes the squared error

It is common to use regularization on multivariate linear functions to avoid over- fitting. With
regularization we minimize the total cost of a hypothesis, counting both the empirical loss and the
complexity of the hypothesis.

Linear classifiers with a hard threshold


Linear functions can be used to do classification as well as regression. For example, Figure
shows data points of two classes: earthquakes and underground explosions. Each point is defined by
two input values, x1 and x2, that refer to body and surface wave magnitudes computed from the seismic
signal. Given these training data, the task of classification is to learn a hypothesis h that will take new
(x1, x2) points and return either 0 for earthquakes or 1 for explosions.
(a) Plot of two seismic data parameters, for earthquakes (white circles) and nuclear
explosions (black circles) (b) The same domain with more data points

A decision boundary is a line that separates the two classes. In Figure, the decision boundary is
a straight line. A linear decision boundary is called a linear separator and data that admit such a separator
are called linearly separable. The classification hypothesis

Linear classification with logistic regression


With the logistic function

The weight update for minimizing the loss is

ARTIFICIAL NEURAL NETWORK


ANN are also called neural networks, connectionist systems, neuromorphic systems, parallel distributed
processing (PDP) systems, etc. Networks of relatively simple processing units, which are very abstract
models of neurons; the network does the computation more than the units.
Neuron-like units

A simple mathematical model for a neuron


Typical activation functions

Neural Network and Logic


McCulloch and Pitts, 1943: showed that whatever you can do with logic networks, you can do with
networks of abstract neuron-like units.

Single Layer Network


A single layer neural network consists of a set of units organized in a layer. Each unit U n
receives a weighted input Ijwith weight Wjn. Figure shows a single layer neural network with j
inputs and outputs.

Multilayer Network
A multilayer network has two or more layers of units, with the output from one layer serving
as input to the next. Generally in a multilayer network there are 3 layers present like, input layer,
output layer and hidden layer. The layer with no external output connections are referred to as hidden
layers. A multilayer neural network structure is given in figure.

Feed Forward neural network


In this network, the information moves in only one direction, forward from the input nodes,
through the hidden nodes and to the output nodes. There are no cycles or loops in the network. In other
way we can say the feed forward neural network is one that does not have any connections from output
to input. All inputs with variable weights are connected with every other node. A single layer feed
forward network has one layer of nodes, whereas a multilayer feed forward network has multiple layers
of nodes. The structure of a feed forward multilayer network is given in figure.

Back Propagation neural network


Multilayer neural networks use a most common technique from a variety of learning technique, called
the back propagation algorithm. In back propagation neural network, the output values are compared
with the correct answer to compute the value of some predefined error function. By various techniques
the error is then fed back through the network. Using this information, the algorithms adjust the
weights of each connection in order to reduce the value of the error function by some small amount.
After repeating this process for a sufficiently large number of training cycles the network will usually
converge to some state where the error of the calculation is small.
The goal of back propagation, as with most training algorithms, is to iteratively adjust the
weights in the network to produce the desired output by minimizing the output error. The algorithm’s
goal is to solve credit assignment problem. Back propagation is a gradient-descent approach in that it
uses the minimization of first-order derivatives to find an optimal solution. The standard back
propagation algorithm is given below.

Step1: Build a network with the chosen number of input, hidden and output u n i t s .
Step2: Initialize all the weights to low random values.
Step3: Randomly, choose a single training pair.
Step4: Copy the input pattern to the input layer.
Step5: Cycle the network so that the activation from the inputs generates the activations in the
hidden and output layers.
Step6: Calculate the error derivative between the output activation and the final o u t p u t .
Step7: Apply the method of back propagation to the summed products of the weights and errors in
the output layer in order to calculate the error in the hidden units.
Step8: Update the weights attached the each unit according to the error in that unit, the output from
the unit below it and the learning parameters, until the error is sufficiently low

NON-PARAMETRIC MODEL
A learning model that summarizes data with a set of parameters of fixed size is called a
parametric model.
e.g., linear models, neural networks
A nonparametric model is one that cannot be characterized by a bounded set of parameters. .
For example, suppose that each hypothesis we generate simply retains within itself all of the training
examples and uses all of them to predict the next example. Such a hypothesis family would be
nonparametric because the effective number of parameters is unbounded - it grows with the number of
examples. This approach is called instance-based learning or memory-based learning. The simplest
instance-based learning method is table lookup.

Nearest Neighbor Models


K-nearest neighbors algorithm:
• Save all the training examples
• For classification: find k nearest neighbors of the input and take a vote (make k odd)
• For regression: take mean or median of the k nearest neighbors, or do a local regression on them
Distances are measured with a Minkowski distance

With p = 2 this is Euclidean distance and with p = 1 it is Manhattan distance. With Boolean attribute
values, the number of attributes on which the two points differ is called the Hamming distance.

(a) A k-nearest-neighbor model showing the extent of the explosion class for the data with k = 1.
Overfitting is apparent. (b) With k = 5, the overfitting problem goes away for this data set.

Curse of Dimensionality
In high dimensions, the nearest points tend to be far away.

The curse of dimensionality: (a) The length of the average neighborhood for 10-nearest-
neighbors in a unit hypercube with 1,000,000 points, as a function of the number of dimensions.
(b) The proportion of points that fall within a thin shell consisting of the outer 1% of the
hypercube, as a function of the number of dimensions. Sampled from 10,000 randomly
distributed points.
Nonparametric regression
Figure shows an example of some different models

Nonparametric regression models: (a) connect the dots, (b) 3-nearest neighbors average, (c) 3-
nearest-neighbors linear regression, (d) locally weighted regression with a quadratic kernel of
width k = 10.
Locally weighted regression
Locally weighted regression gives us the advantages of nearest neighbors, without the discontinuities.
The idea of locally weighted regression is that at each query point xq, the examples that are close to xq
are weighted heavily, and the examples that are farther away are weighted less heavily or not at all. The
decrease in weight over distance is always gradual, not sudden. For a given query point xq we solve the
following weighted regression problem using gradient descent:

A quadratic kernel with kernel width k = 10, centered on the query point x = 0.
SUPPORT VECTOR MACHINE
SVM is a supervised learning algorithm which is a widely used classification algorithm. A new
classification method for both linear and nonlinear data. SVM is applicable for the data that are linearly
separable. In Non-linear data kernel functions are used. SVM finds this hyper plane using support
vectors (“essential” training tuples) and margins (defined by the support vectors). These are the three
properties of SVMs
1. SVMs construct a maximum margin separator—a decision boundary with the largest possible
distance to example points
2. SVMs create a linear separating hyperplane, but they have the ability to embed the data into a
higher-dimensional space, using the so-called kernel trick. Often, data that are not linearly
separable in the original input space are easily separable in the higher dimensional space.
3. SVMs are a nonparametric method—they retain training examples and potentially need to store
them all. SVMs combine the advantages of nonparametric and parametric models: they have
the flexibility to represent complex functions, but they are resistant to overfitting.

Hyper Plane Hyper plane should have the largest margin in a high dimension space to separate
given into two classes. The Margin between the two classes represent the longest distance between
closest data point to those classes.

H1 and H2 are the planes. If we maximize the margin (distance) between two hyperplanes then divide
by 2 we get the decision boundary. Lets take only 2 dimensions, we get the equation for hyper line is
w.x+b=0 which is same as w.x =0.
For each vector xi either
w.xi + b≥ 1 for xi having the class 1
or
w.xi + b≤ -1 for xi having the class -1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point

Support Vectors for linearly separable case


• Support vectors are the elements of the training set that would change the position of the dividing
hyperplane if removed.
• Support vectors are the critical elements of the training set
• The problem of finding the optimal hyper plane is an optimization problem and can be solved by
optimization techniques.
if w.x + b=0 then we get the decision boundary
if w.x + b= 1 then we get (+) class hyperplane
for all positive(x) points satisfy this rule (w.x + b ≥ 1)
if w.x + b= -1 then we get (-) class hyperplane
for all negative(x) points satisfy this rule (w.x + b≤ -1)
D1 = wTx + b = 1 wTx + b – 1 = 0
D2 = wTx + b = -1 wTx + b + 1 = 0
The total distance between D1 and D2 is thus: 2/||w||

for all points we check this condition


if a point (x)(y)*(w.x+b)=1
point=support vector
classified correctly save parameters
else if > 1
classified correctly save parameters
else:
classified incorrectly adjust parameters
Some commonly used kernel functions
1. Linear
K(X, Y) = XTY
2. Polynomial of degree d
K(X, Y) = ( XTY+1)d
3. Gaussian Radial Basis Function
‖𝑋−𝑌‖2
K(X, Y) =𝑒 2𝜎2
4. Tanh kernel
K(X, Y) = tanh ( þ (XTY) - δ)

Applications of SVM
• SVM has been used successfully in many real-world problems
- Text (and hypertext) categorization
- Image classification
- Bioinformatics (Protein classification, Cancer classification)
- Hand-written character recognition

Support vector Machine provides very simple method for linear classification. But performance, in case
of nonlinearly separable data, largely depends on the choice of the kernel.

STATISTICAL LEARNING
Bayesian learning simply calculates the probability of each hypothesis, given the data, and
makes predictions on that basis. That is, the predictions are made by using all the hypotheses, weighted
by their probabilities, rather than by using just a single “best” hypothesis. In this way, learning is
reduced to probabilistic inference. Let D represent all the data, with observed value d; then the
probability of each hypothesis is obtained by Bayes’ rule:

Now, suppose we want to make a prediction about an unknown quantity X. Then we have

The key quantities in the Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the
data under each hypothesis, P(d|hi).

A very common approximation one that is usually adopted in science is to make predictions based on a
single most probable hypothesis—that is, an hi that maximizes P(hi|d). This is often called a maximum
a posteriori or MAP hypothesis. In both Bayesian learning and MAP learning, the hypothesis prior P(hi)
plays an important role.

LEARNING WITH COMPLETE DATA

Statistical learning begins with parameter learning with complete data. A parameter learning task
involves finding the numerical parameters for a probability model whose structure is fixed. Data are
complete when each data point contains values for every variable in the probability model being learned.
Complete data simplify the problem of learning the parameters of a complex model.
Learning Structure
 Maximum-likelihood parameter learning: Discrete models
 Naive Bayes models
 Maximum-likelihood parameter learning: Continuous models
 Bayesian parameter learning

Maximum-likelihood parameter learning: Discrete models


1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.

Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime proportions
are completely unknown. ϴ is the proportion of cherry candies hϴ hypothesis. Therefore, proportion of
lime is 1- ϴ. Suppose we unwrap N candies of which c are cherries. Therefore, lime l=N-c

Equation of likelihood

Log derivative

To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the
resulting expression to zero:

Naive Bayes models


The most common Bayesian network model used in machine learning is the naïve Bayes
model. In this model, the “class” variable C is the root and the “attribute” variables Xi are the leaves.
The model is “naive” because it assumes that the attributes are conditionally independent of each other,
given the class. With observed attribute values x1, . . . , xn, the probability of each class is given by

A deterministic prediction can be obtained by choosing the most likely class.

Maximum-likelihood parameter learning: Continuous models


Continuous probability models such as the linear Gaussian model were used. Learning the
parameters of a Gaussian density function on a single variable. That is, the data are generated as
follows:

The parameters of this model are the mean μ and the standard deviation σ. The log likelihood

Setting the derivatives to zero


Bayesian parameter learning
Maximum-likelihood learning gives rise to some very simple procedures, but it has some
serious deficiencies with small data sets. Bayesian parameter learning places an hypothesis priority over
the possible values of parameter. Parameter independence can be represented as

Learning Bayes net structure


This approach is to search for a good model. It overcomes all disadvantage of all the above
models.

EM ALGORITHM
EM algorithm is presented by Dempster, Laird and Rubin in 1977. EM algorithm is an iterative
estimation algorithm that can derive the maximum likelihood estimates in the presence of
missing/hidden data (incomplete data)

Uses of EM algorithm
 Filling the missing data in a sample.
 Discovering the value of latent variables.
 Estimating the parameters of HMM
 Estimating the parameters of finite mixtures
 Unsupervised learning of clusters
 Parameters of mixtures of Gaussian

Algorithm
1. Consider a set of starting parameter
- Given a set of incomplete data
- Assume observed data come from a specific model
2. Use these to “estimate” the missing data, formulate some parameters for that model, use this to
guess the missing value/data (expectation step).
3. Use “complete data to update parameters from the missing data and observed data, find the
most likely parameters (maximization step).
4. Repeat steps 2 and 3 until convergence.

The main two steps of this algorithm are


Expectation step: Use current parameters to reconstruct hidden structure.
Maximization step: Use the hidden structures to re-estimate parameters.

Basic Setting in EM
X is a set of data points
𝜃 is a parameter vector
EM is a method to find where 𝛳𝑀𝐿
𝛳𝑀𝐿 = arg max L(𝛳)
= arg max log P(X/ 𝛳)
L(𝛳) is the likelihood function

Z=(X,Y)
Z: complete data (augmented data)
X: observed data (incomplete data)
Y: hidden data (missing data)
Coin Toss Problem
The target is to figure out the probability of heads of two coins. ML estimate can be directly calculated
from the result.
 We have two coins A and B
 The probabilities for heads are qA and qB.
 5 measurements set including 10 coin tosses in each set
Coin Type 5 sets of 10 tosses each Coin A Coin B
B HTTTHHTHTH 5H, 5T
A HHHHTHHHHH 9H, 1T
A HTHHHHHTHH 8H, 2T
B HTHHHTTTTT 4H, 6T
A THHHTHHHTH 7H, 3T
Total 24H, 6T 9H, 11T
Maximmum likelihood
qA=24/(24+6)=0.8 (When you toss A coin, the probability of head comes out is 0.8)
qB=9/(9+11)=0.45 (When you toss B coin, the probability of head comes out is 0.45)

The same problem can be worked out with EM algorithm with missing information. We do not know
which coin is tossed in each set. So we cannot able to calculate the maximum likelihood directly. So
that EM algorithm is used.
5 sets of 10 tosses each
HTTTHHTHTH
HHHHTHHHHH
HTHHHHHTHH
HTHHHTTTTT
THHHTHHHTH
1. Initialization step (Randomly choose the initial values between 0 to 1)
(0)
𝑞𝐴 = 0.6
(0)
𝑞𝐵 = 0.5

2. Estimation step
Binomial Distribution
𝑛 𝑛!
( ) 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 = (𝑛−𝑘)!𝑘! 𝑝𝑘 (1 − 𝑝)𝑛−𝑘
𝑘

n is the number of coin tosses


k is the number of head
p is the initial probability

10
( ) 0.65 ∗ (1 − 0.6)5 = 0.201
5
10
( ) 0.55 ∗ (1 − 0.5)5 = 0.246
5

Maximum Likelihood Estimates


0.201/ (0.201+0.246) = 0.45
0.246/ (0.201+0.246) = 0.55
5 sets of 10 tosses each No. head and ML Estimates
tail
HTTTHHTHTH 5H, 5T 0.45 0.55 2.2H, 2.2T 2.8H, 2.8T
HHHHTHHHHH 9H, 1T 0.80 0.20 7.2H, 0.8T 1.8H, 0.2T
HTHHHHHTHH 8H, 2T 0.73 0.27 5.9H,1.7T 2.1H, 0.5T
HTHHHTTTTT 4H, 6T 0.35 0.65 1.4H,2.1T 2.6H, 3.8T
THHHTHHHTH 7H, 3T 0.65 0.35 4.6H, 1.8T 2.5H, 1.1T
Total 21.3H, 8.6T 11.7H, 8.4T

(1) 21.3
𝑞𝐴 = = 0.71
(21.3 + 8.6)
(1) 11.7
𝑞𝐵 = = 0.58
(11.7 + 8.4)

Convergence happens in the 10th iteration.


(10)
𝑞𝐴 = 0.80 (When you toss A coin, the probability of head comes out is 0.8)
(10)
𝑞𝐵 = 0.52 (When you toss B coin, the probability of head comes out is 0.52)
EM for K-means
1. Initialize means 𝜇𝑘 .
2. E step: Assign each point to a cluster.
3. M step: Given clusters, refine mean 𝜇𝑘 of each cluster k.
4. Stop when change in mean is small.

EM for Gaussian Mixtures


1. Initialize Gaussian Mixture parameters mean 𝜇𝑘 , covariance Ʃ𝑘 and mixing co efficient П𝑘 .
2. E step: Assign each point an assignment score for each cluster k.
3. M step: Given scores, adjust 𝜇𝑘 , Ʃ𝑘 , П𝑘 for each cluster k.
4.
5. Evaluate likelihood. If likelihood or parameters converge, then stop.

Strength of EM
 Numerical stability: In every iteration of the EM algorithm, it increases the likelihood of the
observed data,
 The EM handles parameter constraints gracefully.
Problems with EM
 Convergence can be very slow on some problems and is intimately related to the amount of
missing information.
 It guarantees to improve the probability of the training corpus, which is different from reducing
the errors directly.
 It cannot guarantee to reach global maximum.

REINFORCEMENT LEARNING
Reinforcement learning is close to human learning. Algorithm learns a policy of how to act in
a given environment. Every action has some impact in the environment, and the environment provides
rewards that guides the learning algorithm. Reinforcement learning deals with agents that must sense
and act upon their environment.
In many complex domains, reinforcement learning is the only feasible way to train a program
to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate
and consistent evaluations of large numbers of positions, which would be needed to train an evaluation
function directly from examples. Instead, the program can be told when it has won or lost, and it can
use this information to learn an evaluation function that gives reasonably accurate estimates of the
probability of winning from any given position.
Passive reinforcement learning
The agent’s policy is fixed and the task is to learn the utilities of states (or state–action pairs); this
could also involve learning a model of the environment.
 The agent see the sequences of state transitions and associate rewards
 The environment generates state transitions and the agent perceive them

The agent executes a set of trials in TRIAL the environment using its policy π. In each trial, the agent
starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal
states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in that state.
Typical trials might look like this:

Naïve updating
Naïve updating is otherwise called as LMS (least mean squares) approach. In essence, it assumes that
for each state in a training sequence, the observed reward-to-go on that sequence provides direct
evidence of the actual expected reward-to-go. Thus, at the end of each sequence, the algorithm
calculates the observed reward-to-go for each state and updates the estimated utility for that state
accordingly.
Adaptive dynamic programming
Adaptive dynamic programming (or ADP) to denote any reinforcement learning method that
works by solving the utility equations with a dynamic programming algorithm. In terms of its ability to
make good use of experience, ADP provides a standard against which to measure other reinforcement
learning algorithms. The utilities are computed by solving the set of equations

where, R(i) is the reward associated with being in state i, and Mij is the probability that a transition will
occur from state i to state j.

Temporal difference learning


The key is to use the observed transitions to adjust the values of the observed states so that they agree
with the constraint equations. Suppose that we observe a transition from state i to state j, where
currently U(i) = -0.5 and U(j)= +0.5. This suggests that we should consider increasing U(i) to make it
agree better with its successor. This can be achieved using the following updating rule:

where, a is the learning rate parameter. Because this update rule uses the difference in utilities between
successive states, it is often called the temporal-difference, or TD, equation. The basic idea of all
temporal-difference methods is to first define the conditions that hold locally when the utility estimates
are correct; and then to write an update equation that moves the estimates toward this ideal "equilibrium"
equation.

Active learning
Active learning, where the agent must also learn what to do. An active agent must consider
what actions to take, what their outcomes may be, and how they will affect the rewards received.
 The environment model must now incorporate the probabilities of transitions to other states
given a particular action. We will use Mij to denote the probability of reaching state j if the
action a is taken in state i.
 The constraints on the utility of each state must now take into account the fact that the agent
has a choice of actions. A rational agent will maximize its expected utility
 The agent must now choose an action at each step, and will need a performance element to do
so. In the algorithm, this means calling PERFORMANCE-ELEMENT(e) and returning the
resulting action.

You might also like