Cs229 HMM - Ps
Cs229 HMM - Ps
Daniel Ramage
De ember 1, 2007
Abstra
t
How
an we apply ma
hine learning to data that is represented as a
sequen
e of observations over time? For instan
e, we might be interested
in dis
overing the sequen
e of words that someone spoke based on an
audio re
ording of their spee
h. Or we might be interested in annotating
a sequen
e of words with their part-of-spee
h tags. These notes provides a
thorough mathemati
al introdu
tion to the
on
ept of Markov Models
a formalism for reasoning about states over time and Hidden Markov
Models where we wish to re
over a series of states from a series of
observations. The nal se
tion in
ludes some pointers to resour
es that
present this material from other perspe
tives.
1 Markov Models
Given a set of states S = {s1 , s2 , ...s|S| } we
an observe a series over time
~z ∈ S T . For example, we might have the states from a weather system S =
{sun, cloud, rain} with |S| = 3 and observe the weather over a few days {z1 =
ssun , z2 = scloud , z3 = scloud , z4 = srain , z5 = scloud } with T = 5.
The observed states of our weather example represent the output of a random
pro
ess over time. Without some further assumptions, state sj at time t
ould
be a fun
tion of any number of variables, in
luding all the states from times 1
to t−1 and possibly many others that we don't even model. However, we will
make two Markov assumptions that will allow us to tra
tably reason about
time series.
The limited horizon assumption is that the probability of being in a
state at time t depends only on the state at time t − 1. The intuition underlying
this assumption is that the state at time t represents enough summary of the
past to reasonably predi
t the future. Formally:
1
P (zt |zt−1 ) = P (z2 |z1 ); t ∈ 2...T
As a
onvention, we will also assume that there is an initial state and initial
observation z0 ≡ s0 , where s0 represents the initial probability distribution over
states at time 0. This notational
onvenien
e allows us to en
ode our belief
about the prior probability of seeing the rst real state z1 as P (z1 |z0 ). Note
that P (zt |zt−1 , ..., z1 ) = P (zt |zt−1 , ..., z1 , z0 ) be
ause we've dened z0 = s0 for
any state sequen
e. (Other presentations of HMMs sometimes represent these
|S|
prior believes with a ve
tor π ∈ R .)
We parametrize these transitions by dening a state transition matrix A ∈
R(|S|+1)×(|S|+1). The value Aij is the probability of transitioning from state i
to state j at any time t. For our sun and rain example, we might have following
transition matrix:
Note that these numbers (whi
h I made up) represent the intuition that the
weather is self-
orrelated: if it's sunny it will tend to stay sunny,
loudy will
stay
loudy, et
. This pattern is
ommon in many Markov models and
an
be observed as a strong diagonal in the transition matrix. Note that in this
example, our initial state s0 shows uniform probability of transitioning to ea
h
of the three states in our weather system.
2
Y
T
= P (zt |zt−1 ; A)
t=1
Y
T
= Azt−1 zt
t=1
In the se
ond line we introdu
e z0 into our joint probability, whi
h is allowed
by the denition of z0 above. The third line is true of any joint distribution
by the
hain rule of probabilities or repeated appli
ation of Bayes rule. The
fourth line follows from the Markov assumptions and the last line represents
these terms as their elements in our transition matrix A.
Let's
ompute the probability of our example time sequen
e from earlier. We
want P (z1 = ssun , z2 = scloud , z3 = srain , z4 = srain , z5 = scloud ) whi
h
an be
fa
tored as P (ssun |s0 )P (scloud |ssun )P (srain |scloud )P (srain |srain )P (scloud |srain ) =
.33 × .1 × .2 × .7 × .2.
From a learning perspe
tive, we
ould seek to nd the parameters A that maxi-
mize the log-likelihood of sequen
e of observations ~z. This
orresponds to nd-
ing the likelihoods of transitioning from sunny to
loudy versus sunny to sunny,
et
., that make a set of observations most likely. Let's dene the log-likelihood
a Markov model.
X XX
|S| |S| T
= 1{zt−1 = si ∧ zt = sj } log Aij
i=1 j=1 t=1
In the last line, we use an indi
ator fun
tion whose value is one when the
ondition holds and zero otherwise to sele
t the observed transition at ea
h
time step. When solving this optimization problem, it's important to ensure
that solved parameters A still make a valid transition matrix. In parti
ular, we
need to enfor
e that the outgoing probability distribution from state i always
sums to 1 and all elements of A are non-negative. We
an solve this optimization
problem using the method of Lagrange multipliers.
max l(A)
A
3
X
|S|
s.t. Aij = 1, i = 1..|S|
j=1
Aij ≥ 0, i, j = 1..|S|
X XX
|S| |S| T
X
|S|
X
|S|
L(A, α) = 1{zt−1 = si ∧ zt = sj } log Aij + αi (1 − Aij )
i=1 j=1 t=1 i=1 j=1
∂ X X
T |S|
∂L(A, α) ∂
= ( 1{zt−1 = si ∧ zt = sj } log Aij ) + αi (1 − Aij )
∂Aij ∂Aij t=1 ∂Aij j=1
1 X
T
= 1{zt−1 = si ∧ zt = sj } − αi ≡ 0
Aij t=1
⇒
1 X
T
Aij = 1{zt−1 = si ∧ zt = sj }
αi t=1
∂L(A, β) X
|S|
= 1− Aij
∂αi j=1
X|S|
1 X
T
= 1− 1{zt−1 = si ∧ zt = sj } ≡ 0
α
j=1 i t=1
⇒
X X
|S| T
αi = 1{zt−1 = si ∧ zt = sj }
j=1 t=1
X
T
= 1{zt−1 = si }
t=1
Substituting in this value for αi into the expression we derived for Aij we
obtain our nal maximum likelihood parameter value for Âij .
4
PT
t=11{zt−1 = si ∧ zt = sj }
Âij = PT
t=1 1{zt−1 = si }
5
where our alphabet just en
odes the number of i
e
reams
onsumed, i.e. V =
{v1 = 1 ice cream, v2 = 2 ice creams, v3 = 3 ice creams}. What questions
an
an HMM let us answer?
There are three fundamental questions we might ask of an HMM. What is the
probability of an observed sequen
e (how likely were we to see 3, 2, 1, 2 i
e
reams
onsumed)? What is the most likely series of states to generate the observations
(what was the weather for those four days)? And how
an we learn values for
the HMM's parameters A and B given some data?
In an HMM, we assume that our data was generated by the following pro
ess:
posit the existen
e of a series of states ~z over the length of our time series.
This state sequen
e is generated by a Markov model parametrized by a state
transition matrix A. At ea
h time step t, we sele
t an output xt as a fun
tion of
the state zt . Therefore, to get the probability of a sequen
e of observations, we
need to add up the likelihood of the data ~x given every possible series of states.
X
P (~x; A, B) = P (~x, ~z; A, B)
~
z
X
= P (~x|~z; A, B)P (~z; A, B)
~
z
The formulas above are true for any probability distribution. However, the
HMM assumptions allow us to simplify the expression further:
X
P (~x; A, B) = P (~x|~z; A, B)P (~z; A, B)
~
z
X Y
T Y
T
= ( P (xt |zt ; B)) ( P (zt |zt−1 ; A))
~
z t=1 t=1
X Y
T Y
T
= ( Bzt xt ) ( Azt−1 zt )
~
z t=1 t=1
The good news is that this is a simple expression in terms of our parame-
ters. The derivation follows the HMM assumptions: the output independen
e
assumption, Markov assumption, and stationary pro
ess assumption are all used
to derive the se
ond line. The bad news is that the sum is over every possible
assignment to ~z. Be
ause zt
an take one of |S| possible values at ea
h time
step, evaluating this sum dire
tly will require O(|S|T ) operations.
6
Algorithm 1 Forward Pro
edure for
omputing αi (t)
1. Base
ase: αi (0) = A0 i , i = 1..|S|
P
2. Re
ursion: αj (t) = |S|
i=1 αi (t − 1)Aij Bj xt , j = 1..|S|, t = 1..T
X
|S|
= αi (T )
i=1
One of the most
ommon queries of a Hidden Markov Model is to ask what
z ∈ S T given an observed series of outputs
was the most likely series of states ~
T
~x ∈ V . Formally, we seek:
P (~x, ~z ; A, B)
arg max P (~z|~x; A, B) = arg max P = arg max P (~x, ~z; A, B)
~
z ~
z ~z P (~
x, ~z; A, B) ~
z
The rst simpli
ation follows from Bayes rule and the se
ond from the
observation that the denominator does not dire
tly depend on ~z. Naively, we
might try every possible assignment to ~z
and take the one with the highest
T
joint probability assigned by our model. However, this would require O(|S| )
operations just to enumerate the set of possible assignments. At this point, you
might think a dynami
programming solution like the Forward Algorithm might
arg max~z with
save the day, and you'd be right. Noti
e that if you repla
ed the
P
~ , our
urrent task is exa
tly analogous to the expression whi
h motivated
z
the forward pro
edure.
7
Algorithm 2 Naive appli
ation of EM to HMMs
Q(~z) := p(~z|~x; A, B)
(M-Step) Set
X P (~x, ~z; A, B)
A, B := arg max Q(~z) log
A,B Q(~z)
~
z
X
|S|
s.t. Aij = 1, i = 1..|S|; Aij ≥ 0, i, j = 1..|S|
j=1
X
|V |
Bik = 1, i = 1..|S|; Bik ≥ 0, i = 1..|S|, k = 1..|V |
k=1
The Viterbi Algorithm is just like the forward pro
edure ex
ept that
instead of tra
king the total probability of generating the observations seen so
far, we need only tra
k the maximum probability and re
ord its
orresponding
state sequen
e.
The nal question to ask of an HMM is: given a set of observations, what
are the values of the state transition probabilities A and the output emission
probabilities B that make the data most likely? For example, solving for the
maximum likelihood parameters based on a spee
h re
ognition dataset will allow
us to ee
tively train the HMM before asking for the maximum likelihood state
assignment of a
andidate spee
h signal.
In this se
tion, we present a derivation of the Expe
tation Maximization
algorithm for Hidden Markov Models. This proof follows from the general for-
mulation of EM presented in the CS229 le
ture notes. Algorithm 2.4 shows the
basi
EM algorithm. Noti
e that the optimization problem in the M-Step is now
onstrained su
h that A and B
ontain valid probabilities. Like the maximum
likelihood solution we found for (non-Hidden) Markov models, we'll be able to
solve this optimization problem with Lagrange multipliers. Noti
e also that the
E-Step and M-Step both require enumerating all |S|T possible labellings of ~z.
We'll make use of the Forward and Ba
kward algorithms mentioned earlier to
ompute a set of su
ient statisti
s for our E-Step and M-Step tra
tably.
First, let's rewrite the obje
tive fun
tion using our Markov assumptions.
8
X P (~x, ~z; A, B)
A, B = arg max Q(~z) log
A,B Q(~z)
z
~
X
= arg max Q(~z) log P (~x, ~z; A, B)
A,B
z
~
X Y
T Y
T
= arg max Q(~z) log( P (xt |zt ; B)) ( P (zt |zt−1 ; A))
A,B
z
~ t=1 t=1
X X
T
= arg max Q(~z) log Bzt xt + log Azt−1 zt
A,B
z
~ t=1
X X XXX
|S| |S| |V | T
= arg max Q(~z) 1{zt = sj ∧ xt = vk } log Bjk + 1{zt−1 = si ∧ zt = sj } log Aij
A,B
z
~ i=1 j=1 k=1 t=1
In the rst line we split the log division into a subtra
tion and note that
the denominator's term does not depend on the parameters A, B . The Markov
assumptions are applied in line 3. Line 5 uses indi
ator fun
tions to index A
and B by state.
Just as for the maximum likelihood parameters for a visible Markov model,
it is safe to ignore the inequality
onstraints be
ause the solution form naturally
results in only positive solutions. Constru
ting the Lagrangian:
X X XXX
|S| |S| |V | T
L(A, B, δ, ǫ) = Q(~z) 1{zt = sj ∧ xt = vk } log Bjk + 1{zt−1 = si ∧ zt = sj } log Aij
~
z i=1 j=1 k=1 t=1
X
|S|
X
|V |
X
|S|
X
|S|
+ ǫj (1 − Bjk ) + δi (1 − Aij )
j=1 k=1 i=1 j=1
∂L(A, B, δ, ǫ) X 1 X
T
= Q(~z) 1{zt−1 = si ∧ zt = sj } − δi ≡ 0
∂Aij Aij t=1
~
z
1 X XT
Aij = Q(~z) 1{zt−1 = si ∧ zt = sj }
δi t=1
~
z
∂L(A, B, δ, ǫ) X 1 X
T
= Q(~z) 1{zt = sj ∧ xt = vk } − ǫj ≡ 0
∂Bjk Bjk t=1
~
z
1 X XT
Bjk = Q(~z) 1{zt = sj ∧ xt = vk }
ǫj t=1
~
z
9
Taking partial derivatives with respe
t to the Lagrange multipliers and sub-
stituting our values of Aij and Bjk above:
∂L(A, B, δ, ǫ) X
|S|
= 1− Aij
∂δi j=1
X|S|
1 X XT
= 1− Q(~z) 1{zt−1 = si ∧ zt = sj } ≡ 0
j=1 i
δ t=1
~
z
XX
|S|
X
T
δi = Q(~z) 1{zt−1 = si ∧ zt = sj }
j=1 ~
z t=1
X X
T
= Q(~z) 1{zt−1 = si }
~
z t=1
∂L(A, B, δ, ǫ) X
|V |
= 1− Bjk
∂ǫj
k=1
X
|V |
1 X XT
= 1− Q(~z) 1{zt = sj ∧ xt = vk } ≡ 0
ǫj t=1
k=1 ~
z
XX |V |
X
T
ǫj = Q(~z) 1{zt = sj ∧ xt = vk }
k=1 ~
z t=1
X X
T
= Q(~z) 1{zt = sj }
~
z t=1
P PT
~
z Q(~z) 1{zt−1 = si ∧ zt = sj }
t=1
Âij = P PT
z ) t=1 1{zt−1 = si }
~ Q(~
z
P PT
~ z ) t=1 1{zt = sj ∧ xt = vk }
z Q(~
B̂jk = P PT
z ) t=1 1{zt = sj }
~ Q(~
z
X X
T
Q(~z) 1{zt−1 = si ∧ zt = sj }
~
z t=1
10
X
T X
= 1{zt−1 = si ∧ zt = sj }Q(~z)
t=1 ~
z
X
T X
= 1{zt−1 = si ∧ zt = sj }P (~z|~x; A, B)
t=1 ~
z
1 XX T
= 1{zt−1 = si ∧ zt = sj }P (~z, ~x; A, B)
P (~x; A, B) t=1
~
z
1 X
T
= αi (t)Aij Bj xt βj (t + 1)
P (~x; A, B) t=1
In the rst two steps we rearrange terms and substitute in for our denition
of Q. Then we use Bayes rule in deriving line four, followed by the denitions
of α, β , A, and B , in line ve. Similarly, the denominator
an be represented
by summing out over j the value of the numerator.
X X
T
Q(~z) 1{zt−1 = si }
z
~ t=1
X
|S|
X X
T
= Q(~z) 1{zt−1 = si ∧ zt = sj }
j=1 ~
z t=1
1 XX |S| T
= αi (t)Aij Bj xt βj (t + 1)
P (~x; A, B) j=1 t=1
PT
t=1 αi (t)Aij Bj xt βj (t + 1)
Âij = P|S| PT
j=1 t=1 αi (t)Aij Bj xt βj (t + 1)
X X
T
Q(~z) 1{zt = sj ∧ xt = vk }
~
z t=1
1 XX T
= 1{zt = sj ∧ xt = vk }P (~z, ~x; A, B)
P (~x; A, B) t=1
~
z
1 X XX
|S| T
= 1{zt−1 = si ∧ zt = sj ∧ xt = vk }P (~z, ~x; A, B)
P (~x; A, B) i=1 t=1 ~
z
11
Algorithm 3 Forward-Ba
kward algorithm for HMM parameter learning
γt (i, j) := αi (t)Aij Bj xt βj (t + 1)
PT
t=1 γt (i, j)
Aij := P|S| PT
j=1 t=1 γt (i, j)
P|S| PT
i=1 t=1 1{xt = vk } γt (i, j)
Bjk := P|S| PT
i=1 t=1 γt (i, j)
1 XX |S| T
= 1{xt = vk }αi (t)Aij Bj xt βj (t + 1)
P (~x; A, B) i=1 t=1
X X
T
Q(~z) 1{zt = sj }
~
z t=1
1 XXX|S| T
= 1{zt−1 = si ∧ zt = sj }P (~z, ~x; A, B)
P (~x; A, B) i=1 t=1
~
z
1 XX
|S|
T
= αi (t)Aij Bj xt βj (t + 1)
P (~x; A, B) i=1 t=1
Combining these expressions, we have the following form for our maximum
likelihood emission probabilities as:
P|S| PT
i=1 t=1 1{xt = vk }αi (t)Aij Bj xt βj (t + 1)
B̂jk = P|S| PT
i=1 t=1 αi (t)Aij Bj xt βj (t + 1)
12
E-Step, rather than expli
itly evaluating Q(~z) for all ~z ∈ S T , we
ompute
a su
ient statisti
s γt (i, j) = αi (t)Aij Bj xt βj (t + 1) that is proportional to
the probability of transitioning between sate si and sj at time t given all of
our observations ~x. The derived expressions for Aij and Bjk are intuitively
appealing. Aij is
omputed as the expe
ted number of transitions from si to
sj divided by the expe
ted number of appearan
es of si . Similarly, Bjk is
omputed as the expe
ted number of emissions of vk from sj divided by the
expe
ted number of appearan
es of sj .
Like many appli
ations of EM, parameter learning for HMMs is a non-
onvex
problem with many lo
al maxima. EM will
onverge to a maximum based on
its initial parameters, so multiple runs might be in order. Also, it is often
important to smooth the probability distributions represented by A and B so
that no transition or emission is assigned 0 probability.
There are many good sour
es for learning about Hidden Markov Models. For ap-
pli
ations in NLP, I re
ommend
onsulting Jurafsky & Martin's draft se
ond edi-
tion of Spee
h and Language Pro
essing 1 or Manning & S
hütze's Foundations of
Statisti
al Natural Language Pro
essing. Also, Eisner's HMM-in-a-spreadsheet
[1℄ is a light-weight intera
tive way to play with an HMM that requires only a
spreadsheet appli
ation.
Referen
es
[1℄ Jason Eisner. An intera
tive spreadsheet for tea
hing the forward-ba
kward
algorithm. In Dragomir Radev and Chris Brew, editors, Pro
eedings of the
ACL Workshop on Ee
tive Tools and Methodologies for Tea
hing NLP and
CL, pages 1018, 2002.
1 http://www. s. olorado.edu/~martin/slp2.html
13