Fast Temporal State-Splitting for HMM Model Selection and Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Fast Temporal State-Splitting for HMM Model Selection and Learning

Description:

for HMM Model Selection and Learning. Sajid Siddiqi. Geoffrey Gordon. Andrew Moore. t ... Obtain updated model parameters s 1 by maximizing this log-likelihood ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 75

Provided by: carnegieme

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast Temporal State-Splitting for HMM Model Selection and Learning

1
Fast Temporal State-Splittingfor HMM Model
Selection and Learning

Sajid Siddiqi
Geoffrey Gordon
Andrew Moore

2
x
t
3
x
t
How many kinds of observations (x) ?
4
x
t
How many kinds of observations (x) ? 3
5
x
t
How many kinds of observations (x) ? 3 How many
kinds of transitions (xt1xt)?
6
x
t
How many kinds of observations (x) ? 3 How many
kinds of transitions (xt1xt)? 4
7
x
t
We say that this sequence exhibits four states
under the first-order Markov assumption Our goal
is to discover the number of such states (and
their parameter settings) in sequential data, and
to do so efficiently
How many kinds of observations (x) ? 3 How many
kinds of transitions ( xt?xt1)? 4
8
Definitions

An HMM is a 3-tuple ? A,B,p, where
A NxN transition matrix
B NxM observation probability matrix
p Nx1 prior probability vector
? number of states in HMM ?, i.e. N
T number of observations in sequence
qt the state the HMM is in at time t

9
HMMs as DBNs
10
Transition Model
i P(qt1s1qtsi) P(qt1s2qtsi) P(qt1sjqtsi) P(qt1sNqtsi)
1 a11 a12 a1j a1N
2 a21 a22 a2j a2N
3 a31 a32 a3j a3N

i ai1 ai2 aij aiN

N aN1 aN2 aNj aNN
1/3
Each of these probability tables is identical
11
Observation Model
O0
i P(Ot1qtsi) P(Ot2qtsi) P(Otkqtsi) P(OtMqtsi)
1 b1(1) b1 (2) b1 (k) b1(M)
2 b2 (1) b2 (2) b2(k) b2 (M)
3 b3 (1) b3 (2) b3(k) b3 (M)

i bi(1) bi (2) bi(k) bi (M)

N bN (1) bN (2) bN(k) bN (M)
O1
O2
O3
Notation
O4
12
HMMs as DBNs
13
HMMs as FSAs
HMMs as DBNs
S2
S1
S3
S4
14
Operations on HMMs

Problem 1 Evaluation
Given an HMM and an observation sequence, what is
the likelihood of this sequence?
Problem 2 Most Probable Path
Given an HMM and an observation sequence, what is
the most probable path through state space?
Problem 3 Learning HMM parameters
Given an observation sequence and a fixed number
of states, what is an HMM that is likely to have
produced this string of observations?
Problem 3 Learning the number of states
Given an observation sequence, what is an HMM (of
any size) that is likely to have produced this
string of observations?

15
Operations on HMMs
Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Path Inference Computing Q argmaxQ P(O,Q?) Viterbi O(TN2)
Parameter Learning Computing ?argmax?,Q P(O,Q?) Computing ?argmax? P(O?) Viterbi Training Baum-Welch (EM) O(TN2)
Learning the number of states ?? ??
16
Path Inference

Viterbi Algorithm for calculating argmaxQ
P(O,Q?)

17
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
18
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
19
Path Inference

Viterbi Algorithm for calculating argmaxQ
P(O,Q?)

Running time O(TN2) Yields a globally optimal
path through hidden state space, associating each
timestep with exactly one HMM state.
20
Parameter Learning I

Viterbi Training( K-means for sequences)

21
Parameter Learning I

Viterbi Training( K-means for sequences)

Qs1 argmaxQ P(O,Q?s) (Viterbi algorithm)
?s1 argmax? P(O,Qs1?)
Running time O(TN2) per iteration Models the
posterior belief as a d-function per timestep in
the sequence. Performs well on data with easily
distinguishable states.
22
Parameter Learning II

Baum-Welch( GMM for sequences)

Iterate the following two steps until
Calculate the expected complete log-likelihood
given ?s
Obtain updated model parameters ?s1 by
maximizing this log-likelihood

23
Parameter Learning II

Baum-Welch( GMM for sequences)

Iterate the following two steps until
Calculate the expected complete log-likelihood
given ?s
Obtain updated model parameters ?s1 by
maximizing this log-likelihood

Obj(?,?s) EQP(O,Q?) O,?s
?s1 argmax? Obj(?,?s)
Running time O(TN2) per iteration, but with a
larger constant Models the full posterior
belief over hidden states per timestep.
Effectively models sequences with overlapping
states at the cost of extra computation.
24
HMM Model Selection

Distinction between model search and actual
selection step
We can search the spaces of HMMs with different N
using parameter learning, and perform selection
using a criterion like BIC.

25
HMM Model Selection

Distinction between model search and actual
selection step
We can search the spaces of HMMs with different N
using parameter learning, and perform selection
using a criterion like BIC.

Running time O(Tn2) to compute likelihood for
BIC
26
HMM Model Selection I

for n 1 Nmax
Initialize n-state HMM randomly
Learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

27
HMM Model Selection I

for n 1 Nmax
Initialize n-state HMM randomly
Learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

Running time O(Tn2) per iteration Drawback
Local minima in parameter optimization
28
HMM Model Selection II

for n 1 Nmax
for i 1 NumTries
Initialize n-state HMM randomly
Learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

29
HMM Model Selection II

for n 1 Nmax
for i 1 NumTries
Initialize n-state HMM randomly
Learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

Running time O(NumTries x Tn2) per
iteration Evaluates NumTries candidate models
for each n to overcome local minima. However
expensive, and still prone to local minima
especially for large N
30
Idea Binary state splits to generate candidate
models

To split state s into s1 and s2,
Create ? such that ?\s ? ?\s
Initialize ?s1 and ?s2 based on ?s and on
parameter constraints

Notation?s HMM parameters related to state
s?\s HMM parameters not related to state s

first proposed in Ostendorf and Singer, 1997
31
Idea Binary state splits to generate candidate
models

To split state s into s1 and s2,
Create ? such that ?\s ? ?\s
Initialize ?s1 and ?s2 based on ?s and on
parameter constraints
This is an effective heuristic for avoiding local
minima

Notation?s HMM parameters related to state
s?\s HMM parameters not related to state s

first proposed in Ostendorf and Singer, 1997
32
Overall algorithm
33
Overall algorithm
Start with a small number of states
EM (B.W. or V.T.)
Binary state splits followed by EM
BIC on training set
Stop when bigger HMMis not selected
34
Overall algorithm
Start with a small number of states
EM (B.W. or V.T.)
Binary state splits followed by EM
BIC on training set
Stop when bigger HMMis not selected
What is efficient? Want this loop to be at most
O(TN2)
35
HMM Model Selection III

Initialize n0-state HMM randomly
for n n0 Nmax
Learn model parameters
for i 1 n
Split state i, learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

36
HMM Model Selection III

Initialize n0-state HMM randomly
for n n0 Nmax
Learn model parameters
for i 1 n
Split state i, learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

O(Tn2)
37
HMM Model Selection III

Initialize n0-state HMM randomly
for n n0 Nmax
Learn model parameters
for i 1 n
Split state i, learn model parameters
Calculate BIC score
If best so far, store model
if larger model not chosen, stop

O(Tn2)
Running time O(Tn3) per iteration of outer
loop More effective at avoiding local minima
than previous approaches. However, scales poorly
because of n3 term.
38
Fast Candidate Generation
39
Fast Candidate Generation
Only consider timesteps owned by s in Viterbi path
Only allow parameters of split states to vary
Merge parameters and store as candidate
40
OptimizeSplitParams I Split-State Viterbi
Training (SSVT)
Iterate until convergence
41
Constrained Viterbi

Splitting state s to s1,s2. We calculate
using a fast constrained Viterbi algorithm
over only those timesteps owned by s in Q, and
constraining them to belong to s1 or s2 .

42
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
43
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
?
?
?
?
?
?
?
?
44
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
45
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
46
(No Transcript)
47
OptimizeSplitParams I Split-State Viterbi
Training (SSVT)
Iterate until convergence
Running time O(Tsn) per iteration When
splitting state s, assumes rest of the HMM
parameters (?\s ) and rest of the Viterbi path
(Q\Ts) are both fixed
48
Fast approximate BIC
Compute once for base model O(Tn2) Update
optimistically for candidate model O(Ts)
first proposed in Stolcke and Omohundro, 1994
49
HMM Model Selection IV

Initialize n0-state HMM randomly
for n n0 Nmax
Learn model parameters
for i 1 n
Split state i, optimize by constrained EM
Calculate approximate BIC score
If best so far, store model
if larger model not chosen, stop

50
HMM Model Selection IV

Initialize n0-state HMM randomly
for n n0 Nmax
Learn model parameters
for i 1 n
Split state i, optimize by constrained EM
Calculate approximate BIC score
If best so far, store model
if larger model not chosen, stop

O(Tn)
Running time O(Tn2) per iteration of outer loop!
51
Algorithms
slower, more accurate
SOFT Baum-Welch / Constrained Baum-WelchHARD
Viterbi Training / Constrained Viterbi Training
faster, coarser
52
Results

Learning fixed-size models
Learning variable-sized models

Baseline Fixed-size HMM Baum-Welch with five
restarts
53
Learning fixed-size models
54
(No Transcript)
55
Fixed-size experiments table, continued
56
Learning fixed-size models
57
Learning fixed-size models
58
Learning variable-size models
59
Learning variable-size models
60
Learning variable-size models
61
Learning variable-size models
62
Learning variable-size models
63
Conclusion

Pros
Simple and efficient method for HMM model
selection
Also learns better fixed-size models
(Often faster than single run of Baum-Welch )
Different variants suitable for different problems

64
Conclusion

Pros
Simple and efficient method for HMM model
selection
Also learns better fixed-size models
(Often faster than single run of Baum-Welch )
Different variants suitable for different problems

Cons
Greedy heuristic No performance guarantees
Binary splits also prone to local minima
Why binary splits?
Works less well on discrete-valued data
Greater error from Viterbi path assumptions

65
Thank you
66
Appendix
67
Viterbi Algorithm
68
Constrained Vit.
69
EM for HMMs
70
(No Transcript)
71
More definitions
72
(No Transcript)
73
OptimizeSplitParams II Constrained Baum-Welch
Iterate until convergence
74
Penalized BIC

Write a Comment

User Comments (0)