Fast Temporal State-Splitting for HMM Model Selection and Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Fast Temporal State-Splitting for HMM Model Selection and Learning

Description:

for HMM Model Selection and Learning. Sajid Siddiqi. Geoffrey Gordon. Andrew Moore. t ... Obtain updated model parameters s 1 by maximizing this log-likelihood ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 75
Provided by: carnegieme
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Temporal State-Splitting for HMM Model Selection and Learning


1
Fast Temporal State-Splittingfor HMM Model
Selection and Learning
  • Sajid Siddiqi
  • Geoffrey Gordon
  • Andrew Moore

2
x
t
3
x
t
How many kinds of observations (x) ?
4
x
t
How many kinds of observations (x) ? 3
5
x
t
How many kinds of observations (x) ? 3 How many
kinds of transitions (xt1xt)?
6
x
t
How many kinds of observations (x) ? 3 How many
kinds of transitions (xt1xt)? 4
7
x
t
We say that this sequence exhibits four states
under the first-order Markov assumption Our goal
is to discover the number of such states (and
their parameter settings) in sequential data, and
to do so efficiently
How many kinds of observations (x) ? 3 How many
kinds of transitions ( xt?xt1)? 4
8
Definitions
  • An HMM is a 3-tuple ? A,B,p, where
  • A NxN transition matrix
  • B NxM observation probability matrix
  • p Nx1 prior probability vector
  • ? number of states in HMM ?, i.e. N
  • T number of observations in sequence
  • qt the state the HMM is in at time t

9
HMMs as DBNs
10
Transition Model
i P(qt1s1qtsi) P(qt1s2qtsi) P(qt1sjqtsi) P(qt1sNqtsi)
1 a11 a12 a1j a1N
2 a21 a22 a2j a2N
3 a31 a32 a3j a3N

i ai1 ai2 aij aiN

N aN1 aN2 aNj aNN
1/3
Each of these probability tables is identical
11
Observation Model
O0
i P(Ot1qtsi) P(Ot2qtsi) P(Otkqtsi) P(OtMqtsi)
1 b1(1) b1 (2) b1 (k) b1(M)
2 b2 (1) b2 (2) b2(k) b2 (M)
3 b3 (1) b3 (2) b3(k) b3 (M)

i bi(1) bi (2) bi(k) bi (M)

N bN (1) bN (2) bN(k) bN (M)
O1
O2
O3
Notation
O4
12
HMMs as DBNs
13
HMMs as FSAs
HMMs as DBNs
S2
S1
S3
S4
14
Operations on HMMs
  • Problem 1 Evaluation
  • Given an HMM and an observation sequence, what is
    the likelihood of this sequence?
  • Problem 2 Most Probable Path
  • Given an HMM and an observation sequence, what is
    the most probable path through state space?
  • Problem 3 Learning HMM parameters
  • Given an observation sequence and a fixed number
    of states, what is an HMM that is likely to have
    produced this string of observations?
  • Problem 3 Learning the number of states
  • Given an observation sequence, what is an HMM (of
    any size) that is likely to have produced this
    string of observations?

15
Operations on HMMs
Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Path Inference Computing Q argmaxQ P(O,Q?) Viterbi O(TN2)
Parameter Learning Computing ?argmax?,Q P(O,Q?) Computing ?argmax? P(O?) Viterbi Training Baum-Welch (EM) O(TN2)
Learning the number of states ?? ??
16
Path Inference
  • Viterbi Algorithm for calculating argmaxQ
    P(O,Q?)

17
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
18
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
19
Path Inference
  • Viterbi Algorithm for calculating argmaxQ
    P(O,Q?)

Running time O(TN2) Yields a globally optimal
path through hidden state space, associating each
timestep with exactly one HMM state.
20
Parameter Learning I
  • Viterbi Training( K-means for sequences)

21
Parameter Learning I
  • Viterbi Training( K-means for sequences)

Qs1 argmaxQ P(O,Q?s) (Viterbi algorithm)
?s1 argmax? P(O,Qs1?)
Running time O(TN2) per iteration Models the
posterior belief as a d-function per timestep in
the sequence. Performs well on data with easily
distinguishable states.
22
Parameter Learning II
  • Baum-Welch( GMM for sequences)
  1. Iterate the following two steps until
  2. Calculate the expected complete log-likelihood
    given ?s
  3. Obtain updated model parameters ?s1 by
    maximizing this log-likelihood

23
Parameter Learning II
  • Baum-Welch( GMM for sequences)
  1. Iterate the following two steps until
  2. Calculate the expected complete log-likelihood
    given ?s
  3. Obtain updated model parameters ?s1 by
    maximizing this log-likelihood

Obj(?,?s) EQP(O,Q?) O,?s
?s1 argmax? Obj(?,?s)
Running time O(TN2) per iteration, but with a
larger constant Models the full posterior
belief over hidden states per timestep.
Effectively models sequences with overlapping
states at the cost of extra computation.
24
HMM Model Selection
  • Distinction between model search and actual
    selection step
  • We can search the spaces of HMMs with different N
    using parameter learning, and perform selection
    using a criterion like BIC.

25
HMM Model Selection
  • Distinction between model search and actual
    selection step
  • We can search the spaces of HMMs with different N
    using parameter learning, and perform selection
    using a criterion like BIC.

Running time O(Tn2) to compute likelihood for
BIC
26
HMM Model Selection I
  • for n 1 Nmax
  • Initialize n-state HMM randomly
  • Learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

27
HMM Model Selection I
  • for n 1 Nmax
  • Initialize n-state HMM randomly
  • Learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

Running time O(Tn2) per iteration Drawback
Local minima in parameter optimization
28
HMM Model Selection II
  • for n 1 Nmax
  • for i 1 NumTries
  • Initialize n-state HMM randomly
  • Learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

29
HMM Model Selection II
  • for n 1 Nmax
  • for i 1 NumTries
  • Initialize n-state HMM randomly
  • Learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

Running time O(NumTries x Tn2) per
iteration Evaluates NumTries candidate models
for each n to overcome local minima. However
expensive, and still prone to local minima
especially for large N
30
Idea Binary state splits to generate candidate
models
  • To split state s into s1 and s2,
  • Create ? such that ?\s ? ?\s
  • Initialize ?s1 and ?s2 based on ?s and on
    parameter constraints
  • Notation?s HMM parameters related to state
    s?\s HMM parameters not related to state s

first proposed in Ostendorf and Singer, 1997
31
Idea Binary state splits to generate candidate
models
  • To split state s into s1 and s2,
  • Create ? such that ?\s ? ?\s
  • Initialize ?s1 and ?s2 based on ?s and on
    parameter constraints
  • This is an effective heuristic for avoiding local
    minima
  • Notation?s HMM parameters related to state
    s?\s HMM parameters not related to state s

first proposed in Ostendorf and Singer, 1997
32
Overall algorithm
33
Overall algorithm
Start with a small number of states
EM (B.W. or V.T.)
Binary state splits followed by EM
BIC on training set
Stop when bigger HMMis not selected
34
Overall algorithm
Start with a small number of states
EM (B.W. or V.T.)
Binary state splits followed by EM
BIC on training set
Stop when bigger HMMis not selected
What is efficient? Want this loop to be at most
O(TN2)
35
HMM Model Selection III
  • Initialize n0-state HMM randomly
  • for n n0 Nmax
  • Learn model parameters
  • for i 1 n
  • Split state i, learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

36
HMM Model Selection III
  • Initialize n0-state HMM randomly
  • for n n0 Nmax
  • Learn model parameters
  • for i 1 n
  • Split state i, learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

O(Tn2)
37
HMM Model Selection III
  • Initialize n0-state HMM randomly
  • for n n0 Nmax
  • Learn model parameters
  • for i 1 n
  • Split state i, learn model parameters
  • Calculate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

O(Tn2)
Running time O(Tn3) per iteration of outer
loop More effective at avoiding local minima
than previous approaches. However, scales poorly
because of n3 term.
38
Fast Candidate Generation
39
Fast Candidate Generation
Only consider timesteps owned by s in Viterbi path
Only allow parameters of split states to vary
Merge parameters and store as candidate
40
OptimizeSplitParams I Split-State Viterbi
Training (SSVT)
Iterate until convergence
41
Constrained Viterbi
  • Splitting state s to s1,s2. We calculate
  • using a fast constrained Viterbi algorithm
    over only those timesteps owned by s in Q, and
    constraining them to belong to s1 or s2 .

42
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
43
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
?
?
?
?
?
?
?
?
44
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
45
The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
46
(No Transcript)
47
OptimizeSplitParams I Split-State Viterbi
Training (SSVT)
Iterate until convergence
Running time O(Tsn) per iteration When
splitting state s, assumes rest of the HMM
parameters (?\s ) and rest of the Viterbi path
(Q\Ts) are both fixed
48
Fast approximate BIC
Compute once for base model O(Tn2) Update
optimistically for candidate model O(Ts)
first proposed in Stolcke and Omohundro, 1994
49
HMM Model Selection IV
  • Initialize n0-state HMM randomly
  • for n n0 Nmax
  • Learn model parameters
  • for i 1 n
  • Split state i, optimize by constrained EM
  • Calculate approximate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

50
HMM Model Selection IV
  • Initialize n0-state HMM randomly
  • for n n0 Nmax
  • Learn model parameters
  • for i 1 n
  • Split state i, optimize by constrained EM
  • Calculate approximate BIC score
  • If best so far, store model
  • if larger model not chosen, stop

O(Tn)
Running time O(Tn2) per iteration of outer loop!
51
Algorithms
slower, more accurate
SOFT Baum-Welch / Constrained Baum-WelchHARD
Viterbi Training / Constrained Viterbi Training
faster, coarser
52
Results
  1. Learning fixed-size models
  2. Learning variable-sized models

Baseline Fixed-size HMM Baum-Welch with five
restarts
53
Learning fixed-size models
54
(No Transcript)
55
Fixed-size experiments table, continued
56
Learning fixed-size models
57
Learning fixed-size models
58
Learning variable-size models
59
Learning variable-size models
60
Learning variable-size models
61
Learning variable-size models
62
Learning variable-size models
63
Conclusion
  • Pros
  • Simple and efficient method for HMM model
    selection
  • Also learns better fixed-size models
  • (Often faster than single run of Baum-Welch )
  • Different variants suitable for different problems

64
Conclusion
  • Pros
  • Simple and efficient method for HMM model
    selection
  • Also learns better fixed-size models
  • (Often faster than single run of Baum-Welch )
  • Different variants suitable for different problems
  • Cons
  • Greedy heuristic No performance guarantees
  • Binary splits also prone to local minima
  • Why binary splits?
  • Works less well on discrete-valued data
  • Greater error from Viterbi path assumptions

65
Thank you
66
Appendix
67
Viterbi Algorithm
68
Constrained Vit.
69
EM for HMMs
70
(No Transcript)
71
More definitions
72
(No Transcript)
73
OptimizeSplitParams II Constrained Baum-Welch
Iterate until convergence
74
Penalized BIC
Write a Comment
User Comments (0)
About PowerShow.com