Title: Fast Temporal State-Splitting for HMM Model Selection and Learning
1Fast Temporal State-Splittingfor HMM Model
Selection and Learning
- Sajid Siddiqi
- Geoffrey Gordon
- Andrew Moore
2x
t
3x
t
How many kinds of observations (x) ?
4x
t
How many kinds of observations (x) ? 3
5x
t
How many kinds of observations (x) ? 3 How many
kinds of transitions (xt1xt)?
6x
t
How many kinds of observations (x) ? 3 How many
kinds of transitions (xt1xt)? 4
7x
t
We say that this sequence exhibits four states
under the first-order Markov assumption Our goal
is to discover the number of such states (and
their parameter settings) in sequential data, and
to do so efficiently
How many kinds of observations (x) ? 3 How many
kinds of transitions ( xt?xt1)? 4
8Definitions
- An HMM is a 3-tuple ? A,B,p, where
- A NxN transition matrix
- B NxM observation probability matrix
- p Nx1 prior probability vector
- ? number of states in HMM ?, i.e. N
- T number of observations in sequence
- qt the state the HMM is in at time t
9HMMs as DBNs
10Transition Model
i P(qt1s1qtsi) P(qt1s2qtsi) P(qt1sjqtsi) P(qt1sNqtsi)
1 a11 a12 a1j a1N
2 a21 a22 a2j a2N
3 a31 a32 a3j a3N
i ai1 ai2 aij aiN
N aN1 aN2 aNj aNN
1/3
Each of these probability tables is identical
11Observation Model
O0
i P(Ot1qtsi) P(Ot2qtsi) P(Otkqtsi) P(OtMqtsi)
1 b1(1) b1 (2) b1 (k) b1(M)
2 b2 (1) b2 (2) b2(k) b2 (M)
3 b3 (1) b3 (2) b3(k) b3 (M)
i bi(1) bi (2) bi(k) bi (M)
N bN (1) bN (2) bN(k) bN (M)
O1
O2
O3
Notation
O4
12HMMs as DBNs
13HMMs as FSAs
HMMs as DBNs
S2
S1
S3
S4
14Operations on HMMs
- Problem 1 Evaluation
- Given an HMM and an observation sequence, what is
the likelihood of this sequence? - Problem 2 Most Probable Path
- Given an HMM and an observation sequence, what is
the most probable path through state space? - Problem 3 Learning HMM parameters
- Given an observation sequence and a fixed number
of states, what is an HMM that is likely to have
produced this string of observations? - Problem 3 Learning the number of states
- Given an observation sequence, what is an HMM (of
any size) that is likely to have produced this
string of observations?
15Operations on HMMs
Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Path Inference Computing Q argmaxQ P(O,Q?) Viterbi O(TN2)
Parameter Learning Computing ?argmax?,Q P(O,Q?) Computing ?argmax? P(O?) Viterbi Training Baum-Welch (EM) O(TN2)
Learning the number of states ?? ??
16Path Inference
- Viterbi Algorithm for calculating argmaxQ
P(O,Q?)
17t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
18t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
19Path Inference
- Viterbi Algorithm for calculating argmaxQ
P(O,Q?)
Running time O(TN2) Yields a globally optimal
path through hidden state space, associating each
timestep with exactly one HMM state.
20Parameter Learning I
- Viterbi Training( K-means for sequences)
21Parameter Learning I
- Viterbi Training( K-means for sequences)
Qs1 argmaxQ P(O,Q?s) (Viterbi algorithm)
?s1 argmax? P(O,Qs1?)
Running time O(TN2) per iteration Models the
posterior belief as a d-function per timestep in
the sequence. Performs well on data with easily
distinguishable states.
22Parameter Learning II
- Baum-Welch( GMM for sequences)
- Iterate the following two steps until
- Calculate the expected complete log-likelihood
given ?s - Obtain updated model parameters ?s1 by
maximizing this log-likelihood
23Parameter Learning II
- Baum-Welch( GMM for sequences)
- Iterate the following two steps until
- Calculate the expected complete log-likelihood
given ?s - Obtain updated model parameters ?s1 by
maximizing this log-likelihood
Obj(?,?s) EQP(O,Q?) O,?s
?s1 argmax? Obj(?,?s)
Running time O(TN2) per iteration, but with a
larger constant Models the full posterior
belief over hidden states per timestep.
Effectively models sequences with overlapping
states at the cost of extra computation.
24HMM Model Selection
- Distinction between model search and actual
selection step - We can search the spaces of HMMs with different N
using parameter learning, and perform selection
using a criterion like BIC.
25HMM Model Selection
- Distinction between model search and actual
selection step - We can search the spaces of HMMs with different N
using parameter learning, and perform selection
using a criterion like BIC.
Running time O(Tn2) to compute likelihood for
BIC
26HMM Model Selection I
- for n 1 Nmax
- Initialize n-state HMM randomly
- Learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
27HMM Model Selection I
- for n 1 Nmax
- Initialize n-state HMM randomly
- Learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
Running time O(Tn2) per iteration Drawback
Local minima in parameter optimization
28HMM Model Selection II
- for n 1 Nmax
- for i 1 NumTries
- Initialize n-state HMM randomly
- Learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
29HMM Model Selection II
- for n 1 Nmax
- for i 1 NumTries
- Initialize n-state HMM randomly
- Learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
Running time O(NumTries x Tn2) per
iteration Evaluates NumTries candidate models
for each n to overcome local minima. However
expensive, and still prone to local minima
especially for large N
30Idea Binary state splits to generate candidate
models
- To split state s into s1 and s2,
- Create ? such that ?\s ? ?\s
- Initialize ?s1 and ?s2 based on ?s and on
parameter constraints
- Notation?s HMM parameters related to state
s?\s HMM parameters not related to state s
first proposed in Ostendorf and Singer, 1997
31Idea Binary state splits to generate candidate
models
- To split state s into s1 and s2,
- Create ? such that ?\s ? ?\s
- Initialize ?s1 and ?s2 based on ?s and on
parameter constraints - This is an effective heuristic for avoiding local
minima
- Notation?s HMM parameters related to state
s?\s HMM parameters not related to state s
first proposed in Ostendorf and Singer, 1997
32Overall algorithm
33Overall algorithm
Start with a small number of states
EM (B.W. or V.T.)
Binary state splits followed by EM
BIC on training set
Stop when bigger HMMis not selected
34Overall algorithm
Start with a small number of states
EM (B.W. or V.T.)
Binary state splits followed by EM
BIC on training set
Stop when bigger HMMis not selected
What is efficient? Want this loop to be at most
O(TN2)
35HMM Model Selection III
- Initialize n0-state HMM randomly
- for n n0 Nmax
- Learn model parameters
- for i 1 n
- Split state i, learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
36HMM Model Selection III
- Initialize n0-state HMM randomly
- for n n0 Nmax
- Learn model parameters
- for i 1 n
- Split state i, learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
O(Tn2)
37HMM Model Selection III
- Initialize n0-state HMM randomly
- for n n0 Nmax
- Learn model parameters
- for i 1 n
- Split state i, learn model parameters
- Calculate BIC score
- If best so far, store model
- if larger model not chosen, stop
O(Tn2)
Running time O(Tn3) per iteration of outer
loop More effective at avoiding local minima
than previous approaches. However, scales poorly
because of n3 term.
38Fast Candidate Generation
39Fast Candidate Generation
Only consider timesteps owned by s in Viterbi path
Only allow parameters of split states to vary
Merge parameters and store as candidate
40OptimizeSplitParams I Split-State Viterbi
Training (SSVT)
Iterate until convergence
41Constrained Viterbi
- Splitting state s to s1,s2. We calculate
- using a fast constrained Viterbi algorithm
over only those timesteps owned by s in Q, and
constraining them to belong to s1 or s2 .
42The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(N)
1
2
3
4
5
6
7
8
9
43The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
?
?
?
?
?
?
?
?
44The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
45The Viterbi path is denoted by Suppose we split
state N into s1,s2
t d t(1) d t(2) d t(3) d t(s1) d t(s2)
1
2
3
4
5
6
7
8
9
46(No Transcript)
47OptimizeSplitParams I Split-State Viterbi
Training (SSVT)
Iterate until convergence
Running time O(Tsn) per iteration When
splitting state s, assumes rest of the HMM
parameters (?\s ) and rest of the Viterbi path
(Q\Ts) are both fixed
48Fast approximate BIC
Compute once for base model O(Tn2) Update
optimistically for candidate model O(Ts)
first proposed in Stolcke and Omohundro, 1994
49HMM Model Selection IV
- Initialize n0-state HMM randomly
- for n n0 Nmax
- Learn model parameters
- for i 1 n
- Split state i, optimize by constrained EM
- Calculate approximate BIC score
- If best so far, store model
- if larger model not chosen, stop
50HMM Model Selection IV
- Initialize n0-state HMM randomly
- for n n0 Nmax
- Learn model parameters
- for i 1 n
- Split state i, optimize by constrained EM
- Calculate approximate BIC score
- If best so far, store model
- if larger model not chosen, stop
O(Tn)
Running time O(Tn2) per iteration of outer loop!
51Algorithms
slower, more accurate
SOFT Baum-Welch / Constrained Baum-WelchHARD
Viterbi Training / Constrained Viterbi Training
faster, coarser
52Results
- Learning fixed-size models
- Learning variable-sized models
Baseline Fixed-size HMM Baum-Welch with five
restarts
53Learning fixed-size models
54(No Transcript)
55Fixed-size experiments table, continued
56Learning fixed-size models
57Learning fixed-size models
58Learning variable-size models
59Learning variable-size models
60Learning variable-size models
61Learning variable-size models
62Learning variable-size models
63Conclusion
- Pros
- Simple and efficient method for HMM model
selection - Also learns better fixed-size models
- (Often faster than single run of Baum-Welch )
- Different variants suitable for different problems
64Conclusion
- Pros
- Simple and efficient method for HMM model
selection - Also learns better fixed-size models
- (Often faster than single run of Baum-Welch )
- Different variants suitable for different problems
- Cons
- Greedy heuristic No performance guarantees
- Binary splits also prone to local minima
- Why binary splits?
- Works less well on discrete-valued data
- Greater error from Viterbi path assumptions
65Thank you
66Appendix
67Viterbi Algorithm
68Constrained Vit.
69EM for HMMs
70(No Transcript)
71More definitions
72(No Transcript)
73OptimizeSplitParams II Constrained Baum-Welch
Iterate until convergence
74Penalized BIC