Title: Fast State Discovery for HMM Model Selection and Learning
1Fast State Discovery for HMM Model Selection and
Learning
- Sajid M. Siddiqi
- Geoffrey J. Gordon
- Andrew W. Moore
- CMU
2Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices )
3Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual properties
4Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual properties
5Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual
properties However, we would miss important
temporal structure
6Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual
properties However, we would miss important
temporal structure
7Ot
t
Current efficient approaches learn the wrong
model
8Ot
t
Current efficient approaches learn the wrong
model
Our method successfully discovers the overlapping
states
9Ot
t
Our goal Efficiently discover states in
sequential data while learning a Hidden Markov
Model
10Motion Capture
11Definitions and Notation
- An HMM is ?A,B,? where
- A N?N transition matrix
- B observation model ?s ,?s for each of N
states - ? N?1 prior probability vector
- T size of observation sequence O1,,OT
- qt the state the HMM is in at time t. qt ?
s1,,sN
12Operations on HMMs
13Operations on HMMs
14Previous Approaches
- Multi-restart Baum-Welch N is inefficient, highly
prone to local minima
15Previous Approaches
- Multi-restart Baum-Welch N is inefficient, highly
prone to local minima
16Previous Approaches
- Multi-restart Baum-Welch N is inefficient, highly
prone to local minima
17Previous Approaches
- Multi-restart Baum-Welch N is inefficient, highly
prone to local minima
We propose Simultaneous Temporal and Contextual
Splitting (STACS) A top-down approach that is
much better at state-discovery while being at
least as efficient, and a variant V-STACS that is
much faster.
18Bayesian Information Criterion (BIC) for Model
Selection
- - Would like to compute the posterior probability
for model selection - P(model sizedata) / P(datamodel size) P(model
size) - log P(model sizedata) / log P(datamodel size)
log P(model size) -
19Bayesian Information Criterion (BIC) for Model
Selection
- - Would like to compute the posterior probability
for model selection - P(model sizedata) / P(datamodel size) P(model
size) - log P(model sizedata) / log P(datamodel size)
log P(model size) - - BIC assumes a prior that penalizes complexity
(favors smaller models) - log P(model sizedata) ¼ log
P(datamodel size,?MLE) (FP/2) log T - where FP number of free parameters, T
length of data sequence, ?MLE is the ML
parameter estimate -
20Bayesian Information Criterion (BIC) for Model
Selection
- - Would like to compute the posterior probability
for model selection - P(model sizedata) / P(datamodel size) P(model
size) - log P(model sizedata) / log P(datamodel size)
log P(model size) - - BIC assumes a prior that penalizes complexity
(favors smaller models) - log P(model sizedata) ¼ log
P(datamodel size,?MLE) (FP/2) log T - where FP number of free parameters, T
length of data sequence, ?MLE is the ML
parameter estimate - - BIC is an asymptotic approximation to the true
posterior -
21Algorithm Summary (STACS/VSTACS)
- Initialize n0-state HMM randomly
- for n n0 Nmax
- Learn model parameters
- for i 1 n
- Split state i, optimize by constrained EM
(STACS)or constrained Viterbi Training (VSTACS) - Calculate approximate BIC score of split model
- Choose best split based on approximate BIC
- Compare to original model with exact BIC
(STACS)or approximate BIC (VSTACS) - if larger model not chosen, stop
22STACS
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
23STACS
- Learn parameters using EM, calculate the Viterbi
path Q
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
24STACS
- Learn parameters using EM, calculate the Viterbi
path Q - Consider splits on all states
- e.g. for state s2
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
25STACS
- Learn parameters using EM, calculate the Viterbi
path Q - Consider splits on all states
- e.g. for state s2
- Choose a subset D Ot Q(t) s2
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
26STACS
- Learn parameters using EM, calculate the Viterbi
path Q - Consider splits on all states
- e.g. for state s2
- Choose a subset D Ot Q(t) s2
- Note that D O(T/N)
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
27STACS
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
28STACS
- Split the state
- Constrain ?s to ? except for offspring states
observation densities and all their transition
probabilities, both in and out
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
29STACS
- Split the state
- Constrain ?s to ? except for offspring states
observation densities and all their transition
probabilities, both in and out - Learn the free parameters using two-state EM over
D. This optimizes the partially observed
likelihood P(O,Q\D ?s)
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
30STACS
- Split the state
- Constrain ?s to ? except for offspring states
observation densities and all their transition
probabilities, both in and out - Learn the free parameters using two-state EM over
D. This optimizes the partially observed
likelihood P(O,Q\D ?s) - Update Q over D to get R
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
31STACS
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
32STACS
- Scoring is of two types
- The candidates are compared to each other
according to their Viterbi path likelihoods
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
vs.
33STACS
- Scoring is of two types
- The candidates are compared to each other
according to their Viterbi path likelihoods - The best candidate in this ranking is compared to
the un-split model ? using BIC, i.e. - log P(model data ) ? log P(data model)
complexity penalty
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
vs.
vs.
34Viterbi STACS (V-STACS)
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
35Viterbi STACS (V-STACS)
- Recall that STACS learns the free parameters
using two-state EM over D. However, EM also has
winner-take-all variants
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
36Viterbi STACS (V-STACS)
- Recall that STACS learns the free parameters
using two-state EM over D. However, EM also has
winner-take-all variants - V-STACS uses two-state Viterbi training over D to
learn the free parameters, which uses hard
updates vs STACS soft updates
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
37Viterbi STACS (V-STACS)
- Recall that STACS learns the free parameters
using two-state EM over D. However, EM also has
winner-take-all variants - V-STACS uses two-state Viterbi training over D to
learn the free parameters, which uses hard
updates vs STACS soft updates - The Viterbi path likelihood is used to
approximate the BIC vs. the un-split model in
V-STACS
- input n0, data sequence O O1,,OT
- output HMM ? of appropriate size
- ? n0-state initial HMM
- repeat
- optimize ? over sequence O
- choose a subset of states ?
- for each s ? ?
- design a candidate model ?s
- choose a relevant subset of sequence O
- split state s, optimize ?s over subset
- score ?s
- end for
- if maxs(score(?s)) gt score(?)
- ? ? best-scoring candidate from ?s
- else
38Time Complexity
- Optimizing N candidates takes
- N? O(T) time for STACS
- N ? O(T/N) time for V-STACS
- Scoring N candidates takes N ? O(T) time
- Candidate search and scoring is O(TN)
- Best-candidate evaluation is
- O(TN2) for BIC in STACS
- O(TN) for approximate BIC in V-STACS
39Other Methods
- Li-Biswas
- Generates two candidates
- splits state with highest-variance
- merges pair of closest states (rarely chosen)
40Other Methods
- Li-Biswas
- Generates two candidates
- splits state with highest-variance
- merges pair of closest states (rarely chosen)
- Optimizes all candidate parameters over entire
sequence
41Other Methods
- Li-Biswas
- Generates two candidates
- splits state with highest-variance
- merges pair of closest states (rarely chosen)
- Optimizes all candidate parameters over entire
sequence - ML-SSS
- Generates 2N candidates, splitting each state in
two ways
42Other Methods
- Li-Biswas
- Generates two candidates
- splits state with highest-variance
- merges pair of closest states (rarely chosen)
- Optimizes all candidate parameters over entire
sequence - ML-SSS
- Generates 2N candidates, splitting each state in
two ways - Contextual split optimizes offspring states
observation densities with 2-Gaussian mixture EM,
assumes offspring connected in parallel
43Other Methods
- Li-Biswas
- Generates two candidates
- splits state with highest-variance
- merges pair of closest states (rarely chosen)
- Optimizes all candidate parameters over entire
sequence - ML-SSS
- Generates 2N candidates, splitting each state in
two ways - Contextual split optimizes offspring states
observation densities with 2-Gaussian mixture EM,
assumes offspring connected in parallel - Temporal split optimizes offspring states
observation densities, self-transitions and
mutual transitions with EM, assumes offspring
in series
44Other Methods
- Li-Biswas
- Generates two candidates
- splits state with highest-variance
- merges pair of closest states (rarely chosen)
- Optimizes all candidate parameters over entire
sequence - ML-SSS
- Generates 2N candidates, splitting each state in
two ways - Contextual split optimizes offspring states
observation densities with 2-Gaussian mixture EM,
assumes offspring connected in parallel - Temporal split optimizes offspring states
observation densities, self-transitions and
mutual transitions with EM, assumes offspring
in series - Optimizes split of state s over all timesteps
with nonzero posterior probability of being in
state s i.e. O(T) data points
45Results
46Data sets
- Australian Sign-Language data collected from 2
Flock 5DT instrumented gloves and Ascension
flock-of-birds trackerKadous 2002 (available in
UCI KDD Archive) - Other data sets obtained from the literature
- Robot, MoCap, MLog, Vowel
47Learning HMMs of Predetermined SizeScalability
- Robot data (others similar)
48Learning HMMs of Predetermined Size
Log-Likelihood
- Learning a 40-state HMM on Robot data (others
similar)
49Learning HMMs of Predetermined Size
Learning 40-state HMMs
50Model Selection Synthetic Data
- Generalize (4 states, T 1000) to (10 states, T
10,000)
51Model Selection Synthetic Data
- Generalize (4 states, T 1000) to (10 states, T
10,000) - Both STACS, VSTACS discovered 10 states and
correct underlying transition structure
52Model Selection Synthetic Data
- Generalize (4 states, T 1000) to (10 states, T
10,000) - Both STACS, VSTACS discovered 10 states and
correct underlying transition structure - Li-Biswas, ML-SSS failed to find 10-state model
- 10-state Baum-Welch also failed to find correct
observation and transition models, even with 50
restarts!
53Model Selection BIC score
- MoCap data (others similar)
54Model Selection
55Sign-language recognition
- Initial results on sign-language word recognition
- 95 distinct words, 27 instances each, divided 81
- Average classification accuracies and HMM sizes
Accuracy final N
56Modeling motion capture data
- 35-dimensional data (thanks to Adrien Treuille)
57Modeling motion capture data
58Modeling motion capture data
- Original data STACS simulation
(found 235
states) -
59Modeling motion capture data
- Original data STACS simulation
Baum-Welch
(found 235 states) (on 235 states) -
60Modeling motion capture data
- Original data STACS simulation
Baum-Welch
(found 235 states) (on 235 states) -
- Video
61Discovering Underlying Structure
- Sparse dynamics - difficult to learn using
regular EM - STACS smoothly tiles the low-dimensional manifold
of observations along with correct dynamic
structure
62Conclusion
- A better method for HMM model selection and
learning - discovers hidden states
- avoids local minima
- faster than Baum-Welch
- Even when learning HMMs with known size, better
to discover states using STACS up to the desired
N - Widespread applicability
- classification, recognition and prediction for
real-valued sequential data problems
63(No Transcript)
64(No Transcript)
65(No Transcript)
66The Viterbi path is denoted by Suppose we split
state N into s1,s2
67The Viterbi path is denoted by Suppose we split
state N into s1,s2
?
?
?
?
?
?
?
?
68The Viterbi path is denoted by Suppose we split
state N into s1,s2
69The Viterbi path is denoted by Suppose we split
state N into s1,s2