Fast State Discovery for HMM Model Selection and Learning

About This Presentation

Title:

Fast State Discovery for HMM Model Selection and Learning

Description:

Efficiently discover states in sequential data while learning a Hidden Markov Model ... T : size of observation sequence O1,...,OT ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 70

Provided by: carnegieme

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast State Discovery for HMM Model Selection and Learning

1
Fast State Discovery for HMM Model Selection and
Learning

Sajid M. Siddiqi
Geoffrey J. Gordon
Andrew W. Moore
CMU

2
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices )
3
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual properties
4
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual properties
5
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual
properties However, we would miss important
temporal structure
6
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual
properties However, we would miss important
temporal structure
7
Ot
t
Current efficient approaches learn the wrong
model
8
Ot
t
Current efficient approaches learn the wrong
model
Our method successfully discovers the overlapping
states
9
Ot
t
Our goal Efficiently discover states in
sequential data while learning a Hidden Markov
Model
10
Motion Capture
11
Definitions and Notation

An HMM is ?A,B,? where
A N?N transition matrix
B observation model ?s ,?s for each of N
states
? N?1 prior probability vector
T size of observation sequence O1,,OT
qt the state the HMM is in at time t. qt ?
s1,,sN

12
Operations on HMMs
13
Operations on HMMs
14
Previous Approaches

Multi-restart Baum-Welch N is inefficient, highly
prone to local minima

15
Previous Approaches

Multi-restart Baum-Welch N is inefficient, highly
prone to local minima

16
Previous Approaches

Multi-restart Baum-Welch N is inefficient, highly
prone to local minima

17
Previous Approaches

Multi-restart Baum-Welch N is inefficient, highly
prone to local minima

We propose Simultaneous Temporal and Contextual
Splitting (STACS) A top-down approach that is
much better at state-discovery while being at
least as efficient, and a variant V-STACS that is
much faster.
18
Bayesian Information Criterion (BIC) for Model
Selection

- Would like to compute the posterior probability
for model selection
P(model sizedata) / P(datamodel size) P(model
size)
log P(model sizedata) / log P(datamodel size)
log P(model size)

19
Bayesian Information Criterion (BIC) for Model
Selection

- Would like to compute the posterior probability
for model selection
P(model sizedata) / P(datamodel size) P(model
size)
log P(model sizedata) / log P(datamodel size)
log P(model size)
- BIC assumes a prior that penalizes complexity
(favors smaller models)
log P(model sizedata) ¼ log
P(datamodel size,?MLE) (FP/2) log T
where FP number of free parameters, T
length of data sequence, ?MLE is the ML
parameter estimate

20
Bayesian Information Criterion (BIC) for Model
Selection

- Would like to compute the posterior probability
for model selection
P(model sizedata) / P(datamodel size) P(model
size)
log P(model sizedata) / log P(datamodel size)
log P(model size)
- BIC assumes a prior that penalizes complexity
(favors smaller models)
log P(model sizedata) ¼ log
P(datamodel size,?MLE) (FP/2) log T
where FP number of free parameters, T
length of data sequence, ?MLE is the ML
parameter estimate
- BIC is an asymptotic approximation to the true
posterior

21
Algorithm Summary (STACS/VSTACS)

Initialize n0-state HMM randomly
for n n0 Nmax
Learn model parameters
for i 1 n
Split state i, optimize by constrained EM
(STACS)or constrained Viterbi Training (VSTACS)
Calculate approximate BIC score of split model
Choose best split based on approximate BIC
Compare to original model with exact BIC
(STACS)or approximate BIC (VSTACS)
if larger model not chosen, stop

22
STACS

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

23
STACS

Learn parameters using EM, calculate the Viterbi
path Q

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

24
STACS

Learn parameters using EM, calculate the Viterbi
path Q
Consider splits on all states
e.g. for state s2

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

25
STACS

Learn parameters using EM, calculate the Viterbi
path Q
Consider splits on all states
e.g. for state s2
Choose a subset D Ot Q(t) s2

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

26
STACS

Learn parameters using EM, calculate the Viterbi
path Q
Consider splits on all states
e.g. for state s2
Choose a subset D Ot Q(t) s2
Note that D O(T/N)

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

27
STACS

Split the state

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

28
STACS

Split the state
Constrain ?s to ? except for offspring states
observation densities and all their transition
probabilities, both in and out

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

29
STACS

Split the state
Constrain ?s to ? except for offspring states
observation densities and all their transition
probabilities, both in and out
Learn the free parameters using two-state EM over
D. This optimizes the partially observed
likelihood P(O,Q\D ?s)

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

30
STACS

Split the state
Constrain ?s to ? except for offspring states
observation densities and all their transition
probabilities, both in and out
Learn the free parameters using two-state EM over
D. This optimizes the partially observed
likelihood P(O,Q\D ?s)
Update Q over D to get R

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

31
STACS

Scoring is of two types

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

32
STACS

Scoring is of two types
The candidates are compared to each other
according to their Viterbi path likelihoods

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

vs.
33
STACS

Scoring is of two types
The candidates are compared to each other
according to their Viterbi path likelihoods
The best candidate in this ranking is compared to
the un-split model ? using BIC, i.e.
log P(model data ) ? log P(data model)
complexity penalty

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

vs.
vs.
34
Viterbi STACS (V-STACS)

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

35
Viterbi STACS (V-STACS)

Recall that STACS learns the free parameters
using two-state EM over D. However, EM also has
winner-take-all variants

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

36
Viterbi STACS (V-STACS)

Recall that STACS learns the free parameters
using two-state EM over D. However, EM also has
winner-take-all variants
V-STACS uses two-state Viterbi training over D to
learn the free parameters, which uses hard
updates vs STACS soft updates

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

37
Viterbi STACS (V-STACS)

Recall that STACS learns the free parameters
using two-state EM over D. However, EM also has
winner-take-all variants
V-STACS uses two-state Viterbi training over D to
learn the free parameters, which uses hard
updates vs STACS soft updates
The Viterbi path likelihood is used to
approximate the BIC vs. the un-split model in
V-STACS

input n0, data sequence O O1,,OT
output HMM ? of appropriate size
? n0-state initial HMM
repeat
optimize ? over sequence O
choose a subset of states ?
for each s ? ?
design a candidate model ?s
choose a relevant subset of sequence O
split state s, optimize ?s over subset
score ?s
end for
if maxs(score(?s)) gt score(?)
? ? best-scoring candidate from ?s
else

38
Time Complexity

Optimizing N candidates takes
N? O(T) time for STACS
N ? O(T/N) time for V-STACS
Scoring N candidates takes N ? O(T) time
Candidate search and scoring is O(TN)
Best-candidate evaluation is
O(TN2) for BIC in STACS
O(TN) for approximate BIC in V-STACS

39
Other Methods

Li-Biswas
Generates two candidates
splits state with highest-variance
merges pair of closest states (rarely chosen)

40
Other Methods

Li-Biswas
Generates two candidates
splits state with highest-variance
merges pair of closest states (rarely chosen)
Optimizes all candidate parameters over entire
sequence

41
Other Methods

Li-Biswas
Generates two candidates
splits state with highest-variance
merges pair of closest states (rarely chosen)
Optimizes all candidate parameters over entire
sequence
ML-SSS
Generates 2N candidates, splitting each state in
two ways

42
Other Methods

Li-Biswas
Generates two candidates
splits state with highest-variance
merges pair of closest states (rarely chosen)
Optimizes all candidate parameters over entire
sequence
ML-SSS
Generates 2N candidates, splitting each state in
two ways
Contextual split optimizes offspring states
observation densities with 2-Gaussian mixture EM,
assumes offspring connected in parallel

43
Other Methods

Li-Biswas
Generates two candidates
splits state with highest-variance
merges pair of closest states (rarely chosen)
Optimizes all candidate parameters over entire
sequence
ML-SSS
Generates 2N candidates, splitting each state in
two ways
Contextual split optimizes offspring states
observation densities with 2-Gaussian mixture EM,
assumes offspring connected in parallel
Temporal split optimizes offspring states
observation densities, self-transitions and
mutual transitions with EM, assumes offspring
in series

44
Other Methods

Li-Biswas
Generates two candidates
splits state with highest-variance
merges pair of closest states (rarely chosen)
Optimizes all candidate parameters over entire
sequence
ML-SSS
Generates 2N candidates, splitting each state in
two ways
Contextual split optimizes offspring states
observation densities with 2-Gaussian mixture EM,
assumes offspring connected in parallel
Temporal split optimizes offspring states
observation densities, self-transitions and
mutual transitions with EM, assumes offspring
in series
Optimizes split of state s over all timesteps
with nonzero posterior probability of being in
state s i.e. O(T) data points

45
Results
46
Data sets

Australian Sign-Language data collected from 2
Flock 5DT instrumented gloves and Ascension
flock-of-birds trackerKadous 2002 (available in
UCI KDD Archive)
Other data sets obtained from the literature
Robot, MoCap, MLog, Vowel

47
Learning HMMs of Predetermined SizeScalability

Robot data (others similar)

48
Learning HMMs of Predetermined Size
Log-Likelihood

Learning a 40-state HMM on Robot data (others
similar)

49
Learning HMMs of Predetermined Size
Learning 40-state HMMs
50
Model Selection Synthetic Data

Generalize (4 states, T 1000) to (10 states, T
10,000)

51
Model Selection Synthetic Data

Generalize (4 states, T 1000) to (10 states, T
10,000)
Both STACS, VSTACS discovered 10 states and
correct underlying transition structure

52
Model Selection Synthetic Data

Generalize (4 states, T 1000) to (10 states, T
10,000)
Both STACS, VSTACS discovered 10 states and
correct underlying transition structure
Li-Biswas, ML-SSS failed to find 10-state model
10-state Baum-Welch also failed to find correct
observation and transition models, even with 50
restarts!

53
Model Selection BIC score

MoCap data (others similar)

54
Model Selection
55
Sign-language recognition

Initial results on sign-language word recognition
95 distinct words, 27 instances each, divided 81
Average classification accuracies and HMM sizes

Accuracy final N
56
Modeling motion capture data

35-dimensional data (thanks to Adrien Treuille)

57
Modeling motion capture data

Original data

58
Modeling motion capture data

Original data STACS simulation
(found 235
states)

59
Modeling motion capture data

Original data STACS simulation
Baum-Welch

(found 235 states) (on 235 states)

60
Modeling motion capture data

Original data STACS simulation
Baum-Welch

(found 235 states) (on 235 states)
Video

61
Discovering Underlying Structure

Sparse dynamics - difficult to learn using
regular EM
STACS smoothly tiles the low-dimensional manifold
of observations along with correct dynamic
structure

62
Conclusion

A better method for HMM model selection and
learning
discovers hidden states
avoids local minima
faster than Baum-Welch
Even when learning HMMs with known size, better
to discover states using STACS up to the desired
N
Widespread applicability
classification, recognition and prediction for
real-valued sequential data problems

63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
The Viterbi path is denoted by Suppose we split
state N into s1,s2
67
The Viterbi path is denoted by Suppose we split
state N into s1,s2
?
?
?
?
?
?
?
?
68
The Viterbi path is denoted by Suppose we split
state N into s1,s2
69
The Viterbi path is denoted by Suppose we split
state N into s1,s2

Write a Comment

User Comments (0)