Fast State Discovery for HMM Model Selection and Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Fast State Discovery for HMM Model Selection and Learning

Description:

Efficiently discover states in sequential data while learning a Hidden Markov Model ... T : size of observation sequence O1,...,OT ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 70
Provided by: carnegieme
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast State Discovery for HMM Model Selection and Learning


1
Fast State Discovery for HMM Model Selection and
Learning
  • Sajid M. Siddiqi
  • Geoffrey J. Gordon
  • Andrew W. Moore
  • CMU

2
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices )
3
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual properties
4
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual properties
5
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual
properties However, we would miss important
temporal structure
6
Ot
t
Consider a sequence of real-valued observations
(speech, sensor readings, stock prices ) We
can model it purely based on contextual
properties However, we would miss important
temporal structure
7
Ot
t
Current efficient approaches learn the wrong
model
8
Ot
t
Current efficient approaches learn the wrong
model
Our method successfully discovers the overlapping
states
9
Ot
t
Our goal Efficiently discover states in
sequential data while learning a Hidden Markov
Model
10
Motion Capture
11
Definitions and Notation
  • An HMM is ?A,B,? where
  • A N?N transition matrix
  • B observation model ?s ,?s for each of N
    states
  • ? N?1 prior probability vector
  • T size of observation sequence O1,,OT
  • qt the state the HMM is in at time t. qt ?
    s1,,sN

12
Operations on HMMs
13
Operations on HMMs
14
Previous Approaches
  • Multi-restart Baum-Welch N is inefficient, highly
    prone to local minima

15
Previous Approaches
  • Multi-restart Baum-Welch N is inefficient, highly
    prone to local minima

16
Previous Approaches
  • Multi-restart Baum-Welch N is inefficient, highly
    prone to local minima

17
Previous Approaches
  • Multi-restart Baum-Welch N is inefficient, highly
    prone to local minima

We propose Simultaneous Temporal and Contextual
Splitting (STACS) A top-down approach that is
much better at state-discovery while being at
least as efficient, and a variant V-STACS that is
much faster.
18
Bayesian Information Criterion (BIC) for Model
Selection
  • - Would like to compute the posterior probability
    for model selection
  • P(model sizedata) / P(datamodel size) P(model
    size)
  • log P(model sizedata) / log P(datamodel size)
    log P(model size)

19
Bayesian Information Criterion (BIC) for Model
Selection
  • - Would like to compute the posterior probability
    for model selection
  • P(model sizedata) / P(datamodel size) P(model
    size)
  • log P(model sizedata) / log P(datamodel size)
    log P(model size)
  • - BIC assumes a prior that penalizes complexity
    (favors smaller models)
  • log P(model sizedata) ¼ log
    P(datamodel size,?MLE) (FP/2) log T
  • where FP number of free parameters, T
    length of data sequence, ?MLE is the ML
    parameter estimate

20
Bayesian Information Criterion (BIC) for Model
Selection
  • - Would like to compute the posterior probability
    for model selection
  • P(model sizedata) / P(datamodel size) P(model
    size)
  • log P(model sizedata) / log P(datamodel size)
    log P(model size)
  • - BIC assumes a prior that penalizes complexity
    (favors smaller models)
  • log P(model sizedata) ¼ log
    P(datamodel size,?MLE) (FP/2) log T
  • where FP number of free parameters, T
    length of data sequence, ?MLE is the ML
    parameter estimate
  • - BIC is an asymptotic approximation to the true
    posterior

21
Algorithm Summary (STACS/VSTACS)
  • Initialize n0-state HMM randomly
  • for n n0 Nmax
  • Learn model parameters
  • for i 1 n
  • Split state i, optimize by constrained EM
    (STACS)or constrained Viterbi Training (VSTACS)
  • Calculate approximate BIC score of split model
  • Choose best split based on approximate BIC
  • Compare to original model with exact BIC
    (STACS)or approximate BIC (VSTACS)
  • if larger model not chosen, stop

22
STACS
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

23
STACS
  • Learn parameters using EM, calculate the Viterbi
    path Q
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

24
STACS
  • Learn parameters using EM, calculate the Viterbi
    path Q
  • Consider splits on all states
  • e.g. for state s2
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

25
STACS
  • Learn parameters using EM, calculate the Viterbi
    path Q
  • Consider splits on all states
  • e.g. for state s2
  • Choose a subset D Ot Q(t) s2
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

26
STACS
  • Learn parameters using EM, calculate the Viterbi
    path Q
  • Consider splits on all states
  • e.g. for state s2
  • Choose a subset D Ot Q(t) s2
  • Note that D O(T/N)
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

27
STACS
  • Split the state
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

28
STACS
  • Split the state
  • Constrain ?s to ? except for offspring states
    observation densities and all their transition
    probabilities, both in and out
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

29
STACS
  • Split the state
  • Constrain ?s to ? except for offspring states
    observation densities and all their transition
    probabilities, both in and out
  • Learn the free parameters using two-state EM over
    D. This optimizes the partially observed
    likelihood P(O,Q\D ?s)
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

30
STACS
  • Split the state
  • Constrain ?s to ? except for offspring states
    observation densities and all their transition
    probabilities, both in and out
  • Learn the free parameters using two-state EM over
    D. This optimizes the partially observed
    likelihood P(O,Q\D ?s)
  • Update Q over D to get R
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

31
STACS
  • Scoring is of two types
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

32
STACS
  • Scoring is of two types
  • The candidates are compared to each other
    according to their Viterbi path likelihoods
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

vs.
33
STACS
  • Scoring is of two types
  • The candidates are compared to each other
    according to their Viterbi path likelihoods
  • The best candidate in this ranking is compared to
    the un-split model ? using BIC, i.e.
  • log P(model data ) ? log P(data model)
    complexity penalty
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

vs.
vs.
34
Viterbi STACS (V-STACS)
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

35
Viterbi STACS (V-STACS)
  • Recall that STACS learns the free parameters
    using two-state EM over D. However, EM also has
    winner-take-all variants
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

36
Viterbi STACS (V-STACS)
  • Recall that STACS learns the free parameters
    using two-state EM over D. However, EM also has
    winner-take-all variants
  • V-STACS uses two-state Viterbi training over D to
    learn the free parameters, which uses hard
    updates vs STACS soft updates
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

37
Viterbi STACS (V-STACS)
  • Recall that STACS learns the free parameters
    using two-state EM over D. However, EM also has
    winner-take-all variants
  • V-STACS uses two-state Viterbi training over D to
    learn the free parameters, which uses hard
    updates vs STACS soft updates
  • The Viterbi path likelihood is used to
    approximate the BIC vs. the un-split model in
    V-STACS
  • input n0, data sequence O O1,,OT
  • output HMM ? of appropriate size
  • ? n0-state initial HMM
  • repeat
  • optimize ? over sequence O
  • choose a subset of states ?
  • for each s ? ?
  • design a candidate model ?s
  • choose a relevant subset of sequence O
  • split state s, optimize ?s over subset
  • score ?s
  • end for
  • if maxs(score(?s)) gt score(?)
  • ? ? best-scoring candidate from ?s
  • else

38
Time Complexity
  • Optimizing N candidates takes
  • N? O(T) time for STACS
  • N ? O(T/N) time for V-STACS
  • Scoring N candidates takes N ? O(T) time
  • Candidate search and scoring is O(TN)
  • Best-candidate evaluation is
  • O(TN2) for BIC in STACS
  • O(TN) for approximate BIC in V-STACS

39
Other Methods
  • Li-Biswas
  • Generates two candidates
  • splits state with highest-variance
  • merges pair of closest states (rarely chosen)

40
Other Methods
  • Li-Biswas
  • Generates two candidates
  • splits state with highest-variance
  • merges pair of closest states (rarely chosen)
  • Optimizes all candidate parameters over entire
    sequence

41
Other Methods
  • Li-Biswas
  • Generates two candidates
  • splits state with highest-variance
  • merges pair of closest states (rarely chosen)
  • Optimizes all candidate parameters over entire
    sequence
  • ML-SSS
  • Generates 2N candidates, splitting each state in
    two ways

42
Other Methods
  • Li-Biswas
  • Generates two candidates
  • splits state with highest-variance
  • merges pair of closest states (rarely chosen)
  • Optimizes all candidate parameters over entire
    sequence
  • ML-SSS
  • Generates 2N candidates, splitting each state in
    two ways
  • Contextual split optimizes offspring states
    observation densities with 2-Gaussian mixture EM,
    assumes offspring connected in parallel

43
Other Methods
  • Li-Biswas
  • Generates two candidates
  • splits state with highest-variance
  • merges pair of closest states (rarely chosen)
  • Optimizes all candidate parameters over entire
    sequence
  • ML-SSS
  • Generates 2N candidates, splitting each state in
    two ways
  • Contextual split optimizes offspring states
    observation densities with 2-Gaussian mixture EM,
    assumes offspring connected in parallel
  • Temporal split optimizes offspring states
    observation densities, self-transitions and
    mutual transitions with EM, assumes offspring
    in series

44
Other Methods
  • Li-Biswas
  • Generates two candidates
  • splits state with highest-variance
  • merges pair of closest states (rarely chosen)
  • Optimizes all candidate parameters over entire
    sequence
  • ML-SSS
  • Generates 2N candidates, splitting each state in
    two ways
  • Contextual split optimizes offspring states
    observation densities with 2-Gaussian mixture EM,
    assumes offspring connected in parallel
  • Temporal split optimizes offspring states
    observation densities, self-transitions and
    mutual transitions with EM, assumes offspring
    in series
  • Optimizes split of state s over all timesteps
    with nonzero posterior probability of being in
    state s i.e. O(T) data points

45
Results
46
Data sets
  • Australian Sign-Language data collected from 2
    Flock 5DT instrumented gloves and Ascension
    flock-of-birds trackerKadous 2002 (available in
    UCI KDD Archive)
  • Other data sets obtained from the literature
  • Robot, MoCap, MLog, Vowel

47
Learning HMMs of Predetermined SizeScalability
  • Robot data (others similar)

48
Learning HMMs of Predetermined Size
Log-Likelihood
  • Learning a 40-state HMM on Robot data (others
    similar)

49
Learning HMMs of Predetermined Size
Learning 40-state HMMs
50
Model Selection Synthetic Data
  • Generalize (4 states, T 1000) to (10 states, T
    10,000)

51
Model Selection Synthetic Data
  • Generalize (4 states, T 1000) to (10 states, T
    10,000)
  • Both STACS, VSTACS discovered 10 states and
    correct underlying transition structure

52
Model Selection Synthetic Data
  • Generalize (4 states, T 1000) to (10 states, T
    10,000)
  • Both STACS, VSTACS discovered 10 states and
    correct underlying transition structure
  • Li-Biswas, ML-SSS failed to find 10-state model
  • 10-state Baum-Welch also failed to find correct
    observation and transition models, even with 50
    restarts!

53
Model Selection BIC score
  • MoCap data (others similar)

54
Model Selection
55
Sign-language recognition
  • Initial results on sign-language word recognition
  • 95 distinct words, 27 instances each, divided 81
  • Average classification accuracies and HMM sizes

Accuracy final N
56
Modeling motion capture data
  • 35-dimensional data (thanks to Adrien Treuille)

57
Modeling motion capture data
  • Original data


58
Modeling motion capture data
  • Original data STACS simulation
    (found 235
    states)


59
Modeling motion capture data
  • Original data STACS simulation
    Baum-Welch

    (found 235 states) (on 235 states)


60
Modeling motion capture data
  • Original data STACS simulation
    Baum-Welch

    (found 235 states) (on 235 states)

  • Video

61
Discovering Underlying Structure
  • Sparse dynamics - difficult to learn using
    regular EM
  • STACS smoothly tiles the low-dimensional manifold
    of observations along with correct dynamic
    structure

62
Conclusion
  • A better method for HMM model selection and
    learning
  • discovers hidden states
  • avoids local minima
  • faster than Baum-Welch
  • Even when learning HMMs with known size, better
    to discover states using STACS up to the desired
    N
  • Widespread applicability
  • classification, recognition and prediction for
    real-valued sequential data problems

63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
The Viterbi path is denoted by Suppose we split
state N into s1,s2
67
The Viterbi path is denoted by Suppose we split
state N into s1,s2
?
?
?
?
?
?
?
?
68
The Viterbi path is denoted by Suppose we split
state N into s1,s2
69
The Viterbi path is denoted by Suppose we split
state N into s1,s2
Write a Comment
User Comments (0)
About PowerShow.com