Title: A Review of Hidden Markov Models for ContextBased Classification ICML01 Workshop on Temporal and Spa
1A Review of Hidden Markov Models for
Context-Based ClassificationICML01 Workshop
onTemporal and Spatial LearningWilliams
CollegeJune 28th 2001
- Padhraic Smyth
- Information and Computer Science
- University of California, Irvine
- www.datalab.uci.edu
2Outline
- Context in classification
- Brief review of hidden Markov models
- Hidden Markov models for classification
- Simulation results how useful is context?
- (with Dasha Chudova, UCI)
3Historical Note
- Classification in Context was well-studied in
pattern recognition in the 60s and 70s - e.g, recursive Markov-based algorithms were
proposed, before hidden Markov algorithms and
models were fully understood - Applications in
- OCR for word-level recognition
- remote-sensing pixel classification
4Papers of Note
Raviv, J., Decision-making in Markov chains
applied to the problem of pattern recognition,
IEEE Info Theory, 3(4), 1967 Hanson, Riseman,
and Fisher, Context in word recognition, Pattern
Recognition, 1976 Toussaint, G., The use of
context in pattern recognition, Pattern
Recognition, 10, 1978 Mohn, Hjort, and Storvik,
A simulation study of some contextual
classification methods for remotely sensed
data, IEEE Trans Geo. Rem. Sens., 25(6), 1987.
5Context-Based Classification Problems
- Medical Diagnosis
- classification of a patients state over time
- Fraud Detection
- detection of stolen credit card
- Electronic Nose
- detection of landmines
- Remote Sensing
- classification of pixels into ground cover
6Modeling Context
- Common Theme Context
- class labels (and features) are persistent in
time/space
7Modeling Context
- Common Theme Context
- class labels (and features) are persistent in
time/space
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
8Feature Windows
- Predict Ct using a window, e.g., f(Xt, Xt-1,
Xt-2) - e.g., NETtalk application
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
9Alternative Probabilistic Modeling
- E.g., assume p(Ct history) p(Ct Ct-1)
- first order Markov assumption on the classes
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
10Brief review of hidden Markov models (HMMs)
11Graphical Models
- Basic Idea p(U) ltgt an annotated graph
- Let U be a set of random variables of interest
- 1-1 mapping from U to nodes in a graph
- graph encodes independence structure of model
- numerical specifications of p(U) are stored
locally at the nodes
12Acyclic Directed Graphical Models (aka
belief/Bayesian networks)
p(A,B,C) p(CA,B)p(A)p(B)
In general, p(X1, X2,....XN) ? p(Xi
parents(Xi ) )
13Undirected Graphical Models (UGs)
- Undirected edges reflect correlational
dependencies - e.g., particles in physical systems, pixels in an
image - Also known as Markov random fields, Boltzmann
machines, etc
14Examples of 3-way Graphical Models
Markov chain p(A,B,C) p(CB) p(BA) p(A)
15Examples of 3-way Graphical Models
Markov chain p(A,B,C) p(CB) p(BA) p(A)
Independent Causes p(A,B,C) p(CA,B)p(A)p(B)
16Hidden Markov Graphical Model
- Assumption 1
- p(Ct history) p(Ct Ct-1)
- first order Markov assumption on the classes
- Assumption 2
- p(Xt history, Ct ) p(Xt Ct )
- Xt only depends on current class Ct
17Hidden Markov Graphical Model
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
Notes - all temporal dependence is
modeled through the class variable C
- this is the simplest possible model
- Avoids modeling p(Xother Xs)
18Generalizations of HMMs
Spatial Rainfall (observed)
R1
R2
R3
RT
- - - - - - - -
State (hidden)
CT
C1
C2
C3
Atmospheric (observed)
A1
A2
A3
AT
Hidden state model relating atmospheric
measurements to local rainfall Weather state
couples multiple variables in time and
space (Hughes and Guttorp, 1996) Graphical
models language for spatio-temporal modeling
19Exact Probability Propagation (PP) Algorithms
- Basic PP Algorithm
- Pearl, 1988 Lauritzen and Spiegelhalter, 1988
- Assume the graph has no loops
- Declare 1 node (any node) to be a root
- Schedule two phases of message-passing
- nodes pass messages up to the root
- messages are distributed back to the leaves
- (if loops, convert loopy graph to an equivalent
tree)
20Properties of the PP Algorithm
- Exact
- p(nodeall data) is recoverable at each node
- i.e., we get exact posterior from local
message-passing - modification MPE most likely instantiation of
all nodes jointly - Efficient
- Complexity exponential in size of largest clique
- Brute force exponential in all variables
21Hidden Markov Graphical Model
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
22PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root
23PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root Absorb evidence from Xs
(which are fixed)
24PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root Absorb evidence from Xs
(which are fixed) Forward pass pass evidence
forward from C1
25PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root Absorb evidence from Xs
(which are fixed) Forward pass pass evidence
forward from C1 Backward pass pass evidence
backward from CT (This is the celebrated
forward-backward algorithm for HMMs)
26Comments on F-B Algorithm
- Complexity O(T m2)
- Has been reinvented several times
- e.g., BCJR algorithm for error-correcting codes
- Real-time recursive version
- run algorithm forward to current time t
- can propagate backwards to revise history
27HMMs and Classification
28Forward-Backward Algorithm
- Classification
- Algorithm produces p(Ctall other data) at each
node - to minimize 0-1 loss
- choose most likely class at each t
- Most likely class sequence?
- Not the same as the sequence of most likely
classes - can be found instead with Viterbi/dynamic
programming - replace sums in F-B with max
29Supervised HMM learning
- Use your favorite classifier to learn p(CX)
- i.e., ignore temporal aspect of problem
(temporarily) - Now, estimate p(Ct Ct-1) from labeled training
data - We have a fully operational HMM
- no need to use EM for learning if class labels
are provided (i.e., do supervised HMM learning)
30Fault Diagnosis Application (Smyth, Pattern
Recognition, 1994)
Features
X1
X2
X3
XT
- - - - - - - -
Fault Classes
CT
C1
C2
C3
Fault Detection in 34m Antenna
Systems Classes normal, short-circuit, tacho
problem, .. Features AR coefficients measured
every 2 seconds Classes are persistent over
time
31Approach and Results
- Classifiers
- Gaussian model and neural network
- trained on labeled instantaneous window data
- Markov component
- transition probabilities estimated from MTBF data
- Results
- discriminative neural net much better than
Gaussian - Markov component reduced the error rate (all
false alarms) of 2 to 0.
32Classification with and withoutthe Markov context
X1
X2
X3
XT
- - - - - - - -
CT
C1
C2
C3
We will compare what happens when (a) we just
make decisions based on p(Ct Xt ) (ignore
context) (b) we use the full Markov
context (i.e., use forward-backward
to integrate temporal information)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40Simulation Experiments
41Systematic Simulations
X1
X2
X3
XT
- - - - - - - -
CT
C1
C2
C3
Simulation Setup 1. Two Gaussian classes, at
mean 0 and mean 1 gt vary separation
sigma of the Gaussians 2. Markov dependence
A p 1-p 1-p p Vary
p (self-transition) strength of
context Look at Bayes error with and without
context
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47In summary.
- Context reduces error
- greater Markov dependence gt greater reduction
- Reduction is dramatic for pgt0.9
- e.g., even with minimal Gaussian separation,
Bayes error can be reduced to zero!!
48Approximate Methods
- Forward-Only
- necessary in many applications
- Two nearest-neighbors
- only use information from C(t-1) and C(t1)
- How suboptimal are these methods?
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53In summary (for approximations).
- Forward only
- tracks forward-backward reductions
- generally gets much more than 50 of gap between
F-B and context-free Bayes error - 2-neighbors
- typically worse than forward only
- much worse for small separation
- much worse for very high transition probs
- does not converge to zero Bayes error
54Extensions to Simple HMMs
Semi Markov models duration in each state need
not be geometric Segmental Markov
Models outputs within each state have a
non-constant mean, regression function Dynamic
Belief Networks Allow arbitrary dependencies
among classes and features Stochastic
Grammars, Spatial Landmark models, etc See
Afternoon Talks at this workshop for other
approaches
55Conclusions
- Context is increasingly important in many
classification applications - Graphical models
- HMMs are a simple and practical approach
- graphical models provide a general-purpose
language for context - Theory/Simulation
- Effect of context on error rate can be dramatic
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61Sketch of the PP algorithm in action
62Sketch of the PP algorithm in action
63Sketch of the PP algorithm in action
1
64Sketch of the PP algorithm in action
2
1
65Sketch of the PP algorithm in action
2
1
3
66Sketch of the PP algorithm in action
2
1
3
4