A Review of Hidden Markov Models for ContextBased Classification ICML01 Workshop on Temporal and Spa PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A Review of Hidden Markov Models for ContextBased Classification ICML01 Workshop on Temporal and Spa


1
A Review of Hidden Markov Models for
Context-Based ClassificationICML01 Workshop
onTemporal and Spatial LearningWilliams
CollegeJune 28th 2001
  • Padhraic Smyth
  • Information and Computer Science
  • University of California, Irvine
  • www.datalab.uci.edu

2
Outline
  • Context in classification
  • Brief review of hidden Markov models
  • Hidden Markov models for classification
  • Simulation results how useful is context?
  • (with Dasha Chudova, UCI)

3
Historical Note
  • Classification in Context was well-studied in
    pattern recognition in the 60s and 70s
  • e.g, recursive Markov-based algorithms were
    proposed, before hidden Markov algorithms and
    models were fully understood
  • Applications in
  • OCR for word-level recognition
  • remote-sensing pixel classification

4
Papers of Note
Raviv, J., Decision-making in Markov chains
applied to the problem of pattern recognition,
IEEE Info Theory, 3(4), 1967 Hanson, Riseman,
and Fisher, Context in word recognition, Pattern
Recognition, 1976 Toussaint, G., The use of
context in pattern recognition, Pattern
Recognition, 10, 1978 Mohn, Hjort, and Storvik,
A simulation study of some contextual
classification methods for remotely sensed
data, IEEE Trans Geo. Rem. Sens., 25(6), 1987.
5
Context-Based Classification Problems
  • Medical Diagnosis
  • classification of a patients state over time
  • Fraud Detection
  • detection of stolen credit card
  • Electronic Nose
  • detection of landmines
  • Remote Sensing
  • classification of pixels into ground cover

6
Modeling Context
  • Common Theme Context
  • class labels (and features) are persistent in
    time/space

7
Modeling Context
  • Common Theme Context
  • class labels (and features) are persistent in
    time/space

Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
8
Feature Windows
  • Predict Ct using a window, e.g., f(Xt, Xt-1,
    Xt-2)
  • e.g., NETtalk application

Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
9
Alternative Probabilistic Modeling
  • E.g., assume p(Ct history) p(Ct Ct-1)
  • first order Markov assumption on the classes

Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
10
Brief review of hidden Markov models (HMMs)
11
Graphical Models
  • Basic Idea p(U) ltgt an annotated graph
  • Let U be a set of random variables of interest
  • 1-1 mapping from U to nodes in a graph
  • graph encodes independence structure of model
  • numerical specifications of p(U) are stored
    locally at the nodes

12
Acyclic Directed Graphical Models (aka
belief/Bayesian networks)
p(A,B,C) p(CA,B)p(A)p(B)
In general, p(X1, X2,....XN) ? p(Xi
parents(Xi ) )

13
Undirected Graphical Models (UGs)
  • Undirected edges reflect correlational
    dependencies
  • e.g., particles in physical systems, pixels in an
    image
  • Also known as Markov random fields, Boltzmann
    machines, etc

14
Examples of 3-way Graphical Models
Markov chain p(A,B,C) p(CB) p(BA) p(A)
15
Examples of 3-way Graphical Models
Markov chain p(A,B,C) p(CB) p(BA) p(A)
Independent Causes p(A,B,C) p(CA,B)p(A)p(B)
16
Hidden Markov Graphical Model
  • Assumption 1
  • p(Ct history) p(Ct Ct-1)
  • first order Markov assumption on the classes
  • Assumption 2
  • p(Xt history, Ct ) p(Xt Ct )
  • Xt only depends on current class Ct

17
Hidden Markov Graphical Model
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
Notes - all temporal dependence is
modeled through the class variable C
- this is the simplest possible model
- Avoids modeling p(Xother Xs)
18
Generalizations of HMMs
Spatial Rainfall (observed)
R1
R2
R3
RT
- - - - - - - -
State (hidden)
CT
C1
C2
C3
Atmospheric (observed)
A1
A2
A3
AT
Hidden state model relating atmospheric
measurements to local rainfall Weather state
couples multiple variables in time and
space (Hughes and Guttorp, 1996) Graphical
models language for spatio-temporal modeling
19
Exact Probability Propagation (PP) Algorithms
  • Basic PP Algorithm
  • Pearl, 1988 Lauritzen and Spiegelhalter, 1988
  • Assume the graph has no loops
  • Declare 1 node (any node) to be a root
  • Schedule two phases of message-passing
  • nodes pass messages up to the root
  • messages are distributed back to the leaves
  • (if loops, convert loopy graph to an equivalent
    tree)

20
Properties of the PP Algorithm
  • Exact
  • p(nodeall data) is recoverable at each node
  • i.e., we get exact posterior from local
    message-passing
  • modification MPE most likely instantiation of
    all nodes jointly
  • Efficient
  • Complexity exponential in size of largest clique
  • Brute force exponential in all variables

21
Hidden Markov Graphical Model
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Time
22
PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root
23
PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root Absorb evidence from Xs
(which are fixed)
24
PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root Absorb evidence from Xs
(which are fixed) Forward pass pass evidence
forward from C1
25
PP Algorithm for a HMM
Features (observed)
X1
X2
X3
XT
- - - - - - - -
Class (hidden)
CT
C1
C2
C3
Let CT be the root Absorb evidence from Xs
(which are fixed) Forward pass pass evidence
forward from C1 Backward pass pass evidence
backward from CT (This is the celebrated
forward-backward algorithm for HMMs)
26
Comments on F-B Algorithm
  • Complexity O(T m2)
  • Has been reinvented several times
  • e.g., BCJR algorithm for error-correcting codes
  • Real-time recursive version
  • run algorithm forward to current time t
  • can propagate backwards to revise history

27
HMMs and Classification
28
Forward-Backward Algorithm
  • Classification
  • Algorithm produces p(Ctall other data) at each
    node
  • to minimize 0-1 loss
  • choose most likely class at each t
  • Most likely class sequence?
  • Not the same as the sequence of most likely
    classes
  • can be found instead with Viterbi/dynamic
    programming
  • replace sums in F-B with max

29
Supervised HMM learning
  • Use your favorite classifier to learn p(CX)
  • i.e., ignore temporal aspect of problem
    (temporarily)
  • Now, estimate p(Ct Ct-1) from labeled training
    data
  • We have a fully operational HMM
  • no need to use EM for learning if class labels
    are provided (i.e., do supervised HMM learning)

30
Fault Diagnosis Application (Smyth, Pattern
Recognition, 1994)
Features
X1
X2
X3
XT
- - - - - - - -
Fault Classes
CT
C1
C2
C3
Fault Detection in 34m Antenna
Systems Classes normal, short-circuit, tacho
problem, .. Features AR coefficients measured
every 2 seconds Classes are persistent over
time
31
Approach and Results
  • Classifiers
  • Gaussian model and neural network
  • trained on labeled instantaneous window data
  • Markov component
  • transition probabilities estimated from MTBF data
  • Results
  • discriminative neural net much better than
    Gaussian
  • Markov component reduced the error rate (all
    false alarms) of 2 to 0.

32
Classification with and withoutthe Markov context
X1
X2
X3
XT
- - - - - - - -
CT
C1
C2
C3
We will compare what happens when (a) we just
make decisions based on p(Ct Xt ) (ignore
context) (b) we use the full Markov
context (i.e., use forward-backward
to integrate temporal information)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Simulation Experiments
41
Systematic Simulations
X1
X2
X3
XT
- - - - - - - -
CT
C1
C2
C3
Simulation Setup 1. Two Gaussian classes, at
mean 0 and mean 1 gt vary separation
sigma of the Gaussians 2. Markov dependence
A p 1-p 1-p p Vary
p (self-transition) strength of
context Look at Bayes error with and without
context
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
In summary.
  • Context reduces error
  • greater Markov dependence gt greater reduction
  • Reduction is dramatic for pgt0.9
  • e.g., even with minimal Gaussian separation,
    Bayes error can be reduced to zero!!

48
Approximate Methods
  • Forward-Only
  • necessary in many applications
  • Two nearest-neighbors
  • only use information from C(t-1) and C(t1)
  • How suboptimal are these methods?

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
In summary (for approximations).
  • Forward only
  • tracks forward-backward reductions
  • generally gets much more than 50 of gap between
    F-B and context-free Bayes error
  • 2-neighbors
  • typically worse than forward only
  • much worse for small separation
  • much worse for very high transition probs
  • does not converge to zero Bayes error

54
Extensions to Simple HMMs
Semi Markov models duration in each state need
not be geometric Segmental Markov
Models outputs within each state have a
non-constant mean, regression function Dynamic
Belief Networks Allow arbitrary dependencies
among classes and features Stochastic
Grammars, Spatial Landmark models, etc See
Afternoon Talks at this workshop for other
approaches
55
Conclusions
  • Context is increasingly important in many
    classification applications
  • Graphical models
  • HMMs are a simple and practical approach
  • graphical models provide a general-purpose
    language for context
  • Theory/Simulation
  • Effect of context on error rate can be dramatic

56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Sketch of the PP algorithm in action
62
Sketch of the PP algorithm in action
63
Sketch of the PP algorithm in action
1
64
Sketch of the PP algorithm in action
2
1
65
Sketch of the PP algorithm in action
2
1
3
66
Sketch of the PP algorithm in action
2
1
3
4
Write a Comment
User Comments (0)
About PowerShow.com