Hidden Markov Models: Probabilistic Reasoning Over Time presentation

About This Presentation

Transcript and Presenter's Notes

Title: Hidden Markov Models: Probabilistic Reasoning Over Time

1
Hidden Markov ModelsProbabilistic Reasoning
Over Time

Natural Language Processing
CMSC 25000
February 23, 2006

2
Agenda

Hidden Markov Models
Uncertain observation
Temporal Context
Recognition Viterbi
Training the model Baum-Welch
Speech Recognition
Framing the problem Sounds to Sense
Speech Recognition as Modern AI

3
Hidden Markov Models (HMMs)

An HMM is
1) A set of states
2) A set of transition probabilities
Where aij is the probability of transition qi -gt
qj
3)Observation probabilities
The probability of observing ot in state i
4) An initial probability dist over states
The probability of starting in state i
5) A set of accepting states

4
Three Problems for HMMs

Find the probability of an observation sequence
given a model
Forward algorithm
Find the most likely path through a model given
an observed sequence
Viterbi algorithm (decoding)
Find the most likely model (parameters) given an
observed sequence
Baum-Welch (EM) algorithm

5
Bins and Balls

Blue Blue Red

1 1 1 (0.90.3)(0.60.3)(0.60.7)0.0204
1 1 2 (0.90.3)(0.60.3)(0.40.4)0.0077
1 2 1 (0.90.3)(0.40.6)(0.30.7)0.0136
1 2 2 (0.90.3)(0.40.6)(0.70.4)0.0181
2 1 1 (0.10.6)(0.30.7)(0.60.7)0.0052
2 1 2 (0.10.6)(0.30.7)(0.40.4)0.0020
2 2 1 (0.10.6)(0.70.6)(0.30.7)0.0052
2 2 2 (0.10.6)(0.70.6)(0.70.4)0.0070
6
Answers and Issues

Here, to compute probability of observed
Just add up all the state sequence probabilities
To find most likely state sequence
Just pick the sequence with the highest value
Problem Computing all paths expensive
2TNT
Solution Dynamic Programming
Sweep across all states at each time step
Summing (Problem 1) or Maximizing (Problem 2)

7
Forward Probability
Where a is the forward probability, t is the time
in utterance, i,j are states in the
HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in
state bj N is the max state, T is the last time
8
Forward Algorithm

Idea matrix where each cell forwardt,j
represents probability of being in state j after
seeing first t observations.
Each cell expresses the probability
forwardt,j P(o1,o2,...,ot,qtjw)
qt j means "the probability that the tth state
in the sequence of states is state j.
Compute probability by summing over extensions of
all paths leading to current cell.
An extension of a path from a state i at time t-1
to state j at t is computed by multiplying
together i. previous path probability from the
previous cell forwardt-1,i, ii. transition
probability aij from previous state i to current
state j iii. observation likelihood bjt that
current state j matches observation symbol t.

9
Forward Algorithm

Function Forward(observations length T,
state-graph) returns best-path
Num-stateslt-num-of-states(state-graph)
Create path prob matrix forwardinum-states2,T2
Forward0,0lt- 1.0
For each time step t from 0 to T do
for each state s from 0 to num-states do
for each transition s from s in
state-graph
new-scorelt-Forwards,tats,sbs(ot)
Forwards,t1 lt- Forwards,t1new-score

10
Viterbi Algorithm

Find BEST sequence given signal
Best P(sequencesignal)
Take HMM observation sequence
gt seq (prob)
Dynamic programming solution
Record most probable path ending at a state i
Then most probable path from i to end
O(bMn)

11
Viterbi Code
Function Viterbi(observations length T,
state-graph) returns best-path Num-stateslt-num-of-
states(state-graph) Create path prob matrix
viterbinum-states2,T2 Viterbi0,0lt- 1.0 For
each time step t from 0 to T do for each state
s from 0 to num-states do for each
transition s from s in state-graph
new-scorelt-viterbis,tats,sbs(ot)
if ((viterbis,t10) (viterbis,t1ltnew-
score)) then viterbis,t1 lt-
new-score back-pointers,t1lt-s Backtrace
from highest prob state in final column of
viterbi return
12
Modeling Sequences, Redux

Discrete observation values
Simple, but inadequate
Many observations highly variable
Gaussian pdfs over continuous values
Assume normally distributed observations
Typically sum over multiple shared Gaussians
Gaussian mixture models
Trained with HMM model

13
Learning HMMs

Issue Where do the probabilities come from?
Solution Learn from data
Trains transition (aij) and emission (bj)
probabilities
Typically assume structure
Baum-Welch aka forward-backward algorithm
Iteratively estimate counts of transitions/emitted
Get estimated probabilities by forward computn
Divide probability mass over contributing paths

14
Learning HMMs

Issue Where do the probabilities come from?
Supervised/manual construction
Solution Learn from data
Trains transition (aij), emission (bj), and
initial (pi) probabilities
Typically assume state structure is given
Unsupervised
Baum-Welch aka forward-backward algorithm
Iteratively estimate counts of transitions/emitted
Get estimated probabilities by forward computn
Divide probability mass over contributing paths

15
Manual Construction

Manually labeled data
Observation sequences, aligned to
Ground truth state sequences
Compute (relative) frequencies of state
transitions
Compute frequencies of observations/state
Compute frequencies of initial states
Bootstrapping iterate tag, correct, reestimate,
tag.
Problem
Labeled data is expensive, hard/impossible to
obtain, may be inadequate to fully estimate
Sparseness problems

16
Unsupervised Learning

Re-estimation from unlabeled data
Baum-Welch aka forward-backward algorithm
Assume representative collection of data
E.g. recorded speech, gene sequences, etc
Assign initial probabilities
Or estimate from very small labeled sample
Compute state sequences given the data
I.e. use forward algorithm
Update transition, emission, initial probabilities

17
Updating Probabilities

Intuition
Observations identify state sequences
Adjust probability of transitions/emissions
Make closer to those consistent with observed
Increase P(ObservationsModel)
Functionally
For each state i, what proportion of transitions
from state i go to state j
For each state i, what proportion of observations
match O?
How often is state i the initial state?

18
Estimating Transitions

Consider updating transition aij
Compute probability of all paths using aij
Compute probability of all paths through i (w/
and w/o i-gtj)

i
j
19
Forward Probability
Where a is the forward probability, t is the time
in utterance, i,j are states in the
HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in
state bj N is the max state, T is the last time
20
Backward Probability
Where ß is the backward probability, t is the
time in sequence, i,j are states in
the HMM, aij is the transition probability,
bj(ot) is the probability of observing ot
in state bj N is the final state, and T is the
last time
21
Re-estimating

Estimate transitions from i-gtj
Estimate observations in j
Estimate initial i

22
Speech Recognition

Goal
Given an acoustic signal, identify the sequence
of words that produced it
Speech understanding goal
Given an acoustic signal, identify the meaning
intended by the speaker
Issues
Ambiguity many possible pronunciations,
Uncertainty what signal, what word/sense
produced this sound sequence

23
Decomposing Speech Recognition

Q1 What speech sounds were uttered?
Human languages 40-50 phones
Basic sound units b, m, k, ax, ey, (arpabet)
Distinctions categorical to speakers
Acoustically continuous
Part of knowledge of language
Build per-language inventory
Could we learn these?

24
Decomposing Speech Recognition

Q2 What words produced these sounds?
Look up sound sequences in dictionary
Problem 1 Homophones
Two words, same sounds too, two
Problem 2 Segmentation
No space between words in continuous speech
I scream/ice cream, Wreck a nice
beach/Recognize speech
Q3 What meaning produced these words?
NLP (But thats not all!)

25
(No Transcript)
26
Signal Processing

Goal Convert impulses from microphone into a
representation that
is compact
encodes features relevant for speech recognition
Compactness Step 1
Sampling rate how often look at data
8KHz, 16KHz,(44.1KHz CD quality)
Quantization factor how much precision
8-bit, 16-bit (encoding u-law, linear)

27
(A Little More) Signal Processing

Compactness Feature identification
Capture mid-length speech phenomena
Typically frames of 10ms (80 samples)
Overlapping
Vector of features e.g. energy at some frequency
Vector quantization
n-feature vectors n-dimension space
Divide into m regions (e.g. 256)
All vectors in region get same label - e.g. C256

28
Speech Recognition Model

Question Given signal, what words?
Problem uncertainty
Capture of sound by microphone, how phones
produce sounds, which words make phones, etc
Solution Probabilistic model
P(wordssignal)
P(signalwords)P(words)/P(signal)
Idea Maximize P(signalwords)P(words)
P(signalwords) acoustic model P(words) lang
model

29
Language Model

Idea some utterances more probable
Standard solution n-gram model
Typically tri-gram P(wiwi-1,wi-2)
Collect training data
Smooth with bi- uni-grams to handle sparseness
Product over words in utterance

30
Acoustic Model

P(signalwords)
words -gt phones phones -gt vector quantizn
Words -gt phones
Pronunciation dictionary lookup
Multiple pronunciations?
Probability distribution
Dialect Variation tomato
Coarticulation
Product along path

0.5
0.5
0.5
0.2
0.5
0.8
31
Pronunciation Example

Observations 0/1

32
Acoustic Model

P(signal phones)
Problem Phones can be pronounced differently
Speaker differences, speaking rate, microphone
Phones may not even appear, different contexts
Observation sequence is uncertain
Solution Hidden Markov Models
1) Hidden gt Observations uncertain
2) Probability of word sequences gt
State transition probabilities
3) 1st order Markov gt use 1 prior state

33
Acoustic Model

3-state phone model for m
Use Hidden Markov Model (HMM)
Probability of sequence sum of prob of paths

0.3
0.9
0.4
Transition probabilities
0.7
0.1
0.6
C3 0.3
C5 0.1
C6 0.4
C1 0.5
C3 0.2
C4 0.1
C2 0.2
C4 0.7
C6 0.5
Observation probabilities
34
ASR Training

Models to train
Language model typically tri-gram
Observation likelihoods B
Transition probabilities A
Pronunciation lexicon sub-phone, word
Training materials
Speech files word transcription
Large text corpus
Small phonetically transcribed speech corpus

35
Training

Language model
Uses large text corpus to train n-grams
500 M words
Pronunciation model
HMM state graph
Manual coding from dictionary
Expand to triphone context and sub-phone models

36
HMM Training

Training the observations
E.g. Gaussian set uniform initial mean/variance
Train based on contents of small (e.g. 4hr)
phonetically labeled speech set (e.g.
Switchboard)
Training AB
Forward-Backward algorithm training

37
Does it work?

Yes
99 on isolated single digits
95 on restricted short utterances (air travel)
80 professional news broadcast
No
55 Conversational English
35 Conversational Mandarin
?? Noisy cocktail parties

38
N-grams

Perspective
Some sequences (words/chars) are more likely than
others
Given sequence, can guess most likely next
Used in
Speech recognition
Spelling correction,
Augmentative communication
Other NL applications

39
Corpus Counts

Estimate probabilities by counts in large
collections of text/speech
Issues
Wordforms (surface) vs lemma (root)
Case? Punctuation? Disfluency?
Type (distinct words) vs Token (total)

40
Basic N-grams

Most trivial 1/tokens too simple!
Standard unigram frequency
word occurrences/total corpus size
E.g. the0.07 rabbit 0.00001
Too simple no context!
Conditional probabilities of word sequences

41
Markov Assumptions

Exact computation requires too much data
Approximate probability given all prior wds
Assume finite history
Bigram Probability of word given 1 previous
First-order Markov
Trigram Probability of word given 2 previous
N-gram approximation

Bigram sequence
42
Issues

Relative frequency
Typically compute count of sequence
Divide by prefix
Corpus sensitivity
Shakespeare vs Wall Street Journal
Very unnatural
Ngrams
Unigram little bigrams colloc trigramsphrase

43
Evaluating n-gram models

Entropy Perplexity
Information theoretic measures
Measures information in grammar or fit to data
Conceptually, lower bound on bits to encode
Entropy H(X) X is a random var, p prob fn
E.g. 8 things number as code gt 3 bits/trans
Alt. short code if high prob longer if lower
Can reduce
Perplexity
Weighted average of number of choices

44
Entropy of a Sequence

Basic sequence
Entropy of language infinite lengths
Assume stationary ergodic

45
Cross-Entropy

Comparing models
Actual distribution unknown
Use simplified model to estimate
Closer match will have lower cross-entropy

46
Speech Recognition asModern AI

Draws on wide range of AI techniques
Knowledge representation manipulation
Optimal search Viterbi decoding
Machine Learning
Baum-Welch for HMMs
Nearest neighbor k-means clustering for signal
id
Probabilistic reasoning/Bayes rule
Manage uncertainty in signal, phone, word mapping
Enables real world application

Write a Comment

User Comments (0)

About PowerShow.com

Hidden Markov Models: Probabilistic Reasoning Over Time PowerPoint PPT Presentation