Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition - PowerPoint PPT Presentation

About This Presentation

Title:

Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition

Description:

Pick the path with the best score so far (Viterbi approximation) ... Use the same engine to decode along: Statistical n-gram language models with arbitrary n ... – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 31

Provided by: floria87

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition

1
Decoding Techniques for Automatic Speech
Recognition

Florian Metze
Interactive Systems Laboratories

2
Outline

Decoding in ASR
Search Problem
Evaluation Problem
Viterbi Algorithm
Tree Search
Re-Entry
Recombination

3
The ASR problem argW max p(Wx)

Two major knowledge sources
Acoustic Model p(xW)
Language Model P(W)
Bayes p(Wx)P(x)p(xW)P(W)
Search problem argW max p(xW)P(W)
p(xW) consists of Hidden Markov Models
Dictionary defines state sequence hello /hh
eh l ow/
Full model concatenation of states (i.e. sounds)

4
Target Function/ Measure

WER minimum editing distance between
reference and hypothesis
Example
the quick brown fox jumps over REF
quick brown fox jump is over HYP
D S I ERR
WER 3/7 43
Different measure from max p(Wx) !!!

5
A simpler problem Evaluation

So far we have
Dictionary hello /hh eh l ow/
Acoustic Model phh(x), peh(x), pl(x), pow(x)
Language Model P(hello world)
State sequence /hh eh l ow w er l d/
Given W and xAlignment needed!

/ hh eh l ow /
6
A simpler problem Evaluation

So far we have
Dictionary hello /hh eh l ow/
Acoustic Model phh(x), peh(x), pl(x), pow(x)
Language Model P(hello world)
State sequence /hh eh l ow w er l d/
Given W and xAlignment needed!

/ hh eh l ow /
7
The Viterbi Algorithm

Beam search from left to right
Resulting alignment is best match given p?(x) and
x

p(x) Time Time Time Time Time Time Time
1.2 1 1
1.2 1.3 1.1 1 1.1 1.2
1.3 1 1 1 1.2 1.3
1 1.2 1.2 1.1
hh eh l ow
8
The Viterbi Algorithm (contd)

Evaluation problem Dynamic Time Warping
Best alignment for given W, x, and p?(x) by
locally adding scores (-log p) for states and
transitions

p(x) Time Time Time Time Time Time Time
6 6 7
2.4 3.9 4.4 5 6.6 8.4
1.3 2 3 4 6 6.8
1 2.4 3.9 4.4
hh eh l ow
9
Pronunciation Prefix Trees (PPT)

Tree Representation of the Search Dictionary
Very compact ? fast!
Viterbi Algorithm alsoworks for trees

BROADWAY B R OA D W EY BROADLY B R OA D L
IE BUT B AH T
10
Viterbi Search for PPTs

A PPT is traversed in a time-synchronous way
Apply Viterbi Algorithm on
state level (sub-phonemic units b m e)
? Constrained by HMM Topology
phone level
Constrained by PPT
What do we do when we reach the end of a word?

11
Re-Entrant PPTs for continuous speech

Isolated word recognition
Search terminated in leafs of the PPT
Decoding of word sequences
Re-enter the PPT and store the Viterbi path using
a backpointer-table

12
Problem Branching Factor

Imagine sequence of 3 words with 10k vocabulary
10k 3 1000G (potentially)
Not everything will be expanded, of course
Viterbi approximation ? path recombination
Given P(Candy hi I am) P(Candy hello I
am)

13
Path Recombination
At time t Path1 w1 .. wN with score s1
Path2 v1 .. vM with score s2 Where s1
p(x1...xt w1...wN )? P(wi wi-1 wi-2) s2
p(x1...xt v1 ...vM )? P(vi vi-1 vi-2)
In the end, were only interested in the best
path!
14
Path Recombination (contd)

To expand the search space into a new root
Pick the path with the best score so far (Viterbi
approximation)
Initialize scores and backpointers for the root
node according to the best predecessor word
store the left context model information with the
last phone from the predecessor(context-dependent
acoustic models /s ih t/ ? /l ih p/)

15
Problem with Re-Entry

For a correct use of the Viterbi algorithm, the
choice of the best path must include the score
for the transition from the predecessor word to
the successor word
The word identity is not known at the root level,
the choice of the best predecessor can therefore
not be done at this point

16
Consequences

Wrong predecessor words
? language model information only at leaf level
Wrong word boundaries
The starting point for the successor word is
determined without any language model information
Incomplete linguistic information
Open pruning thresholds are needed for beam
search

17
Three-Pass search strategy

Search on a tree-organized lexicon (PPT)
Aggressive path recombination at word ends
Use linguistic information only approximately
Generate a list of starting words for each frame
Search on a flat-organized lexicon
Fix the word segmentation from the first pass
Full use of language model (often needs a third
pass)

18
Three-Pass Decoder Results

Q4g system with cache for acoustic scores
4000 acoustic models trained on BNESST
40k Vocabulary
Test on readBN data

Search Pass Error Rate Real-time factor
Tree Pass 22.0 9.6
Flat Pass 18.8 0.9
Lattice Rescoring 15.0 0.2
19
One-Pass Decoder Motivation

The efficient use of all available knowledge
sources as early as possible should result in
faster decoding
Use the same engine to decode along
Statistical n-gram language models with arbitrary
n
Context-free grammars (CFG)
Word-graphs

20
Linguistic states

Linguistic state, examples
n-1 word history for statistical n-gram LM
Grammar state for CFGs
(lattice node, word history) for word-graphs
To fully use the linguistic knowledge source, the
linguistic state has to be kept during decoding
Path recombination has to be delayed until the
word identity is known

21
Linguistic context assignment

Key idea establish a linguistic polymorphism for
each node of the PPT
Maintain a list of linguistically morphed
instances in each node
Each instance stores its own backpointer and
scores for each state of the underlying HMM with
respect to the linguistic state of that instance

22
PPT with linguistically morphed instances
W EY
R OA D
B
L IE
AH T
Typically 3-gram LM, i.e. P(W)
?iP(wiWi) P(wiWi) P(broadway bullets
over)
23
Language Model Lookahead

Since the linguistic state is known, the
complete LM information P(W) can be applied to
the instances, given the possible successor words
for that node of the PPT
Let
lct linguistic context/ state of
instance i from node n
path(w) path of word w in the PPT
?(n, lct) min w ? w node n ? path(w)
P(wlct)
score(i) p(x1...xt w1...wN)? P(wN-1...)
?(n, lct)

24
LM Lookahead (contd)

When the word becomes unique, the exact lm score
is already incorporated and no explicit word
transitions needs to be computed
The lm scores ? will be updated on demand, based
on a compressed PPT (smearing of LM scores)
Tighter pruning thresholds can be used since the
language model information is not delayed anymore

25
Early Path Recombination

The Path recombination can be performed as soon
as the word becomes unique, which is usually a
few nodes before reaching the leaf. This reduces
the number of unique linguistic contexts and
instances
This is particularly effective for cross-word
models due the fan-out in the right context models

26
One-pass Decoder Summary

One-Pass decoder based on
One copy of tree with dynamically allocated
instances
Early path recombination
Full language model lookahead
Linguistic knowledge sources
Statistical n-grams with n gt3 possible
Context free grammars

27
Results
Real time factor Real time factor Error rate Error rate
3-pass 1-pass 3-pass 1-pass
VM 6.8 4.0 26.9 26.9
readBN 12.2 4.2 14.7 13.9
Meeting 55 38 43.7 43.4
28
Remarks on speed-up

? Speed-up ranges from a factor of almost 3 for
the readBN task to 1.4 for the meeting data
Speed-up depends strongly on matched domain
conditions
Decoder profits from sharp language models
LM Lookahead less effective for weak language
models due to unmatched conditions

29
Memory usage Q4g
Module 3-pass 1-pass
Acoustic Models 44 MB 44 MB
Language Model 87 MB 82 MB
Overhead 16 MB 16 MB
Decoder
- permanent - dynamic 120 MB 100 MB 18 MB 20 MB
Total 367 MB 180 MB
30
Summary

Decoding is time- and memory consuming
Search errors occur when beams too tight
(?trade-off) or Viterbi assumption violated
State-of-the art One-pass decoder
Tree-structure for efficiency
Linguistically morphed instances of nodes and
leafs
Other approaches exist (stack decoding,
a-posteriori decoding, ...)

Write a Comment

User Comments (0)