Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition

Description:

Pick the path with the best score so far (Viterbi approximation) ... Use the same engine to decode along: Statistical n-gram language models with arbitrary n ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 31
Provided by: floria87
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Decoding%20Techniques%20for%20Automatic%20Speech%20Recognition


1
Decoding Techniques for Automatic Speech
Recognition
  • Florian Metze
  • Interactive Systems Laboratories

2
Outline
  • Decoding in ASR
  • Search Problem
  • Evaluation Problem
  • Viterbi Algorithm
  • Tree Search
  • Re-Entry
  • Recombination

3
The ASR problem argW max p(Wx)
  • Two major knowledge sources
  • Acoustic Model p(xW)
  • Language Model P(W)
  • Bayes p(Wx)P(x)p(xW)P(W)
  • Search problem argW max p(xW)P(W)
  • p(xW) consists of Hidden Markov Models
  • Dictionary defines state sequence hello /hh
    eh l ow/
  • Full model concatenation of states (i.e. sounds)

4
Target Function/ Measure
  • WER minimum editing distance between
    reference and hypothesis
  • Example
  • the quick brown fox jumps over REF
  • quick brown fox jump is over HYP
  • D S I ERR
  • WER 3/7 43
  • Different measure from max p(Wx) !!!

5
A simpler problem Evaluation
  • So far we have
  • Dictionary hello /hh eh l ow/
  • Acoustic Model phh(x), peh(x), pl(x), pow(x)
  • Language Model P(hello world)
  • State sequence /hh eh l ow w er l d/
  • Given W and xAlignment needed!

/ hh eh l ow /
6
A simpler problem Evaluation
  • So far we have
  • Dictionary hello /hh eh l ow/
  • Acoustic Model phh(x), peh(x), pl(x), pow(x)
  • Language Model P(hello world)
  • State sequence /hh eh l ow w er l d/
  • Given W and xAlignment needed!

/ hh eh l ow /
7
The Viterbi Algorithm
  • Beam search from left to right
  • Resulting alignment is best match given p?(x) and
    x

p(x) Time Time Time Time Time Time Time
1.2 1 1
1.2 1.3 1.1 1 1.1 1.2
1.3 1 1 1 1.2 1.3
1 1.2 1.2 1.1
hh eh l ow
8
The Viterbi Algorithm (contd)
  • Evaluation problem Dynamic Time Warping
  • Best alignment for given W, x, and p?(x) by
    locally adding scores (-log p) for states and
    transitions

p(x) Time Time Time Time Time Time Time
6 6 7
2.4 3.9 4.4 5 6.6 8.4
1.3 2 3 4 6 6.8
1 2.4 3.9 4.4
hh eh l ow
9
Pronunciation Prefix Trees (PPT)
  • Tree Representation of the Search Dictionary
  • Very compact ? fast!
  • Viterbi Algorithm alsoworks for trees

BROADWAY B R OA D W EY BROADLY B R OA D L
IE BUT B AH T
10
Viterbi Search for PPTs
  • A PPT is traversed in a time-synchronous way
  • Apply Viterbi Algorithm on
  • state level (sub-phonemic units b m e)
  • ? Constrained by HMM Topology
  • phone level
  • Constrained by PPT
  • What do we do when we reach the end of a word?

11
Re-Entrant PPTs for continuous speech
  • Isolated word recognition
  • Search terminated in leafs of the PPT
  • Decoding of word sequences
  • Re-enter the PPT and store the Viterbi path using
    a backpointer-table

12
Problem Branching Factor
  • Imagine sequence of 3 words with 10k vocabulary
  • 10k 3 1000G (potentially)
  • Not everything will be expanded, of course
  • Viterbi approximation ? path recombination
  • Given P(Candy hi I am) P(Candy hello I
    am)

13
Path Recombination
At time t Path1 w1 .. wN with score s1
Path2 v1 .. vM with score s2 Where s1
p(x1...xt w1...wN )? P(wi wi-1 wi-2) s2
p(x1...xt v1 ...vM )? P(vi vi-1 vi-2)
In the end, were only interested in the best
path!
14
Path Recombination (contd)
  • To expand the search space into a new root
  • Pick the path with the best score so far (Viterbi
    approximation)
  • Initialize scores and backpointers for the root
    node according to the best predecessor word
  • store the left context model information with the
    last phone from the predecessor(context-dependent
    acoustic models /s ih t/ ? /l ih p/)

15
Problem with Re-Entry
  • For a correct use of the Viterbi algorithm, the
    choice of the best path must include the score
    for the transition from the predecessor word to
    the successor word
  • The word identity is not known at the root level,
    the choice of the best predecessor can therefore
    not be done at this point

16
Consequences
  • Wrong predecessor words
  • ? language model information only at leaf level
  • Wrong word boundaries
  • The starting point for the successor word is
    determined without any language model information
  • Incomplete linguistic information
  • Open pruning thresholds are needed for beam
    search

17
Three-Pass search strategy
  • Search on a tree-organized lexicon (PPT)
  • Aggressive path recombination at word ends
  • Use linguistic information only approximately
  • Generate a list of starting words for each frame
  • Search on a flat-organized lexicon
  • Fix the word segmentation from the first pass
  • Full use of language model (often needs a third
    pass)

18
Three-Pass Decoder Results
  • Q4g system with cache for acoustic scores
  • 4000 acoustic models trained on BNESST
  • 40k Vocabulary
  • Test on readBN data

Search Pass Error Rate Real-time factor
Tree Pass 22.0 9.6
Flat Pass 18.8 0.9
Lattice Rescoring 15.0 0.2
19
One-Pass Decoder Motivation
  • The efficient use of all available knowledge
    sources as early as possible should result in
    faster decoding
  • Use the same engine to decode along
  • Statistical n-gram language models with arbitrary
    n
  • Context-free grammars (CFG)
  • Word-graphs

20
Linguistic states
  • Linguistic state, examples
  • n-1 word history for statistical n-gram LM
  • Grammar state for CFGs
  • (lattice node, word history) for word-graphs
  • To fully use the linguistic knowledge source, the
    linguistic state has to be kept during decoding
  • Path recombination has to be delayed until the
    word identity is known

21
Linguistic context assignment
  • Key idea establish a linguistic polymorphism for
    each node of the PPT
  • Maintain a list of linguistically morphed
    instances in each node
  • Each instance stores its own backpointer and
    scores for each state of the underlying HMM with
    respect to the linguistic state of that instance

22
PPT with linguistically morphed instances
W EY
R OA D
B
L IE
AH T
Typically 3-gram LM, i.e. P(W)
?iP(wiWi) P(wiWi) P(broadway bullets
over)
23
Language Model Lookahead
  • Since the linguistic state is known, the
    complete LM information P(W) can be applied to
    the instances, given the possible successor words
    for that node of the PPT
  • Let
  • lct linguistic context/ state of
    instance i from node n
  • path(w) path of word w in the PPT
  • ?(n, lct) min w ? w node n ? path(w)
    P(wlct)
  • score(i) p(x1...xt w1...wN)? P(wN-1...)
    ?(n, lct)

24
LM Lookahead (contd)
  • When the word becomes unique, the exact lm score
    is already incorporated and no explicit word
    transitions needs to be computed
  • The lm scores ? will be updated on demand, based
    on a compressed PPT (smearing of LM scores)
  • Tighter pruning thresholds can be used since the
    language model information is not delayed anymore

25
Early Path Recombination
  • The Path recombination can be performed as soon
    as the word becomes unique, which is usually a
    few nodes before reaching the leaf. This reduces
    the number of unique linguistic contexts and
    instances
  • This is particularly effective for cross-word
    models due the fan-out in the right context models

26
One-pass Decoder Summary
  • One-Pass decoder based on
  • One copy of tree with dynamically allocated
    instances
  • Early path recombination
  • Full language model lookahead
  • Linguistic knowledge sources
  • Statistical n-grams with n gt3 possible
  • Context free grammars

27
Results
Real time factor Real time factor Error rate Error rate
3-pass 1-pass 3-pass 1-pass
VM 6.8 4.0 26.9 26.9
readBN 12.2 4.2 14.7 13.9
Meeting 55 38 43.7 43.4
28
Remarks on speed-up
  • ? Speed-up ranges from a factor of almost 3 for
    the readBN task to 1.4 for the meeting data
  • Speed-up depends strongly on matched domain
    conditions
  • Decoder profits from sharp language models
  • LM Lookahead less effective for weak language
    models due to unmatched conditions

29
Memory usage Q4g
Module 3-pass 1-pass
Acoustic Models 44 MB 44 MB
Language Model 87 MB 82 MB
Overhead 16 MB 16 MB
Decoder
- permanent - dynamic 120 MB 100 MB 18 MB 20 MB
Total 367 MB 180 MB
30
Summary
  • Decoding is time- and memory consuming
  • Search errors occur when beams too tight
    (?trade-off) or Viterbi assumption violated
  • State-of-the art One-pass decoder
  • Tree-structure for efficiency
  • Linguistically morphed instances of nodes and
    leafs
  • Other approaches exist (stack decoding,
    a-posteriori decoding, ...)
Write a Comment
User Comments (0)
About PowerShow.com