Coupling between ASR and MT in SpeechtoSpeech Translation - PowerPoint PPT Presentation

About This Presentation
Title:

Coupling between ASR and MT in SpeechtoSpeech Translation

Description:

Lattices/Confusion Network/Confidence Estimation (12 s) Results from ... Usage of confusion network and confidence estimation seem to be under-explored. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 67
Provided by: Arthu61
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Coupling between ASR and MT in SpeechtoSpeech Translation


1
Coupling between ASR and MT in Speech-to-Speech
Translation
  • Arthur Chan
  • Prepared for
  • Advanced Machine Translation Seminar

2
This Seminar
  • Introduction (6 slides)
  • Ringgers categorization of Coupling between ASR
    and NLU (7 slides)
  • Interfaces in Loose Coupling
  • 1 best and N-best (5 slides)
  • Lattices/Confusion Network/Confidence Estimation
    (12 slides)
  • Results from literature
  • Tight Coupling
  • Neys Theory and 2 methods of Implementation (14
    slides)
  • ? Sorry, without FST approaches.
  • Some As Is Ideas on This Topic

3
6 papers on Coupling of Speech-to-Speech
Translation
  • H. Ney, Speech translation Coupling of
    recognition and translation, in Proc. ICASSP,
    1999.
  • Casacuberta et al., Architectures for
    speech-to-speech translation using finite-state
    models, in Proc. Workshop on Speech-to-Speech
    Translation, 2002.
  • E. Matusov, S.Kanthak, and H. Ney, On the
    integration of speech recognition and statistical
    machine translation, in Proc. InterSpeech,
    2005.
  • S.Saleem, S. C. Jou, S. Vogel, and T. Schultz,
    Using word lattice information for a tighter
    coupling in speech translation systems, in Proc.
    ICSLP, 2004.
  • V.H. Quan et al., Integrated N-best re-ranking
    for spoken language translation, in In
    EuroSpeech, 2005.
  • N. Bertoldi and M. Federico, A new decoder for
    spoken language translation based on confusion
    networks, in IEEE ASRU Workshop, 2005.

4
A Conceptual Model of Speech-to-Speech Translation
Speech Recognizer
Machine Translator
Speech Synthesizer
Decoding Result(s)
Translation
waveforms
waveforms
5
Motivation of Tight Coupling between ASR and MT
  • One best of ASR could be wrong
  • MT could be benefited from wide range of
    supplementary information provided by ASR
  • N-best list
  • Lattice
  • Sentenced/Word-based Confidence Scores
  • E.g. Word posterior probability
  • Confusion network
  • Or consensus decoding (Mangu 1999)
  • MT quality may depend on WER of ASR (?)

6
Scope of this talk.
Speech Recognizer
Machine Translator
Speech Synthesizer
1-best?
Translation
N-best?
waveforms
waveforms
Lattice?
Confusion network?
7
Topics Covered Today
  • The concept of Coupling
  • Tightness of coupling between ASR and
    Technology X. (Ringger 95)
  • Two questions
  • What could ASR provide in loose coupling?
  • Discussion of interfaces between ASR and MT in
    loose coupling
  • What is the status of tight coupling?
  • Neys Formulation

8
Topics not covered
  • Direct Modeling
  • Use both features in ASR and MT
  • Some referred as ASR and MT unification
  • Implication of the MT search algorithms on the
    coupling
  • Generation of speech from text.
  • Presenter doesnt know enough.

9
The Concept of Coupling
10
Classification of Coupling of ASR and Natural
Language Understanding (NLU)
  • Proposed in Ringger 95, Harper 94
  • 3 Dimensions of ASR/NLU
  • Complexity of the search algorithm
  • Simple N-gram?
  • Incrementality of the coupling
  • On-line? Left-to-right?
  • Tightness of the coupling
  • Tight? Loose? Semi-tight?

11
Tightness of Coupling
Tight
Semi-Tight
Loose
12
Notes
  • Semi-tight coupling could appear as
  • Feedback loop between ASR and Technology X for
    the whole utterance of speech
  • Or Feedback loop between ASR and Technology X for
    every frame.
  • The Ringger system
  • A good way to understand how speech-based system
    is developed

13
Example 1 LM
  • Someone asserts that ASR has to be used with
    13-grams.
  • In tight-coupling,
  • A search will be devised to search for the best
    word sequence with best acoustic score 13 gram
    likelihood
  • In loose coupling
  • A simple search will be used to generate some
    outputs (N-best list, lattice etc.),
  • 13-gram will then use to rescore the output.
  • In semi-tight coupling
  • 1, A simple search will be used to generate
    results
  • 2, 13 gram will be applied at the word-end only
    (but exact history will not be stored)

14
Example 2 Higher order AM
  • Segmental model assume obs. probability is not
    conditionally independent.
  • Someone assert that segmental model is better
    than just HMM.
  • Tight coupling Direct search of the best word
    sequence using segmental model.
  • Loose coupling Use segmental model to rescore
  • Semi-tight coupling Hybrid HMM-Segmental model
    algorithm?

15
Summary of Coupling between ASR and NLU
16
Implication on ASR/MT coupling
  • Generalize many systems
  • Loose coupling
  • Any system which uses 1-best, n-best, lattice, or
    other inputs for 1-way module communication
  • (Bertoldi 2005)
  • CMU System (Saleem 2004)
  • (Matusov 2005)
  • Tight coupling
  • (Ney 1999)
  • (Casacuberta 2002)
  • Semi-tight coupling
  • (Quan 2005)

17
Interfaces in Loose Coupling1-best and N-best
18
Perspectives
  • ASR outputs
  • 1-best results
  • N-best results
  • Lattice
  • Consensus network.
  • Confidence scores
  • How ASR generate these outputs?
  • Why they are generated?
  • What if there are multiple ASRs?
  • (and what if their results are combined?)

19
Origin of the 1-best.
  • Decoding of HMM-based ASR
  • Searching the best path in a huge HMM-state
    lattice.
  • 1-best ASR result
  • The best path one could find from backtracking.
  • State Lattice (Next page)

20
(No Transcript)
21
Note on 1-best
  • Most of the time 1-best Word Sequence
  • Why?
  • In LVCSR, storing the backtracking pointer table
    for state sequence takes a lot of memory (even
    nowadays)
  • Compare this with the number of frames of score
    one need to be stored
  • Usually a backtrack pointer storing
  • The previous words before the current word
  • Clever structure dynamically allocate
    back-tracking pointer table.

22
What is N-best list?
  • Traceback not only from the 1st -best, also from
    the 2nd best and 3rd best, etc.
  • Pathway
  • Directly from search backtrack pointer table
  • Exact N-best algorithm (Chow 90)
  • Word pair N-best algorithm (Chow 91)
  • A search using Viterbi score as heuristic (Chow
    92)
  • Generate lattice first, then generate N-best from
    lattice

23
Interfaces in Loose CouplingLattice, Consensus
Network and Confidence Estimation
24
What is Lattice?
  • A compact representation of state-lattice
  • Only word node (or link) are involved
  • Difference between N-best and Lattice
  • Lattice could be compact representation of N-best
    list.

25
(No Transcript)
26
How lattice is generated?
  • From the decoding backtracking pointer table
  • Only record all the links between word nodes.
  • From N-best list
  • Become a compact representation of N-best
  • Sometimes spurious link will be introduced

27
How lattice is generated when there are phone
contexts at the word end?
  • Very complicated when phonetic context is
    involved
  • Not only word-end needs to be stored but also the
    phone contexts.
  • Lattice has the word identity as well as contexts
  • Lattice can become very large.

28
How this is resolved?
  • Some used only approximate triphone to generate
    lattice in first stage (BBN)
  • Some generate lattice even with full CD-phones
    but convert it back to no-context lattices (RWTH)
  • Use the lattice with full CD phone contexts
    (RWTH)

29
What ASR folks do when lattice is still too large?
  • Use some criteria to prune the lattice.
  • Example Criteria
  • Word posterior probability
  • Application of another LM or AM, then filtering.
  • General confidence score
  • Maximum lattice density
  • (number of words in lattice/number of words)
  • Or generate an even more compact representation
    than lattices
  • E.g. consensus network.

30
Conclusions on lattices
  • Lattice generation itself could be a complicated
    issue
  • Sometimes, what post-processing stage (e.g. MT)
    will get is pre-filtered, pre-processed results.

31
Confusion Network and Consensus Hypothesis
  • Confusion Network
  • Or Sausage Network.
  • Or Consensus Network

32
Special Properties (?)
  • More local than lattice
  • One can apply simple criteria to find the best
    results
  • E.g. consensus decoding is to apply
    word-posterior probability on confusion network.
  • More tractable
  • In terms of size
  • Found to be useful in
  • ?
  • ?

33
How to generate consensus network?
  • From the lattice
  • Summary of Mangus algorithm
  • Intra-word clustering
  • Inter-word clustering

34
Note on Consensus Network
  • Note
  • Time information might not be preserved in
    confusion network
  • The similarity function directly affect the final
    output of the consensus network.

35
Other ways to generate confusion network
  • From the N-best list
  • Using Rover.
  • A mixture of voting and adding confidence of word

36
Confidence Measure
  • Anything other than likelihood which could tell
    whether the answer is useful
  • E.g.
  • Word posterior probability
  • P(WA)
  • Usually compute using lattices
  • Language model backoff mode
  • Other posterior probabilities (frame, sentence)

37
Interfaces in Loose CouplingResults from the
Literature
38
General word
  • Coupling in SST is still pretty new
  • Papers are chosen according to whether some
    outputs have been used
  • Other techniques such as direct modeling might be
    mixed into the papers.

39
N-best list (Quan 2005)
  • Using N-best list for reranking
  • Interpolation weights of AM and TM are then
    optimized.
  • Summary
  • Reranking gives improvements.

40
Lattices CMU results (Saleem 2004)
  • Summary of results
  • Lattice word error rate improved when lattice
    density improves
  • Lattice density and Weight on Acoustic scores
    turns out to be an important parameter to tune
  • Too large and small could hurt.

41
LWER against Lattice Density
42
Modified Bleu scores against lattice density
43
Optimal density and score weight based on
Utterance Length.
44
Consensus Network
  • Bertoldi 2005 is probably the only work on
    confusion-network based method
  • Summary of results
  • When direct modeling is applied
  • Consensus Network doesnt beat N-best method.
  • Author argues for speed and simplicity of the
    algorithm

45
Confidence Does it help?
  • According to Zhang 2006, Yes.
  • Confidence Measure (CM) filtering is used to
    filter out unnecessary results in N-best
  • Note The approaches used is quite different.

46
Conclusion on Loose Coupling
  • SR could give a rich sets of output.
  • It is still an unknown what type of output should
    be used in pipeline.
  • Currently, it seem to lack of comprehensive
    experimental studies on which method is the best.
  • Usage of confusion network and confidence
    estimation seem to be under-explored.

47
Tight Coupling Theory and Practice
48
Theory (Ney 1999)
Bayes Rule
Introduce f as hidden var.
Bayes Rule
Assume x doesnt depend on target lang.
Sum to Max
49
Layman point of view
  • Three factors
  • Pr(e) target language model
  • Pr(fe) translation model
  • Pr(xf) acoustic model
  • Note assumption has been made only the best
    matching f for e is used.

50
Comparison with SR
  • In SR
  • Pr(f) Source language model
  • In Tight coupling
  • Pr(fe), Pr(e) Translation model and Target
    language model

51
Algorithmic Point of View
  • Brute Force Method Instead of incorporating LM
    into standard Viterbi algorithm
  • Incoporating P(e) and P(fe)
  • gt Very complicated

52
Assumptions in Modeling
  • Alignment Models (HMM)
  • Acoustic Modeling
  • Speech Recognizer will produce a word graph.
  • Each link with word hypothesis covers the portion
    of acoustic scores. (notation is confusing in
    paper)

53
Lexicon Modeling
  • Further assumption from standard IBM models
  • Target word is assumed to be dependent on
    previous word
  • So, in fact, source LM is actually there.

54
First Implementation Local Average Assumptions
  • Local Average Assumptions
  • P(xe) is used to capture the local
    characteristic of the acoustic.

55
Justification of Using Average Local Assumption
  • Rephrased from Author (p.3 para 2)
  • Lexicon modeling and language modeling will cause
    f_j-1, f_j, f_j1 appear in the math.
  • In another words
  • It is too complicated to carry out
  • Computation advantage the local score could be
    obtained just from the word graph but before
    translation
  • gt Full translation strategy could still be
    carried out

56
Computation of P(xe)
  • Make use of best source sequence
  • Also refer to Wessel 98,
  • A commonly used word posterior probability
    algorithm for lattice
  • A forward-backward like procedure is used

57
Second Method Monotone Alignment Assumption -
Network
58
Monotone Alignment Assumption Formula for Text
Input
  • Close-formed solution exist form DP O(JE2)

59
Monotone Alignment Assumption Formula for
Speech Input
  • DP
  • O(JE2F2)

60
How to make Monotone Assumptions work?
  • Words needs to be reordered
  • As part of search strategy.
  • Does acoustic model assumption used?
  • i.e. Are we talking about word lattice or still
    state lattice?
  • Dont know, seems like we are actually talking
    about word lattice.
  • Supported by Matusov 2005

61
Experimental Results in Matusov, Kanthak and Ney
2005
  • Summary of the results
  • Translation quality is only improved by tight
    coupling when the lattice density is not high.
  • Same as Saleem 2004, incorporation of acoustic
    scores help.

62
Conclusion Possible Issues of tight coupling
  • Possibilities
  • In SR, source n-gram LM is very closed to the
    best configuration.
  • The complexity of the algorithm is too high,
    approximation is still necessary to make it work.
  • When the criterion in tight coupling is used. It
    is possible that the LM and the TM need to be
    jointly estimated.
  • The current approaches still havent really
    implement tight-coupling
  • There might be bugs in the programs.

63
Conclusion
  • Two major issues in coupling of SST is discussed
  • In loose coupling
  • Consensus network and Confidence scoring is still
    not fully utilized
  • In tight coupling
  • The approach seem to be haunted by very high
    complexity of search algorithm construction

64
Discussion
65
The End. Thanks.
66
Literature
  • 2006 Ruiqiang Zhang, Genichiro Kikui. Integration
    of Speech Recognition and Machine Translation
    Speech Recognition Word Lattice Translation.
    Speech Communication. Vol.48, Issues 3-4
  • H. Ney, Speech translation Coupling of
    recognition and translation, in Proc. ICASSP,
    1999.
  • E. Matusov, S.Kanthak, and H. Ney, On the
    integration of speech recognition and statistical
    machine translation, in Proc. InterSpeech, 2005.
  • S.Saleem, S. C. Jou, S. Vogel, and T. Schultz,
    Using word lattice information for a tighter
    coupling in speech translation systems, in Proc.
    ICSLP, 2004.
  • V.H. Quan et al., Integrated N-best re-ranking
    for spoken language translation, in In
    EuroSpeech, 2005.
  • N. Bertoldi and M. Federico, A new decoder for
    spoken language translation based on confusion
    networks, in IEEE ASRU Workshop, 2005.
  • L. Mangu, E. Brill, A. Stolcke, Finding
    consensus in speech recognition word error
    minimization and other applications of confusion
    networks, Computer Speech and Language 14(4),
    373-400., (2000)
  • E. Ringger, A Robust Loose Coupling for Speech
    Recognition and Natural Language Understanding,
    1995
Write a Comment
User Comments (0)
About PowerShow.com