How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands - PowerPoint PPT Presentation

About This Presentation
Title:

How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands

Description:

How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands


1
How to handlepronunciation variation in ASRBy
storing episodes in memory?Helmer StrikCentre
for Language and Speech Technology (CLST)Radboud
University Nijmegen, the Netherlands
Radboud University Nijmegen
2
Overview
  • Contents
  • Variation, invariance problem
  • ASR Automatic Speech Recognition
  • HSR Human Speech Recognition
  • ESR Episodic Speech Recognition

3
Invariance problem (1)
  • One of the main issues in speech recognition is
    the large amount of variability present in
    speech.
  • SRIV2006 ITRW on Speech Recognition and
    Intrinsic Variation
  • Invariance problem
  • Variation in stimuli, invariant percept
  • Also visual, tactile, etc.
  • Studied in many fields, no consensus
  • 2 paradigms
  • Invariant
  • Episodic

4
Invariance problem (1)
  • Example 1 Speech
  • Dutch word natuurlijk (naturally, of course)
  • natyrl?k
  • natyl?k
  • tyk
  • Multiword expressions (MWEs)
  • lot of reduction
  • many variants

5
Invariance problem (2)
  • Example 2 Writing (vision)
  • natuurlijk natuurlijk
  • natuurlijk natuurlijk
  • natuurlijk natuurlijk
  • natuurlijk natuurlijk
  • natuurlijk natuurlijk
  • natuurlijk natuurlijk
  • Familiar styles (fonts, handwriting)
  • are recognized better

natuurlijk
natuurlijk
6
ASR - Paradigm
  • Invariant, symbolic approach
  • utterance
  • sequence of words
  • sequence of phonemes
  • sequence of states
  • parametric description pdfs / ANN

7
ASR - Paradigm
  • Same paradigm (HMMs), since 70s
  • Assumptions incorrect, questionable
  • Insufficient performance
  • ASR vs. HSR error rates 8-80x higher
  • Slow progress (ceiling effect?)
  • Simply using more and more data is not sufficient
    (Moore, 2001)
  • ? A new paradigm is needed!
  • However, only few attempts

8
HSR - Indexical information
  • Speech - 2 types of information
  • Verbal info. what, contents
  • Indexical info. how, form
  • e.g. environmental and
  • speaker-specific aspects
  • (pitch, loudness, speech rate, voice quality)

9
HSR - Indexical information
  • Traditional ASR model
  • Verbal information is used
  • Indexical information
  • Noise, disturbances
  • Preprocessing
  • Strip off
  • Normalization (VTLN, MLLR, etc.)
  • And in HSR?

10
HSR - Indexical information
  • HSR Strip off indexical information?
  • No!
  • Familiar voices and accents
  • recognize and mimic
  • Indexical information
  • is perceived and encoded

11
HSR - Indexical information
  • Verbal indexical information
  • processed independently?
  • No!
  • Familiar voices are recognized better
  • Facilitation, also with similar speech

12
HSR - Indexical and detailed information
  • Experimental results
  • indexical information and
  • fine phonetic detail (Hawkins et al.)
  • influence perception
  • Difficult to explain / integrate in the
    traditional, invariant model
  • New models episodic models,
  • for auditive and visual perception

13
ESR - Basic idea
  • A new paradigm for ASR is needed
  • An episodic model !!??
  • Training
  • Store trajectories - (representatives of)
    episodes
  • Recognition
  • Calculate distance between X and sequences of
    stored trajectories (DTW)
  • Take the one with minimum distance the
    recognized word

14
ESR Invariant vs. episodic
  • phone-based HMM ESR
  • --------------------------------------------------
    -----------
  • Unit
  • Phone Syllable, word,
  • Representation
  • States - pdfs or ANN Trajectories
  • Compare
  • Trajectory (X) states Trajectory (X)
    Trajectories
  • Parsimonious representation Extensive
    representation
  • Complex mapping Simple mapping
  • Variation is noise Variation contains info.
  • Normalization Use variation

15
Representation pdfs (Gaussians) ? Much detail,
dynamic information is lost Trajectories details
Phone aj from nine. X begin 3 parts aj(,
aj, aj)
16
Unit phone(me)
  • Switchboard (Greenberg et al.)
  • deletion 25 of the phones
  • substitution 30 of the phones
  • together 55!!
  • Difficult for a model based on sequences of
    phones.
  • Syllables less than 1 deleted
  • Phonetic transcriptions and their evaluation
  • Large differences between humans
  • What is the golden reference?
  • Speech a sequence of symbols?

17
Unit Multiword expressions (MWEs)
  • MWEs (see poster)
  • A lot of reduction
  • many phonemes deleted, or substituted
  • Many variants ( sequences of phonemes)
  • more than 90 for 2 MWEs studied
  • Difficult to handle in ASR systems with current
    methods for pronunciation variation modeling.
  • Reduction, e.g. for a MWE 4 words with 7
    syllables reduced to 1 entity with 2 syllables
  • What should be stored?
  • Units of various lenghts?

18
An episodic approach for ASR
  • Advantages
  • More information during search
  • dynamic, indexical, fine phonetic detail
  • Continuity constraints can be used
  • (reduces the trajectory folding problem)
  • Model is simpler
  • Disadvantage
  • More information during search complexity
  • Brain a lot of storage and CPU
  • Computers more and more powerful

19
An episodic approach for ASR
  • Strik (2003) ITC-irst, Trento, Italy ICPhS,
    Barcelona
  • De Wachter et al. (2003) Interspeech-2003
  • Axelrod Maison (2004) ICASSP-2004
  • Maier Moore (2005) Interspeech-2005
  • Aradilla, Vepa, Bourlard (2005) Interspeech-2005
  • Matton, De Wachter, et al. (2005) SPECOM-2005
  • Promising results
  • The computing power and memory that are needed to
    investigate the episodic approach to speech
    recognition are (becoming) available

20
The HSR-ASR gap
  • HSR ASR 2 different communities
  • Different people, departments, journals,
    terminology, goals, methodologies
  • Goals, evaluation
  • HSR simulate experimental findings
  • ASR reduce WER

21
The HSR-ASR gap
  • Marr (1982) 3 levels of modeling
  • Computational
  • Algorithmic
  • Implementational
  • HSR - (larger) differences at higher levels
  • ASR implementations, end-to-end models using
    real speech signals as input
  • Thousands of exp. WER has been gradually reduced
  • However, essentially the same model
  • New model performance (WER), funding, etc.

22
The HSR-ASR gap - bridge
  • Use same evaluation metric for HSR ASR systems
    reaction times (Cutler Robinson, 1992)
  • Use knowledge or components from the other field
    (Scharenborg et al., 2003).
  • Use models that are suitable for HSR ASR
    research
  • Evaluation from HSR ASR point of view
  • S2S Sound to Sense (Sarah Hawkins)
  • Marie Curie Research Training Network (MC-RTN)
  • Recently approved by the EU

23
Episodic speech recognition
THE END
24
(No Transcript)
25
ESRASA model
26
ESRASA model
  • ESRASA
  • Episodic Speech Recognition And Structure
    Acquisition
  • The ESRASA model is inspired by several previous
    models, especially
  • model described in Johnson (1997)
  • WRAPSA (Jusczyk, 1993), and
  • CGM (Nosofsky, 1986)
  • The ESRASA model is a feedforward neural network
    with two sets of weights atTention weights Tn
    and assoCiation weights Cew. Besides these two
    sets of weights, words, episodes (for speech
    units), and their base activation levels (Bw and
    Be, respectively) will be stored in memory.

27
ESRRecognition
28
ESRPreselection
  • Why preselection?
  • Reduce CPU memory
  • Increase performance
  • Also used in DTW-based pattern recognition
    applications
  • Used in many HSR models

29
ESRCompetition
  • Recognize unknown word X
  • Calculate distance between X and sequences of
    stored episodes (DTW)
  • Take the one with minimum distance the
    recognized word
  • Use continuity constraints (as in TTS)

30
ESRDTW Dynamic Time Warping
31
ESR ResearchPreselection ?
  • Best method?
  • Compare
  • kNN k nearest neighbor
  • Lower bound distance Ddtw ? Dlb ? d
  • Build an index for the lexicon
  • Is preselection needed?
  • Compare with without preselection

32
ESR ResearchUnits for preselection ?
  • Compare
  • Syllable
  • Word
  • Begin (window of fixed length)

33
ESR - ResearchUnits for competition ?
  • Compare
  • Syllables
  • Words
  • In combination with multisyllables?
  • Multisyllables (reduction, resyllabification)
  • Ik weet het niet -gt kweeni
  • Op een gegeven moment -gt pgeefment
  • Zeven-en -gt ze-fnen

34
ESR - ResearchExemplars ?
  • How to select exemplars
  • DTW distances hierarchical clustering
  • VQ LVQ K-means
  • Trade-off normalization (size) lexicon
  • Compare normalization techniques
  • TDNR, MVN, HN
  • VTLN

35
ESR - ResearchFeatures ?
  • Compare
  • Spectral features MFCC, PLP, LPC
  • Articulatory features (ANN)
  • Combine spectral articulatory feat.
  • Different features for preselection competition?

36
ESR - Research Distance metrics ?
  • Compare (frame-based metrics)
  • Euclidean
  • Mahalanobis
  • Itakura (for LPC)
  • Perceptually-based?
  • Distance metric for trajectories?

37
HMM-based ASR Information sources
  • HMM-based ASR, roughly 3 ways
  • Class-specific HMMs
  • Multistream
  • 2-pass decoding
  • Disadvantages
  • Many classes
  • Synchronization recombination
  • Pass 1 no / less knowledge

38
ESR - ResearchInformation sources
  • ESR compare 2 trajectories
  • All details are available during search, e.g.
    context dynamic information
  • Compare shape timing of feat. contours
  • F0 rise early or final, half or complete
  • Tags can be added to the lexicon
  • continuity constraints

39
HSR - Foreign English Examples
  • Conversation about Italy.

FEE 1
I was robbed in Milan.
By parachute?
dropped / robbed
40
HSR - Indexical information
  • HSR Strip off indexical information?
  • No!
  • Familiar voices and accents
  • recognize and mimic FEE 2
  • Indexical information
  • is perceived and encoded

41
HSR - Indexical information
  • Verbal indexical information
  • processed independently? No!
  • Familiar voices are recognized better
  • FEE 3
  • Facilitation, also with similar speech
  • FEE 4

42
ASR - Pronunciation variation
  • SRIV2006
  • ITRW on Speech Recognition and Intrinsic
    Variation
  • Pronunciation variation modeling for ASR
  • Improvements, but generally small
  • Current ASR paradigm suitable?
  • Phonetic transcriptions and their evaluation
  • Large differences between humans
  • What is the golden reference?
  • Speech a sequence of symbols?
Write a Comment
User Comments (0)
About PowerShow.com