Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment

Description:

Special Thanks to the Tball Project for the data, EE619 class for feedback, and ... at the place of articulation, aspiration noise from turbulence at the glottis ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 2
Provided by: sail8
Category:

less

Transcript and Presenter's Notes

Title: Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment


1
Automatic Detection of Voice Onset Time
Contrasts For Use in Pronunciation Assessment
Abe Kazemzadeh1, Joseph Tepperman1, Jorge Silva1,
Hong You2, Sungbok Lee1, Abeer Alwan2, and
Shrikanth Narayanan1 University of Southern
California1 and University of California Los
Angeles2
Project Description
  • Automatically distinguish whether a voiceless
    stop consonant is pronounced with a native or
    accented pronunciation based on voice onset time
    (VOT) characteristics.
  • Use data from the Tball corpus ESL children
    doing oral reading tasks.
  • Evaluate different methods of accomplishing this.
  • State duration measurements
  • Explicit modelling of aspiration
  • Phone probability discrimination

Results
Methodology
Motivation for Studying VOT
  • This study was motivated by a desire to determine
    if a phone was pronounced with a non-standard
    pronunciation.
  • Other reasons to study VOT
  • It is an important contrastive feature
  • It gives information about stress
  • It gives information about word segmentation
  • It may give information about emphasis
  • Baseline method error rates
  • p 55 t23 k29
  • p 19 t20 k48 using duration of 3rd HMM
    state
  • With aspiration model
  • ShortVOT/ LongVOT
  • p 5 / 36
  • t 11 / 38
  • k 57 / 17
  • With probability comparison
  • p 36 / 4
  • t 0 / 5
  • k 0 / 6
  • (trained on test dataover trained?)
  • Baseline use duration measurements from a forced
    alignment.
  • Insert an /h/ symbol in the transcriptions with
    standard pronunciation, train accordingly and
    decode the test files to see if the /h/ phone is
    recognized.
  • Cut out the phones of interest from the audio
    file, train separate models and a combined model,
    and evaluate the likelihood of the separate
    models with respect to. the combined model.
  • The data was transcribed by ear with special
    symbols for non-standard pronunciations
  • Standard 3 state HMMs.
  • The evaluation metric used was the error rate for
    both classes evaluated separately.
  • When using thresholds, the point of equal error
    rate for both classes was used.

What is VOT?
  • It is the interval between the release of
    closure of an articulator (the transient
    burst) and the start of voicing.
  • Defined for stop consonants, e.g. /p,b,t,d,k,g/
  • VOT has a continuum of values
  • When the start of voicing precedes the release of
    closure for a stop, VOT is negative.
  • When the release of closure and onset of voicing
    are coincident, VOT is zero.
  • When voicing comes after release of closure, VOT
    is positive.

Tball Corpus
  • Los Angeles area elementary schools.
  • 256 Children, mainly Spanish native speakers.
  • Reading words, letters, and numbers, and naming
    pictures and colors.
  • Collected by cooperation between USC and UCLA.

Discussion
Physical Realization of VOT
  • Studies have noted that for VOT kgttgtp.
  • Roughly, each method increased in difficulty.
  • The results improved from the baseline, but the
    last approach (comparing probabilities) may have
    been over-trained.
  • Comparing probabilities may be easier to extend
    to other pronunciation modelling tasks.
  • Increasing the frame rate didn't help much.
  • If an Initial consonant has a short VOT, this
    does not necessarily imply non-standard accent,
    one must know the stress pattern of the word.
  • Stop consonants are produced with a closure of
    the vocal tract at a specific point, the place of
    articulation
  • During the closure, there is a build up of
    sub-laryngeal pressure.
  • When the closure is released there is a transient
    burst of air, frication due to turbulence at the
    place of articulation, aspiration noise from
    turbulence at the glottis
  • Voicing may occur before, during, or after the
    release of closure.

tester
techie
child
Conclusion
Linguistic Significance of VOT
  • When classifying stop consonants based on VOT
    characteristics, different approaches work better
    on different stops.
  • Measuring duration of stop state works reasonably
    well for /t,k/ b/c longer VOT than /p/.
  • Detecting insertion of an aspiration model during
    decoding works well for /p,t/ but not k, which
    has too many false positives.
  • Comparing phone probabilities worked well except
    for unaspirated /p/.
  • VOT distinguishes consonants with the same place
    of articulation (/p/ vs. /b/, /t/ vs. /d/, etc.)
  • However, different languages use different VOT
    intervals in contrasts (e.g. taco, pasta).
  • English voiceless stops VOT 40-50 ms
  • Spanish voiceless stops VOT near zero
  • English voiced stops VOT near zero
  • Spanish voiced stops negative VOT (voicing
    before closure

Future Work
Acknowledgements
  • Since VOT is a time/timing related phenomenon, it
    may help to explicitly model the state duration
    density in the HMMs.
  • Other optimisation criteria might be better
    suited than maximum likelihood estimation to
    train models for this purpose.
  • More traditional classification approaches.

Special Thanks to the Tball Project for the data,
EE619 class for feedback, and Daylen Riggs and
Nathan Go for help with the transcriptions. Refer
ences available on request.
Write a Comment
User Comments (0)
About PowerShow.com