Title: Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment
1Automatic Detection of Voice Onset Time
Contrasts For Use in Pronunciation Assessment
Abe Kazemzadeh1, Joseph Tepperman1, Jorge Silva1,
Hong You2, Sungbok Lee1, Abeer Alwan2, and
Shrikanth Narayanan1 University of Southern
California1 and University of California Los
Angeles2
Project Description
- Automatically distinguish whether a voiceless
stop consonant is pronounced with a native or
accented pronunciation based on voice onset time
(VOT) characteristics. - Use data from the Tball corpus ESL children
doing oral reading tasks. - Evaluate different methods of accomplishing this.
- State duration measurements
- Explicit modelling of aspiration
- Phone probability discrimination
Results
Methodology
Motivation for Studying VOT
- This study was motivated by a desire to determine
if a phone was pronounced with a non-standard
pronunciation. - Other reasons to study VOT
- It is an important contrastive feature
- It gives information about stress
- It gives information about word segmentation
- It may give information about emphasis
- Baseline method error rates
- p 55 t23 k29
- p 19 t20 k48 using duration of 3rd HMM
state - With aspiration model
- ShortVOT/ LongVOT
- p 5 / 36
- t 11 / 38
- k 57 / 17
- With probability comparison
- p 36 / 4
- t 0 / 5
- k 0 / 6
- (trained on test dataover trained?)
- Baseline use duration measurements from a forced
alignment. - Insert an /h/ symbol in the transcriptions with
standard pronunciation, train accordingly and
decode the test files to see if the /h/ phone is
recognized. - Cut out the phones of interest from the audio
file, train separate models and a combined model,
and evaluate the likelihood of the separate
models with respect to. the combined model. - The data was transcribed by ear with special
symbols for non-standard pronunciations - Standard 3 state HMMs.
- The evaluation metric used was the error rate for
both classes evaluated separately. - When using thresholds, the point of equal error
rate for both classes was used.
What is VOT?
- It is the interval between the release of
closure of an articulator (the transient
burst) and the start of voicing. - Defined for stop consonants, e.g. /p,b,t,d,k,g/
- VOT has a continuum of values
- When the start of voicing precedes the release of
closure for a stop, VOT is negative. - When the release of closure and onset of voicing
are coincident, VOT is zero. - When voicing comes after release of closure, VOT
is positive.
Tball Corpus
- Los Angeles area elementary schools.
- 256 Children, mainly Spanish native speakers.
- Reading words, letters, and numbers, and naming
pictures and colors. - Collected by cooperation between USC and UCLA.
Discussion
Physical Realization of VOT
- Studies have noted that for VOT kgttgtp.
- Roughly, each method increased in difficulty.
- The results improved from the baseline, but the
last approach (comparing probabilities) may have
been over-trained. - Comparing probabilities may be easier to extend
to other pronunciation modelling tasks. - Increasing the frame rate didn't help much.
- If an Initial consonant has a short VOT, this
does not necessarily imply non-standard accent,
one must know the stress pattern of the word.
- Stop consonants are produced with a closure of
the vocal tract at a specific point, the place of
articulation - During the closure, there is a build up of
sub-laryngeal pressure. - When the closure is released there is a transient
burst of air, frication due to turbulence at the
place of articulation, aspiration noise from
turbulence at the glottis - Voicing may occur before, during, or after the
release of closure.
tester
techie
child
Conclusion
Linguistic Significance of VOT
- When classifying stop consonants based on VOT
characteristics, different approaches work better
on different stops. - Measuring duration of stop state works reasonably
well for /t,k/ b/c longer VOT than /p/. - Detecting insertion of an aspiration model during
decoding works well for /p,t/ but not k, which
has too many false positives. - Comparing phone probabilities worked well except
for unaspirated /p/.
- VOT distinguishes consonants with the same place
of articulation (/p/ vs. /b/, /t/ vs. /d/, etc.) - However, different languages use different VOT
intervals in contrasts (e.g. taco, pasta). - English voiceless stops VOT 40-50 ms
- Spanish voiceless stops VOT near zero
- English voiced stops VOT near zero
- Spanish voiced stops negative VOT (voicing
before closure
Future Work
Acknowledgements
- Since VOT is a time/timing related phenomenon, it
may help to explicitly model the state duration
density in the HMMs. - Other optimisation criteria might be better
suited than maximum likelihood estimation to
train models for this purpose. - More traditional classification approaches.
Special Thanks to the Tball Project for the data,
EE619 class for feedback, and Daylen Riggs and
Nathan Go for help with the transcriptions. Refer
ences available on request.