CS 551651: - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

CS 551651:

Description:

Affricate unvoiced, voiced ch, jh. Aspiration h. Flap dx, nx ... 2 fricatives, 2 affricates, 1 retroflex. retroflex has 'depression' midway along tongue ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 21

Provided by: johnpau1

Category:

Tags: affricate

more less

Transcript and Presenter's Notes

Title: CS 551651:

1
CS 551/651 Structure of Spoken Language Lecture
5 Characteristics of Place of ArticulationPhone
tic Transcription John-Paul Hosom Fall 2008
2
Acoustic-Phonetic Features Manner of
Articulation Approximately 8 manners of
articulation Name Sub-Types
Examples .
Vowel vowel, diphthong aa, iy, uw, eh, ow,
Approximants liquid, glide l, r, w,
y Nasal m, n, ng Stop unvoiced,
voiced p, t, k, b, d, g Fricative unvoiced,
voiced f, th, s, sh, v, dh, z,
zh Affricate unvoiced, voiced ch,
jh Aspiration h Flap dx,
nx Change in manner of articulation usually
abrupt and visible manner provides much
information about location of phonemes.
3
Acoustic-Phonetic Features Place of
Articulation Approximately 8 places of
articulation for consonants Name Examples
. Labial p, b, m,
(w) Labio-Dental f, v Dental th,
dh Alveolar t, d, s, z, n, l Palato-Alveolar s
h, zh, ch, jh, r Palatal y Velar k,
g, ng, (w) Glottal h /l/ doesnt have same
coarticulatory properties as other alveolars
starts as alveolar (/t/, /d/), then becomes
palatal-alveolar /r/ is really a retroflex,
and has a complex place of articulation Place
of articulation more subject to coarticulation
than manner F2 trajectory important for
identifying place of articulation.
4

Acoustic-Phonetic Features Place of Articulation
Labial (/p/, /b/, /m/, /w/)
constriction (or complete closure) at lips
the only unvoiced labial is /p/
the only nasal labial is /m/
characterized by F1, F2, (even) F3 of adjacent
vowel(s)rapidly and briefly decreasing at border
with labial

Acoustic-Phonetic Features Place of Articulation
Labio-Dental (/f/, /v/)
produced by constriction between upper lip and
lower teeth
in English, all labio-dental phonemes are
fricatives
can be characterized by formants of adjacent
vowel(s)decreasing at border with labial
(similar to characteristicsof labials)
Dental (/th/, /dh/)
produced by constriction between tongue tip and
upper teeth(sometimes tongue tip is closer to
alveolar ridge)
in English, all dental phonemes are fricatives
may be characterized by stronger energy above 6
KHz,but weaker than /sh/, /zh/ fricatives

Acoustic-Phonetic Features Place of Articulation
Alveolar (/t/, /d/, /s/, /z/, /n/, /l/)
tongue tip is at or near alveolar ridge
a large number of English consonants are alveolar
primary cue to alveolars F2 of neighboring
vowel(s)is around 1800 Hz, except for /l/
/l/ has low F1 (? 400 Hz) and F2 (? 1000 Hz),
high F3
/l/ before vowel is light /l/, after vowel is
dark /l/.

Acoustic-Phonetic Features Place of Articulation
Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/)
tongue is between alveolar ridge and hard palate
2 fricatives, 2 affricates, 1 retroflex
retroflex has depression midway along tongue
the palato-alveolar fricatives tend to have
strong energy due to weak constriction allowing
large airflow
/r/ (and /er/) most easily identified by F3 below
2000 Hz
Palatal (/y/)
produced with tongue close to hard palate
extreme production of /iy/
F1-F2 tend to be more spread than /iy/, F1 is
lower than /iy/

Acoustic-Phonetic Features Place of Articulation
Velar (/k/, /g/, /ng/)
produced with constriction against velum (soft
palate)
only plosives /k/ and /g/, and nasal /ng/
characteristic of velars is the velar pinch, in
whichF2 and F3 of neighboring vowel become very
closeat boundary with velar. More visible in
front vowel /ih/

Acoustic-Phonetic Features Place of Articulation
Glottal (/h/)
/h/ is the nominal glottal phoneme in English
inreality, the tongue can be in any vowel-like
position
the primary cue for /h/ is formant structure
withoutvoicing, an energy dip, and/or an
increase in aspirationnoise in higher
frequencies.

Distinctive Phonetic Features Summary
Distinctive features may be used to categorize
phoneticsub-classes and show relationships
between phonemes
There is often not a one-to-one correspondence
between afeature value and a particular trait in
the speech signal
A variety of context-dependent and
context-independent cues (sometimes conflicting,
sometimes complimentary) serve to identify
features
Speech is highly variable, highly
context-dependent, andcues to phonemic identity
are spread in both the spectraland time domains.
The diffusion of features makesautomatic speech
recognition difficult, but human
speechrecognition is able to use this diffusion
for robustness.

Redundancy
Distinctive features are not always independent
someredundancy may be implied (especially
binary features)
Example Spanish

high ? ?low low ? ?high ?back ?
?round round ? back low ? back low ?
?round ?back ? ?low round ? ?low These
relationships are language and feature-set
specific. (from Schane, p. 35-38)
12

Redundancy
Redundant information can be indicated by
circling redundantfeatures

Some redundancies are universal (cant be high
and low)
Phonetic sequences also have constraints
(redundant info.)
English has no more than 3 word-initial
consonants in this
case, first consonant is always /s/ next is
always /p/, /t/, or /k/
third is always /r/ or /l/ (from Schane, p.
36-40)

13
Phonetic Transcription Given a corpus of speech
data, its often necessary to create a
transcription word level phoneme
level time-aligned phoneme level
time-aligned detailed phoneme level (with
diacritics) other information phonetic
stress, emotion, syntax, repair Most common are
word-level and time-aligned phoneme level.
Time-aligned phonetic transcription
examples 0 110 .pau 110 180 h 180 240 e
h 240 280 l 280 390 ow 390 540 .pau
t
uw
.br
14
Phonetic Transcription Are phonemes precise
quantities with exact boundaries? No humans
disagree on phonetic labels and boundary
positionsdisagreement may be a matter of
interpretation of the utterance. Phonetic
label agreement between humans
Full, Base Label Set 55 (English), 62 (German),
50 (Mandarin), 42 (Spanish) Broad
Categories 7 corresponding to manner of
articulation From Cole, Oshika, et al.,
ICSLP94
15

Phonetic Transcription
70 agreement on 55 phonemes, 89 agreement on 7
categories
Best phoneme-level automatic speech recognition
results on TIMIT,
with a 39-phoneme symbol set 75.8 (Antoniou and
Reynolds)
Differences
Human agreement evaluated on spontaneous speech
(stories), TIMIT is read speech
Humans used 55 phonemes 39 phonemes for
evaluating TIMIT
Phoneme agreement doesnt translate into word
accuracy
human word accuracy is typically an order of
magnitude better
than the best automatic speech recognition system.

16
Phonetic Transcription Phonetic label boundary
agreement between humans Agreement measured by
comparing two manual labelings, A and B, and
computing the percentage of cases in which B
labels are within some threshold (20 msec) of A
labels.
agreement ()
threshold (msec)
Average agreement of 93.8 within 20 msec
threshold Maximum agreement of 96 within 20 msec
17
Phonetic Transcription Is there a correct
answer? No inherently subjective
although semi-arbitrary guidelines can be
imposed. Is measuring accuracy meaningless?
No phonemes do have identity and order, although
details may be subjective. Sometimes very
precise (if semi-arbitrary) labels and boundaries
are extremely important (e.g. concatenative
text-to-speech databases). What about getting a
computer to generate transcriptions, or at least
phonetic boundaries? Advantages consistent,
fast Disadvantages not accurate, compared to
human transcription not robust to
different speakers, environments
18

Phonetic Transcription
Automatic Phonetic Alignment (assume phonetic
identity is known)
Two common methods
Forced Alignment Use existing speech
recognizer, constrained to recognize only the
correct phoneme sequence. The search
process used by HMM recognizers returns both
phoneme identity and location. Location
information is boundary information.
(2) Dynamic Time Warping (a) Use
text-to-speech or utterance templates to
generate same speech content with known
boundaries. (b) Warp time
scale of reference (TTS or template) with
input speech to
minimize spectral error. (c) Convert known
boundary
locations to original time scale.

19
Phonetic Transcription Accuracy of automatic
alignment Speaker-independent alignment using
Forced Alignment
agreement ()
threshold (msec)
20
Phonetic Transcription Comparing manual and
automatic alignment of TIMIT corpus