CS 551651: - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

CS 551651:

Description:

Affricate unvoiced, voiced ch, jh. Aspiration h. Flap dx, nx ... 2 fricatives, 2 affricates, 1 retroflex. retroflex has 'depression' midway along tongue ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 21
Provided by: johnpau1
Category:
Tags: affricate

less

Transcript and Presenter's Notes

Title: CS 551651:


1
CS 551/651 Structure of Spoken Language Lecture
5 Characteristics of Place of ArticulationPhone
tic Transcription John-Paul Hosom Fall 2008
2
Acoustic-Phonetic Features Manner of
Articulation Approximately 8 manners of
articulation Name Sub-Types
Examples .
Vowel vowel, diphthong aa, iy, uw, eh, ow,
Approximants liquid, glide l, r, w,
y Nasal m, n, ng Stop unvoiced,
voiced p, t, k, b, d, g Fricative unvoiced,
voiced f, th, s, sh, v, dh, z,
zh Affricate unvoiced, voiced ch,
jh Aspiration h Flap dx,
nx Change in manner of articulation usually
abrupt and visible manner provides much
information about location of phonemes.
3
Acoustic-Phonetic Features Place of
Articulation Approximately 8 places of
articulation for consonants Name Examples
. Labial p, b, m,
(w) Labio-Dental f, v Dental th,
dh Alveolar t, d, s, z, n, l Palato-Alveolar s
h, zh, ch, jh, r Palatal y Velar k,
g, ng, (w) Glottal h /l/ doesnt have same
coarticulatory properties as other alveolars
starts as alveolar (/t/, /d/), then becomes
palatal-alveolar /r/ is really a retroflex,
and has a complex place of articulation Place
of articulation more subject to coarticulation
than manner F2 trajectory important for
identifying place of articulation.
4
  • Acoustic-Phonetic Features Place of Articulation
  • Labial (/p/, /b/, /m/, /w/)
  • constriction (or complete closure) at lips
  • the only unvoiced labial is /p/
  • the only nasal labial is /m/
  • characterized by F1, F2, (even) F3 of adjacent
    vowel(s)rapidly and briefly decreasing at border
    with labial

5
  • Acoustic-Phonetic Features Place of Articulation
  • Labio-Dental (/f/, /v/)
  • produced by constriction between upper lip and
    lower teeth
  • in English, all labio-dental phonemes are
    fricatives
  • can be characterized by formants of adjacent
    vowel(s)decreasing at border with labial
    (similar to characteristicsof labials)
  • Dental (/th/, /dh/)
  • produced by constriction between tongue tip and
    upper teeth(sometimes tongue tip is closer to
    alveolar ridge)
  • in English, all dental phonemes are fricatives
  • may be characterized by stronger energy above 6
    KHz,but weaker than /sh/, /zh/ fricatives

6
  • Acoustic-Phonetic Features Place of Articulation
  • Alveolar (/t/, /d/, /s/, /z/, /n/, /l/)
  • tongue tip is at or near alveolar ridge
  • a large number of English consonants are alveolar
  • primary cue to alveolars F2 of neighboring
    vowel(s)is around 1800 Hz, except for /l/
  • /l/ has low F1 (? 400 Hz) and F2 (? 1000 Hz),
    high F3
  • /l/ before vowel is light /l/, after vowel is
    dark /l/.

7
  • Acoustic-Phonetic Features Place of Articulation
  • Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/)
  • tongue is between alveolar ridge and hard palate
  • 2 fricatives, 2 affricates, 1 retroflex
  • retroflex has depression midway along tongue
  • the palato-alveolar fricatives tend to have
    strong energy due to weak constriction allowing
    large airflow
  • /r/ (and /er/) most easily identified by F3 below
    2000 Hz
  • Palatal (/y/)
  • produced with tongue close to hard palate
  • extreme production of /iy/
  • F1-F2 tend to be more spread than /iy/, F1 is
    lower than /iy/

8
  • Acoustic-Phonetic Features Place of Articulation
  • Velar (/k/, /g/, /ng/)
  • produced with constriction against velum (soft
    palate)
  • only plosives /k/ and /g/, and nasal /ng/
  • characteristic of velars is the velar pinch, in
    whichF2 and F3 of neighboring vowel become very
    closeat boundary with velar. More visible in
    front vowel /ih/

9
  • Acoustic-Phonetic Features Place of Articulation
  • Glottal (/h/)
  • /h/ is the nominal glottal phoneme in English
    inreality, the tongue can be in any vowel-like
    position
  • the primary cue for /h/ is formant structure
    withoutvoicing, an energy dip, and/or an
    increase in aspirationnoise in higher
    frequencies.

10
  • Distinctive Phonetic Features Summary
  • Distinctive features may be used to categorize
    phoneticsub-classes and show relationships
    between phonemes
  • There is often not a one-to-one correspondence
    between afeature value and a particular trait in
    the speech signal
  • A variety of context-dependent and
    context-independent cues (sometimes conflicting,
    sometimes complimentary) serve to identify
    features
  • Speech is highly variable, highly
    context-dependent, andcues to phonemic identity
    are spread in both the spectraland time domains.
    The diffusion of features makesautomatic speech
    recognition difficult, but human
    speechrecognition is able to use this diffusion
    for robustness.

11
  • Redundancy
  • Distinctive features are not always independent
    someredundancy may be implied (especially
    binary features)
  • Example Spanish

high ? ?low low ? ?high ?back ?
?round round ? back low ? back low ?
?round ?back ? ?low round ? ?low These
relationships are language and feature-set
specific. (from Schane, p. 35-38)
12
  • Redundancy
  • Redundant information can be indicated by
    circling redundantfeatures
  • Some redundancies are universal (cant be high
    and low)
  • Phonetic sequences also have constraints
    (redundant info.)
  • English has no more than 3 word-initial
    consonants in this
  • case, first consonant is always /s/ next is
    always /p/, /t/, or /k/
  • third is always /r/ or /l/ (from Schane, p.
    36-40)

13
Phonetic Transcription Given a corpus of speech
data, its often necessary to create a
transcription word level phoneme
level time-aligned phoneme level
time-aligned detailed phoneme level (with
diacritics) other information phonetic
stress, emotion, syntax, repair Most common are
word-level and time-aligned phoneme level.
Time-aligned phonetic transcription
examples 0 110 .pau 110 180 h 180 240 e
h 240 280 l 280 390 ow 390 540 .pau
t
uw
.br
14
Phonetic Transcription Are phonemes precise
quantities with exact boundaries? No humans
disagree on phonetic labels and boundary
positionsdisagreement may be a matter of
interpretation of the utterance. Phonetic
label agreement between humans
Full, Base Label Set 55 (English), 62 (German),
50 (Mandarin), 42 (Spanish) Broad
Categories 7 corresponding to manner of
articulation From Cole, Oshika, et al.,
ICSLP94
15
  • Phonetic Transcription
  • 70 agreement on 55 phonemes, 89 agreement on 7
    categories
  • Best phoneme-level automatic speech recognition
    results on TIMIT,
  • with a 39-phoneme symbol set 75.8 (Antoniou and
    Reynolds)
  • Differences
  • Human agreement evaluated on spontaneous speech
    (stories), TIMIT is read speech
  • Humans used 55 phonemes 39 phonemes for
    evaluating TIMIT
  • Phoneme agreement doesnt translate into word
    accuracy
  • human word accuracy is typically an order of
    magnitude better
  • than the best automatic speech recognition system.

16
Phonetic Transcription Phonetic label boundary
agreement between humans Agreement measured by
comparing two manual labelings, A and B, and
computing the percentage of cases in which B
labels are within some threshold (20 msec) of A
labels.
agreement ()
threshold (msec)
Average agreement of 93.8 within 20 msec
threshold Maximum agreement of 96 within 20 msec
17
Phonetic Transcription Is there a correct
answer? No inherently subjective
although semi-arbitrary guidelines can be
imposed. Is measuring accuracy meaningless?
No phonemes do have identity and order, although
details may be subjective. Sometimes very
precise (if semi-arbitrary) labels and boundaries
are extremely important (e.g. concatenative
text-to-speech databases). What about getting a
computer to generate transcriptions, or at least
phonetic boundaries? Advantages consistent,
fast Disadvantages not accurate, compared to
human transcription not robust to
different speakers, environments
18
  • Phonetic Transcription
  • Automatic Phonetic Alignment (assume phonetic
    identity is known)
  • Two common methods
  • Forced Alignment Use existing speech
    recognizer, constrained to recognize only the
    correct phoneme sequence. The search
    process used by HMM recognizers returns both
    phoneme identity and location. Location
    information is boundary information.
  • (2) Dynamic Time Warping (a) Use
    text-to-speech or utterance templates to
    generate same speech content with known
    boundaries. (b) Warp time
  • scale of reference (TTS or template) with
    input speech to
  • minimize spectral error. (c) Convert known
    boundary
  • locations to original time scale.

19
Phonetic Transcription Accuracy of automatic
alignment Speaker-independent alignment using
Forced Alignment
agreement ()
threshold (msec)
20
Phonetic Transcription Comparing manual and
automatic alignment of TIMIT corpus
  • Automatic method still makes stupid mistakes.
  • Manual labeling criteria not rigorously defined.
  • Performance degrades significantly in presence
    of noise.
  • Assumes correct phonetic sequence is known
Write a Comment
User Comments (0)
About PowerShow.com