Goals and Objectives - PowerPoint PPT Presentation

About This Presentation
Title:

Goals and Objectives

Description:

Title: Goals and Objectives Author: Steve Greenberg Last modified by: Steve Greenberg Created Date: 4/29/2000 6:04:44 PM Document presentation format – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 129
Provided by: stevegr4
Category:

less

Transcript and Presenter's Notes

Title: Goals and Objectives


1
What are the Essential Cues for Understanding
Spoken Language? Steven Greenberg Internationa
l Computer Science Institute 1947 Center Street,
Berkeley, CA 94704 http//www.icsi.berkeley.edu/s
teveng steveng_at_icsi.berkeley.edu
2
No Scientist is an Island
IMPORTANT COLLEAGUES ACOUSTIC BASIS OF
SPEECH INTELLIGILIBILTY Takayuki Arai, Joy
Hollenback, Rosaria Silipo AUDITORY-VISUAL
INTEGRATION FOR SPEECH PROCESSING Ken Grant
AUTOMATIC SPEECH RECOGNITION AND FEATURE
CLASSIFICATION Shawn Chang, Lokendra Shastri,
Mirjam Wester STATISTICAL ANALYSIS OF
PRONUNCIATION VARIATION Eric Fosler, Leah
Hitchcock, Joy Hollenback
3
Germane Publications
STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND
PRONUNCIATION MODELING Fosler-Lussier, E.,
Greenberg, S. and Morgan, N. (1999) Incorporating
contextual phonetics into automatic speech
recognition. Proceedings of the 14th
International Congress of Phonetic Sciences, San
Francisco. Greenberg, S. (1997) On the origins of
speech intelligibility in the real world.
Proceedings of the ESCA Workshop on Robust Speech
Recognition for Unknown Communication Channels,
Pont-a-Mousson, France, pp. 23-32. Greenberg, S.
(1999) Speaking in shorthand - A syllable-centric
perspective for understanding pronunciation
variation, Speech Communication, 29,
159-176. Greenberg, S. and Fosler-Lussier, E.
(2000) The uninvited guest Information's role in
guiding the production of spontaneous speech, in
the Proceedings of the Crest Workshop on Models
of Speech Production Motor Planning and
Articulatory Modelling, Kloster Seeon, Germany .
Greenberg, S., Hollenback, J. and Ellis, D.
(1996) Insights into spoken language gleaned from
phonetic transcription of the Switchboard corpus,
in Proc. Intern. Conf. Spoken Lang. (ICSLP),
Philadelphia, pp. S24-27. AUTOMATIC PHONETIC
TRANSCRIPTION AND ACOUSTIC FEATURE
CLASSIFICATION Chang, S. Greenberg, S. and
Wester, M. (2001) An elitist approach to
articulatory-acoustic feature classification. 7th
European Conference on Speech Communication and
Technology (Eurospeech-2001). Chang, S., Shastri,
L. and Greenberg, S. (2000) Automatic phonetic
transcription of spontaneous speech (American
English), Proceedings of the International.
Conference on. Spoken. Language. Processing,
Beijing. Shastri, L., Chang, S. and Greenberg, S.
(1999) Syllable segmentation using temporal flow
model neural networks. Proceedings of the 14th
International Congress of Phonetic Sciences, San
Francisco. Wester, M. Greenberg, S. and Chang,,
S. (2001) A Dutch treatment of an elitist
approach to articulatory-acoustic feature
classification. 7th European Conference on Speech
Communication and Technology (Eurospeech-2001).
http//www.icsi.berkeley.edu/steveng
4
Germane Publications
PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY Arai,
T. and Greenberg, S. (1998) Speech
intelligibility in the presence of cross-channel
spectral asynchrony, IEEE International
Conference on Acoustics, Speech and Signal
Processing, Seattle, pp. 933-936. Greenberg, S.
and Arai, T. (1998) Speech intelligibility is
highly tolerant of cross-channel spectral
asynchrony. Proceedings of the Joint Meeting of
the Acoustical Society of America and the
International Congress on Acoustics, Seattle, pp.
2677-2678. Greenberg, S. and Arai, T. (2001) The
relation between speech intelligibility and the
complex modulation spectrum. Submitted to the 7th
European Conference on Speech Communication and
Technology (Eurospeech-2001). Greenberg, S.,
Arai, T. and Silipo, R. (1998) Speech
intelligibility derived from exceedingly sparse
spectral information, Proceedingss of the
International Conference on Spoken Language
Processing, Sydney, pp. 74-77. Silipo, R.,
Greenberg, S. and Arai, T. (1999) Temporal
Constraints on Speech Intelligibility as Deduced
from Exceedingly Sparse Spectral Representations,
Proceedings of Eurospeech, Budapest.
AUDITORY-VISUAL SPEECH PROCESSING Grant, K. and
Greenberg, S. (2001) Speech intelligibility
derived from processing of asynchronous
processing of auditory-visual information.
Submitted to the ISCA Workshop on Audio-Visual
Speech Processing (AVSP-2001). PROSODIC STRESS
ACCENT AUTOMATIC CLASSIFICATION AND
CHARACTERIZATION Hitchcock, L. and Greenberg, S.
(2001) Vowel height is intimately associated with
stress-accent in spontaneous American English
discourse. Submitted to the 7th European
Conference on Speech Communication and Technology
(Eurospeech-2001). Silipo, R. and Greenberg, S.
(1999) Automatic transcription of prosodic stress
for spontaneous English discourse. Proceedings of
the 14th International Congress of Phonetic
Sciences, San Francisco. Silipo, R. and
Greenberg, S. (2000) Prosodic stress revisited
Reassessing the role of fundamental frequency.
Proceedings of the NIST Speech Transcription
Workshop, College Park, MD. Silipo, R. and
Greenberg, S. (2000) Automatic detection of
prosodic stress in American English discourse.
Technical Report 2000-1, International Computer
Science Institute, Berkeley, CA.
http//www.icsi.berkeley.edu/steveng
5
PROLOGUE The Central Challenge for Models of
Speech Recognition
6
Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
7
The Serial Frame Perspective on Speech
  • Traditional models of speech recognition assume
    that the identity of a phonetic segment depends
    on the detailed spectral profile of the
    acoustic signal for a given (usually 25-ms)
    frame of speech

8
Language - A Syllable-Centric Perspective
A more empirical perspective of spoken language
focuses on the syllable as the interface between
sound and meaning
Within this framework the relationship between
the syllable and the higher and lower tiers is
non-arbitrary and systematic statistically
9
Lines of Evidence
10
Take Home Messages
  • Segmentation is crucial for understanding spoken
    language
  • At the level of the phrase
  • the word
  • the syllable
  • the phonetic segment
  • But . this linguistic segmentation is inherently
    fuzzy
  • As is the spectral information associated with
    each linguistic tier
  • The low-frequency (3-25 Hz) modulation spectrum
    is a crucial acoustic (and possibly visual)
    parameter associated with intelligibility
  • It provides segmentation information that unites
    the phonetic segment with the syllable (and
    possibly the word and beyond)
  • Many properties of spontaneous spoken language
    differ from those of laboratory and citation
    speech
  • There are systematic patterns in real speech
    that potentially reveal underlying principles
    of linguistic organization

11
The Central Importance of the Modulation
Spectrum and the Syllable for Understanding
Spoken Language
12
Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday conditions
13
Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the temporal and modulation
spectral properties of the speech signal The
modulation spectrums peak is attenuated and
shifted down to ca. 2 Hz
based on an illustration by Hynek Hermansky
14
Modulation Spectrum Computation
15
The Modulation Spectrum Reflects Syllables
The peak in the distribution of syllable duration
is close to the mean - 200 ms The syllable
duration distribution is very close to that of
the modulation spectrum - suggesting that the
modulation spectrum reflects syllables
16
The Ability to Understand Speech Under
Reverberant Conditions (Spectral Asynchrony)
17
Spectral Asynchrony - Method
Output of quarter-octave frequency bands quasi-
randomly time-shifted relative to common
reference. Maximum shift interval ranged between
40 and 240 ms (in 20-ms steps). Mean shift
interval is half of the maximum interval.
Adjacent channels separated by a minimum of
one-quarter of the maximum shift range.
Stimuli 40 TIMIT Sentences
She washed his dark suit in greasy dish water
all year
18
Spectral Asynchrony - Paradigm
The magnitude of energy in the 3-6 Hz region of
the modulation spectrum is computed for each (4
or 7 channel sub-band) as a function of spectral
asynchrony The modulation spectrum magnitude is
relatively unaffected by asynchronies of 80 ms or
less (open symbols), but is appreciably
diminished for asynchronies of 160 ms or more Is
intelligibility correlated with the reduction in
the 3-6 Hz modulation spectrum?
19
Intelligibility and Spectral Asynchrony
Speech intelligibility does appear to be roughly
correlated with the energy in the modulation
spectrum between 3 and 6 Hz The correlation
varies depending on the sub-band and the degree
of spectral asynchrony
20
Spectral Asynchrony - Summary
  • Speech is capable of withstanding a high degree
    of temporal asynchrony across frequency channels
  • This form of cross-spectral asynchrony is similar
    to the effects of many common forms of acoustic
    reverberation
  • Speech intelligibility remains high (gt75) until
    this asynchrony (maximum) exceeds 140 ms
  • The magnitude of the low-frequency (3-6 Hz)
    modulation spectrum is highly correlated with
    speech intelligibility

21
Understanding Spoken Language Under Very Sparse
Spectral Conditions
22
A Flaw in the Spectral Asynchrony Study
Of the 448 possible combinations of four slits
across the spectrum (where one slit is present in
each of the 4 sub-bands) ca. 10 (i.e. 45)
exhibit a coefficient of variation less than 10
- thus, the seeming temporal tolerance of the
auditory system may be illusory (if listeners can
decode the speech signal using information from
only a small number of channels distributed
across the spectrum)
Distribution of channel asynchrony
Intelligibility of spectrally desynchronized
speech
23
Spectral Slit Paradigm
Can listeners decode spoken sentences using just
four narrow (1/3 octave) channels (slits)
distributed across the spectrum? The edge of each
slit was separated from its nearest neighbor by
an octave The modulation pattern for each slit
differs from that of the others The four-slit
compound waveform looks very similar to the
full-band signal


24
Word Intelligibility - Single Slits
The intelligibility associated with any single
slit is only 2 to 9 The mid-frequency slits
exhibit somewhat higher intelligibility than the
lateral slits
25
Word Intelligibility - Roap Map
1. Intelligibility as a function of the number of
slits (from one to four)
26
Word Intelligibility - 1 Slit
27
Word Intelligibility - 2 Slits
28
Word Intelligibility - 3 Slits
29
Word Intelligibility - 4 Slits
30
Word Intelligibility - Roap Map
2. Intelligibility for different combinations of
two-slit compounds The two center slits yield
the highest intelligibility
31
Word Intelligibility - 2 Slits
32
Word Intelligibility - 2 Slits
33
Intelligibility - 2 Slits
34
Intelligibility - 2 Slits
35
Intelligibility - 2 Slits
36
Intelligibility - 2 Slits
37
Word Intelligibility - Roap Map
3. Intelligibility for different combinations of
three-slit compounds Combinations with one or
two center slits yield the highest
intelligibility
38
Intelligibility - 3 Slits
39
Intelligibility - 3 Slits
40
Intelligibility - 3 Slits
41
Intelligibility - 3 Slits
42
Word Intelligibility - Roap Map
4. Four slits yield nearly (but not quite)
perfect intelligibility of ca. 90 This maximum
level of intelligibility makes it possible to
deduce the specific contribution of each slit
by itself and in combination with others
43
Intelligibility - 3 Slits
44
Spectral Slits - Summary
  • A detailed spectro-temporal analysis of the
    speech signal is not required to understand
    spoken language
  • An exceedingly sparse spectral representation
    can, under certain circumstances, yield nearly
    perfect intelligibility

45
Modulation Spectrum Across Frequency
The modulation spectrum varies in magnitude
across frequency The shape of the modulation
spectrum is similar for the three lowest slits,
but the highest frequency slit differs from the
rest in exhibiting a far greater amount of energy
in the mid modulation frequencies
46
Word Intelligibility - Single Slits
The intelligibility associated with any single
slit ranges between 2 and 9, suggesting that the
shape and magnitude of the modulation spectrum
per se is NOT the controlling variable for
intelligibility
47
Spectral Slits - Summary
  • A detailed spectro-temporal analysis of the
    speech signal is not required to understand
    spoken language
  • An exceedingly sparse spectral representation
    can, under certain circumstances, yield nearly
    perfect intelligibility
  • The magnitude component of the modulation
    spectrum does not appear to be the controlling
    variable for intelligibility

48
The Effect of Desynchronizing Sparse Spectral
Information on Speech Intelligibility
49
Modulation Spectrum Across Frequency
Desynchronizing the slits by more than 25 ms
results in a significant decline in
intelligibility
50
Spectral Slits - Summary
  • Even small amounts of asynchrony (gt25 ms) imposed
    on spectral slits can result in significant
    degradation of intelligibility
  • Asynchrony greater than 50 ms has a profound
    impact of intelligibility

51
Intelligibility and Slit Asynchrony
52
Spectral Slits - Summary
  • A detailed spectro-temporal analysis of the
    speech signal is not required to understand
    spoken language
  • An exceedingly sparse spectral representation
    can, under certain circumstances, yield nearly
    perfect intelligibility
  • The magnitude component of the modulation
    spectrum does not appear to be the controlling
    variable for intelligibility

53
Spectral Slits - Summary
  • Small amounts of asynchrony (gt25 ms) imposed on
    spectral slits can result in significant
    degradation of intelligibility
  • Asynchrony greater than 50 ms has a profound
    impact of intelligibility
  • Intelligibility progressively declines with
    greater amounts of asynchrony up to an asymptote
    of ca. 250 ms
  • Beyond asynchronies of 250 ms intelligibility
    IMPROVES, but the amount of improvement depends
    on individual factors
  • Such results are NOT inconsistent with the high
    intelligibility of desynchronized
    full-spectrum speech, but rather imply that the
    auditory system is capable of extracting
    phonetically important information from a
    relatively small proportion of spectral channels
  • BOTH the amplitude and phase components of the
    modulation spectrum are extremely important for
    speech intelligibility
  • The modulation phase is of particular importance
    for cross-spectral integration of phonetic
    information

54
Speech Intelligibility Derived from Asynchronous
Presentation of Auditory and Visual Information
55
Auditory-Visual Integration of Speech
  • Video of spoken (Harvard/IEEE) sentences,
    presented in tandem with sparse spectral
    representation (low- and high-frequency slits)

56
Auditory-Visual Integration - Mean
Intelligibility
  • When the AUDIO signal LEADS the VIDEO, there is a
    progressive decline in intelligibility, similar
    to that observed for audio-alone signals
  • When the VIDEO signal LEADS the AUDIO,
    intelligibility is preserved for asynchrony
    intervals as large as 200 ms

9 Subjects
57
Auditory-Visual Integration - by Individual Ss
Video lagging often better than synchronous
Variation across subjects
58
Audio-Video Integration Summary
  • Sparse audio and speech-reading information when
    presented alone provide minimal intelligibility
  • But can, when combined provide good
    intelligibility
  • When the audio signal leads the video,
    intelligibility falls off rapidly as a function
    of onset asynchrony
  • When the video signal leads the audio,
    intelligibility is maintained for asynchronies
    as long as 200 ms
  • The dynamics of the video appear to be combined
    with the dynamics associated with the audio to
    provide good intelligibility
  • The dynamics associated with the video signal are
    probably most closely associated with place of
    articulation information
  • The implication is that place information has a
    long time constant of ca. 200 ms and appears
    linked to the syllable

59
Perceptual Evidence for the Spectral Origin of
Articulatory-Acoustic Features
60
Spectral Slit Paradigm
  • Signals were CV and VC Nonsense Syllables (from
    CUNY)

61
Consonant Recognition - Single Slits
62
Consonant Recognition - 1 Slit
63
Consonant Recognition - 2 Slits
64
Consonant Recognition - 3 Slits
65
Consonant Recognition - 4 Slits
66
Consonant Recognition - 5 Slits
67
Consonant Recognition - 2 Slits
68
Consonant Recognition - 2 Slits
69
Consonant Recognition - 2 Slits
70
Consonant Recognition - 2 Slits
71
Consonant Recognition - 2 Slits
72
Consonant Recognition - 2 Slits
73
Consonant Recognition - 3 Slits
74
Consonant Recognition - 3 Slits
75
Consonant Recognition - 3 Slits
76
Consonant Recognition - 4 Slits
77
Consonant Recognition - 5 Slits
78
Articulatory - Feature Analysis
  • The consonant recognition results can be scored
    in terms of articulatory features correct
  • The the accuracy of the features are scored
    relative to the accuracy of consonant
    recognition an interesting pattern emerges
  • Certain features (place and manner) appear to be
    highly correlated with consonant recognition
    performance
  • While the voicing and rounding features are less
    highly correlated

79
Correlation - AFs/Consonant Recognition
Consonant recognition is almost perfectly
correlated with place of articulation
performance This correlation suggests that the
place feature is based on cues distributed across
the entire speech spectrum, in contrast to
features such as voicing and rounding, which
appear to be extracted from a narrower band of
the spectrum Manner is also highly correlated
with consonant recognition, implying that this
feature is extracted from a fairly broad portion
of the spectrum
80
Phonetic Transcription of Spontaneous (American)
English
81
Phonetic Transcription of Spontaneous English
  • Telephone dialogues of 5-10 minutes duration -
    SWITCHBOARD
  • Amount of material manually transcribed    
  • 4 hours labeled at the phone level and segmented
    at the syllabic level (this material was later
    phonetically segmented by automatic methods)
  • 1 hour labeled and segmented at the
    phonetic-segment level
  • Diversity of material transcribed
  • Spans speech of both genders (ca. 50/50)
    reflecting a wide range of American dialectal
    variation (6 regions army brat), speaking
    rate and voice quality
  • Transcribed by whom?
  • 11 undergraduates and 1 graduate student, all
    enrolled at UC-Berkeley. Most of the corpus was
    transcribed by four individuals out of the twelve
  • Supervised by Steven Greenberg and John Ohala
  • Transcription system
  • A variant of Arpabet, with phonetic diacritics
    such as_gl,_cr, _fr, _n, _vl, _vd
  • How long does transcription take? (Dont ask!)
  • 388 times real time for labeling and segmentation
    at the phonetic-segment level
  • 150 times real time for labeling phonetic
    segments and segmenting syllables
  • How was labeling and segmentation performed?
  • Using a display of the signal waveform,
    spectrogram, word transcription and forced
    alignments (estimates of phones and boundaries)
    audio (listening at multiple time scales -
    phone, word, utterance) on Sun workstations
  • Data available at - http//www.icsi/berkeley.edu/r
    eal/stp

82
Phonetic Transcription
What a typical computer screen shot of the
speech material looks like to a transcriber
83
A Brief Tour of Pronunciation Variation in Sponta
neous American English
84
How Many Pronunciations of and?
85
How Many Pronunciations of and?
86
How Many Different Pronunciations?
The 20 most frequency words account for 35 of
the tokens
87
How Many Different Pronunciations?
The 40 most frequency words account for 45 of
the tokens
88
How Many Different Pronunciations?
The 60 most frequency words account for 55 of
the tokens
89
How Many Different Pronunciations?
The 80 most frequency words account for 62 of
the tokens
90
How Many Different Pronunciations?
The 100 most frequency words account for 67 of
the tokens
91
English Syllable Structure is (sort of) Like
Japanese
Most syllables are simple in form (no consonant
clusters)
87 of the pronunciations are simple syllabic
forms 84 of the canonical corpus is composed of
simple syllabic forms
C Consonant V Vowel Examples CV go CVC
cat VC of V a
Corpus Canonical representation Pronunciation
Actual pronunciation
Coda consonants tend to drop
n 103, 054
92
Complex Syllables ARE Important (Though)
There are many complex syllable forms
(consonant clusters), but all occur relatively
infrequently
Thus, despite Englishs reputation for complex
syllabic forms, only ca. 15 of the syllable
tokens are actually complex
Complex syllables tend to be part of noun phrases
(nouns or adjectives)
Percent
C Consonant V Vowel Examples CVCC
fifth VCC ounce CCV stow CCVC
stoop CCVCC stops CCCVCC strength
Coda consonants tend to drop
n 17,760
93
Syllable-Centric Pronunciation Patterns
Onsets are pronounced canonically far more often
than nuclei or codas
Codas tend to be pronounced canonically more
frequently in formal speech than in spontaneous
dialogues
Percent Canonically Pronounced
n 120,814
(Read Sentences)
Syllable Position
(Spontaneous speech)
Cat k ae t k onset
ae nucleus t coda
94
Complex Onsets are Highly Canonical
Complex onsets are pronounced more canonically
than simple onsets despite the greater potential
for deviation from the standard pronunciation
COMPLEX onsets contain TWO or MORE consonants
Percent Canonically Pronounced
(Read sentences)
(Spontaneous speech)
Syllable Onset Type
95
Speaking Style Affects Syllable Codas
COMPLEX codas contain TWO or MORE consonants
Codas are much more likely to be realized
canonically in formal than in spontaneous speech
Percent Canonically Pronounced
Syllable Coda Type
STP Spontaneous phone dialogues TIMIT Read
sentences
96
Onsets (but not Codas) Affect Nuclei
The presence of a syllable onset has a
substantial impact on the realization of the
nucleus
Percent Canonically Pronounced
STP Spontaneous phone dialogues TIMIT Read
sentences
97
Syllable-Centric Articulatory Feature Analysis
  • Place of articulation deviates most in nucleus
    position
  • Manner of articulation deviates most in onset and
    coda position
  • Voicing deviates most in coda position

Phonetic deviation along a SINGLE feature
Place is VERY unstable in nucleus position
Place deviates very little from canonical form in
the onset and coda. It is a STABLE AF in these
positions
98
Articulatory PLACE Feature Analysis
  • Place of articulation is a dominant feature in
    nucleus position only
  • Drives the feature deviation in the nucleus for
    manner and rounding

Phonetic deviation across SEVERAL features
Place carries manner and rounding in the nucleus
99
Articulatory MANNER Feature Analysis
  • Manner of articulation is a dominant feature in
    onset and coda position
  • Drives the feature deviation in onsets and codas
    for place and voicing

Phonetic deviation across SEVERAL features
Manner drives place and voicing deviations in
the onset and coda
Manner is less stable in the coda than in the
onset
100
Articulatory VOICING Feature Analysis
  • Voicing is a subordinate feature in all syllable
    positions
  • Its deviation pattern is controlled by manner in
    onset and coda positions

Phonetic deviation across SEVERAL features
Voicing is unstable in coda position and is
dominated by manner
101
The Intimate Relation Between Stress Accent
and Vocalic Identity (especially height)
102
What is (usually) Meant by Prosodic Stress?
  • Prosody is supposed to pertain to extra-phonetic
    cues in the acoustic signal
  • The pattern of variation over a sequence of
    SYLLABLES pertaining to syllabic DURATION,
    AMPLITUDE and PITCH (fo) variation over time
    (but the plot thickens, as we shall see)

103
OGI Stories - Pitch Doesnt Cut the Mustard
  • Although pitch range is the most important of the
    fo-related cues, it is not as good a predictor
    of stress as DURATION

104
Total Energy is the Best Predictor of Stress
  • Duration x Amplitude is superior to all other
    combination pairs of acoustic parameters.
    Pitch appears redundant with duration.

105
The Nitty Gritty (a.k.a. the Corpus Material)
  • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS
  • Switchboard contains informal telephone dialogues
  • 54 minutes of material that had previously been
    phonetically transcribed (by highly trained
    phonetics students from UC- Berkeley)
  • 45.5 minutes of pure speech (filled pauses,
    junctures filtered out), consisting of
  • 9,991 words, 13,446 syllables, 33,370 phonetic
    segments
  • All of this material had been hand-segmented at
    either the phonetic- segment or syllabic level
    by the transcribers
  • The syllabic-segmented material was subsequently
    segmented at the phonetic-segment level by a
    special-purpose neural network trained on
    72-minutes of hand-segmented Switchboard
    material. This automatic segmentation was
    manually verified

106
Manual Transcription of Stress Accent
  • 2 UC-Berkeley Linguistics students each
    transcribed the full 45 minutes of material
    (i.e., there is 100 overlap between the 2)
  • Three levels of stress-accent were marked for
    each syllabic nucleus
  • Fully stressed (78 concordance between
    transcribers)
  • Completely unstressed (85 interlabeler
    agreement)
  • An intermediate level of accent (neither fully
    stressed, nor completely unstressed (ca. 60
    concordance)
  • Hence, 95 concordance in terms of some level of
    stress
  • The labels of the two transcribers were averaged
  • In those instances where there was disagreement,
    the magnitude of disparity was almost always
    (ca. 90) one step. Usually, disagreement
    signaled a genuine ambiguity in stress accent
  • The illustrations in this presentation are based
    solely on those data in which both
    transcribers concurred (i.e., fully stressed or
    completely unstressed)

107
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory
    properties - both related to the motion of the
    tongue
  • The front-back plane is most closely associated
    with the second formant frequency (or more
    precisely F2 - F1) and the volume of the
    front-cavity resonance
  • The height parameter is closely linked to the
    frequency of F1
  • In the classic vowel triangle segments are
    positioned in terms of the tongue positions
    associated with their production, as follows

108
Durational Differences - Stressed/Unstressed
  • There is a large dynamic range in duration
    between stressed and unstressed nuclei
  • Diphthongs and tense, low monophthongs tend to
    have a larger range than the lax monophthongs

109
Spatial Patterning of Duration and Amplitude
  • Lets return to the vowel triangle and see if it
    can shed light on certain patterns in the
    vocalic data
  • The duration, amplitude (and their product,
    integrated energy, will be plotted on a 2-D
    grid , where the x-axis will always be in terms
    of hypothetical front-back tongue position
    (and hence remain a constant throughout the
    plots to follow)
  • The y-axis will serve as the dependent measure,
    sometimes expressed in terms of duration, or
    amplitude, or their product

110
Duration - Monophthongs vs. Diphthongs
All nuclei
111
Duration - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
112
Proportion of Stress Accent and Vowel Height
113
Take Home Messages
  • The vowel system of English (and perhaps other
    languages as well) needs to be re-thought in
    light of the intimate relationship between
    vocalic identity, nucleic duration and stress
    accent
  • Stressed syllables tend to have significantly
    longer nuclei than their unstressed
    counterparts, consistent with the findings
    reported by Silipo and Greenberg in previous
    years meetings regarding the OGI Stories
    corpus (telephone monologues)
  • Certain vocalic classes exhibit a far greater
    dynamic range in duration than others
  • Diphthongs tend to be longer than monophthongs,
    BUT .
  • The low monophthongs (ae, aa, ay, aw,
    ao) exhibit patterns of duration and dynamic
    range under stress (accent) similar to diphtongs
  • The statistical patterns are consistent with the
    hypothesis that duration serves under many
    conditions as either a primary or secondary
    cue for vowel height (normally associated with
    the frequency of the first formant)

114
Take Home Messages
  • Moreover, the stress-accent system in spontaneous
    (American) English appears to be closely
    associated with vocalic identity
  • Low vowels are far more likely to be fully
    stressed than high vowels (with the mid vowels
    exhibiting an intermediate probability of being
    stressed)
  • Thus, the identity of a vowel can not be
    considered independently of stress-accent
  • The two parameters are likely to be flip sides of
    the same Koine
  • Although English is not generally considered to
    be a vowel-quantity language (as is Finnish),
    given the close relationship between
    stress-accent and duration, and between
    duration and vowel quality, there is some
    sense in which English (and perhaps other
    stress-accent languages) manifest certain
    properties of a quantity system

115
Automatic Methods for Articulatory Feature
Extraction and Phonetic Transcription
116
Manner-Specific Place Classification
117
Manner Feature Classification/Segmentation
  • Automatic methods (neural networks) can
    accurately label MANNER of articulation
    features for spontaneous material (Switchboard
    corpus)
  • Implication MANNER information may be
    relatively co-terminous with phonetic segments
    and evade co-articulation effects

118
Label Accuracy per Frame
  • Central frames are labeled more accurately than
    those close to the segmental boundaries
  • Implication some frames are created more equal
    than others

OGI Numbers Corpus
Frame step interval 10 ms
119
MANNER Classification Elitist Approach
  • Confident (usually central) frames are
    classified more accurately

NTIMIT (telephone) Corpus
120
Manner-Specific Place Classification
  • Knowing the manner improves place
    classification for consonants

NTIMIT (telephone) Corpus
121
Manner-Specific Place Classification
  • Knowing the manner improves place
    classification for vowels as well

NTIMIT (telephone) Corpus
122
Manner-Specific Place Classification Dutch
  • Knowing the manner improves place
    classification for consonants and vowels in
    DUTCH as well as in English

VIOS (telephone) Corpus
123
Manner-Specific Place Classification Dutch
  • Knowing the manner improves place
    classification for the approximant segments
    in DUTCH
  • Approximants are classified as vocalic rather
    than as consonantal

VIOS (telephone) Corpus
124
Take Home Messages
  • Automatic recognition systems can be used to test
    specific hypotheses about the acoustic
    properties of articulatory features, segments and
    syllables
  • Manner information appears to be well classified
    and segmented
  • Suggests that manner features may be the key
    articulatory feature dimension for segmentation
    within the syllable
  • Place information is not as well classified as
    manner information
  • Improvement of place with manner-specific
    classification suggests that place recognition
    does depend to a certain degree on manner
    classification
  • Voicing information appears to be relatively
    robust under many conditions and therefore is
    likely to emanate from a variety of spectral
    regions
  • The time constant for voicing information is also
    likely to be less than or coterminous with the
    segment

125
Sample Transcription from the ALPS System
  • The ALPS (automatic labeling of phonetic
    segments) system performs very similarly to
    manual transcription in terms of both labels and
    segmentation
  • 11 ms average concordance in segmentation
  • 83 concordance with respect to phonetic labels

OGI Numbers (telephone) corpus
126
ALPS Output Can Be Superior to Alignments
Switchboard (telephone) Corpus
127
Grand Summary and Conclusions
  • The controlling parameters for understanding
    spoken language appear to be based on
    low-frequency modulation patterns in the acoustic
    signal associated with the syllable
  • Both the magnitude and phase of the modulation
    patterns are important
  • Encoding information in terms of low-frequency
    modulations provides a certain degree of
    robustness to the speech signal that enables it
    to be decoded under a wide range of acoustic and
    speaking conditions
  • Manner information appears to be the key to
    understanding segmentation internal to the
    syllable
  • Place features appear to be dominant and most
    stable at syllable onset and coda
  • Manner is the stable feature dimension for the
    syllabic nucleus
  • Voicing and rounding appear to be auxiliary
    features linked to manner and place feature
    information
  • Real speech can be useful in delineating
    underlying patterns of l linguistic organization

128
Thats All, Folks Many Thanks for Your Time and
Attention
Write a Comment
User Comments (0)
About PowerShow.com