Title: Time%20Frames
1Time Frames of Spoken Language Steven
Greenberg International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 http//www.icsi.berkeley.edu/steveng steven
g_at_icsi.berkeley.edu In Collaboration with
Hannah Carvey, Leah Hitchcock and Shawn Chang
2Acknowledgements and Thanks
Statistical Analysis and Automatic
Classification Hannah Carvey, Shawn Chang, Leah
Hitchcock Research Funding U.S. National
Science Foundation U.S. Department of Defense
3For Further Information
Consult the web site www.icsi.berkel
ey.edu/steveng
4OVERTURE The Central Challenge for Models of
Speech Recognition
5Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
Cat /k/ /ae/ /t/
Cat k ae t
6The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal (provided courtesy of the auditory system)
computed for each interval (frame) of speech
7The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal (provided courtesy of the auditory system)
computed for each interval (frame) of speech
(this is literally how automatic speech
recognition systems decode the speech signal)
8 Challenge Number One Pronunciation Variability
9Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse
10Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse The are literally
dozens of ways in which common words are
pronounced
11Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse The are literally
dozens of ways in which common words are
pronounced (as the following two slides
illustrate for the word and based on manual
phonetic annotation of a corpus comprising
telephone dialogues)
12How Many Pronunciations of and?
Canonical pronunciation
13How Many Pronunciations of and?
14Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard)
15How Many Different Pronunciations?
The 20 most frequent words account for 35 of the
tokens
16 QUESTION How do listeners decode the speech
signal given the large amount of pronunciation
variation?
17 Challenge Number Two Acoustic Variability
18Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday conditions
19Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday
conditions Yet, the intelligibility of speech is
remarkably stable (unless the amount of
reverberation or background noise is truly
extreme)
20Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday
conditions Yet, the intelligibility of speech is
remarkably stable (unless the amount of
reverberation or background noise is truly
extreme) How can this be so?
21 QUESTION Is there some acoustic property that
provides a basis for perceptual stability of the
speech signal?
22An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions
Modulation Spectrum
based on an illustration by Hynek Hermansky
23An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions In reverberant
environments the MODULATION SPECTRUMS peak is
attenuated and shifted down to ca. 2 Hz (but is
largely preserved)
Modulation Spectrum
based on an illustration by Hynek Hermansky
24An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions In reverberant
environments the modulation spectrums peak is
attenuated and shifted down to ca. 2 Hz (but is
largely preserved) (What is the modulation
spectrum? you ask)
Modulation Spectrum
based on an illustration by Hynek Hermansky
25An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions In reverberant
environments the modulation spectrums peak is
attenuated and shifted down to ca. 2 Hz (but is
largely preserved) (What is the modulation
spectrum? you ask) Lets find out!
Modulation Spectrum
based on an illustration by Hynek Hermansky
26Modulation Spectrum Computation
27Intelligibility and the Modulation Spectrum
Significant attenuation (or distortion) of the
modulation spectrum results in an appreciable
decline in the ability to understand spoken
language
Greenberg and Arai (1998)
28Intelligibility and the Modulation Spectrum
Significant attenuation (or distortion) of the
modulation spectrum results in an appreciable
decline in the ability to understand spoken
language Why should this be so?
Greenberg and Arai (1998)
29Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility?
30Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically?
31Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad?
32Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad?
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
33Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad? Does the
modulation spectrum reflect a unitary property of
the speech signal?
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
34Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad? Does the
modulation spectrum reflect a unitary property of
the speech signal? Or something more complex?
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
35The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms)
36The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms) The distribution associated
with SYLLABLE DURATION is similar to the pattern
of the MODULATION SPECTRUM .
37The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms) The distribution associated
with SYLLABLE DURATION is similar to the pattern
of the MODULATION SPECTRUM .
Syllable duration distribution associated with a
30-minute subset of Switchboard
Syllable duration (in terms of equivalent Modulati
on frequency)
Modulation Spectrum
Modulation spectrum of a short excerpt from the
Switchboard Corpus
38The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms) The distribution associated
with SYLLABLE DURATION is similar to the pattern
of the MODULATION SPECTRUM . Suggesting that the
latter reflects SYLLABLES
Syllable duration distribution associated with a
30-minute subset of Switchboard
Syllable duration (in terms of equivalent Modulati
on frequency)
Modulation spectrum of a short excerpt from the
Switchboard Corpus
39The Trouble with Syllables
The question thus arises
40The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal
41The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal Why is the distribution of syllable
duration so broad?
42The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal Why is the distribution of syllable
duration so broad?
Syllable duration (modulation frequency)
Modulation Spectrum
Modulation spectrum of 15 minutes of spontaneous
Japanese speech (OGI-TS corpus) compared with the
syllable duration distribution for the same
material (Arai and Greenberg, 1997)
43The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal Why is the distribution of syllable
duration so broad? And does this variability in
syllable duration reflect something significant?
Syllable duration (modulation frequency)
Modulation Spectrum
Modulation spectrum of 15 minutes of spontaneous
Japanese speech (OGI-TS corpus) compared with the
syllable duration distribution for the same
material (Arai and Greenberg, 1997)
44PART ONE What Underlies Variation in Word
Duration?
45Word Duration
Most words (81) in the Switchboard corpus are
monosyllabic, and most of the remainder are
disyllabic (together comprising 95 of the words)
46Word Duration
Most words (81) in the Switchboard corpus are
monosyllabic, and most of the remainder are
disyllabic (together comprising 95 of the
words) The distribution of word duration
therefore largely parallels that of
syllables (plotted in units of duration ms on a
logarithmic scale)
All Words
47What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)?
48What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena?
49What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena? It has been
observed for WRITTEN text that the more frequent
words tend to be shorter and the less common
words longer (i.e., Zipfs law)
50What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena? It has been
observed for WRITTEN text that the more frequent
words tend to be shorter and the less common
words longer (i.e., Zipfs law) Does such a
relationship hold for spoken language?
51What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena? It has been
observed for WRITTEN text that the more frequent
words tend to be shorter and the less common
words longer (i.e., Zipfs law) Does such a
relationship hold for spoken language? Lets find
out!
52Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of occurrence
53Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of occurrence
Words with fewer than 5 instances omitted from
graph
r 0 .42
54Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of
occurrence Such an exercise shows that there is a
WEAK relationship (r 0.42) between lexical
(unigram) frequency and word duration
Words with fewer than 5 instances omitted from
graph
r 0 .42
55Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of
occurrence Such an exercise shows that there is a
WEAK relationship (r 0.42) between lexical
(unigram) frequency and word duration There is a
lot of variability in word duration for any
given frequency range
Words with fewer than 5 instances omitted from
graph
r 0 .42
56Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of
occurrence Such an exercise shows that there is a
WEAK relationship (r 0.42) between lexical
(unigram) frequency and word duration There is a
lot of variability in word duration for any
given frequency range Suggesting that lexical
frequency, alone, is unlikely to account for
variation in word duration
Words with fewer than 5 instances omitted from
graph
r 0 .42
57If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT
58If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a word
59If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary)
60If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable)
61If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words
62If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment
63If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment The remainder of this presentation
focuses on the statistical relationship between
stress accent and duration at these different
linguistic tiers
64If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment The remainder of this presentation
focuses on the statistical relationship between
stress accent and duration at these different
linguistic tiers Before examining these data,
lets briefly consider the nature of the
annotated material
65If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment The remainder of this presentation
focuses on the statistical relationship between
stress accent and duration at these different
linguistic tiers Before examining these data,
lets briefly consider the nature of the
annotated material (this is important for
evaluating the reliability of the results
obtained)
66INTERMEZZO Being Phonetically (and
Prosodically) Annotated
67Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and segmented)
68Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated
69Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level
70Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level
71Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods
72Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material
73Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon)
74Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) There is a Lot of Diversity in the
Material Transcribed
75Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) There is a Lot of Diversity in the
Material Transcribed Spans speech of both genders
(ca. 50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality
76Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) There is a Lot of Diversity in the
Material Transcribed Spans speech of both genders
(ca. 50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality Transcription System A variant of Arpabet
(which was also used for transcription of the
TIMIT corpus)
77Phonetic Transcription of Spontaneous English
The Data are Available at .
78Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp
79Phonetic Transcription
How was the Labeling and Segmentation Performed?
80Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students
81Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform
82Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram
83Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
84Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries)
85Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio
86Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance) on
Sun workstations
87Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance) on
Sun workstations Additionally, automatic
segmentation and labeling of articulatory manner
was used as a guide for phonetic labeling and
segmentation in recent work
88Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
89Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
90Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy
91Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light
92Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
93Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
94Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
95Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others) - An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)
96Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others) - An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary) - In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5)
97Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others) - An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary) - In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5) (and one other labeled as very lightly
accented (0.25))
98PART TWO The Relation between Stress Accent
and Word Duration
99Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal
100Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental Frequency
101Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude
102Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration
103Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration
104Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material)
105Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0
106Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0 Therefore, it is not unreasonable
to assume that the stress accent patterns
associated with words bear some tangible relation
to lexical duration
107Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0 Therefore, it is not unreasonable
to assume that the stress accent patterns
associated with words bear some tangible relation
to lexical duration So
108Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0 Therefore, it is not unreasonable
to assume that the stress accent patterns
associated with words bear some tangible relation
to lexical duration So , lets find out!
109Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words
110Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words (these are words
containing at least one heavily accented
syllable)
111Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words (these are words
containing at least one heavily accented
syllable) The mean duration of this subset (36)
is 378 ms (s.d. 168 ms)
Heavily Accented
112Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words (these are words
containing at least one heavily accented
syllable) The mean duration of this subset (36)
is 378 ms (s.d. 168 ms) Most of the heavily
accented words are longer than 200 ms
Heavily Accented
113Word Duration and Stress Accent Level
Lets now compare the duration of the heavily
accented words with those of their lightly
accented counterparts (25 of the total)
Heavily Accented
114Word Duration and Stress Accent Level
Lets now compare the duration of the heavily
accented words with those of their lightly
accented counterparts (25 of the total) The mean
duration of this subset is 255 ms (s.d. 116 ms)
Heavily Accented
Lightly Accented
115Word Duration and Stress Accent Level
Lets now compare the duration of the heavily
accented words with those of their lightly
accented counterparts (25 of the total) The mean
duration of this subset is 255 ms (s.d. 116
ms) In many respects the durational properties of
these two subsets are similar
Heavily Accented
Lightly Accented
116Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented counterparts
Heavily Accented
Lightly Accented
117Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented
counterparts The mean duration of the unaccented
subset (39) is 149 ms (s.d. 78 ms)
Unaccented
Heavily Accented
Lightly Accented
118Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented
counterparts The mean duration of the unaccented
subset (39) is 149 ms (s.d. 78 ms) The
unaccented words are generally shorter than 200
ms
Unaccented
Heavily Accented
Lightly Accented
119Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented
counterparts The mean duration of the unaccented
subset (39) is 149 ms (s.d. 78 ms) The
unaccented words are generally shorter than 200
ms and constitute a very different distributional
form than their accented counterparts
Unaccented
Heavily Accented
Lightly Accented
120Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels
Unaccented
Heavily Accented
Lightly Accented
121Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels When we do so,
All Words
Unaccented
Heavily Accented
Lightly Accented
122Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels When we do so, we
notice that the left-hand branch of the lexical
distribution largely reflects unaccented words,
All Words
Unaccented
Heavily Accented
Lightly Accented
123Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels When we do so, we
notice that the left-hand branch of the lexical
distribution largely reflects unaccented words,
while the right-hand branch reflects mostly
accented words (with the peak reflecting both)
All Words
Unaccented
Heavily Accented
Lightly Accented
124Word Duration and Stress Accent Level
Therefore, it appears that the broad distribution
of word duration (and, in turn, syllable
duration) largely reflects the co-existence of
accented and unaccented words within spontaneous
speech
All Words
Unaccented
Heavily Accented
Lightly Accented
125Word Duration and Stress Accent Level
Therefore, it appears that the broad distribution
of word duration (and, in turn, syllable
duration) largely reflects the co-existence of
accented and unaccented words within spontaneous
speech What are the implications of this insight?
All Words
Unaccented
Heavily Accented
Lightly Accented
126Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level
127Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
128Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level Does this
insight have implications for the lower tiers of
spoken language?
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
129Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level Does this
insight have implications for the lower tiers of
spoken language? (e.g., the phonetic and
phonological levels)
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
130Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level Does this
insight have implications for the lower tiers of
spoken language? (e.g., the phonetic and
phonological levels) Lets find out!
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
131 INTERMEZZO Anatomy of the Syllable
132The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure
133The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position
134The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level)
135The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns
136The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
137The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus?
138The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus? What is a coda?
139The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a nucleus?
What is a coda? What is a coda? The following
slides provide a brief (and gentle) introduction
to syllable structure
140Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA
J JUNCTURE
141Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition)
J JUNCTURE
142Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT)
J JUNCTURE
143Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT)
J JUNCTURE
144Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda (Nine)
J JUNCTURE
145Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two)
J JUNCTURE
146 PART THREE Stress Accent and Syllable Position
147The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent
148The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent These data serve to
illustrate the sort of variation observed that is
conditioned by position within the syllable
149Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable
Deletions
CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
150Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Its also systematic when
stress accent is taken into account
Deletions
CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
151Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Its also systematic when
stress accent is taken into account BOTH syllable
structure and accent level are required for a
full accounting
Deletions
CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
152 A Coarse Perspective on Pronunciation
Variation (at the level of the syllable and
stress accent)
153Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position
154Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position We will begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment duration
155Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position We will begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level
156Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position We will begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level Under such conditions,
the durational properties associated with light
accent are generally intermediate between heavy
accent and none
157Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English
158Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English - Together, the V, VC, CV and CVC forms account for
85 of syllables -
159Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English - Together, the V, VC, CV and CVC forms account for
85 of syllables - The CVCC and CCVC forms account for another 10
-
160Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English - Together, the V, VC, CV and CVC forms account for
85 of syllables - The CVCC and CCVC forms account for another 10
- Together, the CV and CVC forms cover ca. 60 of
the syllables -
161Syllable Duration - Across Syllable Forms
- It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) -
V Vowel C Consonant
Canonical Syllable Forms
162Syllable Duration - Across Syllable Forms
- It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) - Note the systematic lengthening of the syllable
for each form as the accent level increases from
none to light to heavy -
V Vowel C Consonant
Canonical Syllable Forms
163Syllable Duration - Across Syllable Forms
- It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) - Note the systematic lengthening of the syllable
for each form as the accent level increases from
none to light to heavy - This pattern is representative of accents impact
on duration -
V Vowel C Consonant
Canonical Syllable Forms
164Syllable Duration - Across Syllable Forms
- It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) - Note the systematic lengthening of the syllable
for each form as the accent level increases from
none to light to heavy - This pattern is representative of accents impact
on duration (as well see) -
V Vowel C Consonant
Canonical Syllable Forms
165Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none)
V Vowel C Consonant
Canonical Syllable Forms
166Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts
V Vowel C Consonant
Canonical Syllable Forms
167Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV)
V Vowel C Consonant
Canonical Syllable Forms
168Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV) This
pattern implies that accent has the greatest
impact on vocalic duration
V Vowel C Consonant
Canonical Syllable Forms
169Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph below
Canonical Syllable Forms
170Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below The duration of vowels in accented
syllables (of all forms) are at least twice as
long as their unaccented counterparts
Canonical Syllable Forms
171Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below The duration of vowels in accented
syllables (of all forms) are at least twice as
long as their unaccented counterparts This
pattern implies that the syllable nucleus absorbs
a major comp