Time%20Frames

About This Presentation

Title:

Time%20Frames

Description:

Syllable duration distribution associated with a 30-minute subset of Switchboard ... Most words (81%) in the Switchboard corpus are monosyllabic, and most of the ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 363

Provided by: stevegr4

Category:

more less

Transcript and Presenter's Notes

Title: Time%20Frames

1
Time Frames of Spoken Language Steven
Greenberg International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 http//www.icsi.berkeley.edu/steveng steven
g_at_icsi.berkeley.edu In Collaboration with
Hannah Carvey, Leah Hitchcock and Shawn Chang
2
Acknowledgements and Thanks
Statistical Analysis and Automatic
Classification Hannah Carvey, Shawn Chang, Leah
Hitchcock Research Funding U.S. National
Science Foundation U.S. Department of Defense
3
For Further Information
Consult the web site www.icsi.berkel
ey.edu/steveng
4
OVERTURE The Central Challenge for Models of
Speech Recognition
5
Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
Cat /k/ /ae/ /t/
Cat k ae t
6
The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal (provided courtesy of the auditory system)
computed for each interval (frame) of speech
7
The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal (provided courtesy of the auditory system)
computed for each interval (frame) of speech
(this is literally how automatic speech
recognition systems decode the speech signal)
8
Challenge Number One Pronunciation Variability
9
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse
10
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse The are literally
dozens of ways in which common words are
pronounced
11
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse The are literally
dozens of ways in which common words are
pronounced (as the following two slides
illustrate for the word and based on manual
phonetic annotation of a corpus comprising
telephone dialogues)
12
How Many Pronunciations of and?
Canonical pronunciation
13
How Many Pronunciations of and?
14
Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard)
15
How Many Different Pronunciations?
The 20 most frequent words account for 35 of the
tokens
16
QUESTION How do listeners decode the speech
signal given the large amount of pronunciation
variation?
17
Challenge Number Two Acoustic Variability
18
Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday conditions
19
Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday
conditions Yet, the intelligibility of speech is
remarkably stable (unless the amount of
reverberation or background noise is truly
extreme)
20
Effects of Reverberation on the Speech Signal
Reflections from walls and other surfaces
routinely modify the spectro-temporal structure
of the speech signal under everyday
conditions Yet, the intelligibility of speech is
remarkably stable (unless the amount of
reverberation or background noise is truly
extreme) How can this be so?
21
QUESTION Is there some acoustic property that
provides a basis for perceptual stability of the
speech signal?
22
An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions
Modulation Spectrum
based on an illustration by Hynek Hermansky
23
An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions In reverberant
environments the MODULATION SPECTRUMS peak is
attenuated and shifted down to ca. 2 Hz (but is
largely preserved)
Modulation Spectrum
based on an illustration by Hynek Hermansky
24
An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions In reverberant
environments the modulation spectrums peak is
attenuated and shifted down to ca. 2 Hz (but is
largely preserved) (What is the modulation
spectrum? you ask)
Modulation Spectrum
based on an illustration by Hynek Hermansky
25
An Invariant Property of the Speech Signal?
Low-frequency energy fluctuations of the pressure
waveform are largely preserved under many
acoustic-interference conditions In reverberant
environments the modulation spectrums peak is
attenuated and shifted down to ca. 2 Hz (but is
largely preserved) (What is the modulation
spectrum? you ask) Lets find out!
Modulation Spectrum
based on an illustration by Hynek Hermansky
26
Modulation Spectrum Computation
27
Intelligibility and the Modulation Spectrum
Significant attenuation (or distortion) of the
modulation spectrum results in an appreciable
decline in the ability to understand spoken
language
Greenberg and Arai (1998)
28
Intelligibility and the Modulation Spectrum
Significant attenuation (or distortion) of the
modulation spectrum results in an appreciable
decline in the ability to understand spoken
language Why should this be so?
Greenberg and Arai (1998)
29
Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility?
30
Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically?
31
Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad?
32
Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad?
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
33
Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad? Does the
modulation spectrum reflect a unitary property of
the speech signal?
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
34
Anatomy of the Modulation Spectrum
Why is the modulation spectrums integrity so
crucial for intelligibility? What does it reflect
linguistically? Why is the bandwidth of the
modulation spectrum associated with
(intelligible) speech so broad? Does the
modulation spectrum reflect a unitary property of
the speech signal? Or something more complex?
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
35
The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms)
36
The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms) The distribution associated
with SYLLABLE DURATION is similar to the pattern
of the MODULATION SPECTRUM .
37
The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms) The distribution associated
with SYLLABLE DURATION is similar to the pattern
of the MODULATION SPECTRUM .
Syllable duration distribution associated with a
30-minute subset of Switchboard
Syllable duration (in terms of equivalent Modulati
on frequency)
Modulation Spectrum
Modulation spectrum of a short excerpt from the
Switchboard Corpus
38
The Modulation Spectrum Reflects Syllables
The peak in the modulation spectrum (for speech)
is ca. 5 Hz (200 ms) The distribution associated
with SYLLABLE DURATION is similar to the pattern
of the MODULATION SPECTRUM . Suggesting that the
latter reflects SYLLABLES
Syllable duration distribution associated with a
30-minute subset of Switchboard
Syllable duration (in terms of equivalent Modulati
on frequency)
Modulation spectrum of a short excerpt from the
Switchboard Corpus
39
The Trouble with Syllables
The question thus arises
40
The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal
41
The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal Why is the distribution of syllable
duration so broad?
42
The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal Why is the distribution of syllable
duration so broad?
Syllable duration (modulation frequency)
Modulation Spectrum
Modulation spectrum of 15 minutes of spontaneous
Japanese speech (OGI-TS corpus) compared with the
syllable duration distribution for the same
material (Arai and Greenberg, 1997)
43
The Trouble with Syllables
The question thus arises If the modulation
spectrum truly reflects syllables in the speech
signal Why is the distribution of syllable
duration so broad? And does this variability in
syllable duration reflect something significant?
Syllable duration (modulation frequency)
Modulation Spectrum
Modulation spectrum of 15 minutes of spontaneous
Japanese speech (OGI-TS corpus) compared with the
syllable duration distribution for the same
material (Arai and Greenberg, 1997)
44
PART ONE What Underlies Variation in Word
Duration?
45
Word Duration
Most words (81) in the Switchboard corpus are
monosyllabic, and most of the remainder are
disyllabic (together comprising 95 of the words)
46
Word Duration
Most words (81) in the Switchboard corpus are
monosyllabic, and most of the remainder are
disyllabic (together comprising 95 of the
words) The distribution of word duration
therefore largely parallels that of
syllables (plotted in units of duration ms on a
logarithmic scale)
All Words
47
What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)?
48
What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena?
49
What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena? It has been
observed for WRITTEN text that the more frequent
words tend to be shorter and the less common
words longer (i.e., Zipfs law)
50
What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena? It has been
observed for WRITTEN text that the more frequent
words tend to be shorter and the less common
words longer (i.e., Zipfs law) Does such a
relationship hold for spoken language?
51
What Underlies Word Duration Variability?
Is this distribution of lexical duration of a
uniform nature (and source)? Or does it reflect a
more complex set of phenomena? It has been
observed for WRITTEN text that the more frequent
words tend to be shorter and the less common
words longer (i.e., Zipfs law) Does such a
relationship hold for spoken language? Lets find
out!
52
Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of occurrence
53
Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of occurrence
Words with fewer than 5 instances omitted from
graph
r 0 .42
54
Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of
occurrence Such an exercise shows that there is a
WEAK relationship (r 0.42) between lexical
(unigram) frequency and word duration
Words with fewer than 5 instances omitted from
graph
r 0 .42
55
Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of
occurrence Such an exercise shows that there is a
WEAK relationship (r 0.42) between lexical
(unigram) frequency and word duration There is a
lot of variability in word duration for any
given frequency range
Words with fewer than 5 instances omitted from
graph
r 0 .42
56
Is Word Duration Related to Word Frequency?
Word duration (derived from the phonetically
annotated portion of the Switchboard corpus) can
be plotted relative to frequency of
occurrence Such an exercise shows that there is a
WEAK relationship (r 0.42) between lexical
(unigram) frequency and word duration There is a
lot of variability in word duration for any
given frequency range Suggesting that lexical
frequency, alone, is unlikely to account for
variation in word duration
Words with fewer than 5 instances omitted from
graph
r 0 .42
57
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT
58
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a word
59
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary)
60
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable)
61
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words
62
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment
63
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment The remainder of this presentation
focuses on the statistical relationship between
stress accent and duration at these different
linguistic tiers
64
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment The remainder of this presentation
focuses on the statistical relationship between
stress accent and duration at these different
linguistic tiers Before examining these data,
lets briefly consider the nature of the
annotated material
65
If Not (entirely) Word Frequency, Then What?
One parameter that might be more directly related
to word duration (and other durational properties
of speech) is STRESS ACCENT Stress Accent is
related to the emphasis (or prominence)
associated with individual syllables within a
word Although dictionaries list the stress
patterns associated with words, this information
is but a rough guide to the actual patterns
observed (as is the phonetic pronunciation
provided in the dictionary) In order to obtain
empirical data pertaining to stress accent, it is
necessary to manually annotate a corpus (syllable
by syllable) This manual annotation has been
performed for a 45-minute subset of the
Switchboard corpus, which has also been labeled
with respect to phonetic segments, syllables and
words It is thus possible to ascertain the
relationship between stress accent and duration
at the level of the word, syllable and phonetic
segment The remainder of this presentation
focuses on the statistical relationship between
stress accent and duration at these different
linguistic tiers Before examining these data,
lets briefly consider the nature of the
annotated material (this is important for
evaluating the reliability of the results
obtained)
66
INTERMEZZO Being Phonetically (and
Prosodically) Annotated
67
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and segmented)

68
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated
69
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level
70
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level
71
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods
72
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material
73
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon)
74
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) There is a Lot of Diversity in the
Material Transcribed
75
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) There is a Lot of Diversity in the
Material Transcribed Spans speech of both genders
(ca. 50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality
76
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) There is a Lot of Diversity in the
Material Transcribed Spans speech of both genders
(ca. 50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality Transcription System A variant of Arpabet
(which was also used for transcription of the
TIMIT corpus)
77
Phonetic Transcription of Spontaneous English
The Data are Available at .
78
Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp
79
Phonetic Transcription
How was the Labeling and Segmentation Performed?
80
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students
81
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform
82
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram
83
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
84
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries)
85
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio
86
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance) on
Sun workstations
87
Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance) on
Sun workstations Additionally, automatic
segmentation and labeling of articulatory manner
was used as a guide for phonetic labeling and
segmentation in recent work
88
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent

89
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished

90
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy

91
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light

92
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None

93
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None

94
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)

95
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)

96
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)
In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5)

97
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to a fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)
In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5) (and one other labeled as very lightly
accented (0.25))

98
PART TWO The Relation between Stress Accent
and Word Duration
99
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal
100
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental Frequency
101
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude
102
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration
103
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration
104
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material)
105
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0
106
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0 Therefore, it is not unreasonable
to assume that the stress accent patterns
associated with words bear some tangible relation
to lexical duration
107
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0 Therefore, it is not unreasonable
to assume that the stress accent patterns
associated with words bear some tangible relation
to lexical duration So
108
Back to Stress Accent and Word Duration
Stress accent is supposed to bear some systematic
relation to three principal acoustic parameters
of the speech signal Fundamental
Frequency Amplitude Duration In previous
studies my colleagues and I have shown that f0
-related cues play a relatively small role in
stress accent assignment (at least for
spontaneous American English material) Amplitude
and duration appear to play a far more important
role than f0 Therefore, it is not unreasonable
to assume that the stress accent patterns
associated with words bear some tangible relation
to lexical duration So , lets find out!
109
Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words
110
Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words (these are words
containing at least one heavily accented
syllable)
111
Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words (these are words
containing at least one heavily accented
syllable) The mean duration of this subset (36)
is 378 ms (s.d. 168 ms)
Heavily Accented
112
Word Duration and Stress Accent Level
Lets first examine the durational properties of
heavily accented words (these are words
containing at least one heavily accented
syllable) The mean duration of this subset (36)
is 378 ms (s.d. 168 ms) Most of the heavily
accented words are longer than 200 ms
Heavily Accented
113
Word Duration and Stress Accent Level
Lets now compare the duration of the heavily
accented words with those of their lightly
accented counterparts (25 of the total)
Heavily Accented
114
Word Duration and Stress Accent Level
Lets now compare the duration of the heavily
accented words with those of their lightly
accented counterparts (25 of the total) The mean
duration of this subset is 255 ms (s.d. 116 ms)
Heavily Accented
Lightly Accented
115
Word Duration and Stress Accent Level
Lets now compare the duration of the heavily
accented words with those of their lightly
accented counterparts (25 of the total) The mean
duration of this subset is 255 ms (s.d. 116
ms) In many respects the durational properties of
these two subsets are similar
Heavily Accented
Lightly Accented
116
Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented counterparts
Heavily Accented
Lightly Accented
117
Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented
counterparts The mean duration of the unaccented
subset (39) is 149 ms (s.d. 78 ms)
Unaccented
Heavily Accented
Lightly Accented
118
Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented
counterparts The mean duration of the unaccented
subset (39) is 149 ms (s.d. 78 ms) The
unaccented words are generally shorter than 200
ms
Unaccented
Heavily Accented
Lightly Accented
119
Word Duration and Stress Accent Level
Lets now compare the duration of unaccented
words with that of their accented
counterparts The mean duration of the unaccented
subset (39) is 149 ms (s.d. 78 ms) The
unaccented words are generally shorter than 200
ms and constitute a very different distributional
form than their accented counterparts
Unaccented
Heavily Accented
Lightly Accented
120
Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels
Unaccented
Heavily Accented
Lightly Accented
121
Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels When we do so,
All Words
Unaccented
Heavily Accented
Lightly Accented
122
Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels When we do so, we
notice that the left-hand branch of the lexical
distribution largely reflects unaccented words,
All Words
Unaccented
Heavily Accented
Lightly Accented
123
Word Duration and Stress Accent Level
Lets now compare the durational properties of
ALL WORDS in the corpus with those pertaining to
words of varying accent levels When we do so, we
notice that the left-hand branch of the lexical
distribution largely reflects unaccented words,
while the right-hand branch reflects mostly
accented words (with the peak reflecting both)
All Words
Unaccented
Heavily Accented
Lightly Accented
124
Word Duration and Stress Accent Level
Therefore, it appears that the broad distribution
of word duration (and, in turn, syllable
duration) largely reflects the co-existence of
accented and unaccented words within spontaneous
speech
All Words
Unaccented
Heavily Accented
Lightly Accented
125
Word Duration and Stress Accent Level
Therefore, it appears that the broad distribution
of word duration (and, in turn, syllable
duration) largely reflects the co-existence of
accented and unaccented words within spontaneous
speech What are the implications of this insight?
All Words
Unaccented
Heavily Accented
Lightly Accented
126
Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level
127
Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
128
Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level Does this
insight have implications for the lower tiers of
spoken language?
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
129
Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level Does this
insight have implications for the lower tiers of
spoken language? (e.g., the phonetic and
phonological levels)
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
130
Breadth of the Modulation Spectrum
The broad bandwidth of the modulation spectrum,
therefore, appears to reflect the heterogeneity
in syllabic and lexical duration associated with
variation in stress accent level Does this
insight have implications for the lower tiers of
spoken language? (e.g., the phonetic and
phonological levels) Lets find out!
All Accents (Convergnce)
Unaccented
Heavily Accented
Modulation spectrum of 40 TIMIT sentences
(computed across a 6-kHz bandwidth)
131
INTERMEZZO Anatomy of the Syllable
132
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure
133
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position
134
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level)
135
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns
136
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
137
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus?
138
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus? What is a coda?
139
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a nucleus?
What is a coda? What is a coda? The following
slides provide a brief (and gentle) introduction
to syllable structure
140
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA
J JUNCTURE
141
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition)
J JUNCTURE
142
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT)
J JUNCTURE
143
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT)
J JUNCTURE
144
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda (Nine)
J JUNCTURE
145
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two)
J JUNCTURE
146
PART THREE Stress Accent and Syllable Position
147
The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent
148
The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent These data serve to
illustrate the sort of variation observed that is
conditioned by position within the syllable
149
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable
Deletions

All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
150
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Its also systematic when
stress accent is taken into account
Deletions

All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
151
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Its also systematic when
stress accent is taken into account BOTH syllable
structure and accent level are required for a
full accounting
Deletions

All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
152
A Coarse Perspective on Pronunciation
Variation (at the level of the syllable and
stress accent)

153
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position
154
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position We will begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment duration
155
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position We will begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level
156
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position We will begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level Under such conditions,
the durational properties associated with light
accent are generally intermediate between heavy
accent and none
157
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English

158
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English
Together, the V, VC, CV and CVC forms account for
85 of syllables

159
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English
Together, the V, VC, CV and CVC forms account for
85 of syllables
The CVCC and CCVC forms account for another 10

160
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English
Together, the V, VC, CV and CVC forms account for
85 of syllables
The CVCC and CCVC forms account for another 10
Together, the CV and CVC forms cover ca. 60 of
the syllables

161
Syllable Duration - Across Syllable Forms

It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)

V Vowel C Consonant
Canonical Syllable Forms
162
Syllable Duration - Across Syllable Forms

It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)
Note the systematic lengthening of the syllable
for each form as the accent level increases from
none to light to heavy

V Vowel C Consonant
Canonical Syllable Forms
163
Syllable Duration - Across Syllable Forms

It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)
Note the systematic lengthening of the syllable
for each form as the accent level increases from
none to light to heavy
This pattern is representative of accents impact
on duration

V Vowel C Consonant
Canonical Syllable Forms
164
Syllable Duration - Across Syllable Forms

It is not surprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)
Note the systematic lengthening of the syllable
for each form as the accent level increases from
none to light to heavy
This pattern is representative of accents impact
on duration (as well see)

V Vowel C Consonant
Canonical Syllable Forms
165
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none)
V Vowel C Consonant
Canonical Syllable Forms
166
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts
V Vowel C Consonant
Canonical Syllable Forms
167
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV)
V Vowel C Consonant
Canonical Syllable Forms
168
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of only two
accent levels (heavy and none) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV) This
pattern implies that accent has the greatest
impact on vocalic duration
V Vowel C Consonant
Canonical Syllable Forms
169
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph below

Canonical Syllable Forms
170
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below The duration of vowels in accented
syllables (of all forms) are at least twice as
long as their unaccented counterparts
Canonical Syllable Forms
171
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below The duration of vowels in accented
syllables (of all forms) are at least twice as
long as their unaccented counterparts This
pattern implies that the syllable nucleus absorbs
a major comp

Write a Comment

User Comments (0)