Prosody in Spoken Language Understanding

About This Presentation

Title:

Prosody in Spoken Language Understanding

Description:

Pure acoustic-prosodic model. No word label, POS, lexical stress info ... Can we identify the tone structure of a language from the acoustic space without ... – PowerPoint PPT presentation

Number of Views:302

Avg rating:3.0/5.0

Slides: 45

Provided by: gina45

Category:

more less

Transcript and Presenter's Notes

Title: Prosody in Spoken Language Understanding

1
Prosody in Spoken Language Understanding

Gina Anne Levow
University of Chicago
January 4, 2008
NLP Winter School 2008

U Give me the price for ATT.
U Give me the price for ATT.
U Give me the price for ATT.
U Give me the price for American Telephone and
Telegraph.

3
Error Correction Spiral

U Give me the price for ATT.
S Hewlett Packard was 83 ¾, up 2 ½ since
yesterday.
U Give me the price for ATT.
S Data General was at 10 ½ up a quarter.
U Give me the price for ATT.
S Hewlett Packard was 83 ¾, up 2 ½.
U Give me the price for American Telephone and
Telegraph.
S Excuse me?

4
Roadmap

Corrections A motivating example
Defining prosody
Why prosody?
Challenges in prosody
Prosody in language understanding
Recognizing tone and pitch accent
Spoken corrections, Topic segmentation
Conclusions

5
Defining Prosody

Prosody
Phonetic phenomena in speech than span more than
a single segment-suprasegmental
Prosody includes
Stress, focus, tone, intonation, length/pause,
rhythm
Prosodic features include
Pitch perceptual correlate of fundamental
frequency
f0 rate of vocal fold vibration
Loudness/intensity, duration, segment quality

6
Why Prosody?

Prosody plays a crucial role
At all levels of language
Lexical, syntactic, pragmatic/discourse
Establishes meaning
Disambiguates sense and structure
Across languages families
Common physiological, articulatory basis
In synthesis and recognition of fluent speech

7
Prosody and the Lexicon

Lexical Determines word identity
Prosodic effect at the syllable level (minimal
unit)
Lexical stress syllable prominence
Combination of length, pitch movement, loudness
REcord (N) vs reCORD (V)
Pitch accent can differentiate words in some
languages
Lexical tone tone languages, e.g. Chinese,
Punjabi
Pitch height (register) and/or shape (contour)

Ma (high) mother Ma (rising) hemp Ma (low)
horse Ma (falling) scold
8
Prosody and Syntax

Prosody can disambiguate structure
Associated with chunking and attachment
Not identical with syntactic phrase boundaries
Prosody is predictable from syntax, except when
it isnt
Prosodic phrasing indicated by
Some combination of pause, change in pitch

9
Chunking, or phrasing

A1 I met Mary and Elenas mother at the mall
yesterday.
A2 I met Mary and Elenas mother at the mall
yesterday.

Example from Jennifer Venidetti
10
Punctuation Prosody Humor

A panda goes into a restaurant and has a meal.
Just before he leaves he takes out a gun and
fires it. The irate restaurant owner says Why
did you do that? The panda replies, I'm a
panda. Look it up.The restaurateur goes to his
dictionary and under panda finds black and
white arboreal, bear like creatures eats, shoots
and leaves.

11
Prosody in Pragmatics Discourse

Focus
Prominence, new information pitch accent
October eleventh
Sentence type, dialogue act
Statement vs. declarative question Its
raining (?)
Discourse Structure (Topic), Emotion

from Shih, Prosody Learning and Generation
12
Challenges in Prosody I

Highly variable
Actual realization differs from ideal
Speaker variation
Gender, vocal track differences, idiosyncrasy
Tonal coarticulation
Neighboring tones influence (like segmental)
Underlying fall can become rise
Parallel encoding
Effects at multiple levels realized
simultaneously

13
Challenges in Prosody II

Challenges for learning
Lack of training data
Sparseness
Many prosodic phenomena are infrequent
E.g., non-declarative utterances, topic
boundaries, contrastive accents, etc
Challenging for machine learning methods
Costs of labeling
Many prosodic events require expert labeling
Need large corpus to attest
Time-consuming, expensive

Context and Learning in Multilingual Tone and
Pitch Accent Recognition

15
Strategy Context

Common model across languages
Pure acoustic-prosodic model
No word label, POS, lexical stress info
English, Mandarin Chinese (also Cantonese,
isiZulu)
Exploit contextual information
Features from adjacent syllables, phrase contour
Analyze impact of
Context position, context encoding, context type
gt 12.5 reduction in error over no context

16
Data Collections

English (Ostendorf et al, 95)
Boston University Radio News Corpus, f2b
Manually annotated, aligned, syllabified
4 Pitch accent labels, aligned to syllables
Mandarin
TDT2 Voice of America Mandarin Broadcast News
Automatically aligned, syllabified
4 main tones, neutral

17
Local Feature Extraction

Uniform representation for tone, pitch accent
Motivated by Pitch Target Approximation Model
Tone/pitch accent target exponentially approached
Linear target height, slope (Xu et al, 99)
Base features
Pitch, Intensity max, mean, min, range
(Praat, speaker normalized)
Pitch at 5 points across voiced region
Duration
Initial, final in phrase
Slope
Linear fit to last half of pitch contour

18
Context Features

Local context
Extended features
Pitch max, mean, adjacent points of preceding,
following syllables
Difference features
Difference between
Pitch max, mean, mid, slope
Intensity max, mean
Of preceding, following and current syllable
Phrasal context
Compute collection average phrase slope
Compute scalar pitch values, adjusted for slope

19
Classification Experiments

Classifier Support Vector Machine
Linear kernel
Multiclass formulation
SVMlight (Joachims), LibSVM (Cheng Lin 01)
41 training / test splits
Experiments Effects of
Context position preceding, following, none,
both
Context encoding Extended/Difference
Context type local, phrasal

20
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
21
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
22
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
23
Discussion Local Context

Any context information improves over none
Preceding context information consistently
improves over none or following context
information
English Generally more context features are
better
Mandarin Following context can degrade
Little difference in encoding (Extend vs Diffs)
Consistent with phonetic analysis (Xu) that
carryover coarticulation is greater than
anticipatory

24
Results Discussion Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5 81.3
No Phrase 72 79.9

Phrase contour compensation enhances recognition
Simple strategy
Use of non-linear slope compensate may improve

25
Context Summary

Employ common acoustic representation
Tone (Mandarin), pitch accent (English)
Cantonese 64 68 with RBF kernel
SVM classifiers - linear kernel 76, 81
Local context effects
Up to gt 20 relative reduction in error
Preceding context greatest contribution
Carryover vs anticipatory
Phrasal context effects
Compensation for phrasal contour improves
recognition

26
Strategy Training

Challenge
Can we use the underlying acoustic structure of
the language through unlabeled examples to
reduce the need for expensive labeled training
data?
Exploit semisupervised and unsupervised learning
Semi-supervised Laplacian SVM
K-means and asymmetric k-lines clustering
Substantially outperform baselines
Can approach supervised levels

27
Data Collections Processing

English (as before)
Boston University Radio News Corpus, f2b
Binary Unaccented vs accented
4-way Unaccented, High, Downstepped High, Low
Mandarin
Lab speech data (Xu, 1999)
5 syllable utterances vary tone, focus position
In-focus, pre-focus, post-focus
TDT2 Voice of America Mandarin Broadcast News
4-way High, Mid-rising, Low, High falling
isiZulu (as before)
Read web sentences
2-way High vs low

28
Semi-supervised Learning

Approach
Employ small amount of labeled data
Exploit information from additional presumably
more available unlabeled data
Few prior examples several weakly supervised
(Wong et al, 05)
Classifier
Laplacian SVM (Sindhwani,BelkinNiyogi 05)
Semi-supervised variant of SVM
Exploits unlabeled examples
RBF kernel, typically 6 nearest neighbors,
transductive

29
Experiments

Pitch accent recognition
Binary classification Unaccented/Accented
1000 instances, proportionally sampled
Labeled training 200 unacc, 100 acc
80 accuracy (cf. 84 w/15x labeled SVM)
Mandarin tone recognition
4-way classification n(n-1)/2 binary classifiers
400 instances balanced 160 labeled
Clean lab speech- in-focus-94
cf. 99 w/SVM, 1000s train 85 w/SVM 160
training samples
Broadcast news 70
Cf. lt 50 w/SVM 160 training samples

30
Unsupervised Learning

Question
Can we identify the tone structure of a language
from the acoustic space without training?
Analogous to language acquisition
Significant recent research in unsupervised
clustering
Established approaches k-means
Spectral clustering (Shi Malik 97, Fischer
Poland 2004) asymmetric k-lines
Little research for tone
Self-organizing maps (Gauthier et al,2005)
Tones identified in lab speech using f0
velocities
Cluster-based bootstrapping (Narayanan et al,
2006)
Prominence clustering (Tambourini 05)

31
Contrasting Clustering

Contrasts
Clustering 2-16 clusters, label w/most freq
class
3 Spectral approaches
Perform spectral decomposition of affinity matrix
Asymmetric k-lines (Fischer Poland 2004)
Symmetric k-lines (Fischer Poland 2004)
Laplacian Eigenmaps (Belkin, Niyogi, Sindhwani
2004)
Binary weights, k-lines clustering
K-means Standard Euclidean distance
of clusters 2-16
Best results gt 78
2 clusters asymmetric k-lines gt 2 clusters
kmeans
Larger clusters all similar

32
Contrasting Learners
33
Tone Clustering I

Mandarin four tones
400 samples balanced
2-phase clustering 2-5 clusters each
Asymmetric k-lines, k-means clustering
Clean read speech
In-focus syllables 87 (cf. 99 supervised)
In-focus and pre-focus 77 (cf. 93 supervised)
Broadcast news 57 (cf. 74 supervised)
K-means requires more clusters to reach k-lines
level

34
Tone Structure
First phase of clustering splits high/rising from
low/falling by slope Second phase by pitch height
35
Conclusions

Common prosodic framework for tone and pitch
accent recognition
Contextual modeling enhances recognition
Local context and broad phrase contour
Carryover coarticulation has larger effect for
Mandarin
Exploiting unlabeled examples for recognition
Semi- and Un-supervised approaches
Best cases approach supervised levels with less
training
Exploits acoustic structure of tone and accent
space

36
Error Correction Spiral

U Give me the price for ATT.
S Hewlett Packard was 83 ¾, up 2 ½ since
yesterday.
U Give me the price for ATT.
S Data General was at 10 ½ up a quarter.
U Give me the price for ATT.
S Hewlett Packard was 83 ¾, up 2 ½.
U Give me the price for American Telephone and
Telegraph.
S Excuse me?

37
Recognizing Spoken Corrections

Spoken Corrections
Recognize user attempts to correct ASR failures
Compare original input to repeat corrections
Significant differences
Corrections increases in duration, pause
/length, final fall
Increases in pitch accent for misrecognitions
Automatic recognition with decision trees,
boosting
Distinguish corrective/not (human level)
Key features raw/normalized duration, pause
Identify specific word being corrected
Key features highest pitch, widest pitch range

38
The ProblemSpeech Topic Segmentation

Separate audio stream into component topics

On "World News Tonight" this Thursday, another
bad day on stock markets, all over the world
global economic anxiety. Another massacre in
Kosovo, the U.S. and its allies prepare to do
something about it. Very slowly. And the
millennium bug, Lubbock Texas prepares for
catastrophe, Bangalore, in India, sees only
profit.
39
Is It Possible in Mandarin?
40
Recognizing Shifts in Topic Turn

Topic Turn boundaries in English Mandarin
Initial syllables
Significantly higher pitch, loudness than final
Lexical and prosodic cues
Cue words, tfidf similarity pitch, loudness,
silence
Automatic recognition with decision trees,
boosting
Voting to combine text, prosody, silence 97
accuracy
Key features
Pause pitch, loudness contrast between syllables

41
Conclusions Opportunities

Prosody
Rich source of information for languages
Challenging due to variation, paucity of data
Can be successfully employed, with learning, to
improve language understanding
Pitch accent, tone, dialogue act, turn, topic,
Unrestricted conversational, multi-party,
multimodal speech much more challenging
Increased variability, interaction with
non-verbal evidence

42
Thanks

Dinoj Surendran, Siwei Wang, Yi Xu
V. Sindhwani, M. Belkin, P. Niyogi I. Fischer
J. Poland T. Joachims C-C. Cheng C. Lin
This work supported by NSF Grant 0414919
http//people.cs.uchicago.edu/levow/tai

43
Phrasing can disambiguate
Mary Elenas mother
mall
I met Mary and Elenas mother at the mall
yesterday
One intonation phrase with relatively flat
overall pitch range.
44
Phrasing can disambiguate
Elenas mother
mall
Mary
I met Mary and Elenas mother at the mall
yesterday
Separate phrases, with expanded pitch movements.
45
Lists of numbers, nouns

twenty.eight.five
ninety.four.three
seventy.three.seven
forty.seven.seven
seventy.seven.seven

coffee cake and cream
chocolate ice cream and cake
fish fingers and bottles
cheese sandwiches and milk
cream buns and chocolate

from Prosody on the Web tutorial on chunking
46
Clustering

Pitch accent clustering
4 way distinction 1000 samples, proportional
2-16 clusters constructed
Assign most frequent class label to each cluster
Classifier
Asymmetric k-lines
context-dependent kernel radii, non-spherical
gt 78 accuracy
2 clusters asymmetric k-lines best
Context effects
Vector w/preceding context vs vector with no
context comparable

Write a Comment

User Comments (0)

About PowerShow.com

Prosody in Spoken Language Understanding - PowerPoint PPT Presentation

Prosody in Spoken Language Understanding

Pure acoustic-prosodic model. No word label, POS, lexical stress info ... Can we identify the tone structure of a language from the acoustic space without ... – PowerPoint PPT presentation