Prosody in Spoken Language Understanding - PowerPoint PPT Presentation

About This Presentation
Title:

Prosody in Spoken Language Understanding

Description:

Pure acoustic-prosodic model. No word label, POS, lexical stress info ... Can we identify the tone structure of a language from the acoustic space without ... – PowerPoint PPT presentation

Number of Views:302
Avg rating:3.0/5.0
Slides: 45
Provided by: gina45
Category:

less

Transcript and Presenter's Notes

Title: Prosody in Spoken Language Understanding


1
Prosody in Spoken Language Understanding
  • Gina Anne Levow
  • University of Chicago
  • January 4, 2008
  • NLP Winter School 2008

2
  • U Give me the price for ATT.
  • U Give me the price for ATT.
  • U Give me the price for ATT.
  • U Give me the price for American Telephone and
    Telegraph.

3
Error Correction Spiral
  • U Give me the price for ATT.
  • S Hewlett Packard was 83 ¾, up 2 ½ since
    yesterday.
  • U Give me the price for ATT.
  • S Data General was at 10 ½ up a quarter.
  • U Give me the price for ATT.
  • S Hewlett Packard was 83 ¾, up 2 ½.
  • U Give me the price for American Telephone and
    Telegraph.
  • S Excuse me?

4
Roadmap
  • Corrections A motivating example
  • Defining prosody
  • Why prosody?
  • Challenges in prosody
  • Prosody in language understanding
  • Recognizing tone and pitch accent
  • Spoken corrections, Topic segmentation
  • Conclusions

5
Defining Prosody
  • Prosody
  • Phonetic phenomena in speech than span more than
    a single segment-suprasegmental
  • Prosody includes
  • Stress, focus, tone, intonation, length/pause,
    rhythm
  • Prosodic features include
  • Pitch perceptual correlate of fundamental
    frequency
  • f0 rate of vocal fold vibration
  • Loudness/intensity, duration, segment quality

6
Why Prosody?
  • Prosody plays a crucial role
  • At all levels of language
  • Lexical, syntactic, pragmatic/discourse
  • Establishes meaning
  • Disambiguates sense and structure
  • Across languages families
  • Common physiological, articulatory basis
  • In synthesis and recognition of fluent speech

7
Prosody and the Lexicon
  • Lexical Determines word identity
  • Prosodic effect at the syllable level (minimal
    unit)
  • Lexical stress syllable prominence
  • Combination of length, pitch movement, loudness
  • REcord (N) vs reCORD (V)
  • Pitch accent can differentiate words in some
    languages
  • Lexical tone tone languages, e.g. Chinese,
    Punjabi
  • Pitch height (register) and/or shape (contour)

Ma (high) mother Ma (rising) hemp Ma (low)
horse Ma (falling) scold
8
Prosody and Syntax
  • Prosody can disambiguate structure
  • Associated with chunking and attachment
  • Not identical with syntactic phrase boundaries
  • Prosody is predictable from syntax, except when
    it isnt
  • Prosodic phrasing indicated by
  • Some combination of pause, change in pitch

9
Chunking, or phrasing
  • A1 I met Mary and Elenas mother at the mall
    yesterday.
  • A2 I met Mary and Elenas mother at the mall
    yesterday.

Example from Jennifer Venidetti
10
Punctuation Prosody Humor
  • A panda goes into a restaurant and has a meal.
    Just before he leaves he takes out a gun and
    fires it. The irate restaurant owner says Why
    did you do that? The panda replies, I'm a
    panda. Look it up.The restaurateur goes to his
    dictionary and under panda finds black and
    white arboreal, bear like creatures eats, shoots
    and leaves.

11
Prosody in Pragmatics Discourse
  • Focus
  • Prominence, new information pitch accent
  • October eleventh
  • Sentence type, dialogue act
  • Statement vs. declarative question Its
    raining (?)
  • Discourse Structure (Topic), Emotion

from Shih, Prosody Learning and Generation
12
Challenges in Prosody I
  • Highly variable
  • Actual realization differs from ideal
  • Speaker variation
  • Gender, vocal track differences, idiosyncrasy
  • Tonal coarticulation
  • Neighboring tones influence (like segmental)
  • Underlying fall can become rise
  • Parallel encoding
  • Effects at multiple levels realized
    simultaneously

13
Challenges in Prosody II
  • Challenges for learning
  • Lack of training data
  • Sparseness
  • Many prosodic phenomena are infrequent
  • E.g., non-declarative utterances, topic
    boundaries, contrastive accents, etc
  • Challenging for machine learning methods
  • Costs of labeling
  • Many prosodic events require expert labeling
  • Need large corpus to attest
  • Time-consuming, expensive

14
  • Context and Learning in Multilingual Tone and
    Pitch Accent Recognition

15
Strategy Context
  • Common model across languages
  • Pure acoustic-prosodic model
  • No word label, POS, lexical stress info
  • English, Mandarin Chinese (also Cantonese,
    isiZulu)
  • Exploit contextual information
  • Features from adjacent syllables, phrase contour
  • Analyze impact of
  • Context position, context encoding, context type
  • gt 12.5 reduction in error over no context

16
Data Collections
  • English (Ostendorf et al, 95)
  • Boston University Radio News Corpus, f2b
  • Manually annotated, aligned, syllabified
  • 4 Pitch accent labels, aligned to syllables
  • Mandarin
  • TDT2 Voice of America Mandarin Broadcast News
  • Automatically aligned, syllabified
  • 4 main tones, neutral

17
Local Feature Extraction
  • Uniform representation for tone, pitch accent
  • Motivated by Pitch Target Approximation Model
  • Tone/pitch accent target exponentially approached
  • Linear target height, slope (Xu et al, 99)
  • Base features
  • Pitch, Intensity max, mean, min, range
  • (Praat, speaker normalized)
  • Pitch at 5 points across voiced region
  • Duration
  • Initial, final in phrase
  • Slope
  • Linear fit to last half of pitch contour

18
Context Features
  • Local context
  • Extended features
  • Pitch max, mean, adjacent points of preceding,
    following syllables
  • Difference features
  • Difference between
  • Pitch max, mean, mid, slope
  • Intensity max, mean
  • Of preceding, following and current syllable
  • Phrasal context
  • Compute collection average phrase slope
  • Compute scalar pitch values, adjusted for slope

19
Classification Experiments
  • Classifier Support Vector Machine
  • Linear kernel
  • Multiclass formulation
  • SVMlight (Joachims), LibSVM (Cheng Lin 01)
  • 41 training / test splits
  • Experiments Effects of
  • Context position preceding, following, none,
    both
  • Context encoding Extended/Difference
  • Context type local, phrasal

20
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
21
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
22
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
23
Discussion Local Context
  • Any context information improves over none
  • Preceding context information consistently
    improves over none or following context
    information
  • English Generally more context features are
    better
  • Mandarin Following context can degrade
  • Little difference in encoding (Extend vs Diffs)
  • Consistent with phonetic analysis (Xu) that
    carryover coarticulation is greater than
    anticipatory

24
Results Discussion Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5 81.3
No Phrase 72 79.9
  • Phrase contour compensation enhances recognition
  • Simple strategy
  • Use of non-linear slope compensate may improve

25
Context Summary
  • Employ common acoustic representation
  • Tone (Mandarin), pitch accent (English)
  • Cantonese 64 68 with RBF kernel
  • SVM classifiers - linear kernel 76, 81
  • Local context effects
  • Up to gt 20 relative reduction in error
  • Preceding context greatest contribution
  • Carryover vs anticipatory
  • Phrasal context effects
  • Compensation for phrasal contour improves
    recognition

26
Strategy Training
  • Challenge
  • Can we use the underlying acoustic structure of
    the language through unlabeled examples to
    reduce the need for expensive labeled training
    data?
  • Exploit semisupervised and unsupervised learning
  • Semi-supervised Laplacian SVM
  • K-means and asymmetric k-lines clustering
  • Substantially outperform baselines
  • Can approach supervised levels

27
Data Collections Processing
  • English (as before)
  • Boston University Radio News Corpus, f2b
  • Binary Unaccented vs accented
  • 4-way Unaccented, High, Downstepped High, Low
  • Mandarin
  • Lab speech data (Xu, 1999)
  • 5 syllable utterances vary tone, focus position
  • In-focus, pre-focus, post-focus
  • TDT2 Voice of America Mandarin Broadcast News
  • 4-way High, Mid-rising, Low, High falling
  • isiZulu (as before)
  • Read web sentences
  • 2-way High vs low

28
Semi-supervised Learning
  • Approach
  • Employ small amount of labeled data
  • Exploit information from additional presumably
    more available unlabeled data
  • Few prior examples several weakly supervised
    (Wong et al, 05)
  • Classifier
  • Laplacian SVM (Sindhwani,BelkinNiyogi 05)
  • Semi-supervised variant of SVM
  • Exploits unlabeled examples
  • RBF kernel, typically 6 nearest neighbors,
    transductive

29
Experiments
  • Pitch accent recognition
  • Binary classification Unaccented/Accented
  • 1000 instances, proportionally sampled
  • Labeled training 200 unacc, 100 acc
  • 80 accuracy (cf. 84 w/15x labeled SVM)
  • Mandarin tone recognition
  • 4-way classification n(n-1)/2 binary classifiers
  • 400 instances balanced 160 labeled
  • Clean lab speech- in-focus-94
  • cf. 99 w/SVM, 1000s train 85 w/SVM 160
    training samples
  • Broadcast news 70
  • Cf. lt 50 w/SVM 160 training samples

30
Unsupervised Learning
  • Question
  • Can we identify the tone structure of a language
    from the acoustic space without training?
  • Analogous to language acquisition
  • Significant recent research in unsupervised
    clustering
  • Established approaches k-means
  • Spectral clustering (Shi Malik 97, Fischer
    Poland 2004) asymmetric k-lines
  • Little research for tone
  • Self-organizing maps (Gauthier et al,2005)
  • Tones identified in lab speech using f0
    velocities
  • Cluster-based bootstrapping (Narayanan et al,
    2006)
  • Prominence clustering (Tambourini 05)

31
Contrasting Clustering
  • Contrasts
  • Clustering 2-16 clusters, label w/most freq
    class
  • 3 Spectral approaches
  • Perform spectral decomposition of affinity matrix
  • Asymmetric k-lines (Fischer Poland 2004)
  • Symmetric k-lines (Fischer Poland 2004)
  • Laplacian Eigenmaps (Belkin, Niyogi, Sindhwani
    2004)
  • Binary weights, k-lines clustering
  • K-means Standard Euclidean distance
  • of clusters 2-16
  • Best results gt 78
  • 2 clusters asymmetric k-lines gt 2 clusters
    kmeans
  • Larger clusters all similar

32
Contrasting Learners
33
Tone Clustering I
  • Mandarin four tones
  • 400 samples balanced
  • 2-phase clustering 2-5 clusters each
  • Asymmetric k-lines, k-means clustering
  • Clean read speech
  • In-focus syllables 87 (cf. 99 supervised)
  • In-focus and pre-focus 77 (cf. 93 supervised)
  • Broadcast news 57 (cf. 74 supervised)
  • K-means requires more clusters to reach k-lines
    level

34
Tone Structure
First phase of clustering splits high/rising from
low/falling by slope Second phase by pitch height
35
Conclusions
  • Common prosodic framework for tone and pitch
    accent recognition
  • Contextual modeling enhances recognition
  • Local context and broad phrase contour
  • Carryover coarticulation has larger effect for
    Mandarin
  • Exploiting unlabeled examples for recognition
  • Semi- and Un-supervised approaches
  • Best cases approach supervised levels with less
    training
  • Exploits acoustic structure of tone and accent
    space

36
Error Correction Spiral
  • U Give me the price for ATT.
  • S Hewlett Packard was 83 ¾, up 2 ½ since
    yesterday.
  • U Give me the price for ATT.
  • S Data General was at 10 ½ up a quarter.
  • U Give me the price for ATT.
  • S Hewlett Packard was 83 ¾, up 2 ½.
  • U Give me the price for American Telephone and
    Telegraph.
  • S Excuse me?

37
Recognizing Spoken Corrections
  • Spoken Corrections
  • Recognize user attempts to correct ASR failures
  • Compare original input to repeat corrections
  • Significant differences
  • Corrections increases in duration, pause
    /length, final fall
  • Increases in pitch accent for misrecognitions
  • Automatic recognition with decision trees,
    boosting
  • Distinguish corrective/not (human level)
  • Key features raw/normalized duration, pause
  • Identify specific word being corrected
  • Key features highest pitch, widest pitch range

38
The ProblemSpeech Topic Segmentation
  • Separate audio stream into component topics

On "World News Tonight" this Thursday, another
bad day on stock markets, all over the world
global economic anxiety. Another massacre in
Kosovo, the U.S. and its allies prepare to do
something about it. Very slowly. And the
millennium bug, Lubbock Texas prepares for
catastrophe, Bangalore, in India, sees only
profit.
39
Is It Possible in Mandarin?
40
Recognizing Shifts in Topic Turn
  • Topic Turn boundaries in English Mandarin
  • Initial syllables
  • Significantly higher pitch, loudness than final
  • Lexical and prosodic cues
  • Cue words, tfidf similarity pitch, loudness,
    silence
  • Automatic recognition with decision trees,
    boosting
  • Voting to combine text, prosody, silence 97
    accuracy
  • Key features
  • Pause pitch, loudness contrast between syllables

41
Conclusions Opportunities
  • Prosody
  • Rich source of information for languages
  • Challenging due to variation, paucity of data
  • Can be successfully employed, with learning, to
    improve language understanding
  • Pitch accent, tone, dialogue act, turn, topic,
  • Unrestricted conversational, multi-party,
    multimodal speech much more challenging
  • Increased variability, interaction with
    non-verbal evidence

42
Thanks
  • Dinoj Surendran, Siwei Wang, Yi Xu
  • V. Sindhwani, M. Belkin, P. Niyogi I. Fischer
    J. Poland T. Joachims C-C. Cheng C. Lin
  • This work supported by NSF Grant 0414919
  • http//people.cs.uchicago.edu/levow/tai

43
Phrasing can disambiguate
Mary Elenas mother
mall
I met Mary and Elenas mother at the mall
yesterday
One intonation phrase with relatively flat
overall pitch range.
44
Phrasing can disambiguate
Elenas mother
mall
Mary
I met Mary and Elenas mother at the mall
yesterday
Separate phrases, with expanded pitch movements.
45
Lists of numbers, nouns
  • twenty.eight.five
  • ninety.four.three
  • seventy.three.seven
  • forty.seven.seven
  • seventy.seven.seven
  • coffee cake and cream
  • chocolate ice cream and cake
  • fish fingers and bottles
  • cheese sandwiches and milk
  • cream buns and chocolate

from Prosody on the Web tutorial on chunking
46
Clustering
  • Pitch accent clustering
  • 4 way distinction 1000 samples, proportional
  • 2-16 clusters constructed
  • Assign most frequent class label to each cluster
  • Classifier
  • Asymmetric k-lines
  • context-dependent kernel radii, non-spherical
  • gt 78 accuracy
  • 2 clusters asymmetric k-lines best
  • Context effects
  • Vector w/preceding context vs vector with no
    context comparable
Write a Comment
User Comments (0)
About PowerShow.com