Understanding%20Spoken%20Corrections%20in%20Human-Computer%20Dialogue%20Gina-Anne%20Levow%20University%20of%20Chicago%20http://www.cs.uchicago.edu/~levow%20MAICS%20April%201,%202006 - PowerPoint PPT Presentation

About This Presentation
Title:

Understanding%20Spoken%20Corrections%20in%20Human-Computer%20Dialogue%20Gina-Anne%20Levow%20University%20of%20Chicago%20http://www.cs.uchicago.edu/~levow%20MAICS%20April%201,%202006

Description:

Understanding Spoken Corrections in. Human-Computer Dialogue. Gina-Anne Levow ... BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Understanding%20Spoken%20Corrections%20in%20Human-Computer%20Dialogue%20Gina-Anne%20Levow%20University%20of%20Chicago%20http://www.cs.uchicago.edu/~levow%20MAICS%20April%201,%202006


1
Understanding Spoken Corrections
inHuman-Computer DialogueGina-Anne
LevowUniversity of Chicagohttp//www.cs.uchicago
.edu/levowMAICSApril 1, 2006
2
Error Correction Spiral
  • U Give me the price for ATT.
  • S Hewlett-Packard was 83 ¾, up 2 ½ since
    yesterday.
  • U Give me the price for ATT.
  • S Data General was at 10 ½ up a quarter.
  • U Give me the price for ATT.
  • S Hewlett-Packard was 83 ¾, up 2 ½.
  • U Give me the price for American Telephone and
    Telegraph.
  • S Excuse me?

3
Identifying Corrections
  • Most "Reasoning" Approaches
  • Correction gt Violates Task, Belief Constraint
  • Requires Tight Task, Belief Model
  • Often Requires Accurate Recognition
  • This Approach
  • Uses Acoustic or Lexical Information
  • Content, Context Independent

4
Accomplishments
  • Corrections vs Original Inputs
  • Significant Differences Duration, Pause, Pitch
  • Corrections vs Recognizer Models
  • Contrasts Phonology and Duration
  • Correction Recognition
  • Decision Tree Classifier 65-77 accuracy
  • Human Baseline 80

5
Why Corrections?
  • Recognizer Error Rates 25-40
  • REAL meaning of utterance
  • user intent
  • Corrections misrecognized 2.5X as often
  • Hard to Correct gt Poor Quality System

6
Why it's Necessary
  • Error Repair Requires Detection
  • Errors can be very difficult to detect
  • E.g. Misrecognitions
  • Focus Repair Efforts
  • Corrections Decrease Recognition Accuracy
  • Adaptation Requires Identification

7
Why is it Hard?
  • Recognition Failures and Errors
  • Repetition ltgt Correction
  • 500 Strings gt 6700 Instances (80)
  • Speech Recognition Technology
  • Variation - Undesirable, Suppressed

8
(No Transcript)
9
Roadmap
  • Data Collection and Description
  • SpeechActs System Field Trial
  • Characterizing Corrections
  • Original-Repeat Pair Data Analysis
  • Acoustic and Phonological Measures Results
  • Recognizing Corrections
  • Conclusions and Future Work

10
SpeechActs System
  • Speech-Only System over the Telephone
  • (Yankelovich, Levow Marx 1995)
  • Access to Common Desktop Applications
  • Email, Calendar, Weather, Stock Quotes
  • BBN's Hark Speech Recognition,
    Centigram TruVoice Speech Synthesis
  • In-house Natural Language Analysis
    Back-end Applications, Dialog Manager

11
System Data Overview
  • Approximately 60 hours of interactions
  • Digitized at 8kHz, 8-bit mu-law encoding
  • 18 subjects 14 novices, 4 experts, single shots
  • 7529 user utterances, 1961 errors
  • P(error correct) 18 P(error error) 44

12
System Recognition Error Types
  • Rejection Errors - Below Recognition Level
  • U Switch to Weather
  • S (Heard) ltnothinggt
  • S (said) Huh?
  • Misrecognition Errors - Substitution in Text
  • U Switch to Weather
  • S (Heard) Switch to Calendar
  • S (Said) On Friday, May 4, you have talk at
    Chicago.
  • 1250 Rejections 2/3
  • 706 Misrecognitions 1/3

13
Analysis Data
  • 300 Original Input-Repeat Correction Pairs
  • Lexically Matched, Same Speaker
  • Example
  • S (Said) Please say mail, calendar, weather.
  • U Switch to Weather. Original
  • S (Said) Huh?
  • U Switch to Weather. Repeat.

14
Analysis Duration
  • Automatic Forced Alignment, Hand-Edited
  • Total Speech Onset to End of Utterance
  • Speech Total - Internal Silence
  • Contrasts Original Input/Repeat Correction
  • Total Increases 12.5 on average
  • Speech Increases 9 on average

15
(No Transcript)
16
Analysis Pause
  • Utterance Internal Silence gt 10ms
  • Not Preceding Unvoiced Stops(t), Affricates(ch)
  • Contrasts Original Input/Repeat Correction
  • Absolute 46 Increase
  • Ratio of Silence to Total Duration 58 Increase

17
(No Transcript)
18
Pitch Tracks
19
Analysis Pitch I
  • ESPS/Waves Pitch Tracker, Hand-Edited
  • Normalized Per-Subject
  • (Value-Subject Mean) / (Subject Std Dev)
  • Pitch Maximum, Minimum, Range
  • Whole Utterance Last Word
  • Contrasts Original Input/Repeat Correction
  • Significant Decrease in Pitch Minimum
  • Whole Utterance Last Word

20
Analysis Pitch II
21
Analysis Overview
  • Significant Differences Original/Correction
  • Duration Pause
  • Significant Increases Original vs Correction
  • Pitch
  • Significant Decrease in Pitch Minimum
  • Increase in Final Falling Contours
  • Conversational-to-Clear Speech Shift

22
Analysis Phonology
  • Reduced Form gt Citation Form
  • Schwa to unreduced vowel (20)
  • E.g. Switch t' mail gt Switch to mail.
  • Unreleased or Flapped 't' gt Released 't' (50)
  • E.g. Read message tweny gt Read message twenty
  • Citation Form gt Hyperclear Form
  • Extreme lengthening, calling intonation (20)
  • E.g. Goodbye gt Goodba-aye

23
Durational Model Contrasts I
Original Inputs Final vs Non-final position
Non-final Final
20
of Words
-1 0 1 2
Departure from Model Mean (Std Dev)
24
Durational Model Contrasts
Compare to SR model (Chung 1995) Phrase-final
lengthening Words in final position
significantly longer than non-final and
than model prediction All significantly longer
in correction utterances
Non-final Final
of Words
Departure from Model Mean (Std Dev)
25
Durational Model Contrasts II
Correction Utterances Greater Increases
20
Non-final Final
of Words
20
-1 0 1 2
Departure from Model Mean (Std Dev)
26
Analysis Overview II
  • Original vs Correction Recognizer Model
  • Phonology
  • Reduced Form gt Citation Form gt Hyperclear Form
  • Conversational to (Hyper) Clear Shift
  • Duration
  • Contrast between Final and Non-final Words
  • Departure from ASR Model
  • Increase for Corrections, especially Final Words

27
Automatic Recognition of Spoken Corrections
  • Machine learning classifier
  • Decision Trees
  • Trained on labeled examples
  • Features Duration, Pause, Pitch
  • Evaluation
  • Overall 65 accuracy (inc. text features)
  • Absolute and normalized duration
  • Misrecognitions 77 accuracy (inc. text
    features)
  • Absolute and normalized duration, pitch
  • 65 accuracy acoustic features only
  • Approaches human baseline 79.4

28
Accomplishments
  • Contrasts between Originals vs Corrections
  • Significant Differences in Duration, Pause, Pitch
  • Conversational-to-Clear Speech Shifts
  • Shifts away from Recognizer Models
  • Corrections Recognized at 65-77
  • Near-human Levels

29
Future Work
  • Modify ASR Duration Model for Correction
  • Reflect Phonological and Duration Change
  • Identify Locus of Correction for Misrecognitions
  • U Switch to Weather
  • S (Heard) Switch to Calendar
  • S (Said) On Friday, May 4, you have talk at
    Chicago.
  • USwitch to WEATHER!
  • Preliminary tests
  • 26/28 Corrected Words Detected, 2 False Alarms

30
Future Work
  • Identify and Exploit Cues to Discourse and
    Information Structure
  • Incorporate Prosodic Features into Model of
    Spoken Dialogue
  • Exploit Text and Acoustic Features for
    Segmentation of Broadcast Audio and Video
  • Necessary first phase for information retrieval
  • Assess language independence
  • First phase Segmentation of Mandarin and
    Cantonese Broadcast News (in collaboration with
    CUHK)

31
Classification of Spoken Corrections
  • Decision Trees
  • Intelligible, Robust to irrelevant attributes
  • ?Rectangular decision boundaries Dont combine
    features
  • Features (38 total, 15 in best trees)
  • Duration, pause, pitch, and amplitude
  • Normalized and absolute
  • Training and Testing
  • 50 Original Inputs, 50 Repeat Corrections
  • 7-way cross-validation

32
Recognizer Results (Overall)
  • Tree Size 57 (unpruned), 37 (pruned)
  • Minimum of 10 nodes per branch required
  • First Split Normalized Duration (All Trees)
  • Most Important Features
  • Normalized Absolute Duration, Speaking Rate
  • 65 Accuracy - Null Baseline-50

33
Example Tree
34
Classifier Results Misrecognitions
  • Most important features
  • Absolute and Normalized Duration
  • Pitch Minimum and Pitch Slope
  • 77 accuracy (with text)
  • 65 (acoustic features only)
  • Null baseline - 50
  • Human baseline - 79.4 (Hauptman Rudnicky 1987)

35
Misrecognition Classifier
36
Background Related Work
  • Detecting and Preventing Miscommunication
  • (Smith Gordon 96, Traum Dillenbourg 96)
  • Identifying Discourse Structure in Speech
  • Prosody (Grosz Hirschberg 92, Swerts
    Ostendorf 95)
  • Cue wordsprosody (Taylor et al 96,
    HirschbergLitman 93)
  • Self-repairs (Heeman Allen 94, Bear et al 92)
  • Acoustic-only (Nakatani Hirschberg 94,
    Shriberg et al 97)
  • Speaking Modes (Ostendorf et al 96, Daly Zue
    96)
  • Spoken Corrections
  • Human baseline (Rudnicky Hauptmann 87)
  • (Oviatt et al 96, 98 Levow 98,99 Hirschberg et
    al 99,00)
  • Other languages (Bell Gustafson 99, Pirker et
    al 99,Fischer 99)

37
Learning Method Options
  • (K)-Nearest Neighbor
  • Need Commensurable Attribute Values
  • Sensitive to Irrelevant Attributes
  • Labeling Speed - Training Set Size
  • Neural Nets
  • Hard to Interpret
  • Can Require More Computation Training Data
  • Fast, Accurate when Trained
  • Decision Trees lt
  • Intelligible, Robust to Irrelevant Attributes
  • Fast, Compact when Trained
  • ?Rectangular Decision Boundaries, Don't Test
    Feature Combinations
Write a Comment
User Comments (0)
About PowerShow.com