Title: Understanding%20Spoken%20Corrections%20in%20Human-Computer%20Dialogue%20Gina-Anne%20Levow%20University%20of%20Chicago%20http://www.cs.uchicago.edu/~levow%20MAICS%20April%201,%202006
1Understanding Spoken Corrections
inHuman-Computer DialogueGina-Anne
LevowUniversity of Chicagohttp//www.cs.uchicago
.edu/levowMAICSApril 1, 2006
2Error Correction Spiral
- U Give me the price for ATT.
- S Hewlett-Packard was 83 ¾, up 2 ½ since
yesterday. - U Give me the price for ATT.
- S Data General was at 10 ½ up a quarter.
- U Give me the price for ATT.
- S Hewlett-Packard was 83 ¾, up 2 ½.
- U Give me the price for American Telephone and
Telegraph. - S Excuse me?
3Identifying Corrections
- Most "Reasoning" Approaches
- Correction gt Violates Task, Belief Constraint
- Requires Tight Task, Belief Model
- Often Requires Accurate Recognition
- This Approach
- Uses Acoustic or Lexical Information
- Content, Context Independent
4Accomplishments
- Corrections vs Original Inputs
- Significant Differences Duration, Pause, Pitch
- Corrections vs Recognizer Models
- Contrasts Phonology and Duration
- Correction Recognition
- Decision Tree Classifier 65-77 accuracy
- Human Baseline 80
5Why Corrections?
- Recognizer Error Rates 25-40
- REAL meaning of utterance
- user intent
- Corrections misrecognized 2.5X as often
- Hard to Correct gt Poor Quality System
6Why it's Necessary
- Error Repair Requires Detection
- Errors can be very difficult to detect
- E.g. Misrecognitions
- Focus Repair Efforts
- Corrections Decrease Recognition Accuracy
- Adaptation Requires Identification
7Why is it Hard?
- Recognition Failures and Errors
- Repetition ltgt Correction
- 500 Strings gt 6700 Instances (80)
- Speech Recognition Technology
- Variation - Undesirable, Suppressed
8(No Transcript)
9Roadmap
- Data Collection and Description
- SpeechActs System Field Trial
- Characterizing Corrections
- Original-Repeat Pair Data Analysis
- Acoustic and Phonological Measures Results
- Recognizing Corrections
- Conclusions and Future Work
10SpeechActs System
- Speech-Only System over the Telephone
- (Yankelovich, Levow Marx 1995)
- Access to Common Desktop Applications
- Email, Calendar, Weather, Stock Quotes
- BBN's Hark Speech Recognition,
Centigram TruVoice Speech Synthesis - In-house Natural Language Analysis
Back-end Applications, Dialog Manager
11System Data Overview
- Approximately 60 hours of interactions
- Digitized at 8kHz, 8-bit mu-law encoding
- 18 subjects 14 novices, 4 experts, single shots
- 7529 user utterances, 1961 errors
- P(error correct) 18 P(error error) 44
12System Recognition Error Types
- Rejection Errors - Below Recognition Level
- U Switch to Weather
- S (Heard) ltnothinggt
- S (said) Huh?
- Misrecognition Errors - Substitution in Text
- U Switch to Weather
- S (Heard) Switch to Calendar
- S (Said) On Friday, May 4, you have talk at
Chicago. - 1250 Rejections 2/3
- 706 Misrecognitions 1/3
13Analysis Data
- 300 Original Input-Repeat Correction Pairs
- Lexically Matched, Same Speaker
- Example
- S (Said) Please say mail, calendar, weather.
- U Switch to Weather. Original
- S (Said) Huh?
- U Switch to Weather. Repeat.
14Analysis Duration
- Automatic Forced Alignment, Hand-Edited
- Total Speech Onset to End of Utterance
- Speech Total - Internal Silence
- Contrasts Original Input/Repeat Correction
- Total Increases 12.5 on average
- Speech Increases 9 on average
15(No Transcript)
16Analysis Pause
- Utterance Internal Silence gt 10ms
- Not Preceding Unvoiced Stops(t), Affricates(ch)
- Contrasts Original Input/Repeat Correction
- Absolute 46 Increase
- Ratio of Silence to Total Duration 58 Increase
17(No Transcript)
18Pitch Tracks
19Analysis Pitch I
- ESPS/Waves Pitch Tracker, Hand-Edited
- Normalized Per-Subject
- (Value-Subject Mean) / (Subject Std Dev)
- Pitch Maximum, Minimum, Range
- Whole Utterance Last Word
- Contrasts Original Input/Repeat Correction
- Significant Decrease in Pitch Minimum
- Whole Utterance Last Word
20Analysis Pitch II
21Analysis Overview
- Significant Differences Original/Correction
- Duration Pause
- Significant Increases Original vs Correction
- Pitch
- Significant Decrease in Pitch Minimum
- Increase in Final Falling Contours
- Conversational-to-Clear Speech Shift
22Analysis Phonology
- Reduced Form gt Citation Form
- Schwa to unreduced vowel (20)
- E.g. Switch t' mail gt Switch to mail.
- Unreleased or Flapped 't' gt Released 't' (50)
- E.g. Read message tweny gt Read message twenty
- Citation Form gt Hyperclear Form
- Extreme lengthening, calling intonation (20)
- E.g. Goodbye gt Goodba-aye
23Durational Model Contrasts I
Original Inputs Final vs Non-final position
Non-final Final
20
of Words
-1 0 1 2
Departure from Model Mean (Std Dev)
24Durational Model Contrasts
Compare to SR model (Chung 1995) Phrase-final
lengthening Words in final position
significantly longer than non-final and
than model prediction All significantly longer
in correction utterances
Non-final Final
of Words
Departure from Model Mean (Std Dev)
25Durational Model Contrasts II
Correction Utterances Greater Increases
20
Non-final Final
of Words
20
-1 0 1 2
Departure from Model Mean (Std Dev)
26Analysis Overview II
- Original vs Correction Recognizer Model
- Phonology
- Reduced Form gt Citation Form gt Hyperclear Form
- Conversational to (Hyper) Clear Shift
- Duration
- Contrast between Final and Non-final Words
- Departure from ASR Model
- Increase for Corrections, especially Final Words
27Automatic Recognition of Spoken Corrections
- Machine learning classifier
- Decision Trees
- Trained on labeled examples
- Features Duration, Pause, Pitch
- Evaluation
- Overall 65 accuracy (inc. text features)
- Absolute and normalized duration
- Misrecognitions 77 accuracy (inc. text
features) - Absolute and normalized duration, pitch
- 65 accuracy acoustic features only
- Approaches human baseline 79.4
28Accomplishments
- Contrasts between Originals vs Corrections
- Significant Differences in Duration, Pause, Pitch
- Conversational-to-Clear Speech Shifts
- Shifts away from Recognizer Models
- Corrections Recognized at 65-77
- Near-human Levels
29Future Work
- Modify ASR Duration Model for Correction
- Reflect Phonological and Duration Change
- Identify Locus of Correction for Misrecognitions
- U Switch to Weather
- S (Heard) Switch to Calendar
- S (Said) On Friday, May 4, you have talk at
Chicago. - USwitch to WEATHER!
- Preliminary tests
- 26/28 Corrected Words Detected, 2 False Alarms
30Future Work
- Identify and Exploit Cues to Discourse and
Information Structure - Incorporate Prosodic Features into Model of
Spoken Dialogue - Exploit Text and Acoustic Features for
Segmentation of Broadcast Audio and Video - Necessary first phase for information retrieval
- Assess language independence
- First phase Segmentation of Mandarin and
Cantonese Broadcast News (in collaboration with
CUHK)
31Classification of Spoken Corrections
- Decision Trees
- Intelligible, Robust to irrelevant attributes
- ?Rectangular decision boundaries Dont combine
features - Features (38 total, 15 in best trees)
- Duration, pause, pitch, and amplitude
- Normalized and absolute
- Training and Testing
- 50 Original Inputs, 50 Repeat Corrections
- 7-way cross-validation
32Recognizer Results (Overall)
- Tree Size 57 (unpruned), 37 (pruned)
- Minimum of 10 nodes per branch required
- First Split Normalized Duration (All Trees)
- Most Important Features
- Normalized Absolute Duration, Speaking Rate
- 65 Accuracy - Null Baseline-50
33Example Tree
34Classifier Results Misrecognitions
- Most important features
- Absolute and Normalized Duration
- Pitch Minimum and Pitch Slope
- 77 accuracy (with text)
- 65 (acoustic features only)
- Null baseline - 50
- Human baseline - 79.4 (Hauptman Rudnicky 1987)
35Misrecognition Classifier
36Background Related Work
- Detecting and Preventing Miscommunication
- (Smith Gordon 96, Traum Dillenbourg 96)
- Identifying Discourse Structure in Speech
- Prosody (Grosz Hirschberg 92, Swerts
Ostendorf 95) - Cue wordsprosody (Taylor et al 96,
HirschbergLitman 93) - Self-repairs (Heeman Allen 94, Bear et al 92)
- Acoustic-only (Nakatani Hirschberg 94,
Shriberg et al 97) - Speaking Modes (Ostendorf et al 96, Daly Zue
96) - Spoken Corrections
- Human baseline (Rudnicky Hauptmann 87)
- (Oviatt et al 96, 98 Levow 98,99 Hirschberg et
al 99,00) - Other languages (Bell Gustafson 99, Pirker et
al 99,Fischer 99)
37Learning Method Options
- (K)-Nearest Neighbor
- Need Commensurable Attribute Values
- Sensitive to Irrelevant Attributes
- Labeling Speed - Training Set Size
- Neural Nets
- Hard to Interpret
- Can Require More Computation Training Data
- Fast, Accurate when Trained
- Decision Trees lt
- Intelligible, Robust to Irrelevant Attributes
- Fast, Compact when Trained
- ?Rectangular Decision Boundaries, Don't Test
Feature Combinations