Cost of Misunderstandings presentation

About This Presentation

Transcript and Presenter's Notes

Title: Cost of Misunderstandings

1
Cost of Misunderstandings

Modeling the Cost of Misunderstanding Errors in
the CMU Communicator Dialog System
Presented by Dan Bohus (dbohus_at_cs.cmu.edu)
Work by Dan Bohus, Alex Rudnicky
Carnegie Mellon University, 2001

2
Outline

Quick overview of previous utterance-level
confidence annotation work.
Modeling the cost of misunderstandings in spoken
dialog systems.
Experiments results.
Further analysis.
Summary, further work, conclusion

3
Utterance-Level Confidence Annotation Overview

Confidence annotation data-driven
classification
Corpus 2 months, 131 dialogs, 4550 utterances.
Features 12 features from decoder, parsing,
dialog management levels.
Classifiers Decision Tree, ANN, BayesNet,
AdaBoost, NaiveBayes, SVM Logistic Regression
model (later on).

4
Confidence annotator performance

Baseline error rate 32
Garble baseline 25
Classifiers performance 16
Differences between classifiers are statistically
insignificant except for Naïve Bayes
On a soft-metric, logistic regression model
clearly outperformed the others
But is this the right way to evaluate performance?

5
Judging Performance

Classification Error Rate (FPFN).
Assumes implicitly that FP and FN errors have
same cost
But cost of misunderstanding in dialog systems is
presumably different for FPs and FNs.
Build an error function which take into account
these costs, and optimize for that.
Cost also depends on
domain/system not a problem
dialog state

6
Problem Formulation

(1) Develop a cost model which allows us to
quantitatively assess the costs of FP and FN
errors.
(2) Use the costs to pick the optimal tradeoff
point on the classifier ROC.

7
The Cost Model

Model the impact of the FPs and FNs on the system
performance
Identify a suitable performance metric P
Build a statistical regression model at the
dialog session level
P f(FPs, FNs)
P k CostFPFP CostFNFN (Linear Regr)
Then we can plot f, and implicitly optimize for P

8
Measuring Performance

User Satisfaction (i.e. 5-point scale)
Hard to get
Very subjective hard to make it consistent
across users
Concept transfer efficiency
CTC correctly transferred concepts per turn
ITC incorrectly transferred concepts per turn
Completion

9
Detour The Dataset

134 dialogs (2561 utterances), collected using 4
scenarios
Satisfaction scores only for 35 dialogs
Corpus manually labeled at the concept and level
4 labels OK / RBAD / PBAD / OOD
Aggregate utterance labels generated
Confidence annotator decisions logged
Computed counts of FPs, FNs, CTCs, ITCs for each
session

10
Example

U I want to fly from Pittsburgh to Boston
S I want to fly from Pittsburgh to Austin
C I_want/OK Depart_Loc/OK
Arrive_Loc/RBAD
Only 2 relevantly expressed concepts
If Accept CTC 1, ITC 1
If Reject CTC 0, ITC 0

11
Targeting Efficiency Model 1

3 Successively refined models
CTC FP FN TN k
CTC - correctly transferred concepts / turn
TN true negatives

12
Targeting Efficiency Model 2

CTC - ITC (REC ) FP FN TN k
ITC - incorrectly transferred concepts / turn
REC relevantly expressed concepts

13
Targeting Efficiency Model 3

CTC-ITC RECFPCFPNCFNTNk
2 types of FPs
With concepts - FPC
Without concepts - FPNC

14
Model 3 - Results

CTC-ITC RECFPCFPNCFNTNk

15
Other models

Completion (binary)
Logistic regression model
Estimated model does not indicate a good fit
User satisfaction (5-point scale)
Based on only 35 dialogs
R2 0.61 (similar to literature Walker et al)
Explanation subjectivity of metric limited
dataset

16
Problem Formulation

(1) Develop a cost model which allows us to
quantitatively assess the costs of FP and FN
errors.
(2) Use the costs to pick the optimal tradeoff
point on the classifier ROC.

17
Tuning the Confidence Annotator

Using Model 3
CTC-ITC RECFPNCFPCFNTNk
Drop k REC, plug in the values
Cost 0.48FPNC2.12FPC1.33FN0.56TN
Minimize Cost instead of Classification Error
Rate (FPFN), and well implicitly maximize
concept transfer efficiency.

18
Operating Characteristic
19
Further Analysis

Is CTC-ITC really modeling dialog performance ?
Mean 0.71, Std.Dev 0.28
Mean for completed dialogs 0.82
Mean for uncompleted dialogs 0.57
Difference between means significant at very high
level of confidence
P-value 7.2310-9 (in t-test)
So, looks like CTC-ITC is okay, right ?

20
Further Analysis (contd)

Can we reliably extrapolate to other areas of the
operating characteristic ?

21
Further Analysis (contd)

Can we reliably extrapolate to other areas of the
operating characteristic ?
Yes, look at the distribution of the FP and FN
ratios across dialogs.

22
Further Analysis (contd)

Impact of baseline error rate ?
Compared models constructed based on high and low
error rates
For low error rate curve becomes monotonically
increasing
This clearly indicates that trust everything /
have no confidence is the way to go in this
setting

23
Our explanation so far

Ability to easily overwrite incorrectly captured
information in the CMU Communicator
Relatively low error rates
Likelihood of repeated misrecognition is low

24
Conclusion

Data-driven approach to quantitatively assess the
costs of various types of misunderstandings.
Models based on efficiency fit data well
obtained costs confirm intuition.
For CMU Communicator, model predicts that total
cost stays the same across a large range of the
operating characteristic of the classifier.

25
Further Experiments

But, of course, we can verify predictions
experimentally
Collect new data with the system running with a
very low threshold.
55 dialogs collected so far.
Thanks to those who have participated in these
experiments.
Help if you have the time to the others
www.cs.cmu.edu/dbohus/scenarios.htm
Re-estimate models, verify predictions

Cost of Misunderstandings PowerPoint PPT Presentation