Title: DQR test suites for spoken dialogue system evaluation : A paradigm for a qualitative evaluation
1DQR test suites for spoken dialogue system
evaluation A paradigm for a qualitative
evaluation
- Jean-Yves Antoine
- VALORIA
- U. Bretagne Sud
- Vannes, France
Jérôme Zeiliger INRS-Telecom Quebec, Canada
Jean Caelen CLIPS Institut IMAG Grenoble, France
2Quantitative evaluation
- Overall performance of the system
- Accuracy rates outputs / predefinite references
- Advantages
- Objective evaluation
- Overall improvements over time
- Drawbacks
- Lack of predictive power
- Lack of genericness
3Predictability some questions
- Overall accuracy rate of the system
- How does it depend on the performances of its
components ? - Overall accuracy rate of a specific component
- How does it depend on the testing data ?
- How does it depend on the application ?
- How should it enlighten us about future
improvements ?
4Predictability a solution
Quantitative evaluation
Qualitative evaluation
- Assessment of the Overall improvements of the
technology - Appropriateness to a specific task / application
Evaluation of the systems behaviour on EVERY
specific phenomenon
PREDICTABILITY
5DQR methodology
- Qualitative Evaluation in NLP
- TSNLP FRACAS AUPELF-UREF
- DQR test suites
- Declaration D the utterance the system should
understand. D concerns a specific phenomenon - Peter is attending a meeting. He is to chair
it. - Question Q assesses the understanding of D
- Is Peter to chair a meeting ?.
- Reply R Yes / No
6DQR Evaluation and Speech
EXTENSIONS OF THE DQR METHODOLOGY
Specificity of the spoken language interaction
Specificity of the speech technologies
Structural Analysis spontaneous unexpected
structures Dialog Strategy
Practical adaptation of the DQR test suites
7Multi-level Evaluation
- Literal understanding (structural analysis)
- Implicit understanding (anaphora, ellipses)
- Inference - common sense reasonning (logical
inferences) - - pragmatic reasonning
- - multiple turns inferences
- Speech acts interpretation (intention in action)
- Speakers intention recognition (preliminary
intention) - Relevance - reply of the system
- - dialogue strategy
8Practical achievement
Simplicity of the question Q
- (D) I need to go to Granada tomorrow morning
- (Q) Go to Granada
- (R) Yes
Simplicity of the evaluation
- Computation of the answer mere unification
-
- Accuracy rate specific to each phenomenon
Rsystem UNIF ( D, Q )
9Genericity
Unification of the intrinsic representations of
the system
No predefinite references No common
representations
Complete independance
10Predicatibility literal understanding
- Key information retrieval
(D) I need to go to Granada tomorrow
morning (Q) Go to Granada (R) Yes
(D) Turn on right after the building with the red
shutters (Q) Red shutters (R) Yes (Q) Building
with shutters (R) Yes
11Predicatibility negative tests
Positive Tests
Tracking the errors
Negative Tests
Explaining the errors
Example literal understanding
(D) Turn on right after the building with the red
shutters (Q) Red building (R) No (D) Move the
circle and the triangle on the right (Q) Move the
right triangle (R) No
12Predicatibility spoken constructions
- Repetitions, self-corrections
(D) I want to leave tomorrow evening no sorry
morning (Q) Tomorow morning (R) Yes
(D) On the right of the circle, draw a red
triangle (Q) Draw a circle (R) No
13Conclusion
- A predictive and generic paradigm of evaluation
- Already in use in NLP (Fracas, 1996)
- Adaptable to spoken language understanding
- AUPELF-UREF French-speaking evaluation
- Adaptable to spoken dialog ????
- Lack of interactive abilities of the present
systems