Title: Using Wordlevel Features to Better Predict Student Emotions during Spoken Tutoring Dialogues
1Using Word-level Features to Better Predict
Student Emotions during Spoken Tutoring Dialogues
- Mihai Rotaru
- Diane J. Litman
- DoD Group Meeting Presentation
2Introduction
- Why is important to detect/handle emotions?
- Emotion annotation
- Classification task
- Previous work
3(Spoken) Tutoring dialogues
- Education
- Classroom setting
- Human (one-on-one) tutoring
- Computer tutoring (ITS Intelligent Tutoring
Systems) - Addressing the learning gap between human and
computer tutoring - Dialogue-based ITS (Ex Why2)
- Improve the language understanding module of ITS
- Incorporate affective reasoning
- Connection between learning and student emotional
state - Adding human-provided emotional scaffolding to a
reading tutor increases student persistence (Aist
et al., 2002)
4Human-Computer Excerpt
- Tutor26 Let me ask my question in a more
general way. What happens to the velocity of a
body when there is no force acting on it? - Student27 dammit (ASR it is)
- Tutor28 Could you please repeat that?
- Student29 same (ASR i same)
- Tutor30 Right. And what happens to the
HORIZONTAL velocity of a body when there is no
HORIZONTAL force acting on it? - Student31 zero (ASR the zero)
- Tutor32 Well... OK, let's consider how forces
in the vertical direction affect the motion of a
body in a horizontal direction in a different
scenario ltomittedgt - Student33 oh gravity you already said this (ASR
beats gravity you're exert less) - Tutor34 Fine. Are there any other forces acting
on the apple as it falls? - Student35 no why are you doing this again (ASR
no y and to it yes) - Tutor36 Correct. The only force acting on the
apple as it falls to the earth is the earth's
gravitational force. In which direction does
gravitational force act? - Student37 downward you computer (ASR downward
you computer) -
5Affective reasoning
- Prerequisites
- Dialogue-based ITS Why2
- Interaction via speech ITSPOKE (Intelligent
Tutoring SPOKEn dialogue system) - Affective reasoning
- Detect student emotions
- Handle student emotions
6- Back-end is Why2-Atlas system (VanLehn et al.,
2002) - Sphinx2 speech recognition and Cepstral
text-to-speech
7- Back-end is Why2-Atlas system (VanLehn et al.,
2002) - Sphinx2 speech recognition and Cepstral
text-to-speech
8- Back-end is Why2-Atlas system (VanLehn et al.,
2002) - Sphinx2 speech recognition and Cepstral
text-to-speech
9Student emotions
- Emotion annotation
- Perceived, intuitive expressions of emotion
- Relative to other turns in context and tutoring
task - 3 Main emotion classes
- Negative - e.g. uncertain, bored, irritated,
confused, sad (question turns) - Positive - e.g. confident, enthusiastic
- Neutral - no strong expression of negative or
positive emotion (grounding turns) - Corpora
- Human-Human (453 student turns from 10 dialogues)
- Human-Computer (333 student turns from 15
dialogues)
10Annotation example
-
- Tutor Uh let us talk of one car first.
-
- Student ok. (EMOTION NEUTRAL)
-
- Tutor If there is a car, what is it that exerts
force on the car such that it accelerates
forward? -
- Student The engine. (EMOTION POSITIVE)
-
- Tutor Uh well engine is part of the car, so how
can it exert force on itself? - Student um (EMOTION NEGATIVE)
11Classification task
- 3 Levels of Annotation Granularity
- NPN - Negative, Positive, Neutral
- NnN - Negative, Non-Negative
- positives and neutrals are conflated as
Non-Negative - EnE - Emotional, Non-Emotional
- negatives and positives are conflated as
Emotional neutrals are Non-Emotional - useful for triggering system adaptation (HH
corpus analysis) - Agreed subset
- Predict the class of each student turn
12Previous work - Features
- Human-Human
- 5 feature types
- Acoustic-prosodic
- amplitude, pitch, duration
- Lexical
- Other automatic
- Manual
- Identifiers
- Combinations
- Current turn
- Contextual
- Local previous two turns
- Global all turns so far
- Human-Computer
- 3 feature types
- Acoustic-prosodic
- amplitude, pitch, duration
- Lexical
- Other automatic
- Manual
- Identifiers
- Combinations
13Previous work - Results
Litman and Forbes, ACL 2004
14How to improve?
- Use word-level features instead of turn-level
features - Extend the pitch features set
- Simplified word-level emotion model
15Why word-level features?
- Emotion might not be expressed over the entire
turn - This is great
Angry
Happy
16Why word-level features? (2)
- Can approximate pitch contour better at sub-turn
levels. - Especially for longer turns
This is great
17Extended pitch features set
- Previous work
- Min, Max
- Avg, Stdev
- Extend with
- Start, End
- Regression coefficient and regression error
- Quadratic regression coefficient
from Batliner et al. 2003
18But wait
Features
Machine learning
Student turn
Turn emotional class
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
Turn-level
Word-level
Word 1
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
?
Turn emotional class
Word n
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
Machine learning
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
Sönmez et al., 1998
19Word-level emotion model
Features
Machine learning
Student turn
Turn emotional class
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
Turn-level
Word-level
Word-level emotion
Word 1
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
Turn emotional class
Word n
Word-level emotion
321654615, asdakd, 342.234234 Asdhkas, a34334,
324,7657755
20Word-level emotion model
- Training phase
- Each word labeled with turn class
- Extra features to identify the position of the
word in the turn (distance in words from the
beginning and end of the turn) - Learn emotion model at the word level
- Test phase
- Predict each word class based on the learned
model - Use majority/weighted voting to label the turn
based on its word classes - Ties are broken randomly
21Questions to answer
- Will word level feature work better than turn
level features for emotion prediction? - Yes
- If yes, where does the advantage comes from?
- Better prediction of longer turns
- Is there a feature set that offers robust
performance? - Yes. Combination of pitch and lexical features at
word level.
22Experiments
- EnE classification, agreed turns
- Two contrasting corpora
- Two contrasting learners (WEKA)
- IB1 nearest neighbor classifier
- ADA boosted decision trees
23Feature sets
- Only pitch and lexical features
- 6 sets of features
- Turn level
- Lex-Turn only lexical
- Pitch-Turn only pitch
- PitchLex-Turn lexical and prosodic
- Word level
- Lex-Word only lexical positional
- Pitch-Word only pitch positional
- PitchLex-Word lexical and prosodic positional
- Baseline majority class
- 10 x 10 cross validation
24Results IB1 on HH
- Word-level features significantly outperform
turn-level features - Word-level better than turn-level on longer turns
- Best performers Lex-Word, PitchLex-Word
25Results ADA on HH
- Turn-level performance increases a lot
- Word-level significantly better than turn-level
on features sets with pitch - Word-level better than turn-level on longer turns
but the difference is smaller - Best performers Lex-Turn, Lex-Word,
PitchLex-Word
26Results IB1 on HC
- Word-level features significantly outperform
turn-level features - Lexical information less helpful than on HH
corpus - Word-level better than turn-level on longer turns
- Best performers Pitch-Word, PitchLex-Word
27Results ADA on HC
- Difference not significant anymore
- IB1 better than ADA on word-level features
- ADA has bigger variance on this corpus
- Word-level better than turn-level on longer turns
but the difference is smaller - Best performers Pitch-Turn, Pitch-Word,
PitchLex-Turn, PitchLex-Word
28Discussion
- Lexical features at turn and word-level are
similar - Performance dependent on corpus and learner
- Pitch features differ significantly
- Word-level better than turn-level (4/6)
- PitchLex-Word a consistent best performer
- Our best accuracies comparable with previous work
29Conclusions Future work
- Word-level better than turn-level for emotion
prediction - Even under a very simple word-level emotion model
- Word-level better at predicting longer turns
- PitchLex-Word a consistent best performer
- Future work
- More refined word-level emotion models
- HMMs
- Co-training
- Filter irrelevant words
- Use the prosodic information left out
- See if our conclusions generalize on detecting
student uncertainty - Experiment with other sub-turn units (breath
groups)
30Feature Extraction per Student Turn
- Five feature types
- acoustic-prosodic (1)
- non acoustic-prosodic
- lexical (2)
- other automatic (3)
- manual (4)
- identifiers (5)
- Research questions
- utility of different features
- speaker and task dependence
31Feature Types (1)
- Acoustic-Prosodic Features (normalized)
- 4 pitch (f0) max, min, mean, standard dev.
- 4 energy (RMS) max, min, mean, standard dev.
- 4 temporal turn duration (seconds)
- pause length preceding turn (seconds)
- tempo (syllables/second)
- internal silence in turn (zero f0
frames) - ? available to ITSPOKE in real time
32Feature Types (2)
-
- Lexical Items
- word occurrence vector
33Feature Types (3)
- Other Automatic Features available from ITSPOKE
logs - Turn Begin Time (seconds from dialog start)
- Turn End Time (seconds from dialog start)
- Is Temporal Barge-in (student turn begins before
tutor turn ends) - Is Temporal Overlap (student turn begins and
ends in tutor turn) - Number of Words in Turn
- Number of Syllables in Turn
34Feature Types (4)
- Manual Features (currently) available only from
human transcription - Is Prior Tutor Question (tutor turn contains
?) - Is Student Question (student turn contains ?)
- Is Semantic Barge-in (student turn begins at
tutor word/pause boundary) - Number of Hedging/Grounding Phrases (e.g.
mm-hm, um) - Is Grounding (canonical phrase turns not
preceded by a tutor question) - Number of False Starts in Turn (e.g.
acc-acceleration)