Title: multimodal emotion recognition and expressivity analysis ICME 2005 Special Session
1multimodal emotion recognition and expressivity
analysisICME 2005 Special Session
- Stefanos Kollias, Kostas Karpouzis
- Image, Video and Multimedia Systems Lab National
Technical University of Athens
2expressivity and emotion recognition
- affective computing
- capability of machines to recognize, express,
model, communicate and respond to emotional
information - computers need the ability to recognize human
emotion - everyday HCI is emotional three-quarters of
computer users admit to swearing at computers ? - user input and system reaction are important to
pinpoint problems or provide natural interfaces
3the targeted interaction framework
- Generating intelligent interfaces with affective,
learning, reasoning and adaptive capabilities. - Multidisciplinary expertise is the basic means
for novel interfaces, including perception and
emotion recognition, semantic analysis,
cognition, modelling and expression generation
and production of multimodal avatars capable of
adapting to the goals and context of interaction.
- Humans function due to four primary modes of
being, i.e., affect, motivation, cognition, and
behavior these are related to feeling, wanting,
thinking, and acting. - Affect is particularly difficult requiring to
understand and model the causes and consequences
of emotions. The latter, especially as realized
in behavior, is a daunting task
4(No Transcript)
5everyday emotional states
I think you might be getting just a wee bit
bored maybe a coffee?
- dramatic extremes (terror, misery, elation) are
fascinating, but marginal for HCI. - the target of an affect-aware system
- register everyday states with an emotional
component excitement, boredom, irritation,
enthusiasm, stress, satisfaction, amusement - achieve sensitivity to everyday emotional states
6affective computing applications
- detect specific incidents/situations that need
human intervention - e.g. anger detection in a call center
- naturalistic interfaces
- keyboard/mouse/pointer paradigm can be difficult
for the elderly, handicapped people or children - speech and gesture interfaces can be useful
7the EU perspective
- Until 2002, related research was dominated by
mid-scale projects - ERMIS multimodal emotion recognition (facial
expressions, linguistic and prosody analysis) - NECA networked affective ECAs
- SAFIRA affective input interfaces
- NICE Natural Interactive Communication for
Edutainment - MEGA Multisensory Expressive Gesture
Applications - INTERFACE Multimodal Analysis/Synthesis System
for Human Interaction to Virtual and Augmented
Environments
8the EU perspective
- FP6 (2002-2006) issued two calls for multimodal
interfaces - Call 1 (April 2003) and Call 5 (September 2005)
covering multimodal and multilingual areas - Integrated Projects AMI Augmented Multi-party
Interaction and CHIL - Computers In the Human
Interaction Loop - Networks of Excellence Humaine and Similar
- Other calls covered Leisure and entertainment,
e-Inclusion, Cognitive systems and Presence
and Interaction
9the HUMAINE Network of Excellence
- FP6 Call 1 Network of Excellence Research on
Emotions and Human-Machine Interaction - start 1st January 2004, duration 48 months
- IST thematic priority Multimodal Interfaces
- emotions in human-machine interaction
- creation of a new, interdisciplinary research
community - advancing the state of the art in a principled way
10the HUMAINE Network of Excellence
- 33 partner groups from 14 countries
- coordinated by Queens University of Belfast
- goals of HUMAINE
- integrate existing expertise in psychology,
computer engineering, cognition, interaction and
usability - promote shared insight
- http//emotion-research.net
11moving forward
- future EU orientations include (extracted from
Call 1 evaluation, 2004) - adaptability and re-configurable interfaces
- collaborative technologies and interfaces in the
arts - less explored modalities, e.g. haptics,
bio-sensing - affective computing, including character and
facial expression recognition and animation - more product innovation and industrial impact
- FP7 direction Simulation, Visualization,
Interaction, Mixed Reality - blending semantic/knowledge and interface
technologies
12the special session
- segment-based approach to the recognition of
emotions in speech - M. Shami, M. Kamel, University of Waterloo
- comparing feature sets for acted and spontaneous
speech in view of automatic emotion recognition - T. Vogt, E. Andre, University of Augsburg
- annotation and detection of blended emotions in
real human-human dialogs recorded in a call
center - L. Vidrascu, L. Devillers, LIMSI-CNRS, France
- a real-time lip sync system using a genetic
algorithm for automatic neural network
configuration - G. Zoric, I. Pandzic, University of Zagreb
- visual/acoustic emotion recognition
- Cheng-Yao Chen, Yue-Kai Huang, Perry Cook,
Princeton University - an intelligent system for facial emotion
recognition - R. Cowie, E. Douglas-Cowie, Queens University of
Belfast, J. Taylor, King's College, S. Ioannou,
M. Wallace, IVML/NTUA
13the big picture
- feature extraction from multiple modalities
- prosody, words, face, gestures, biosignals
- unimodal recognition
- multimodal recognition
- using detected features to cater for affective
interaction
14audiovisual emotion recognition
- the core system combines modules dealing with
- visual signs
- linguistic content of speech (what you say)
- paralinguistic content (how you say it)
- and recognition based on all the signs
15facial analysis module
- face detection, i.e. finding a face without prior
information about its location - using prior knowledge about where to look
- face tracking
- extraction of key regions and points in the face
- monitoring of movements over time (as features
for users expressions/emotions) - provide confidence level for the validity of each
detected feature
16facial analysis module
- face detection, obtained through SVM
classification - facial feature extraction, by robust estimation
of the primary facial features, i.e., eyes,
mouth, eyebrows and nose - fusion of different extraction techniques, with
confidence level estimation. - MPEG-4 FP and FAP feature extraction to feed the
expression and emotion recognition task. - 3-D modeling for improved accuracy in FP and FAP
feature estimation, at an increased computational
load, when the facial user model is known.
17facial analysis module
the extracted mask for the eyes
detected feature points in the masks
18FAP estimation
- Absence of clear quantitative definition of FAPs
- It is possible to model FAPs through FDP feature
points movement using distances s(x,y)
e.g. close_t_r_eyelid (F20) - close_b_r_eyelid
(F22) ? D13s (3.2,3.4) ? f13 D13 - D13-NEUTRAL
19face detection
face
Classify
no face
SVM classifier
20face detection
detected face
estimation of the active contour of the face
extraction of the facial area
21facial feature extraction
extraction of eyes and mouth key regions within
the face
extraction of MPEG-4 Facial Points (FPs) key
points in the eye mouth regions
22other visual features
- visemes, eye gaze, head pose
- movement patterns, temporal correlations
- hand gestures, body movements
- deictic/conversational gestures
- body language
- measurable parameters to render expressivity on
affective ECAs - spatial extent, repetitiveness, volume, etc.
23video analysis using 3D
- Step 1 Scan or approximate 3d model
- (in this case estimated from video data only
using face space approach)
24video analysis using 3D
- Step 2 Represent 3d model using a
- predefined template geometry, the same template
- is used for expressions.
- This template shows higher
- density around eyes, and mouth
- and lower density around flatter
- areas such as cheeks, forehead,
- etc.
-
25video analysis using 3D
- Step 3 Construct database of facial expressions
by recording various actors. The statistics
derived from these performances is stored in
terms of a Dynamic Face Space
- Step 4 Apply the expressions to the actor in the
video data
26video analysis using 3D
- Step 5 Matching rotate head apply various
expressions and match current state with 2D video
frame - - Global Minimization process
-
27video analysis using 3D
- the global matching/minimization process is
complex - it is sensitive to
- illumination, which may vary across sequence,
- shading, shadowing effects on the face,
- color changes, or color differences
- variability in expressions, some expressions
- can not be generated using the statistics of the
a priori recorded sequences - it is time consuming (several minutes per frame)
28video analysis using 3D
Local template matching
Pose estimation
29video analysis using 3D
30video analysis using 3D
3D models
31video analysis using 3D
Add expressions
32auditory module
- linguistic analysis aims to extract the words
that the speaker produces - paralinguistic analysis aims to extract
significant variations in the way words are
produced - mainly in pitch, loudness, timing, and
voice quality - both are designed to cope with the less than
perfect signals that are likely to occur in real
use
33linguistic analysis
(a)
(a) The Linguistic Analysis Subsystem (b) The
Speech Recognition Module
Acoustic
Language
Enhanced
Modeling
Modeling
Speech
Signal
(b)
Parameter
Search
Extraction
Dictionary
Engine
Module
Text
34paralinguistic analysis
- ASSESS, developed by QUB, describes speech at
multiple levels - intensity spectrum edits, pauses, frication
raw pitch estimates a smooth fitted curve
rises falls in intensity pitch
35integrating the evidence
- level 1
- facial emotion
- phonetic emotion
- linguistic emotion
- level 2
- total emotional state (result of the "level 1
emotions") - modeling technique fuzzy set theory (research by
Massaro suggests this models the way humans
integrate signs)
36integrating the evidence
- mechanisms linking attention and emotion in the
brain form a useful model
Goals (SFG)
IMC (hetermodal CX)
Goals (Inhibition) ACG
Visual Input
Thalamus /Superior Colliculus
Salience NBM (Ach source)
Valence (Amygdala)
37biosignal analysis
- different emotional expressions produce different
changes in autonomic activity - anger increased heart rate and skin temperature
- fear increased heart rate, decreased skin
temperature - happiness decreased heart rate, no change in
skin temperature - easily integrated with external channels (face
and speech) - presentation by J. Kim in the HUMAINE WP4
workshop, September 2004
38biosignal analysis
Acoustics and noise
EEG Brain waves
Respiration Breathing rate
Temperature
EMG Muscle tension
BVP- Blood volume pulse
GSR Skin conductivity
EKG Heart rate
39biosignal analysis
- skin-sensing requires physical contact
- need to improve accuracy, robustness to motion
artifacts - vulnerable to distortion
- most research measures artificially elicited
emotions in a lab environment and a from single
subject - different individuals show emotion with different
response in autonomic channels (hard for
multi-subjects) - rarely studied physiological emotion recognition,
literature offers ideas rather than well-defined
solutions
40multimodal emotion recognition
- recognition models- application dependency
- discrete / dimensional / appraisal theory models
- theoretical models of multimodal integration
- direct / separate / dominant / motor integration
- modality synchronization
- visemes/ EMGs FAPs / SC-RSP speech
- temporal evolution and modality sequentiality
- multimodal recognition techniques
- classifiers context goals
cognition/attention modality significance in
interaction