multimodal emotion recognition and expressivity analysis ICME 2005 Special Session - PowerPoint PPT Presentation

About This Presentation

Title:

multimodal emotion recognition and expressivity analysis ICME 2005 Special Session

Description:

capability of machines to recognize, express, model, communicate and respond to ... deictic/conversational gestures 'body language' ... – PowerPoint PPT presentation

Number of Views:289

Avg rating:3.0/5.0

Slides: 41

Provided by: IVML

Category:

more less

Transcript and Presenter's Notes

Title: multimodal emotion recognition and expressivity analysis ICME 2005 Special Session

1
multimodal emotion recognition and expressivity
analysisICME 2005 Special Session

Stefanos Kollias, Kostas Karpouzis
Image, Video and Multimedia Systems Lab National
Technical University of Athens

2
expressivity and emotion recognition

affective computing
capability of machines to recognize, express,
model, communicate and respond to emotional
information
computers need the ability to recognize human
emotion
everyday HCI is emotional three-quarters of
computer users admit to swearing at computers ?
user input and system reaction are important to
pinpoint problems or provide natural interfaces

3
the targeted interaction framework

Generating intelligent interfaces with affective,
learning, reasoning and adaptive capabilities.
Multidisciplinary expertise is the basic means
for novel interfaces, including perception and
emotion recognition, semantic analysis,
cognition, modelling and expression generation
and production of multimodal avatars capable of
adapting to the goals and context of interaction.
Humans function due to four primary modes of
being, i.e., affect, motivation, cognition, and
behavior these are related to feeling, wanting,
thinking, and acting.
Affect is particularly difficult requiring to
understand and model the causes and consequences
of emotions. The latter, especially as realized
in behavior, is a daunting task

4
(No Transcript)
5
everyday emotional states
I think you might be getting just a wee bit
bored maybe a coffee?

dramatic extremes (terror, misery, elation) are
fascinating, but marginal for HCI.
the target of an affect-aware system
register everyday states with an emotional
component excitement, boredom, irritation,
enthusiasm, stress, satisfaction, amusement
achieve sensitivity to everyday emotional states

6
affective computing applications

detect specific incidents/situations that need
human intervention
e.g. anger detection in a call center
naturalistic interfaces
keyboard/mouse/pointer paradigm can be difficult
for the elderly, handicapped people or children
speech and gesture interfaces can be useful

7
the EU perspective

Until 2002, related research was dominated by
mid-scale projects
ERMIS multimodal emotion recognition (facial
expressions, linguistic and prosody analysis)
NECA networked affective ECAs
SAFIRA affective input interfaces
NICE Natural Interactive Communication for
Edutainment
MEGA Multisensory Expressive Gesture
Applications
INTERFACE Multimodal Analysis/Synthesis System
for Human Interaction to Virtual and Augmented
Environments

8
the EU perspective

FP6 (2002-2006) issued two calls for multimodal
interfaces
Call 1 (April 2003) and Call 5 (September 2005)
covering multimodal and multilingual areas
Integrated Projects AMI Augmented Multi-party
Interaction and CHIL - Computers In the Human
Interaction Loop
Networks of Excellence Humaine and Similar
Other calls covered Leisure and entertainment,
e-Inclusion, Cognitive systems and Presence
and Interaction

9
the HUMAINE Network of Excellence

FP6 Call 1 Network of Excellence Research on
Emotions and Human-Machine Interaction
start 1st January 2004, duration 48 months
IST thematic priority Multimodal Interfaces
emotions in human-machine interaction
creation of a new, interdisciplinary research
community
advancing the state of the art in a principled way

10
the HUMAINE Network of Excellence

33 partner groups from 14 countries
coordinated by Queens University of Belfast
goals of HUMAINE
integrate existing expertise in psychology,
computer engineering, cognition, interaction and
usability
promote shared insight
http//emotion-research.net

11
moving forward

future EU orientations include (extracted from
Call 1 evaluation, 2004)
adaptability and re-configurable interfaces
collaborative technologies and interfaces in the
arts
less explored modalities, e.g. haptics,
bio-sensing
affective computing, including character and
facial expression recognition and animation
more product innovation and industrial impact
FP7 direction Simulation, Visualization,
Interaction, Mixed Reality
blending semantic/knowledge and interface
technologies

12
the special session

segment-based approach to the recognition of
emotions in speech
M. Shami, M. Kamel, University of Waterloo
comparing feature sets for acted and spontaneous
speech in view of automatic emotion recognition
T. Vogt, E. Andre, University of Augsburg
annotation and detection of blended emotions in
real human-human dialogs recorded in a call
center
L. Vidrascu, L. Devillers, LIMSI-CNRS, France
a real-time lip sync system using a genetic
algorithm for automatic neural network
configuration
G. Zoric, I. Pandzic, University of Zagreb
visual/acoustic emotion recognition
Cheng-Yao Chen, Yue-Kai Huang, Perry Cook,
Princeton University
an intelligent system for facial emotion
recognition
R. Cowie, E. Douglas-Cowie, Queens University of
Belfast, J. Taylor, King's College, S. Ioannou,
M. Wallace, IVML/NTUA

13
the big picture

feature extraction from multiple modalities
prosody, words, face, gestures, biosignals
unimodal recognition
multimodal recognition
using detected features to cater for affective
interaction

14
audiovisual emotion recognition

the core system combines modules dealing with
visual signs
linguistic content of speech (what you say)
paralinguistic content (how you say it)
and recognition based on all the signs

15
facial analysis module

face detection, i.e. finding a face without prior
information about its location
using prior knowledge about where to look
face tracking
extraction of key regions and points in the face
monitoring of movements over time (as features
for users expressions/emotions)
provide confidence level for the validity of each
detected feature

16
facial analysis module

face detection, obtained through SVM
classification
facial feature extraction, by robust estimation
of the primary facial features, i.e., eyes,
mouth, eyebrows and nose
fusion of different extraction techniques, with
confidence level estimation.
MPEG-4 FP and FAP feature extraction to feed the
expression and emotion recognition task.
3-D modeling for improved accuracy in FP and FAP
feature estimation, at an increased computational
load, when the facial user model is known.

17
facial analysis module
the extracted mask for the eyes
detected feature points in the masks
18
FAP estimation

Absence of clear quantitative definition of FAPs

It is possible to model FAPs through FDP feature
points movement using distances s(x,y)

e.g. close_t_r_eyelid (F20) - close_b_r_eyelid
(F22) ? D13s (3.2,3.4) ? f13 D13 - D13-NEUTRAL
19
face detection

face
Classify

no face
SVM classifier

20
face detection
detected face
estimation of the active contour of the face
extraction of the facial area
21
facial feature extraction
extraction of eyes and mouth key regions within
the face
extraction of MPEG-4 Facial Points (FPs) key
points in the eye mouth regions
22
other visual features

visemes, eye gaze, head pose
movement patterns, temporal correlations
hand gestures, body movements
deictic/conversational gestures
body language
measurable parameters to render expressivity on
affective ECAs
spatial extent, repetitiveness, volume, etc.

23
video analysis using 3D

Step 1 Scan or approximate 3d model
(in this case estimated from video data only
using face space approach)

24
video analysis using 3D

Step 2 Represent 3d model using a
predefined template geometry, the same template
is used for expressions.
This template shows higher
density around eyes, and mouth
and lower density around flatter
areas such as cheeks, forehead,
etc.

25
video analysis using 3D

Step 3 Construct database of facial expressions
by recording various actors. The statistics
derived from these performances is stored in
terms of a Dynamic Face Space

Step 4 Apply the expressions to the actor in the
video data

26
video analysis using 3D

Step 5 Matching rotate head apply various
expressions and match current state with 2D video
frame
- Global Minimization process

27
video analysis using 3D

the global matching/minimization process is
complex
it is sensitive to
illumination, which may vary across sequence,
shading, shadowing effects on the face,
color changes, or color differences
variability in expressions, some expressions
can not be generated using the statistics of the
a priori recorded sequences
it is time consuming (several minutes per frame)

28
video analysis using 3D
Local template matching
Pose estimation
29
video analysis using 3D
30
video analysis using 3D
3D models
31
video analysis using 3D
Add expressions
32
auditory module

linguistic analysis aims to extract the words
that the speaker produces
paralinguistic analysis aims to extract
significant variations in the way words are
produced - mainly in pitch, loudness, timing, and
voice quality
both are designed to cope with the less than
perfect signals that are likely to occur in real
use

33
linguistic analysis
(a)
(a) The Linguistic Analysis Subsystem (b) The
Speech Recognition Module
Acoustic
Language
Enhanced
Modeling
Modeling
Speech
Signal
(b)
Parameter
Search
Extraction
Dictionary
Engine
Module
Text
34
paralinguistic analysis

ASSESS, developed by QUB, describes speech at
multiple levels
intensity spectrum edits, pauses, frication
raw pitch estimates a smooth fitted curve
rises falls in intensity pitch

35
integrating the evidence

level 1
facial emotion
phonetic emotion
linguistic emotion
level 2
total emotional state (result of the "level 1
emotions")
modeling technique fuzzy set theory (research by
Massaro suggests this models the way humans
integrate signs)

36
integrating the evidence

mechanisms linking attention and emotion in the
brain form a useful model

Goals (SFG)
IMC (hetermodal CX)
Goals (Inhibition) ACG
Visual Input
Thalamus /Superior Colliculus
Salience NBM (Ach source)
Valence (Amygdala)
37
biosignal analysis

different emotional expressions produce different
changes in autonomic activity
anger increased heart rate and skin temperature
fear increased heart rate, decreased skin
temperature
happiness decreased heart rate, no change in
skin temperature
easily integrated with external channels (face
and speech)
presentation by J. Kim in the HUMAINE WP4
workshop, September 2004

38
biosignal analysis
Acoustics and noise
EEG Brain waves
Respiration Breathing rate
Temperature
EMG Muscle tension
BVP- Blood volume pulse
GSR Skin conductivity
EKG Heart rate
39
biosignal analysis

skin-sensing requires physical contact
need to improve accuracy, robustness to motion
artifacts
vulnerable to distortion
most research measures artificially elicited
emotions in a lab environment and a from single
subject
different individuals show emotion with different
response in autonomic channels (hard for
multi-subjects)
rarely studied physiological emotion recognition,
literature offers ideas rather than well-defined
solutions

40
multimodal emotion recognition

recognition models- application dependency
discrete / dimensional / appraisal theory models
theoretical models of multimodal integration
direct / separate / dominant / motor integration
modality synchronization
visemes/ EMGs FAPs / SC-RSP speech
temporal evolution and modality sequentiality
multimodal recognition techniques
classifiers context goals
cognition/attention modality significance in
interaction