Combined Gesture-Speech Analysis and Synthesis - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Combined Gesture-Speech Analysis and Synthesis

Description:

The SIMILAR NoE Summer Workshop 2005. Combined Gesture-Speech Analysis and Synthesis ... VirtualDub. Anvil. Paraat. Image Processing and Feature Extraction: ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 21

Provided by: Emr96

Category:

more less

Transcript and Presenter's Notes

Title: Combined Gesture-Speech Analysis and Synthesis

1
Combined Gesture-Speech Analysis and Synthesis

M. Emre Sargin, Engin Erzin, Yücel Yemez, A.
Murat Tekalp
msargin,eerzin,yyemez,mtekalp_at_ku.edu.tr
Multimedia Vision and Graphics Laboratory, Koc
University

2
Outline

Project Objective
Technical Description
Preparation of Gesture-Speech Database
Detection of Gesture Elements
Gesture-Speech Correlation Analysis
Synthesis of Gestures Accompanying Speech
Resources
Work Plan
Team Members

3
Project Objective

The production of speech and gesture is
interactive throughout the entire communication
process.
Computer-Human Interaction systems should be
interactive such that, for an edutainment
application, animated persons speech should be
aided and complemented by its gestures.
Two main goals of this project
Analysis and modeling of correlation between
speech and gestures.
Synthesis of correlated natural gestures
accompanying speech.

4
Technical Description

Preparation of Gesture-Speech Database
Detection of Gesture Elements
Gesture-Speech Correlation Analysis
Synthesis of Gestures Accompanying Speech

5
Preparation of Database

Gestures of a specific person will be
investigated.
The video database related with that specific
person should include the gestures that he/she
frequently uses.
Locations of head, arm, elbows, etc. should
easily be detectable and traceable.

6
Detection of Gesture Elements

In this project, we consider arm and head
gestures.
Main tasks included in detection of gesture
elements
Tracking of head region.
Tracking of hand and possibly shoulder and elbow.
Extraction of gesture features.
Recognition and labeling of gestures.

7
Head Region Tracking

To extract motion information coming from head
one should first extract head region.
Exhaustive search of head in each frame is a
possible solution. However this is
computationally inefficient.
Tracking is efficient by the means of
computational complexity.
Motion information calculated for tracking will
be used for head gesture features.

8
Tracking Methodology

Exhaustive search for head region in initial
frame
Haar-Based Face Detection
Skin Color information
Extraction of motion information from head region
Optical flow vectors
Fitting global motion parameters optical flow
vectors
Warp search window according to motion
information.
Search for head region in the search window.

9
Head Tracking Results
10
Hand Tracking Methodology

Hand region will be extracted using skin color
information.
Robust State-Space Tracking will be applied.
Observations are position of hand.
States are position, speed and acceleration of
hand.
Kalman Filtering removes unwanted noise from
features
In Regular Kalman Filter, parameters are fixed.
In Robust Kalman Filter parameters are
re-adjusted for each iteration to minimize MSE
and overcome the effects of abrupt changes in
motion of hand.

11
Extraction of Gesture Features

Head Gesture Features Global Motion Parameters
calculated within head region will be used.
Hand Gesture Features Hand center of mass
position and calculated velocity will form hand
gesture features.

12
Gesture-Speech Correlation Analysis

Recognized gestures are labeled w.r.t. time.
Head Gestures Down, Up, Left, Right, Left-Right,
Arm Gestures Abduction, Adduction, Extension,
Recognized speech patterns are labeled w.r.t.
time.
Semantic Info Approval, Refusal phrases, etc.
Prosodic Info Intonational phrases, ToBI
transcriptions, etc.
Correlation Analysis via examining
Co-occurrence Matrix
Input/Output Hidden Markov Models

13
Co-occurrence Matrix

Estimation of joint probability distribution
function, f(g,s)
For each time sample give a vote to related
gesture-speech label pair.
For a specific speech element the most correlated
gesture feature will be
giargmax ( f (gx,si) )
Relatively easy to compute.
Gives an intuition about what we are examining.

x
14
Input/Output Hidden Markov Models

IOHMM is a graphical model which allows the
mapping of input sequences into output sequences.
It is used in three tasks of sequence processing
Prediction
Regression
Classification
The model is trained to maximize the conditional
distribution of an output sequence y1,,yt
given an input sequence x1,,xt.
In our project
Input sequence will be speech labels.
Output sequence will be gesture labels.

15
Synthesis of Gestures Accompanying Speech

Based on the methodology used in correlation
analysis given a speech signal
Features will be extracted.
Most probable speech label will be designated to
speech patterns.
Gesture pattern that is most correlated with
speech pattern will be used to animate a stick
model of a person.

16
Resources

Database Preparation and Labeling
VirtualDub
Anvil
Paraat
Image Processing and Feature Extraction
Matlab Image Processing Toolbox
OpenCV Image Processing Library
Gesture-Speech Correlation Analysis
HTK HMM Toolbox
Torch Machine Learning Library

17
Work Plan

Timeline of the project
Schedule of the lectures

18
Team Members

Ferda Ofli
Koc University
Image, Video Processing and Feature Extraction
Yelena Yasinnik
Massachusetts Institute of Technology
Audio-Visual Correlation Analysis
Oya Aran
Bogazici University
Gesture Based Human-Computer Interaction Systems

19
Team Members

Alexey Anatolievich Karpov
Saint-Petersburg Institute for Informatics and
Automation
Speech Based Human-Computer Interaction Systems
Stephen Wilson
University College Dublin
Audio-Visual Gesture Annotation
Alexander Refsum Jensenius
Department of Music, Oslo University
Gesture Analysis

20
References

Jie Yao and Jeremy R. Cooperstock, Arm Gesture
Detection in a Classroom Environment, Proc.
WACV02 pp. 153-157, 2002.
Y. Azoz, L. Devi. R. Sharma, Tracking Hand
Dynamics in Unconstrained Environments, Proc.
Int. Conference on Automatic Face and Gesture
Recognition98 pp. 274-279, 1998.
S. Malassiotis, N. Aifanti, M.G. Strintzis, A
Gesture Recognition System Using 3D Data, Proc.
Int. Symposium on 3D Data Processing
Visualization and Transmission02 pp.
190-193,2002.
J-M. Chung, N. Ohnishi, Cue Circles Image
Feature for Measuring 3-D Motion of Articulated
Objects Using Sequential Image Pair, Proc. Int.
Conference on Automatic Face and Gesture
Recognition98 pp. 474-479, 1998.
S. Kettebekov, M. Yeasin, R. Sharma, Prosody
based co-analysis for continuous recognition of
coverbal gestures,Proc. ICMI02 pp.161-166,
2002.
F. Quek, D. McNeill, R. Ansari, X-F. Ma, R.
Bryll, S. Duncan, K.E. McCullough Gesture cues
for conversational interaction in monocular
video, Proc. Int. Workshop on Recognition,
Analysis, and Tracking of Faces and Gestures in
Real-Time Systems99 pp. 119-126, 1999.
For detailed information visit
http//htk.eng.cam.ac.uk
Rabiner, L. Juang, B., An introduction to
hidden Markov models ASSP Magazine, IEEE, Vol.3,
Iss.1, pp. 4- 16, Jan 1986
Jae-Moon Chung Ohnishi, N., Cue circles image
feature for measuring 3-D motion of articulated
objects using sequential image pair Automatic
Face and Gesture Recognition, 1998. Proceedings.
Third IEEE International Conference on, Vol.,
Iss., pp. 474-479, 14-16 Apr 1998
A. Just, O. Bernier, S. Marcel., Recognition of
isolated complex mono- and bi-manual 3D hand
gestures Proc. 6. ICAFGR, 2004