Exploiting video information for Meeting Structuring - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Exploiting video information for Meeting Structuring

Description:

Exploiting video information for Meeting Structuring . . . – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 22

Provided by: acuk

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting video information for Meeting Structuring

1
Exploiting video information for Meeting
Structuring
2
Agenda

Introduction
Feature set extension
Video features processing
Video features integration
Preliminary results
Conclusions

3
Meeting Structuring (1)

Goal recognise events which involve one or more
communicative modalities
Monologue / Dialogue / Note taking / Presentation
/ Presentation at the whiteboard
Working environment IDIAP framework
69 five minutes long meetings of 4 participants
30 transcribed meetings
Scripted meeting structure

4
Meeting Structuring (2)

3 audio derived feature families
Speaker turns, Prosodic Features, Lexical
Features

Speaker Turns
Mic. Array
Beam-forming
Prosody
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
ASR
Lexical features
Transcription.
M/DI discrimination
5
Meeting Structuring (3)

Dynamic Bayesian Network based models (using
GMTK, Bilmes et al.)
Multi-stream processing (parallel stream
processing)
Counter structure (state duration modelling)

.
.
C0
C0
C0

3 feature families
Prosodic features (S1)
Speaker Turns (S2)
Lexical features (S3)
Leave-one-out cross-validation
over 30 annotated meetings

E0
E0
E0
A0
At
At1
.
.
S01
St1
St11
.
.
S02
St2
St12
Corr Sub Del Ins AER
W.o. counter 91.7 4.5 3.8 2.6 10.9
With counter 92.9 5.1 1.9 1.9 9.0
.
Y01
Yt1
Yt11
Y02
Yt2
Yt12
6
Feature set extension (1)

Multi-party meeting are multi-modal
communicative processes
Our features cover only two modalities audio
(prosodic features speaker turns) and lexical
content (lexical monologue/dialogue
discriminator)

Exploiting video contents is the next step!!
7
Feature set extension (2)

Goal improve the recognition of Note
taking, Presentation and Whiteboard

The three most confused symbols
Three meeting actions which highly involve
body/hands movements
Approach extract low level video features and
leave their interpretation to high level
specialised models
8
Feature set extension (3)

We need motion features for hands/head-torso
regions
Constraints
The system must be simple
Reliable against environmental changes
(lighting, backgrounds, )
Open to further extensions / modifications
Initial assumptions
Meetings video contents are quite static
Participants occupy only few spatial regions
and tend to stay there
Meeting room configuration (camera positions,
seats,
furniture ) is fixed

9
Video feature extraction (1)

Motion analysis is performed using

Kanade Lucas Tomasi (KLT) feature tracking
and partitioning resulting trajectories
according to their relative position into the
scene
Four spatial regions for each scene Head 1 /
2 Hands 1 / 2
10
KLT (1)

Assumption brightness of every point of a (slow)
moving or static object does not change for
images taken at near time instants

(Taylor series approximated to the 1st
derivative)
Optical flow constraint equation
Represents how fast the intensity is changing
with time
Moving object speeds
Brightness gradient
If we have one equation in two
unknown hence more than one solution
11
KLT (2)

Minimizing weighted least square error
In two dimensions the system has the form
If the solution is

are neighbour points of x, with same
constant velocity
12
KLT (3)

A good feature is
one that can be tracked well (Tomasi et al.)
if are the eigenvalues of , the system
is well-conditioned if
and even better if it is part of a human body

(high texture content)
Large eigenvalues, but in the same range
Pixel with higher probability to be skin are
preferred
13
KLT (4)

KLT feature tracking consists of 3 steps
Select n good features
Track the selected n features
Replace lost features

We decided to track n100 features is a square
(7x7) window
14
Skin modelling

Color based approach (Cr,Cb) chromatic subspace

Skin samples taken from unused meetings
Initial experiments made using a single Gaussian
Now 3 components Gaussian Mixture Model
15
Video feature extraction (2)
Structure of the implemented system
Video
Skin Detection
KLT
Skin model
100 features / frame
Trajectory Structure
100 trajectories / frame
16
Video feature extraction (3)
Trajectory Structure
Remove long and quite static trajectories
Define 4 partitions (regions) (2 x heads,2 x
hands)
H1
H2
Ha1
Ha2
4 regions
Trajectories classification

R
Define 2 additional fixed regions
L
4 regions
Evaluate Average Motion
17
Video feature extraction (4)
2.
1.
3.
4.
18
Video feature extraction (5)
Taking motion vectors averaged over many
trajectories helps reducing noise
For each scene 4 motion vectors, one for each
region, are estimated (to be soon enhanced with 2
more regions/vectors L and R)
In order to detect if someone is entering or
leaving the scene

Open issues
Loss of tracking for fast moving objects
Account during the tracking
Assumption of a fixed scene structure
Delayed/offline processing

19
Integration
Goal extend the multi-stream model with a new
video stream

It is possible that the extended model will be
intractable due to the increased state space
In this case
State space reduction through a multi-time-scale
approach will be attempted
Early integration of Speaker turns
Lexical features will be investigated

20
Preliminary results

Before proceeding with the proposed integration
we need to
compare video performances against the other
features families
validate the extracted video features

Speaker Turns Prosodic Features Lexical Features Video Features
Accuracy 85.9 69.9 52.6 48.1
Corr Sub Del Ins AER
(A) Two-stream model 87.8 4.5 7.7 3.2 15.4
(B) Two-stream model 90.4 3.2 6.4 4.5 14.1
Video features alone have quite poor
performances, but they seem to be helpful if
evaluated together with Speaker Turns