Exploiting video information for Meeting Structuring - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Exploiting video information for Meeting Structuring

Description:

Exploiting video information for Meeting Structuring . . . – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 22
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: Exploiting video information for Meeting Structuring


1
Exploiting video information for Meeting
Structuring
2
Agenda
  • Introduction
  • Feature set extension
  • Video features processing
  • Video features integration
  • Preliminary results
  • Conclusions

3
Meeting Structuring (1)
  • Goal recognise events which involve one or more
    communicative modalities
  • Monologue / Dialogue / Note taking / Presentation
    / Presentation at the whiteboard
  • Working environment IDIAP framework
  • 69 five minutes long meetings of 4 participants
  • 30 transcribed meetings
  • Scripted meeting structure

4
Meeting Structuring (2)
  • 3 audio derived feature families
  • Speaker turns, Prosodic Features, Lexical
    Features

Speaker Turns
Mic. Array
Beam-forming
Prosody
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
ASR
Lexical features
Transcription.
M/DI discrimination
5
Meeting Structuring (3)
  • Dynamic Bayesian Network based models (using
    GMTK, Bilmes et al.)
  • Multi-stream processing (parallel stream
    processing)
  • Counter structure (state duration modelling)

.
.
C0
C0
C0
  • 3 feature families
  • Prosodic features (S1)
  • Speaker Turns (S2)
  • Lexical features (S3)
  • Leave-one-out cross-validation
  • over 30 annotated meetings

E0
E0
E0
A0
At
At1
.
.
S01
St1
St11
.
.
S02
St2
St12
Corr Sub Del Ins AER
W.o. counter 91.7 4.5 3.8 2.6 10.9
With counter 92.9 5.1 1.9 1.9 9.0
.
Y01
Yt1
Yt11
Y02
Yt2
Yt12
6
Feature set extension (1)
  • Multi-party meeting are multi-modal
  • communicative processes
  • Our features cover only two modalities audio
    (prosodic features speaker turns) and lexical
    content (lexical monologue/dialogue
    discriminator)

Exploiting video contents is the next step!!
7
Feature set extension (2)
  • Goal improve the recognition of Note
    taking, Presentation and Whiteboard

The three most confused symbols
Three meeting actions which highly involve
body/hands movements
Approach extract low level video features and
leave their interpretation to high level
specialised models
8
Feature set extension (3)
  • We need motion features for hands/head-torso
    regions
  • Constraints
  • The system must be simple
  • Reliable against environmental changes
    (lighting, backgrounds, )
  • Open to further extensions / modifications
  • Initial assumptions
  • Meetings video contents are quite static
  • Participants occupy only few spatial regions
  • and tend to stay there
  • Meeting room configuration (camera positions,
    seats,
  • furniture ) is fixed

9
Video feature extraction (1)
  • Motion analysis is performed using

Kanade Lucas Tomasi (KLT) feature tracking
and partitioning resulting trajectories
according to their relative position into the
scene
Four spatial regions for each scene Head 1 /
2 Hands 1 / 2
10
KLT (1)
  • Assumption brightness of every point of a (slow)
    moving or static object does not change for
    images taken at near time instants

(Taylor series approximated to the 1st
derivative)
Optical flow constraint equation
Represents how fast the intensity is changing
with time
Moving object speeds
Brightness gradient
If we have one equation in two
unknown hence more than one solution
11
KLT (2)
  • Minimizing weighted least square error
  • In two dimensions the system has the form
  • If the solution is

are neighbour points of x, with same
constant velocity
12
KLT (3)
  • A good feature is
  • one that can be tracked well (Tomasi et al.)
  • if are the eigenvalues of , the system
    is well-conditioned if
  • and even better if it is part of a human body

(high texture content)
Large eigenvalues, but in the same range
Pixel with higher probability to be skin are
preferred
13
KLT (4)
  • KLT feature tracking consists of 3 steps
  • Select n good features
  • Track the selected n features
  • Replace lost features

We decided to track n100 features is a square
(7x7) window
14
Skin modelling
  • Color based approach (Cr,Cb) chromatic subspace

Skin samples taken from unused meetings
Initial experiments made using a single Gaussian
Now 3 components Gaussian Mixture Model
15
Video feature extraction (2)
Structure of the implemented system
Video
Skin Detection
KLT
Skin model
100 features / frame
Trajectory Structure
100 trajectories / frame
16
Video feature extraction (3)
Trajectory Structure
Remove long and quite static trajectories
Define 4 partitions (regions) (2 x heads,2 x
hands)
H1
H2
Ha1
Ha2
4 regions
Trajectories classification

R
Define 2 additional fixed regions
L
4 regions
Evaluate Average Motion
17
Video feature extraction (4)
2.
1.
3.
4.
18
Video feature extraction (5)
Taking motion vectors averaged over many
trajectories helps reducing noise
For each scene 4 motion vectors, one for each
region, are estimated (to be soon enhanced with 2
more regions/vectors L and R)
In order to detect if someone is entering or
leaving the scene
  • Open issues
  • Loss of tracking for fast moving objects
  • Account during the tracking
  • Assumption of a fixed scene structure
  • Delayed/offline processing

19
Integration
Goal extend the multi-stream model with a new
video stream
  • It is possible that the extended model will be
  • intractable due to the increased state space
  • In this case
  • State space reduction through a multi-time-scale
  • approach will be attempted
  • Early integration of Speaker turns
  • Lexical features will be investigated

20
Preliminary results
  • Before proceeding with the proposed integration
    we need to
  • compare video performances against the other
    features families
  • validate the extracted video features

Speaker Turns Prosodic Features Lexical Features Video Features
Accuracy 85.9 69.9 52.6 48.1
Corr Sub Del Ins AER
(A) Two-stream model 87.8 4.5 7.7 3.2 15.4
(B) Two-stream model 90.4 3.2 6.4 4.5 14.1
Video features alone have quite poor
performances, but they seem to be helpful if
evaluated together with Speaker Turns
  1. (Speaker Turns) (Prosody Lexical Features)
  2. (Speaker Turns) (Video Features)

21
Summary
  • Extraction of video features through
  • A skin detector enhanced KLT feature tracker
  • Segmentation of trajectories into 4/6 spatial
    regions
  • (Simple and fast approach, but with some open
    problems)
  • Validation of Motion Vectors as a video feature
  • Integration in the existing framework (work in
    progress)
Write a Comment
User Comments (0)
About PowerShow.com