Dynamic Bayesian Networks for Meeting Structuring - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic Bayesian Networks for Meeting Structuring

Description:

Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield) – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 26
Provided by: spandhDc
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Bayesian Networks for Meeting Structuring


1
Dynamic Bayesian Networks for Meeting Structuring
  • Alfred Dielmann, Steve Renals
  • (University of Sheffield)

2
Introduction
GOAL
  • Automatic analysis of meetings through
    multimodal events recognition

Using objective measures and statistical methods
events which involve one or more communicative
modalities, and represent a single participant
or a whole group behaviour
3
Multimodal Recognition
Meeting Room
Knowledge Database

Audio
Video

Feature Extraction
Signal Pre-processing
Information Retrieval
Specialised Recognition Systems
(Speech,Video,Gestures)
Models
Multimodal Events Recognition
4
Group Actions
  • The machine observes group behaviours through
    objective measures (external observer)
  • Results of this analysis are structured into a
    sequence of symbols (coding system)
  • Exhaustive (covering the entire meeting duration)
  • Mutually exclusive (non overlapping symbols)
  • We used the coding system adopted by the IDIAP
    framework, composed by 5 meeting actions
  • Monologue / Dialogue / Note taking / Presentation
    / Presentation at the whiteboard

derived from different comunicative modalities

5
Corpus
  • 60 meetings (30x2 set) collected in the IDIAP
    Smart Meeting Room
  • 30 meetings are used for the training
  • 23 meetings are used for the testing
  • 7 meetings will be used for the results
    validation
  • 4 participants per meeting
  • 5 hours of multi-channel Audio-Visual recordings
  • 3 fixed cameras
  • 4 lapel microphones 8 element circular
    microphones array
  • Meeting agendas are generated a priori and
    strictly followed, in order to have an average of
    5 meeting actions for each meeting
  • Available for public distribution

http//mmm.idiap.ch/
6
Features (1)
Only features derived from audio are currently
used...
Speaker Turns
Dimension reduction
Mic. Array
Beam-forming
Prosody and Acoustic
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
..
7
Features (2)
  • Speaker Turns

Li(t)Lj(t-1)Lk(t-2)
i
k
j
Location based Speech activities (SRP-PHAT
beamforming) Kindly provided by IDIAP
Speaker Turns Features
8
Features (3)
Mask Features using Speech activity
RMS
Energy
Pitch
Pitch extractor
Filters ()
Lapel Mic.
Rate Of Speech
MRATE
Mic. Array
Beam-forming
() Histogram, median and interpolating filter
9
Features (4)
Wed like to integrate other features..
Participants Motion features
Video
Image Processing
Other blob positions
Gestures and Actions
Transcripts
Audio.
ASR
Everything that could be automatically extracted
from a recorded meeting
Other
10
Dynamic Bayesian Networks (1)
Bayesian Networks are a convenient graphical way
to describe statistical (in)dependencies among
random variables
A
F
Direct Acyclic Graph
Conditional Probability Tables
C
S
Given a set of examples, EM learning
algorithms (ie Baum-Welch) could be used to
train CPTs
L
Given a set of known evidence nodes,
the probability of other nodes can be computed
through inference
O
11
Dynamic Bayesian Networks (2)
  • DBN are an extension of BNs with random variables
    that evolves in time
  • Instancing a static BN for each temporal slice t
  • Explicating temporal dependences between
    variables

C
S
C
S
C
S
L
L
L
..
O
O
O
t0 t1
tT
12
Dynamic Bayesian Networks (3)
  • Hidden Markov Models, Kalman Filter Models and
    other state-space models are just a special case
    of DBNs

p
A
Q0
Qt
Qt1
.
.
Representation of an HMM as an instance of a DBN
Y0
Yt
Yt1
B
t0 t t1
13
Dynamic Bayesian Networks (4)
  • Representing HMMs in terms of DBNs makes easy to
    create variations on the basic theme .

Z0
Zt
Zt
.
X0
Xt
Xt1
.
Z0
Zt
Zt1
.
V0
Vt
Vt
Q0
Qt
Qt1
.
Q0
Qt
Qt
.
Y0
Yt
Yt1
Y0
Yt
Yt
Factorial HMMs
Coupled HMMs
14
Dynamic Bayesian Networks (5)
  • Use of DBN and BN present some advantages
  • Intuitive way to represent models graphically,
    with a standard notation
  • Unified theory for a huge number of models
  • Connecting different models in a structured view
  • Making easier to study new models
  • Unified set of instruments (ie GMTK) to work
    with them (training, inference, decoding)
  • Maximizes resources reuse
  • Minimizes setup time

15
First Model (1)
  • Early integration of features and modelling
    through a
  • 2-level Hidden Markov Model

Hidden Meeting Actions
A0
At
At1
AT
.
.
Hidden Sub-states
S0
St
St1
ST
.
.
Observable Features Vector
Y0
Yt
Yt1
YT
16
First Model (2)
  • The main idea behind this model is to decompose
    each meeting action in a sequence of sub
    actions or substates
  • (Note that different actions are free to share
    the same sub-state)
  • The structure is composed by two Ergodic HMM
    chains
  • The top chain links sub-states St with
    actions At
  • The lower one maps directly the feature vectors
    Yt into a sub-state St

A0
At
.
S0
St
.
Y0
Yt
17
First Model (3)
  • The sequence of actions At is known a priori
  • The sequence St is determined during the
    training process,and the meaning of each substate
    is unknown
  • The cardinality of St is one of the models
    parameters
  • The mapping of observable features Yt into
    hidden sub-states St is obtained through
    Gaussian Mixture Models

A0
At
.
S0
St
.
Y0
Yt
18
Second Model (1)
  • Multistream processing of features through two
    parallel and independent Hidden Markov Models

Action Counter
.
.
C0
C0
C0
C0
E0
E0
E0
Enable Transitions
A0
At
At1
AT
.
.
Meeting Actions
Hidden Sub-states
S01
St1
St11
ST1
.
.
S02
St2
St12
ST2
.
.
Prosodic Features
Y01
Yt1
Yt11
YT1
Y02
Yt2
Yt12
YT2
Speaker Turns Features
19
Second Model (2)
  • Each features-group (or modality) Ym, is mapped
    into an independent HMM chain, therefore every
    group is evaluated independently and mapped into
    an hidden sub-state Stn

As in the previous model, there is another HMM
layer (A), witch represents meeting actions
A0
At
.
S01
St1
.
The whole sub-state St1 x St2 x Stn is
mapped into an action At
S02
St2
.
Y01
Yt1
Y02
Yt2
20
Second Model (3)
  • It is a variable-duration HMM with explicit
    enable node
  • At represents meeting actions as usual
  • Ct counts meeting actions
  • Et is a binary indicator variable that enables
    states changes inside the node At

Ct 1 1 2 2 2
Et 0 1 0 0 0
At 8 8 5 5 5
.
.
C0
C0
C0
E0
E0
E0
A0
At
At1
.
.
21
Second Model (4)
  • Training when At changes Ct is incremented
    and is set on for a single frame Et (At ,Et and
    Ct are part of the training dataset)

Behaviours of Et and Ct learned during the
training phase are then exploited during the
decoding
Ct 1 1 2 2 2
Et 0 1 0 0 0
At 8 8 5 5 5
  • Decoding At is free to change only if Et
    is high, and
  • then according to Ct state

22
Results
  • Using the two models previously described,
    results obtained using only audio derived
    features

Corr. Sub. Del. Ins. AER
First Model 93.2 2.3 4.5 4.5 11.4
Second Model 94.7 1.5 3.8 0.8 6.1
The second model reduces effectively both the
number of Substitutions and the number of
Insertions
Equivalent to the Word Error Rate measure, used
to evaluate speech recogniser performances
23
Conclusions
  • A new approach has been proposed
  • Achieved results seem to be promising, and in the
    future wed like to
  • Validate them with the remaining part of the
    test-set (or eventually an independent test-set)
  • Integrate other features
  • video, ASR transcripts, Xtalk, .
  • Try new experiments with existing models
  • Develop new DBNs based models

24
(No Transcript)
25
Multimodal Recognition (2)
Knowledge sources
Approaches
A standalone hi-level recogniser operating on low
level raw data
  • Raw Audio
  • Raw Video
  • Acoustic Features
  • Visual Features
  • Automatic Speech Recognition
  • Video Understanding
  • Gesture Recognition
  • Eye Gaze Tracking
  • Emotion Detection
  • .

Fusion of different recognisers at an early
stage, generating hybrid recognisers (like AVSR)
Integration of recognisers outputs through an
high level recogniser
Write a Comment
User Comments (0)
About PowerShow.com