An Examination of Audio-Visual Fused HMMs for Speaker Recognition - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

An Examination of Audio-Visual Fused HMMs for Speaker Recognition

Description:

Presented by David Dean. CRICOS No. 000213J. 2. Speech, Audio, Image and Video ... variability, high recognition accuracy, and protection against impersonation. ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 22
Provided by: david1132
Category:

less

Transcript and Presenter's Notes

Title: An Examination of Audio-Visual Fused HMMs for Speaker Recognition


1
An Examination of Audio-Visual Fused HMMs for
Speaker Recognition
  • David Dean, Tim Wark and Sridha Sridharan
  • Presented by David Dean

2
Why audio-visual speaker recognition
  • Bimodal recognition exploits the synergy between
    acoustic speech and visual speech, particularly
    under adverse conditions. It is motivated by the
    needin many potential applications of
    speech-based recognitionfor robustness to speech
    variability, high recognition accuracy, and
    protection against impersonation.
  • (Chibelushi, Deravi and Mason 2002)

3
Early and late fusion
  • Most early approaches to audio-visual speaker
    recognition (AVSPR) used either early or late
    fusion (feature or decision)
  • Problems
  • Decision fusion cannot model temporal
    dependencies
  • Feature fusion suffers from problems with noise,
    and has difficulties in modelling the
    asychronicity of audio-visual speech (Chibelushi
    et al., 2002)

Early Fusion
Late Fusion
4
Middle fusion - coupled HMMs
  • Middle fusion models can accept two streams of
    input and the combination is done within the
    classifier
  • Most middle fusion is performed using coupled
    HMMs (shown here)
  • Can be difficult to train
  • Dependencies between hidden states are not strong
    (Brand, 1999)

5
Middle fusion fused HMMs
  • Pan et al. (2004) used probabilistic models to
    investigate the optimal multi-stream HMM design
  • Maximise mutual information in audio and video
  • They found that linking the observations of one
    modality to the hidden states of the other was
    more optimal than linking just the hidden states
    (i.e. Coupled HMM)
  • The fused HMM designed results in two designs,
    acoustic, and video biased

Acoustic Biased FHMM
6
Choosing the dominant modility
  • The choice of the dominant modality (the one
    biased towards) should be based upon which
    individual HMM can more reliably estimate the
    hidden state sequence for a particular
    application
  • Generally audio
  • Alternatively, both versions can be used
    concurrently and decision fused (as in Pan et
    al., 2004)
  • This research looks at the relative performance
    of each biased FHMM design individually
  • If recognition can be performed using only one
    FHMM, decoding can be done in half the time
    compared to decision fusion of both FHMMs

7
Training FHMMs
  • Both biased FHMM (if needed) are trained
    independently
  • Train the dominant (audio for acoustic-biased,
    video for video-biased) HMM independently upon
    the training observation sequences for that
    modality
  • The best hidden state sequence of the trained HMM
    is found for each training observation using the
    Viterbi process
  • Calculate the coupling parameters between the
    dominant hidden state sequence and the training
    observation sequences for the subordinate
    modality
  • i.e. estimate the probability of getting certain
    subordinate observation whilst within a
    particular dominant hidden state

8
Decoding FHMMs
  • The dominant FHMM can be viewed as a special type
    of HMM that outputs observations in two streams
  • This does not affect the decoding lattice, and
    the Viterbi algorithm can be used to decode
  • Provided that it has access to observations in
    both streams

9
Experimental setup
HMM Decision Fusion
Acoustic HMM
Decision Fusion
Speaker Decision
Acoustic Feature Extraction
Visual HMM
Visual Feature Extraction
Acoustic-Biased FHMM
Acoustic Biased FHMM
Speaker Decision
Video-Biased FHMM
Visual Biased FHMM
Speaker Decision
Lip Location Tracking
10
Lip location and tracking
  • Lip tracking performed as by Dean et al., 2005.

11
Feature extraction and datasets
  • Audio
  • MFCC 12 1 energy, deltas and accelerations
    43 features
  • Video
  • DCT 20 coefficients deltas and accelerations
    60 features
  • Isolated speech from CUAVE (Patterson et al,
    2002)
  • 4 sequences for training, 1 for testing (for each
    of 36 speakers)
  • Each sequence is zero one two nine
  • Testing was also performed on noisy data
  • Speech-babble corrupted audio versions
  • Poorly-tracked lip region-of-interest video
    features

Well tracked
Poorly tracked
12
Fused HMM design
  • Both acoustic- and visual-biased FHMMs are
    examined
  • Underlying HMMs are speaker-dependent word-models
    for each digit
  • MLLR adapted from speaker-independent background
    word-models
  • Trained using HTK Toolkit (Young et al, 2002)
  • Secondary models are based on discrete
    vector-quantisation codebooks
  • Codebook is generated from secondary data
  • The number of occurrences of each discrete VQ
    value within each state was recorded to
    arrive at an estimate of .
  • Codebook size of 100 was found to work best for
    both modalities

13
Decision Fusion
  • Fused HMM performance is compared to decision
    fusion of normal HMMs in each state
  • Weight of each stream is based upon audio weight
    parameter a, which can range from
  • 0 (video only), to
  • 1 (audio only)
  • Two decision fusion configurations were used
  • a 0.5
  • Simulated adaptive fusion
  • Best a for each noise level

14
Speaker recognition well tracked video
  • Tier 1 recognition rate
  • Video HMM, video-biased FHMM, and Decision-Fusion
    are all performing at 100
  • Audio-biased FHMM performs much better than the
    HMM only, but not as well as video at low noise
    levels

15
Speaker recognition poorly tracked video
  • Video is degraded through poor tracking
  • Video FHMM has no real improvement on video HMM
  • Audio FHMM is better than all for most
    audio-noise levels
  • Even better than simulated adaptive fusion

16
Video vs. Audio-Biased FHMM
  • Adding video to audio HMMs to create an
    acoustic-biased FHMM provides a clear improvement
    over the HMM alone
  • However, adding audio to video HMMs provides
    neglibile improvement
  • Video HMM provides poor state alignment

17
Acoustic-biased FHMM vs. Decision Fusion
  • FHMMs can take advantage of the relationship
    between modalities on a frame-by-frame basis
  • Decision fusion can only compare two scores over
    an entire utterance
  • FHMM even works better than simulated adaptive
    fusion for most noise levels
  • Actual adaptive fusion would require estimation
    of noise levels
  • FHMM is running with no knowledge of noise

18
Conclusion
  • Acoustic biased FHMM provide a clear improvement
    on acoustic HMMs
  • Video biased FHMM do not improve upon video HMMs
  • Video HMMs are unreliable at estimating state
    sequences
  • Acoustic biased FHMM performs better than
    simulated adaptive decision fusion at most noise
    levels
  • With around half the decoding processing cost
    (more when the cost of real adaptive fusion is
    included)

19
Future/Continuing Work
  • As the CUAVE database is quite small for speaker
    recognition experiments at only 36 subjects,
    research has continued on the XM2VTS database
    (Messer et al., 1999), which has 295 subjects
  • Continuous GMM models replaced the VQ secondary
    models
  • Video DCT VQ couldnt handle session variability
  • Verification (rather than identification) allows
    system performance to be examined more easily
  • System is still undergoing development

20
References
  • M. Brand, A bayesian computer vision system for
    modeling human interactions, in ICVS99, Gran
    Canaria, Spain, 1999.
  • C. Chibelushi, F. Deravi, and J. Mason, A review
    of speech-based bimodal recognition, Multimedia,
    IEEE Transactions on, vol. 4, no. 1, pp. 2337,
    2002.
  • D. Dean, P. Lucey, S. Sridharan, and T. Wark,
    Comparing audio and visual information for
    speech processing, in ISSPA 2005, Sydney,
    Australia, 2005, pp. 5861.
  • K. Messer, J. Matas, J. Kittler, J. Luettin, and
    G. Maitre, Xm2vtsdb The extended m2vts
    database, in Audio and Video-based Biometric
    Person Authentication (AVBPA 99), Second
    International Conference on, Washington D.C.,
    1999, pp. 7277.
  • H. Pan, S. Levinson, T. Huang, and Z.-P. Liang,
    A fused hidden markov model with application to
    bimodal speech processing, IEEE Transactions on
    Signal Processing, vol. 52, no. 3, pp. 573581,
    2004.
  • E. Patterson, S. Gurbuz, Z. Tufekci, and J.
    Gowdy, Cuave a new audio-visual database for
    multimodal human-computer interface research, in
    Acoustics, Speech, and Signal Processing, 2002.
    Proceedings. (ICASSP 02). IEEE International
    Conference on, vol. 2, 2002, pp. 20172020.
  • S. Young, G. Evermann, D. Kershaw, G. Moore, J.
    Odell, D. Ollason, D. Povey, V. Valtchev, and P.
    Woodland, The HTK Book, 3rd ed. Cambridge, UK
    Cambridge University Engineering Department.,
    2002.

21
Questions?
Write a Comment
User Comments (0)
About PowerShow.com