A Baseline System for Speaker Recognition - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

A Baseline System for Speaker Recognition

Description:

A Baseline System for Speaker Recognition. C. Mokbel, H. Greige, R. Zantout, H. Abi Akl ... Adaptation technique used to estimate speaker model starting from ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 22
Provided by: cmok4
Category:

less

Transcript and Presenter's Notes

Title: A Baseline System for Speaker Recognition


1
A Baseline System for Speaker Recognition
  • C. Mokbel, H. Greige, R. Zantout, H. Abi Akl
  • A. Ghaoui, J. Chalhoub, R. Bayeh
  • University Of Balamand - ELISA

2
Outline
  • Introduction
  • Baseline speaker recognition system
  • NIST 2002 evaluation
  • Conclusion and perspective

3
Introduction
  • A baseline system has been built and was used in
    the NIST 2002 speaker recognition evaluation
  • GMM based system
  • Normalization using z-norm
  • Adaptation technique used to estimate speaker
    model starting from world model

4
Baseline Speaker Recognition System
  • Feature extraction
  • Speech recognition based feature vectors
  • 13 MFCC coefficients including the energy on
    logarithmic scale
  • first and second order derivative
  • Leading to 39 feature parameters
  • Preprocessing using cepstral mean normalization

5
Baseline Speaker Recognition System
  • GMM modeling for both hypotheses speaker and non
    speaker (world)
  • EM algorithm to train the world model
    (Baum-Welch)
  • Initialization using LBG VQ
  • Speaker model adapted mean vectors from the
    world model
  • Approximation of the unified adaptation
    approach (Online Adaptation of HMMs to
    Real-Life Conditions A Unified Framework, IEEE
    Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans.
    on SAP Vol. 9, n 4, may 2001)

6
Baseline Speaker Recognition System
  • Speaker Adaptation
  • World model Gaussian distributions grouped in a
    binary tree
  • Speaker data driven determination of the Gaussian
    classes
  • MLLR applied based on these classes only means
    of Gaussian distributions are adapted
  • MAP applied to the leaves Gaussian distributions

7
Baseline Speaker Recognition System
  • Building the Gaussian tree bottom up
  • Grouping two by two the closest Gaussian
    distributions
  • Distance between 2 Gaussian distributions is
    equal to the loss in the likelihood of the
    associated data if the two Gaussian are merged in
    a unique Gaussian

8
Baseline Speaker Recognition System
  • After the E-step of the EM algorithm the weights
    associated to the leaves of the tree are
    propagated through the tree up to the root
  • Going from the root to the leaves, nodes are
    selected whenever one of their two children has a
    weight less than a threshold
  • This defines a partition that will be used in an
    MLLR algorithm

9
Baseline Speaker Recognition System
  • MAP algorithm
  • Estimated Gaussian means parameters at the leaves
    are smoothed using a fixed weight with the
    parameters of the world Gaussian

10
Baseline Speaker Recognition System
  • Given a target speaker model ls, the world model
    lw and a test utterance X, the score for this
    utterance is computed as the log likelihood
    ratio
  • s log p(X/ls) / p(X/lw)
  • This score should be normalized due to the fact
    that the world model is not precise

11
Baseline Speaker Recognition System
  • Normalization using the z-norm
  • Few impostors utterances are used
  • A score is computed for every utterance
  • The different scores define a distribution per
    target speaker
  • Target speakers distributions should be similar
    for a decision using a unique threshold
  • Reduce and center the distribution
  • ns a s b

12
Baseline Speaker Recognition System
  • Based on the data from the 2001 evaluation a DET
    curve can be plotted
  • Find the optimal decision threshold that minimize
    the cost defined by NIST2002, i.e.
  • Cdet CmisPrmiss/targetPrtarget
    CFalseAlarmPrFalseAlarm/NonTarget(1-Prtarget)

13
NIST 2002 evaluation
  • Feature vector 13 MFCCs 13 d 13 d2
  • Cepstral Mean Normalization
  • Gender dependent GMM with 256 Gaussian mixtures
    for world model
  • Trained on a subset of the cellular data of NIST
    2001 evaluation

14
NIST 2002 evaluation
  • Target speaker model adapted from world model
  • For every iteration and after the E step
  • Threshold (cumulative probability 3.0) to
    select tree nodes
  • MLLR used to update the Gaussian means
  • Approximated MAP to smooth the MLLR estimated
    parameters linear combination between the MLLR
    estimated mean (0.8) and the world (a priori)
    mean (0.2)

15
NIST 2002 evaluation
  • 16 male and 21 female speakers (NIST 2001) used
    as impostors (8 test files from each)
  • The pseudo-impostors scores define a distribution
    used to z-normalize the score for a given target
    speaker
  • Global threshold estimated on NIST 2001 data in
    order to minimize the cost

16
NIST 2002 evaluation
  • System characteristics
  • CPU time on a pentium III 800 MHz
  • 2.1 ms per frame and per speaker for speaker
    model adaptation
  • 0.92 ms per frame for the test
  • Memory usage
  • 360 Kbytes per test

17
NIST 2002 evaluation
  • Results
  • Cdet 0.100292
  • Min Cdet 0.097833
  • DET Curve

18
NIST 2002 evaluation
19
NIST 2002 evaluation
20
NIST 2002 evaluation
21
Conclusions and perspectives
  • A new baseline system has been developed and
    evaluated
  • A lot of work to be done, mainly
  • Optimize the feature extraction module
  • Implement the complete Unified Adaptation
    approach
  • Investigate new normalization strategies
  • Integrate automatic labeling of speech segments
Write a Comment
User Comments (0)
About PowerShow.com