An Intro to Speaker Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

An Intro to Speaker Recognition

Description:

Speaker Recognition Nikki Mirghafori Acknowledgment: some s borrowed from the Heck & Reynolds tutorial, and A. Stolcke. Nikki Mirghafori 4/23/12 EECS 225D ... – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 42
Provided by: www1IcsiB
Category:

less

Transcript and Presenter's Notes

Title: An Intro to Speaker Recognition


1
An Intro to Speaker Recognition
  • Nikki Mirghafori
  • Acknowledgment some slides borrowed from the
    Heck Reynolds tutorial, and A. Stolcke.

2
Todays class
  • Interactive
  • Measures of success for today
  • You talk at least as much as I do
  • You learn and remember the basics
  • You feel you can do this stuff
  • We all have fun with the material!

3
A 10-minute Project Design
  • You are experts with different backgrounds. Your
    previous startup companies were wildly
    successful. A large VC firm in the valley wants
    to fund YOUR next creation, as long as the
    project is in speaker recognition.
  • The VC funding is yours, if you come up with some
    kind of a coherent plan/list of issues
  • What is your proposed application?
  • What will be the sources of error and
    variability, i.e., technology challenges?
  • What types of features will you use?
  • What sorts of statistical modeling
    tools/techniques?
  • What will be your data needs?
  • Any other issues you can think of along your
    path?

4
Extracting Information from Speech
  • Whats noise? whats signal?
  • Orthogonal in many ways
  • Use many of the same models and tools

Goal Automatically extract information
transmitted in speech signal
5
Speaker Recognition Applications
  • Access control
  • Physical facilities
  • Data and data networks
  • Transaction authentication
  • Telephone credit card purchases
  • Bank wire transfers
  • Fraud detection
  • Monitoring
  • Remote time and attendance logging
  • Home parole verification
  • Information retrieval
  • Customer information for call centers
  • Audio indexing (speech skimming device)
  • Personalization
  • Forensics
  • Voice sample matching

6
Tasks
  • Identification vs. verification
  • Closed set vs. open set identification
  • Also, segmentation, clustering, tracking...

7
Identification
Speaker Model Database
Test Speech
Whose voice is it?
Closed-set Speaker Identification
8
Identification
Speaker Model Database
Test Speech
Whose voice is it?
Open-set Speaker Identification
9
Verification/Authentication/Detection
Speaker Model Database
Test Speech
Does the voice match?
Verification requires claimant ID
10
Speech Modalities
  • Text-dependent recognition
  • Recognition system knows text spoken by person
  • Examples fixed phrase, prompted phrase
  • Used for applications with strong control over
    user input
  • Knowledge of spoken text can improve system
    performance
  • Text-independent recognition
  • Recognition system does not know text spoken by
    person
  • Examples User selected phrase, conversational
    speech
  • Used for applications with less control over user
    input
  • More flexible system but also more difficult
    problem
  • Speech recognition can provide knowledge of
    spoken text
  • Text-Constrained recognition. Exercise for the
    reader.

11
Text-constrained Recognition
  • Basic idea build speaker models for words rich
    in speaker information
  • Example
  • What time did you say? um... okay, I_think
    thats a good plan.
  • Text-dependent strategy in a text-independent
    context

12
Voice as a biometric
  • Biometric a human generated signal or
    attribute for authenticating a persons identity
  • Voice is a popular biometric
  • natural signal to produce
  • does not require a specialized input device
  • ubiquitous telephones and microphone equipped
    PC
  • Voice biometric with other forms of security

Strongest security
  • Something you have - e.g., badge
  • Something you know - e.g., password
  • Something you are - e.g., voice

13
How to build a system?
  • Feature choices
  • low level (MFCC, PLP, LPC, F0, ...) and high
    level (words, phones, prosody, ...)
  • Types of models
  • HMM, GMM, Support Vector Machines (SVM), DTW,
    Nearest Neighbor, Neural Nets
  • Making decisions Log Likelihood Thresholds,
    threshold setting for desired operating point
  • Other issues normalization (znorm, tnorm),
    optimal data selection to match expected
    conditions, channel variability, noise, etc.

14
Verification Performance
  • There are many factors to consider in design of
    an evaluation of a speaker verification system
  • Most importantly The evaluation data and design
    should match the target application domain of
    interest

15
Verification Performance
Increasing constraints
Probability of False Reject (in )
Text-dependent (Combinations) Clean Data Single
microphone Large amount of train/test speech
Probability of False Accept (in )
16
Verification Performance
  • Example Performance Curve

Application operating point depends on relative
costs of the two error types
PROBABILITY OF FALSE REJECT (in )
PROBABILITY OF FALSE ACCEPT (in )
17
Human vs. Machine
Humans44better
  • Motivation for comparing human to machine
  • Evaluating speech coders and potential forensic
    applications
  • Schmidt-Nielsen and Crystal used NIST evaluation
    (DSP Journal, January 2000)
  • Same amount of training data
  • Matched Handset-type tests
  • Mismatched Handset-type tests
  • Used 3-sec conversational utterances from
    telephone speech

Humans15worse
ErrorRates
18
Features
  • Desirable attributes of features for an automatic
    system (Wolf 72)
  • Occur naturally and frequently in speech
  • Easily measurable
  • Not change over time or be affected by speakers
    health
  • Not be affected by reasonable background noise
    nor depend on specific transmission
    characteristics
  • Not be subject to mimicry

Practical
Robust
Secure
  • No feature has all these attributes

19
Training Test Phases
Enrollment Phase
Feature Extraction
Model Training
Model for each speaker
Training speech for each speaker
20
Decision making
  • Verification decision approaches have roots in
    signal detection theory
  • 2-class Hypothesis test
  • H0 the speaker is an impostorH1 the speaker
    is indeed the claimed speaker.
  • Statistic computed on test utterance S as
    likelihood ratio

21
Decision making
  • Identification pick model (of N) with best
    score
  • Verification usual approach is via likelihood
    ratio tests, hypothesis testing, i.e.
  • By Bayes
  • P(targetx)/P(nontargetx)
  • P(xtarget)P(target)/P(xnontarget)P(nontarge
    t)
  • accept if gt threshold, reject otherwise
  • Cant sum over all non-target talkers -- world
    for SV!
  • Use cohorts (collection of impostors)
  • Train universal/world/background model
    (speaker independent, its trained on many
    speakers)

22
Spectral Based Approach
  • Traditional speaker recognition systems use
  • Cepstral feaures
  • Gaussian Mixture Models (GMMs)

D.A. Reynolds, T.F. Quatieri, R.B. Dunn. Speaker
Verification using Adapted Gaussian Mixture
Models, Digital Signal Processing, 10(1--3),
January/April/July 2000
23
Features Levels of Information
Hierarchy of Perceptual Cues
semantics, idiolects, pronunciations, idiosyncrasies socio-economic status, education, place of birth
prosody, rhythm, speed intonation, volume modulation personality type, parental influence
acoustic aspects of speech, nasal, deep, breathy, rough anatomical structure of vocal apparatus
24
Low level features
  • Speech production model source-filter
    interaction
  • Anatomical structure (vocal tract/glottis)
    conveyed in speech spectrum

Glottal pulses
Vocal tract
Speech signal
25
Word N-gram Features
  • Idea (Doddington 2001)
  • Word usage can be idiosyncratic to a speaker
  • Model speakers by relative frequencies of word
    N-grams
  • Reflects vocabulary AND grammar
  • Cf. similar approaches for authorship and
    plagiarism detection on text documents.
  • First (unpublished) use in speaker recognition
    Heck et al. (1998)
  • Implementation
  • Get 1-best word recognition output
  • Extract N-gram frequencies
  • Model likelihood ratio OR
  • Model frequency vectors by SVM

I_shall 0.002
I_think 0.025
I_would 0.012

26
Phone N-gram features
Model the pattern of phone usage or short term
pronunciation for a speaker
Open-loop phone recognition
Support Vector Machine (SVM)
0.0254 0.0068 0.0198 - 0.0001 0.8827
0.7264 - 0.0329 0.2847 0.2983
score
27
MLLR transform vectors as features
Speaker-dependent
Speaker-independent
Phone class B
Phone class A
Speaker-independent
Speaker-dependent
MLLR Transforms Features
28
Models
  • HMMs
  • text dep (could use whole word/phone model)
  • prompted (phone models)
  • text indt (use LVCSR) -- or GMMs!
  • templates DTW (if text-dependent system)
  • nearest neighbor frame level, training data as
    model, non-parametric
  • neural nets train explicitly discriminating
    models
  • SVMs

29
Speaker Models -- HMM
  • Speaker models (voiceprints) represent voice
    biometric in compact and generalizable form
  • Modern speaker verification systems use Hidden
    Markov Models (HMMs)
  • HMMs are statistical models of how a speaker
    produces sounds
  • HMMs represent underlying statistical variations
    in the speech state (e.g., phoneme) and temporal
    changes of speech between the states.
  • Fast training algorithms (EM) exist for HMMs with
    guaranteed convergence properties.

h-a-d
30
Speaker Models HMM/GMM
  • Form of HMM depends on the application

31
Word N-gram Modeling Likelihood Ratios
  • Model N-gram token log likelihood ratio
  • Numerator speaker language model estimated from
    enrollment data
  • Denominator background language model estimated
    from large speaker population
  • Normalize by token count
  • Choose all reasonably frequent bigrams or
    trigrams, or a weighted combination of both

32
Speaker Recognition with SVMs
  • Each speech sample (training or test) generates a
    point in a derived feature space
  • The SVM is trained to separate the target sample
    from the impostor ( UBM) samples
  • Scores are computed as the Euclidean distance
    from the decision hyperplane to the test sample
    point
  • SVMs training is biased against misclassifying
    positive examples (typically very few, often just
    1)

Background sample Target sample Test sample
33
Feature Transforms for SVMs
  • SVMs have been a boon for higher-level (as well
    as cepstral speaker recognition) research they
    allow great flexibility in the choice of features
  • However, we need a sequence kernel
  • Dominant approach transform variable-length
    feature stream into fixed, finite-dimensional
    feature space
  • Then use linear kernel
  • All the action is in the feature transform!

34
Combination of Systems
  • Systems work best in combination, especially
    ones using higher level features
  • Need to estimate optimal combination weight.
    E.g., use neural network
  • Combination weights trained on a held-out
    development dataset

GMM
MMLR
WordHMM
PhoneNgram
Neural Network Combiner
35
Variability The Achilles Heel...
  • Variability (extrinsic intrinsic) in the
    spectrum can cause error
  • Data of focus has mainly been extrinsic
  • Channel mismatch
  • Microphone
  • carbon-button, hands-free,..
  • Acoustic environment
  • Office, car, airport, ...
  • Transmission channel
  • Landline, cellular, VoIP, ...
  • Three compensation approaches
  • Feature-based
  • Model-based
  • Score-based

36
NIST Speaker Verification Evaluations
  • Annual NIST evaluations of speaker verification
    technology (since 1996)
  • Aim Provide a common paradigm for comparing
    technologies
  • Focus Conversational telephone speech
    (text-independent)

Improve
37
The NIST Evaluation Task
  • Conversational telephone speech, interview
  • Landline, cellular, hands-free, multiple-mics in
    room
  • 5 min of conversations between two speakers
  • Various conditions, e.g.,
  • Training 8, 1, or other number of conversation
    sides
  • Test 1 conversation side, 30 secs, etc.
  • Evaluation
  • Equal Error Rate (EER)
  • Decision Cost Function (DCF)
  • (10, 1,
    0.01)

38
The End
  • Whats one interesting you learned today you may
    share with a friend over dinner conversation?

39
Backup slides
40
Word Conditional Models -- example
  • Boakye et al. (2004)
  • 19 words and bi-grams
  • Discourse markers actually, anyway, like, see,
    well, now, you_know, you_see, i_think, i_mean
  • Filled pauses um, uh
  • Backchannels yeah, yep, okay, uhhuh, right,
    i_see, i_know
  • Trained whole-word HMMs, instead of GMMs, to
    model evolution of speech in time
  • Combines well with low-level (i.e., cepstral GMM)
    system, especially with more training data

41
Phone N-Grams -- example
  • Idea (Hatch et al., 05) model the pattern of
    phone usage or short term pronunciation for a
    speaker
  • Use open-loop phone recognition to obtain phone
    hypotheses
  • Create models of relative frequencies of phone
    n-grams of the speaker vs. others
  • Use SVM for modeling
  • Combines well, esp. with increased data
  • Works across languages
Write a Comment
User Comments (0)
About PowerShow.com