Title: An Intro to Speaker Recognition
1An Intro to Speaker Recognition
- Nikki Mirghafori
- Acknowledgment some slides borrowed from the
Heck Reynolds tutorial, and A. Stolcke.
2Todays class
- Interactive
- Measures of success for today
- You talk at least as much as I do
- You learn and remember the basics
- You feel you can do this stuff
- We all have fun with the material!
3A 10-minute Project Design
- You are experts with different backgrounds. Your
previous startup companies were wildly
successful. A large VC firm in the valley wants
to fund YOUR next creation, as long as the
project is in speaker recognition. - The VC funding is yours, if you come up with some
kind of a coherent plan/list of issues - What is your proposed application?
- What will be the sources of error and
variability, i.e., technology challenges? - What types of features will you use?
- What sorts of statistical modeling
tools/techniques? - What will be your data needs?
- Any other issues you can think of along your
path?
4Extracting Information from Speech
- Whats noise? whats signal?
- Orthogonal in many ways
- Use many of the same models and tools
Goal Automatically extract information
transmitted in speech signal
5Speaker Recognition Applications
- Access control
- Physical facilities
- Data and data networks
- Transaction authentication
- Telephone credit card purchases
- Bank wire transfers
- Fraud detection
- Monitoring
- Remote time and attendance logging
- Home parole verification
- Information retrieval
- Customer information for call centers
- Audio indexing (speech skimming device)
- Personalization
- Forensics
- Voice sample matching
6Tasks
- Identification vs. verification
- Closed set vs. open set identification
- Also, segmentation, clustering, tracking...
7Identification
Speaker Model Database
Test Speech
Whose voice is it?
Closed-set Speaker Identification
8Identification
Speaker Model Database
Test Speech
Whose voice is it?
Open-set Speaker Identification
9Verification/Authentication/Detection
Speaker Model Database
Test Speech
Does the voice match?
Verification requires claimant ID
10Speech Modalities
- Text-dependent recognition
- Recognition system knows text spoken by person
- Examples fixed phrase, prompted phrase
- Used for applications with strong control over
user input - Knowledge of spoken text can improve system
performance
- Text-independent recognition
- Recognition system does not know text spoken by
person - Examples User selected phrase, conversational
speech - Used for applications with less control over user
input - More flexible system but also more difficult
problem - Speech recognition can provide knowledge of
spoken text - Text-Constrained recognition. Exercise for the
reader.
11Text-constrained Recognition
- Basic idea build speaker models for words rich
in speaker information - Example
- What time did you say? um... okay, I_think
thats a good plan. - Text-dependent strategy in a text-independent
context
12Voice as a biometric
- Biometric a human generated signal or
attribute for authenticating a persons identity - Voice is a popular biometric
- natural signal to produce
- does not require a specialized input device
- ubiquitous telephones and microphone equipped
PC
- Voice biometric with other forms of security
Strongest security
- Something you have - e.g., badge
- Something you know - e.g., password
- Something you are - e.g., voice
13How to build a system?
- Feature choices
- low level (MFCC, PLP, LPC, F0, ...) and high
level (words, phones, prosody, ...) - Types of models
- HMM, GMM, Support Vector Machines (SVM), DTW,
Nearest Neighbor, Neural Nets - Making decisions Log Likelihood Thresholds,
threshold setting for desired operating point - Other issues normalization (znorm, tnorm),
optimal data selection to match expected
conditions, channel variability, noise, etc.
14Verification Performance
- There are many factors to consider in design of
an evaluation of a speaker verification system
- Most importantly The evaluation data and design
should match the target application domain of
interest
15Verification Performance
Increasing constraints
Probability of False Reject (in )
Text-dependent (Combinations) Clean Data Single
microphone Large amount of train/test speech
Probability of False Accept (in )
16Verification Performance
- Example Performance Curve
Application operating point depends on relative
costs of the two error types
PROBABILITY OF FALSE REJECT (in )
PROBABILITY OF FALSE ACCEPT (in )
17Human vs. Machine
Humans44better
- Motivation for comparing human to machine
- Evaluating speech coders and potential forensic
applications - Schmidt-Nielsen and Crystal used NIST evaluation
(DSP Journal, January 2000) - Same amount of training data
- Matched Handset-type tests
- Mismatched Handset-type tests
- Used 3-sec conversational utterances from
telephone speech
Humans15worse
ErrorRates
18Features
- Desirable attributes of features for an automatic
system (Wolf 72)
- Occur naturally and frequently in speech
- Easily measurable
- Not change over time or be affected by speakers
health - Not be affected by reasonable background noise
nor depend on specific transmission
characteristics - Not be subject to mimicry
Practical
Robust
Secure
- No feature has all these attributes
19Training Test Phases
Enrollment Phase
Feature Extraction
Model Training
Model for each speaker
Training speech for each speaker
20Decision making
- Verification decision approaches have roots in
signal detection theory
- 2-class Hypothesis test
- H0 the speaker is an impostorH1 the speaker
is indeed the claimed speaker. - Statistic computed on test utterance S as
likelihood ratio
21Decision making
- Identification pick model (of N) with best
score - Verification usual approach is via likelihood
ratio tests, hypothesis testing, i.e. - By Bayes
- P(targetx)/P(nontargetx)
- P(xtarget)P(target)/P(xnontarget)P(nontarge
t) - accept if gt threshold, reject otherwise
- Cant sum over all non-target talkers -- world
for SV! - Use cohorts (collection of impostors)
- Train universal/world/background model
(speaker independent, its trained on many
speakers)
22Spectral Based Approach
- Traditional speaker recognition systems use
- Cepstral feaures
- Gaussian Mixture Models (GMMs)
D.A. Reynolds, T.F. Quatieri, R.B. Dunn. Speaker
Verification using Adapted Gaussian Mixture
Models, Digital Signal Processing, 10(1--3),
January/April/July 2000
23Features Levels of Information
Hierarchy of Perceptual Cues
semantics, idiolects, pronunciations, idiosyncrasies socio-economic status, education, place of birth
prosody, rhythm, speed intonation, volume modulation personality type, parental influence
acoustic aspects of speech, nasal, deep, breathy, rough anatomical structure of vocal apparatus
24Low level features
- Speech production model source-filter
interaction - Anatomical structure (vocal tract/glottis)
conveyed in speech spectrum
Glottal pulses
Vocal tract
Speech signal
25Word N-gram Features
- Idea (Doddington 2001)
- Word usage can be idiosyncratic to a speaker
- Model speakers by relative frequencies of word
N-grams - Reflects vocabulary AND grammar
- Cf. similar approaches for authorship and
plagiarism detection on text documents. - First (unpublished) use in speaker recognition
Heck et al. (1998) - Implementation
- Get 1-best word recognition output
- Extract N-gram frequencies
- Model likelihood ratio OR
- Model frequency vectors by SVM
I_shall 0.002
I_think 0.025
I_would 0.012
26Phone N-gram features
Model the pattern of phone usage or short term
pronunciation for a speaker
Open-loop phone recognition
Support Vector Machine (SVM)
0.0254 0.0068 0.0198 - 0.0001 0.8827
0.7264 - 0.0329 0.2847 0.2983
score
27MLLR transform vectors as features
Speaker-dependent
Speaker-independent
Phone class B
Phone class A
Speaker-independent
Speaker-dependent
MLLR Transforms Features
28Models
- HMMs
- text dep (could use whole word/phone model)
- prompted (phone models)
- text indt (use LVCSR) -- or GMMs!
- templates DTW (if text-dependent system)
- nearest neighbor frame level, training data as
model, non-parametric - neural nets train explicitly discriminating
models - SVMs
29Speaker Models -- HMM
- Speaker models (voiceprints) represent voice
biometric in compact and generalizable form
- Modern speaker verification systems use Hidden
Markov Models (HMMs) - HMMs are statistical models of how a speaker
produces sounds - HMMs represent underlying statistical variations
in the speech state (e.g., phoneme) and temporal
changes of speech between the states. - Fast training algorithms (EM) exist for HMMs with
guaranteed convergence properties.
h-a-d
30Speaker Models HMM/GMM
- Form of HMM depends on the application
31Word N-gram Modeling Likelihood Ratios
- Model N-gram token log likelihood ratio
- Numerator speaker language model estimated from
enrollment data - Denominator background language model estimated
from large speaker population - Normalize by token count
- Choose all reasonably frequent bigrams or
trigrams, or a weighted combination of both
32Speaker Recognition with SVMs
- Each speech sample (training or test) generates a
point in a derived feature space - The SVM is trained to separate the target sample
from the impostor ( UBM) samples - Scores are computed as the Euclidean distance
from the decision hyperplane to the test sample
point - SVMs training is biased against misclassifying
positive examples (typically very few, often just
1)
Background sample Target sample Test sample
33Feature Transforms for SVMs
- SVMs have been a boon for higher-level (as well
as cepstral speaker recognition) research they
allow great flexibility in the choice of features - However, we need a sequence kernel
- Dominant approach transform variable-length
feature stream into fixed, finite-dimensional
feature space - Then use linear kernel
- All the action is in the feature transform!
34Combination of Systems
- Systems work best in combination, especially
ones using higher level features - Need to estimate optimal combination weight.
E.g., use neural network - Combination weights trained on a held-out
development dataset
GMM
MMLR
WordHMM
PhoneNgram
Neural Network Combiner
35Variability The Achilles Heel...
- Variability (extrinsic intrinsic) in the
spectrum can cause error - Data of focus has mainly been extrinsic
- Channel mismatch
- Microphone
- carbon-button, hands-free,..
- Acoustic environment
- Office, car, airport, ...
- Transmission channel
- Landline, cellular, VoIP, ...
- Three compensation approaches
- Feature-based
- Model-based
- Score-based
36NIST Speaker Verification Evaluations
- Annual NIST evaluations of speaker verification
technology (since 1996) - Aim Provide a common paradigm for comparing
technologies - Focus Conversational telephone speech
(text-independent)
Improve
37The NIST Evaluation Task
- Conversational telephone speech, interview
- Landline, cellular, hands-free, multiple-mics in
room - 5 min of conversations between two speakers
- Various conditions, e.g.,
- Training 8, 1, or other number of conversation
sides - Test 1 conversation side, 30 secs, etc.
- Evaluation
- Equal Error Rate (EER)
- Decision Cost Function (DCF)
-
-
- (10, 1,
0.01)
38The End
- Whats one interesting you learned today you may
share with a friend over dinner conversation?
39Backup slides
40Word Conditional Models -- example
- Boakye et al. (2004)
- 19 words and bi-grams
- Discourse markers actually, anyway, like, see,
well, now, you_know, you_see, i_think, i_mean - Filled pauses um, uh
- Backchannels yeah, yep, okay, uhhuh, right,
i_see, i_know - Trained whole-word HMMs, instead of GMMs, to
model evolution of speech in time - Combines well with low-level (i.e., cepstral GMM)
system, especially with more training data
41Phone N-Grams -- example
- Idea (Hatch et al., 05) model the pattern of
phone usage or short term pronunciation for a
speaker - Use open-loop phone recognition to obtain phone
hypotheses - Create models of relative frequencies of phone
n-grams of the speaker vs. others - Use SVM for modeling
- Combines well, esp. with increased data
- Works across languages