An Intro to Speaker Recognition - PowerPoint PPT Presentation

About This Presentation

Title:

An Intro to Speaker Recognition

Description:

Speaker Recognition Nikki Mirghafori Acknowledgment: some s borrowed from the Heck & Reynolds tutorial, and A. Stolcke. Nikki Mirghafori 4/23/12 EECS 225D ... – PowerPoint PPT presentation

Number of Views:255

Avg rating:3.0/5.0

Slides: 42

Provided by: www1IcsiB

Category:

more less

Transcript and Presenter's Notes

Title: An Intro to Speaker Recognition

1
An Intro to Speaker Recognition

Nikki Mirghafori
Acknowledgment some slides borrowed from the
Heck Reynolds tutorial, and A. Stolcke.

2
Todays class

Interactive
Measures of success for today
You talk at least as much as I do
You learn and remember the basics
You feel you can do this stuff
We all have fun with the material!

3
A 10-minute Project Design

You are experts with different backgrounds. Your
previous startup companies were wildly
successful. A large VC firm in the valley wants
to fund YOUR next creation, as long as the
project is in speaker recognition.
The VC funding is yours, if you come up with some
kind of a coherent plan/list of issues
What is your proposed application?
What will be the sources of error and
variability, i.e., technology challenges?
What types of features will you use?
What sorts of statistical modeling
tools/techniques?
What will be your data needs?
Any other issues you can think of along your
path?

4
Extracting Information from Speech

Whats noise? whats signal?
Orthogonal in many ways
Use many of the same models and tools

Goal Automatically extract information
transmitted in speech signal
5
Speaker Recognition Applications

Access control
Physical facilities
Data and data networks
Transaction authentication
Telephone credit card purchases
Bank wire transfers
Fraud detection
Monitoring
Remote time and attendance logging
Home parole verification
Information retrieval
Customer information for call centers
Audio indexing (speech skimming device)
Personalization
Forensics
Voice sample matching

6
Tasks

Identification vs. verification
Closed set vs. open set identification
Also, segmentation, clustering, tracking...

7
Identification
Speaker Model Database
Test Speech
Whose voice is it?
Closed-set Speaker Identification
8
Identification
Speaker Model Database
Test Speech
Whose voice is it?
Open-set Speaker Identification
9
Verification/Authentication/Detection
Speaker Model Database
Test Speech
Does the voice match?
Verification requires claimant ID
10
Speech Modalities

Text-dependent recognition
Recognition system knows text spoken by person
Examples fixed phrase, prompted phrase
Used for applications with strong control over
user input
Knowledge of spoken text can improve system
performance

Text-independent recognition
Recognition system does not know text spoken by
person
Examples User selected phrase, conversational
speech
Used for applications with less control over user
input
More flexible system but also more difficult
problem
Speech recognition can provide knowledge of
spoken text
Text-Constrained recognition. Exercise for the
reader.

11
Text-constrained Recognition

Basic idea build speaker models for words rich
in speaker information
Example
What time did you say? um... okay, I_think
thats a good plan.
Text-dependent strategy in a text-independent
context

12
Voice as a biometric

Biometric a human generated signal or
attribute for authenticating a persons identity
Voice is a popular biometric
natural signal to produce
does not require a specialized input device
ubiquitous telephones and microphone equipped
PC

Voice biometric with other forms of security

Strongest security

Something you have - e.g., badge

Something you know - e.g., password

Something you are - e.g., voice

13
How to build a system?

Feature choices
low level (MFCC, PLP, LPC, F0, ...) and high
level (words, phones, prosody, ...)
Types of models
HMM, GMM, Support Vector Machines (SVM), DTW,
Nearest Neighbor, Neural Nets
Making decisions Log Likelihood Thresholds,
threshold setting for desired operating point
Other issues normalization (znorm, tnorm),
optimal data selection to match expected
conditions, channel variability, noise, etc.

14
Verification Performance

There are many factors to consider in design of
an evaluation of a speaker verification system

Most importantly The evaluation data and design
should match the target application domain of
interest

15
Verification Performance
Increasing constraints
Probability of False Reject (in )
Text-dependent (Combinations) Clean Data Single
microphone Large amount of train/test speech
Probability of False Accept (in )
16
Verification Performance

Example Performance Curve

Application operating point depends on relative
costs of the two error types
PROBABILITY OF FALSE REJECT (in )
PROBABILITY OF FALSE ACCEPT (in )
17
Human vs. Machine
Humans44better

Motivation for comparing human to machine
Evaluating speech coders and potential forensic
applications
Schmidt-Nielsen and Crystal used NIST evaluation
(DSP Journal, January 2000)
Same amount of training data
Matched Handset-type tests
Mismatched Handset-type tests
Used 3-sec conversational utterances from
telephone speech

Humans15worse
ErrorRates
18
Features

Desirable attributes of features for an automatic
system (Wolf 72)

Occur naturally and frequently in speech
Easily measurable
Not change over time or be affected by speakers
health
Not be affected by reasonable background noise
nor depend on specific transmission
characteristics
Not be subject to mimicry

Practical
Robust
Secure

No feature has all these attributes

19
Training Test Phases
Enrollment Phase
Feature Extraction
Model Training
Model for each speaker
Training speech for each speaker
20
Decision making

Verification decision approaches have roots in
signal detection theory

2-class Hypothesis test
H0 the speaker is an impostorH1 the speaker
is indeed the claimed speaker.
Statistic computed on test utterance S as
likelihood ratio

21
Decision making

Identification pick model (of N) with best
score
Verification usual approach is via likelihood
ratio tests, hypothesis testing, i.e.
By Bayes
P(targetx)/P(nontargetx)
P(xtarget)P(target)/P(xnontarget)P(nontarge
t)
accept if gt threshold, reject otherwise
Cant sum over all non-target talkers -- world
for SV!
Use cohorts (collection of impostors)
Train universal/world/background model
(speaker independent, its trained on many
speakers)

22
Spectral Based Approach

Traditional speaker recognition systems use
Cepstral feaures
Gaussian Mixture Models (GMMs)

D.A. Reynolds, T.F. Quatieri, R.B. Dunn. Speaker
Verification using Adapted Gaussian Mixture
Models, Digital Signal Processing, 10(1--3),
January/April/July 2000
23
Features Levels of Information
Hierarchy of Perceptual Cues
semantics, idiolects, pronunciations, idiosyncrasies socio-economic status, education, place of birth
prosody, rhythm, speed intonation, volume modulation personality type, parental influence
acoustic aspects of speech, nasal, deep, breathy, rough anatomical structure of vocal apparatus
24
Low level features

Speech production model source-filter
interaction
Anatomical structure (vocal tract/glottis)
conveyed in speech spectrum

Glottal pulses
Vocal tract
Speech signal
25
Word N-gram Features

Idea (Doddington 2001)
Word usage can be idiosyncratic to a speaker
Model speakers by relative frequencies of word
N-grams
Reflects vocabulary AND grammar
Cf. similar approaches for authorship and
plagiarism detection on text documents.
First (unpublished) use in speaker recognition
Heck et al. (1998)
Implementation
Get 1-best word recognition output
Extract N-gram frequencies
Model likelihood ratio OR
Model frequency vectors by SVM

I_shall 0.002
I_think 0.025
I_would 0.012

26
Phone N-gram features
Model the pattern of phone usage or short term
pronunciation for a speaker
Open-loop phone recognition
Support Vector Machine (SVM)
0.0254 0.0068 0.0198 - 0.0001 0.8827
0.7264 - 0.0329 0.2847 0.2983
score
27
MLLR transform vectors as features
Speaker-dependent
Speaker-independent
Phone class B
Phone class A
Speaker-independent
Speaker-dependent
MLLR Transforms Features
28
Models

HMMs
text dep (could use whole word/phone model)
prompted (phone models)
text indt (use LVCSR) -- or GMMs!
templates DTW (if text-dependent system)
nearest neighbor frame level, training data as
model, non-parametric
neural nets train explicitly discriminating
models
SVMs

29
Speaker Models -- HMM

Speaker models (voiceprints) represent voice
biometric in compact and generalizable form

Modern speaker verification systems use Hidden
Markov Models (HMMs)
HMMs are statistical models of how a speaker
produces sounds
HMMs represent underlying statistical variations
in the speech state (e.g., phoneme) and temporal
changes of speech between the states.
Fast training algorithms (EM) exist for HMMs with
guaranteed convergence properties.

h-a-d
30
Speaker Models HMM/GMM

Form of HMM depends on the application

31
Word N-gram Modeling Likelihood Ratios

Model N-gram token log likelihood ratio
Numerator speaker language model estimated from
enrollment data
Denominator background language model estimated
from large speaker population
Normalize by token count
Choose all reasonably frequent bigrams or
trigrams, or a weighted combination of both

32
Speaker Recognition with SVMs

Each speech sample (training or test) generates a
point in a derived feature space
The SVM is trained to separate the target sample
from the impostor ( UBM) samples
Scores are computed as the Euclidean distance
from the decision hyperplane to the test sample
point
SVMs training is biased against misclassifying
positive examples (typically very few, often just
1)

Background sample Target sample Test sample
33
Feature Transforms for SVMs

SVMs have been a boon for higher-level (as well
as cepstral speaker recognition) research they
allow great flexibility in the choice of features
However, we need a sequence kernel
Dominant approach transform variable-length
feature stream into fixed, finite-dimensional
feature space
Then use linear kernel
All the action is in the feature transform!

34
Combination of Systems

Systems work best in combination, especially
ones using higher level features
Need to estimate optimal combination weight.
E.g., use neural network
Combination weights trained on a held-out
development dataset

GMM
MMLR
WordHMM
PhoneNgram
Neural Network Combiner
35
Variability The Achilles Heel...

Variability (extrinsic intrinsic) in the
spectrum can cause error
Data of focus has mainly been extrinsic
Channel mismatch
Microphone
carbon-button, hands-free,..
Acoustic environment
Office, car, airport, ...
Transmission channel
Landline, cellular, VoIP, ...
Three compensation approaches
Feature-based
Model-based
Score-based

36
NIST Speaker Verification Evaluations

Annual NIST evaluations of speaker verification
technology (since 1996)
Aim Provide a common paradigm for comparing
technologies
Focus Conversational telephone speech
(text-independent)

Improve
37
The NIST Evaluation Task

Conversational telephone speech, interview
Landline, cellular, hands-free, multiple-mics in
room
5 min of conversations between two speakers
Various conditions, e.g.,
Training 8, 1, or other number of conversation
sides
Test 1 conversation side, 30 secs, etc.
Evaluation
Equal Error Rate (EER)
Decision Cost Function (DCF)
(10, 1,
0.01)

38
The End

Whats one interesting you learned today you may
share with a friend over dinner conversation?

39
Backup slides
40
Word Conditional Models -- example

Boakye et al. (2004)
19 words and bi-grams
Discourse markers actually, anyway, like, see,
well, now, you_know, you_see, i_think, i_mean
Filled pauses um, uh
Backchannels yeah, yep, okay, uhhuh, right,
i_see, i_know
Trained whole-word HMMs, instead of GMMs, to
model evolution of speech in time
Combines well with low-level (i.e., cepstral GMM)
system, especially with more training data

41
Phone N-Grams -- example

Idea (Hatch et al., 05) model the pattern of
phone usage or short term pronunciation for a
speaker
Use open-loop phone recognition to obtain phone
hypotheses
Create models of relative frequencies of phone
n-grams of the speaker vs. others
Use SVM for modeling
Combines well, esp. with increased data
Works across languages