Lecture 16 Speaker Recognition - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Lecture 16 Speaker Recognition

Description:

Lecture 16 Speaker Recognition Information College, Shandong University _at_ Weihai Definition Method of recognizing a Person form his/her voice. Depends on Speaker ... – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 27
Provided by: bill475
Category:

less

Transcript and Presenter's Notes

Title: Lecture 16 Speaker Recognition


1
Lecture 16 Speaker Recognition
Information College, Shandong University _at_
Weihai
2
Definition
  • Method of recognizing a Person form his/her
    voice.
  • Depends on Speaker Specific Characteristics
  • To determine whether a specified speaker is
    speaking in a given segment of speech
  • This task is the one closest to biometric
    identification using speech

3
Voice is a popular Biometric
  • Voice Biometric
  • Natural signal to produce
  • Does not require a specialized input device
  • Can be used on site or remotely
  • Telephone banking, Voice mail browsing, .
  • Security
  • Keys, card, ...
  • Passwords, PIN, ...
  • Fingerprint, voiceprint, Iris-print

4
Similar Tasks
  • Speaker Verification
  • Extract information from the stream of speech.
  • Verifies that a person is who she/he claims to
    be.
  • One-to-one comparison.
  • Speaker Recognition
  • Extract information from the stream of speech.
  • Assigns an identity to the voice of an unknown
    person.
  • One-to-many comparison.
  • Speech Recognition
  • Extracts information from the stream of speech.
  • Figures out what a person is saying.

5
Task of Today
  • Speech Recognition
  • History
  • Scheme
  • Speaker Features
  • Methods

6
Recognition Milestone
  • 1920, first electromechanical toy Rex',
    (Elmwood Co. )
  • Late 1940s, US Defense, Automatic Translation
    Machine
  • Project failed, but sparked the research at MIT,
    CMU, commercial institutions.
  • During 1950's, first system capable of
    recognizing digits spoken over the telephone was
    developed by Bell Labs.
  • 1962, Shoebox form IBM
  • In early 1970's, the system HARPY capable of
    sentences, limited grammar, by Carnegie-Mellon
    University.
  • HARPY required so much computing power as in 50
    contemporary computers.
  • Moreover, the system recognized discrete speech,
    where words are separated by longer pauses than
    usual.

7
Recognition Milestone
  • In the 1980s, significant progress in speech
    recognition technology
  • Word error rates continue to drop by factor of 2
    every two years.
  • IBM in 1985, in real time, isolated words from
    set of 20,000 after 20-minute training, with
    error rate lt 5.
  • ATT, call routing system, speaker independent
    word-spotting technology, few key phrases.
  • Several very large vocabulary dictation systems
  • require speakers to pause between words.
  • Better for specific domain.
  • In 1990's
  • VoiceBroker deployed by Charles Schwab, stock
    brokerage, in 1996.
  • ViaVoice by IBM, first distributed with the now
    almost forgotten operating system OS/2 in 1996.
  • 1997, Dragon introduced Naturaly Speaking, first
    continuous speech recognition package
  • Today
  • Airline reservations with British Airways,
  • Train reservation for Amtrak,
  • Weather forecasts telephone directory
    information

8
Terminology of Speech Recognition
  • Speaker Dependent Recognition
  • The recognition system is designed to work with
    just one or a small number of individual speakers
  • Speaker Independent Recognition
  • These systems are designed to work with all the
    speakers from a given linguistic community

9
Terminology of Speech Recognition
  • Large Vocabulary Recognition
  • Example are domain specific recognition systems
    such as used by medical consultants for
    dictating notes on their ward rounds
  • Very difficult to make accurate large vocabulary,
    speaker independent systems
  • Small Vocabulary Recognition
  • Typically recognition of a few keywords such as
    digits or a set of commands.
  • Example voice operated telephone number dialing

10
Terminology of Speech Recognition
  • Isolated Word Recognition
  • Systems which can only recognize individual words
    which are preceded and followed by relatively
    long period of silence
  • Connected Word Recognition
  • Systems which can recognize a limited sequence of
    words spoken in succession (e.g. Ninety-eight
    thirty-five four thousand)
  • Continuous Word Recognition
  • These systems can recognize speech as it occurs
    and recognize the speech in real time. Such
    system usually work with large vocabulary, but
    with moderate accuracy.

11
Speech Recognition Scheme
  • Three steps in Speech recognition are performed
    in ANY recognition system
  • Feature Extraction
  • Measurement of similarity
  • Decision making

12
Recognition Systems
Pattern matching is constrained in many ways,
e.g. the rules of language (grammar), spelling
and possible pronunciations
Derive a compact representation of the speech
waveform
reference patterns
accept/ reject
speech
feature extraction
pattern matching
decision rule
test pattern
Find the word with the greatest similarity to the
input speech
c0(t)
...
c1(t)
cM(t)

?cM(t)
?c1(t)
?c0(t)

?2c0(t)
?2c1(t)
?2cM(t)
13
Speech Model Features
14
Speaker Recognition Features
  • The features are low-level speech signal
    representation parameters that convey complete
    information about the signal.
  • High-level characteristics like accent,
    intonation, etc. are encoded within the
    representation in a very complex and cryptic
    manner.
  • The features contain speaker-dependent
    components.
  • Uniqueness and permanence of the features is
    problematic.

15
Questions
  • Do the features that uniquely characterize people
    exist?
  • Uniqueness and permanence of most of the features
    used in biometric systems have not been proven.
  • Is the humans ability to identify a person a
    limit that no automatic system can overcome?
  • Automated systems might be able to identify
    people better than average person can do. In
    practice, expert systems do not perform the task
    better than the experts who built them.

16
Questions
  • How important are the algorithms versus the
    knowledge of features and their relationships to
    achieve high identification accuracy?
  • Knowledge of features and their relationships is
    fundamental for accurate biometric systems. The
    algorithms play an important, still secondary,
    role in the process as no algorithm can
    compensate for the lack of the adequate features.

17
Speaker models
  • Used to represent the speaker specific
    information conveyed in the feature vectors
  • Several different modeling techniques have been
    applied
  • Template Matching
  • Nearest Neighbor
  • Neural Networks
  • Hidden Markov Models
  • State-of-the-art speaker recognition algorithms
    are based on statistical models of short-term
    acoustic measurements on the input speech signal

18
Speaker models
  • Use long-term averages of acoustic features
    (spectrum, pitch) first and earliest Idea
  • To average out the factors influencing
    intra-speaker variation, leave only the speaker
    dependent component.
  • Drawback required long speech utterance(gt20s)
  • Training SD model for each speaker
  • Explicit segmentation HMM
  • Implicit segmentation VQ,GMM

19
Speaker models
  • HMM
  • Advantage Text-independent
  • Drawback A significant increase in
    computational complexity
  • VQ
  • Advantage Unsupervised clustering
  • Drawback Text-dependent
  • GMM
  • Advantage Text-Independent, Probabilistic
    framework (robust), Computationally efficient,
    Easily to be implemented.

20
Speaker models
  • Discriminative Neural Network
  • Model the decision function which best
    discriminate speakers
  • Advantage
  • Less parameters, higher performance compared
    to VQ model.
  • Drawback
  • The network must be retrained when a new
    speaker is added to the system.

21
Progressing
VQ NN
HMM VQ NN
GMM HMM VQ NN
1985
1995
Easy
Word Error Rate
Hard
21
State of the Art Speech Recognition
22
QV Example
distortion
This sample has less distortion for A than for B
Acoustic Space 2
Speaker A
Speaker B
Acoustic Space 1
23
HMM Example
  • Two model of tomato

Word in the vocabulary is presented with
phonemes. Each phoneme is viewed as an HMM A
word model is constructed by combining HMMs for
the phonemes
24
Gaussian Mixture Model (GMM)
Speech Recognition
(GMM) State Level
25
Gaussian Mixture Model (GMM)
Speaker Recognition
Speaker k



26
Limits
  • The best performing algorithms for
    text-independent speaker verification use
    Gaussian Mixture Models (GMM) (single state HMM)
  • The linguistic structure of the speech signal is
    not taken into account and all sounds are
    represented using a unique model
  • The sequential information is ignored
  • There is a recent trend in using High-level
    features
  • Large Vocabulary Continuous Speech Recognition
    System
  • Good results for a small set of languages
  • Need huge amount of annotated speech databases
    (an enormous amount of time and human effort )
  • Language and task dependent
Write a Comment
User Comments (0)
About PowerShow.com