ROBUST SPEECH RECOGNITION Signal Processing for Speech Applications - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

ROBUST SPEECH RECOGNITION Signal Processing for Speech Applications

Description:

Carnegie Mellon University. Telephone: (412) 268-2535. Fax: (412) 268-3890. rms_at_cs.cmu.edu ... Carnegie. Mellon. Slide 2 ECE and SCS Robust Speech Group ... – PowerPoint PPT presentation

Number of Views:794
Avg rating:5.0/5.0
Slides: 69
Provided by: rompope
Category:

less

Transcript and Presenter's Notes

Title: ROBUST SPEECH RECOGNITION Signal Processing for Speech Applications


1
ROBUST SPEECH RECOGNITIONSignal Processing for
Speech Applications
  • Richard Stern
  • Robust Speech Recognition Group
  • Carnegie Mellon University
  • Telephone (412) 268-2535
  • Fax (412) 268-3890
  • rms_at_cs.cmu.edu
  • http//www.cs.cmu.edu/rms
  • Short Course at UNAM
  • August 14-17, 2007

2
SIGNAL PROCESSING FOR SPEECH APPLICATIONS
  • Major speech technologies
  • Speech coding
  • Speech synthesis
  • Speech recognition

3
OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction
  • Major functional components
  • Signal processing to extract features from speech
    waveforms
  • Comparison of features to pre-stored templates
  • Important design choices
  • Choice of features
  • Specific method of comparing features to stored
    templates

4
GOALS OF SPEECH REPRESENTATIONS
  • Capture important phonetic information in speech
  • Computational efficiency
  • Efficiency in storage requirements
  • Optimize generalization

5
GOALS OF THIS LECTURE
  • We will describe and explain how we accomplish
    feature extraction for automatic speech
    recognition.
  • Some specific topics
  • Sampling
  • Linear predictive coding (LPC)
  • LPC-derived cepstral coefficients (LPCC)
  • Mel-frequency cepstral coefficients (MFCC)
  • Some of the underlying mathematics
  • Continuous-time Fourier transform (CTFT)
  • Discrete-time Fourier transform (DTFT)
  • Z-transforms

6
OUTLINE OF PRESENTATION
  • Introduction
  • Why perform signal processing?
  • The source-filter model of speech production
  • Sampling of continuous-time signals
  • Digital filtering of signals
  • Frequency representations in continuous and
    discrete time
  • Feature extraction for speech recognition

7
WHY PERFORM SIGNAL PROCESSING?
  • A look at the time-domain waveform of six

Its hard to infer much from the time-domain
waveform
8
WHY PERFORM SIGNAL PROCESSING IN THE FREQUENCY
DOMAIN?
  • Human hearing is based on frequency analysis
  • Use of frequency analysis often simplifies signal
    processing
  • Use of frequency analysis often facilitates
    understanding

9
THE SOURCE-FILTER MODEL FOR SPEECH
  • A useful model for representing the generation of
    speech sounds

Amplitude
pn
10
Speech coding separating the vocal tract
excitation and and filter
  • Original speech
  • Speech with 75-Hz excitation
  • Speech with 150 Hz excitation
  • Speech with noise excitation

11
SAMPLING CONTINUOUS-TIME SIGNALS
  • Original speech waveform and its samples

12
THE SAMPLING THEOREM
  • Nyquist theorem If the signal is sampled with a
    frequency that is at least twice the maximum
    frequency of the incoming speech, we can recover
    the original waveform by lowpass filtering. With
    lower sampling frequencies, aliasing will occur,
    which produces distortion from which the original
    signal cannot be recovered.

Recovered speech wave
Speech wave
LOWPASS FILTER
Sampling pulse train
13
EFFECTS OF ALIASING
  • Undersampling at 10 kHz

14
Digital filtering of signals
yn
xn
Filter 1 yn 3.6yn15.0yn23.2yn3.82
yn4 .013xn.032xn1.044xn2.033xn3
.013xn4 Filter 2 yn 2.7yn13.3yn22
.0yn3.57yn4 .35xn1.3xn12.0xn21.3
xn3.35xn4
15
Filter 1 in the time domain
16
Output of Filter 1 in the frequency domain
Original
Lowpass
17
Filter 2 in the time domain
18
Output of Filter 2 in the frequency domain
Original
Highpass
19
OUTLINE OF PRESENTATION
  • Introduction
  • Frequency representations in continuous and
    discrete time
  • Continuous-time Fourier series (CTFS)
  • Continuous-time Fourier transform (CTFT)
  • Discrete-time Fourier transform (DTFT)
  • Short-time Fourier analysis
  • Effects of window size and shape
  • Z-transforms
  • Feature extraction for speech recognition

20
Frequency Representation The Continuous-Time
Fourier Series (CTFS)
  • If x(t) is periodic with period T
  • where

21
Frequency Representation The Continuous-Time
Fourier Series (CTFS)
  • Comments
  • The coefficients Xk are complex
  • Alternate representation
  • where

22
Example of Fourier series synthesisBuilding a
square wave from a sum of cosines
23
Frequency Representation The Continuous-Time
Fourier Transform (CTFT)
  • where

24
Frequency Representation The Discrete-Time
Fourier Transform (DTFT)
Comment DTFTs are always periodic in frequency
so typically only frequencies from -p to p are
used.
25
EXAMPLES OF CTFTs and DTFTs
  • Continuous-time decaying exponential

26
EXAMPLES OF CTFTs and DTFTs
  • Discrete-time decaying exponential

27
CORRESPONDENCE BETWEEN FREQUENCY IN DISCRETE AND
CONTINUOUS TIME
  • Suppose that a continuous-time signal is sampled
    at a rate greater than the Nyquist rate, with a
    time between samples of T
  • Let W represent continuous-time frequency in
    radians/sec
  • Let w represent discrete-time frequency in
    radians
  • Then
  • Comment The maximum discrete-time frequency, w
    p, corresponds to the Nyquist frequency in
    continuous time, half the sampling rate

28
SHORT-TIME FOURIER ANALYSIS
  • Problem Conventional Fourier analysis does not
    capture time-varying nature of speech signals
  • Solution Multiply signals by finite-duration
    window function, then compute DTFT
  • Side effect windowing causes spectral blurring

29
TWO POPULAR WINDOW SHAPES
  • Rectangular window
  • Hamming window
  • Comment Hamming window is frequently preferred
    because of its frequency response

30
TIME AND FREQUENCY RESPONSE OF WINDOWS
31
EFFECT OF WINDOW DURATION
  • Short-duration window Long-Duration window

32
BREAKING UP THE INCOMING SPEECH INTO FRAMES USING
HAMMING WINDOWS
33
Generalizing the DTFT The Z-transform
  • Discrete-Time Fourier Transform (DTFT)
  • where
  • Z-transform
  • where

34
WHY DO WE USE Z-TRANSFORMS?
  • Z-Transforms enable us to relate difference
    equations to frequency response of linear systems
  • Z-transforms facilitate design of discrete-time
    systems
  • Z-transforms provide insight into linear
    prediction

35
CHARACTERIZING LINEAR SHIFT-INVARIANT SYSTEMS BY
DIFFERENCE EQUATIONS
  • A simple example ...
  • Difference equation characterizing system
  • Or

xn
LSI SYSTEM
yn
hn
36
TIME-DOMAIN AND FREQUENCY-DOMAIN CHARACTERIZATION
OF SYSTEMS
  • Let hn be the response of a system to the unit
    impulse function and let H(z) be the Z-transform
    of hn
  • Then
  • and

xn
LSI SYSTEM
yn
hn
37
RELATING Z-TRANSFORMS TO DIFFERENCE EQUATIONS
  • Z-transform delay property
  • if
    then

xn
LSI SYSTEM
yn
hn
38
RELATING Z-TRANSFORMS TO DIFFERENCE EQUATIONS
  • Difference equation characterizing system
  • Z-transform characterizing system

xn
LSI SYSTEM
yn
hn
39
POLES AND ZEROS
  • Poles and zeros are the roots of the denominator
    and numerator of LSI systems
  • Zeros of system are at z 0, z 1
  • Poles of system are at z .9ejp/4, z .9e-jp/4

40
MAGNITUDE OF THE DTFT FROM POLE AND ZERO LOCATIONS
  • To evaluate the magnitude of the DTFT, consider
    the locus of points in the z-plane corresponding
    to the unit circle.
  • For each location on the unit circle, the
    magnitude of the DTFT is proportional to the
    product of the distances from the zeros divided
    by the product of the distances from the poles

41
Inferring the Magnitude of the DTFT from the
Z-transform
  • The magnitude of the DTFT is obtained from the
    magnitude of H(z) as we walk around the unit
    circle of the z-plane

42
SUMMARY OF Z-TRANSFORM DISCUSSION
  • Z-transforms are a generalization of DTFTs
  • Difference equations can be easily obtained from
    Z-transforms
  • Locations of poles and zeros in z-plane provide
    insight about frequency response

43
OUTLINE OF PRESENTATION
  • Introduction
  • Frequency representations in continuous and
    discrete time
  • Feature extraction for speech recognition
  • Linear prediction (LPC)
  • Representations based on cepstral coefficients
  • LPCC - Linear prediction-based cepstral
    coefficients
  • MFCC - Mel frequency cepstral coefficients
  • Perceptual linear prediction (PLP)

44
LINEAR PREDICTION OF SPEECH
  • Find the best all-pole approximation to the
    DTFT of a segment of speech
  • All-pole model is reasonable for most speech
  • Very efficient in terms of data storage
  • Coefficients ak can be computed efficiently
  • Phase information not preserved (not a problem
    for us)

45
LINEAR PREDICTION EXAMPLE
  • Spectra from the /ih/ in six
  • Comment LPC spectrum follows peaks well

46
FEATURES FOR SPEECH RECOGNITION CEPSTRAL
COEFFICIENTS
  • The cepstrum is the inverse transform of the log
    of the magnitude of the spectrum
  • Useful for separating convolved signals (like the
    source and filter in the speech production model)
  • Can be thought of as the Fourier series expansion
    of the magnitude of the Fourier transform
  • Generally provides more efficient and robust
    coding of speech information than LPC
    coefficients
  • Most common basic feature for speech recognition

47
TWO WAYS OF DERIVING CEPSTRAL COEFFIENTS
  • LPC-derived cepstral coefficients (LPCC)
  • Compute traditional LPC coefficients
  • Convert to cepstra using linear transformation
  • Warp cepstra using bilinear transform
  • Mel-frequency cepstral coefficients (MFCC)
  • Compute log magnitude of windowed signal
  • Multiply by triangular Mel weighting functions
  • Compute inverse discrete cosine transform

48
COMPUTING CEPSTRAL COEFFICIENTS
  • Comments
  • MFCC is currently the most popular
    representation.
  • Typical systems include a combination of
  • MFCC coefficients
  • Delta MFCC coefficients
  • Delta delta MFCC coefficients
  • Power and delta power coefficients

49
COMPUTING LPC CEPSTRAL COEFFICIENTS
  • Procedure used in SPHINX-I
  • A/D conversion at 16-kHz sampling rate
  • Apply Hamming window, duration 320 samples (20
    msec) with 50 overlap (100-Hz frame rate)
  • Pre-emphasize to boost high-frequency components
  • Compute first 14 auto-correlation coefficients
  • Perform Levinson-Durbin recursion to obtain 14
    LPC coefficients
  • Convert LPC coefficients to cepstral coefficients
  • Perform frequency warping to spread low
    frequencies
  • Apply vector quantization to generate three
    codebooks

50
An example the vowel in welcome
  • The original time function

51
THE TIME FUNCTION AFTER WINDOWING
52
THE RAW SPECTRUM
53
PRE-EMPHASIZING THE SIGNAL
  • Typical pre-emphasis filter
  • Its frequency response

54
THE SPECTRUM OF THE PRE-EMPHASIZED SIGNAL
55
THE LPC SPECTRUM
56
THE TRANSFORM OF THE CEPSTRAL COEFFICIENTS
57
THE BIG PICTURE THE ORIGINAL SPECTROGRAM
58
EFFECTS OF LPC PROCESSING
59
COMPARING REPRESENTATIONS
  • ORIGINAL SPEECH LPCC
    CEPSTRA (unwarped)

60
COMPUTING MEL FREQUENCY CEPSTRAL COEFFICIENTS
  • Segment incoming waveform into frames
  • Compute frequency response for each frame using
    DFTs
  • Group magnitude of frequency response into 25-40
    channels using triangular weighting functions
  • Compute log of weighted magnitudes for each
    channel
  • Take inverse DCT/DFT of weighted magnitudes for
    each channel, producing 14 cepstral coefficients
    for each frame
  • Calculate delta and double-delta coefficients

61
AN EXAMPLE DERIVING MFCC coefficients
62
WEIGHTING THE FREQUENCY RESPONSE
63
THE ACTUAL MEL WEIGHTING FUNCTIONS
64
THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
65
THE CEPSTRAL COEFFICIENTS
66
LOGSPECTRA RECOVERED FROM CEPSTRA
67
COMPARING SPECTRAL REPRESENTATIONS
  • ORIGINAL SPEECH MEL LOG MAGS
    AFTER CEPSTRA

68
SUMMARY
  • We outlined some of the relevant issues
    associated with the representation of speech for
    automatic recognition
  • The source-filter model of speech production
  • Sampling
  • Windowing
  • The Discrete-Time Fourier Transform (DTFT)
  • Concepts of frequency in continuous time and
    discrete time
  • The Z-transform
  • Linear prediction/linear predictive coding (LPC)
  • Mel frequency cepstral coefficients (MFCC)

69
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com