Voice DSP Processing II - PowerPoint PPT Presentation

About This Presentation
Title:

Voice DSP Processing II

Description:

Part 1 Speech biology and what we can learn from it ... since it doesn't differentiate between speech and noise ... dependent on details of speech signal ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 39
Provided by: yjs8
Category:
Tags: dsp | processing | voice

less

Transcript and Presenter's Notes

Title: Voice DSP Processing II


1
VoiceDSPProcessing II
  • Yaakov J. Stein
  • Chief ScientistRAD Data Communications

2
Voice DSP
  • Part 1 Speech biology and what we can learn from
    it
  • Part 2 Speech DSP (AGC, VAD, features, echo
    cancellation)
  • Part 3 Speech compression techiques
  • Part 4 Speech Recognition

3
Voice DSP - Part 2
  • Simplest processing
  • Gain
  • AGC
  • VAD
  • More complex processing
  • pitch tracking
  • U/V decision
  • computing LPC
  • other features
  • Echo Cancellation
  • Sources of echo
  • Echo suppression
  • Echo cancellation
  • Adaptive noise cancellation
  • The LMS algorithm
  • Other adaptive algorithms
  • The standard LEC

4
Voice DSP - Part 2a
  • Simplest
  • voice
  • DSP

5
Gain (volume) Control
  • In analog processing (electronics) gain requires
    an amplifier
  • Great care must be taken to ensure linearity!
  • In digital processing (DSP) gain requires only
    multiplication
  • y G x
  • Need enough bits!

6
Automatic Gain Control (AGC)
  • Can we set the gain automatically?
  • Yes, based on the signals Energy!
  • E x2 (t) dt S xn2
  • All we have to do is apply gain until attain
    desired energy
  • Assume we want the energy to be Y
  • Then
  • y Y/ E x G x
  • has exactly this energy

7
AGC - cont.
  • What if the input isnt stationary (gets stronger
    and weaker over time) ?
  • The energy is defined for all times - lt t lt
  • so it cant help!
  • So we define energy in window E(t)
  • and continuously vary gain G(t)
  • This is Adaptive Gain Control
  • We dont want gain to jump from window to window
  • so we smooth the instantaneous gain
  • G(t) a G(t) (1-a) Y/E(t)
  • IIR filter

8
8
8
AGC - cont.
  • The a coefficient determines how fast G(t) can
    change
  • In more complex implementations we may separately
    control
  • integration time, attack time, release time
  • What is involved in the computation of G(t) ?
  • Squaring of input value
  • Accumulation
  • Square root (or Pythagorean sum)
  • Inversion (division)
  • Square root and inversion are hard for a DSP
    processor
  • but algorithmic improvements are possible (and
    often needed)

9
Simple VAD
  • Sometimes it is useful to know whether someone is
    talking (or not)
  • Save bandwidth
  • Suppress echo
  • Segment utterances
  • We might be able to get away with energy VOX
  • Normally need Noise Riding Threshold / Signal
    Riding Threshold
  • However, there are problems energy VOX
  • since it doesnt differentiate between speech and
    noise
  • What we really want is a speech-specific activity
    detector
  • Voice Activity Detector

10
Simple VAD - cont.
  • VADs operate by recognizing that speech is
    different from noise
  • Speech is low-pass while noise is white
  • Speech is mostly voiced and so has pitch in a
    given range
  • Average noise amplitude is relatively constant
  • A simple VAD may use
  • zero crossings
  • zero crossing derivative
  • spectral tilt filter
  • energy contours
  • combinations of the above

11
Other simple processes
  • Simple not significantly dependent on details
    of speech signal
  • Speed change of recorded signal
  • Speed change with pitch compensation
  • Pitch change with speed compensation
  • Sample rate conversion
  • Tone generation
  • Tone detection
  • Dual tone generation
  • Dual tone detection (need high reliability)

12
Voice DSP - Part 2b
  • Complex
  • voice
  • DSP

13
Correlation
  • One major difference between simple and complex
    processing
  • is the computation of correlations (related to
    LPC model)
  • Correlation is a measure of similarity
  • Shouldnt we use squared difference to measure
    similarity?
  • D2 lt (x(t) - y(t) )2 gt
  • No, since squared difference is sensitive to
  • gain
  • time shifts

14
Correlation - cont.
  • D2 lt (x(t) - y(t) )2 gt lt x2 gt lt y2 gt
    - 2 lt x(t) y(t) gt
  • So when D2 is minimal C(0) lt x(t) y(t) gt is
    maximal
  • and arbitrary gains dont change this
  • To take time shifts into account
  • C(t) lt x(t) y(tt) gt
  • and look for maximal t !
  • We can even find out how much a signal resembles
    itself

15
Autocorrelation
  • Crosscorrelation Cx y (t) lt x(t) y(tt) gt
  • Autocorrelation Cx (t) lt x(t) x(tt) gt
  • Cx (0) is the energy!
  • Autocorrelation helps find hidden periodicities!
  • Much stronger than looking in the time
    representation
  • Wiener Khintchine
  • Autocorrelation C(t) and Power Spectrum S(f) are
    FT pair
  • So autocorrelation contains the same information
    as the power spectrum
  • and can itself be computed by FFT

16
Pitch tracking
  • How can we measure (and track) the pitch?
  • We can look for it in the spectrum
  • but it may be very weak
  • may not even be there (filtered out)
  • need high resolution spectral estimation
  • Correlation based methods
  • The pitch periodicity should be seen in the
    autocorrelation!
  • Sometimes computationally simpler is the
  • Absolute Magnitude Difference Function
  • lt x(t) - x(tt) gt

17
Pitch tracking - cont.
  • Sondhis algorithm for autocorrelation-based
    pitch tracking
  • obtain window of speech
  • determine if the segment is voiced (see U/V
    decision below)
  • low-pass filter and center-clip
  • to reduce formant
    induced correlations
  • compute autocorrelation lags corresponding to
    valid pitch intervals
  • find lag with maximum correlation OR
  • find lag with maximal accumulated correlation in
    all multiples
  • Post processing
  • Pitch trackers rarely make small errors (usually
    double pitch)
  • So correct outliers based on neighboring values

18
Other Pitch Trackers
  • Millers data-reduction Gold and Rabiners
    parallel processing methods
  • Zero-crossings, energy, extrema of waveform
  • Nolls cepstrum based pitch tracker
  • Since the pitch and formant contributions are
    separated in cepstral domain
  • Most accurate for clean speech, but not robust in
    noise
  • Methods based on LPC error signal
  • LPC technique breaks down at pitch pulse onset
  • Find periodicity of error by autocorrelation
  • Inverse filtering method
  • Remove formant filtering by low-order LPC
    analysis
  • Find periodicity of excitation by autocorrelation
  • Sondhi-like methods are the best for noisy speech

19
U/V decision
  • Between VAD and pitch tracking
  • Simplest U/V decision is based on energy and zero
    crossings
  • More complex methods are combined with pitch
    tracking
  • Methods based on pattern recognition
  • Is voicing well defined?
  • Degree of voicing (buzz)
  • Voicing per frequency band (interference)
  • Degree of voicing per frequency band

20
LPC Coefficients
  • How do we find the vocal tract filter
    coefficients?
  • System identification problem
  • All-pole (AR) filter
  • Connection to prediction
  • Sn G en Sm am sn-m
  • Can find G from energy (so lets ignore it)

Unknown filter
known input
known output
21
LPC Coefficients
  • For simplicity lets assume three a coefficients
  • Sn en a1 sn-1 a 2 s n-2 a 3 s
    n-3
  • Need three equations!
  • Sn en a1 sn-1 a 2 s n-2
    a 3 s n-3
  • Sn1 en1 a1 sn a 2 s n-1 a
    3 s n-2
  • Sn2 en2 a1 sn1 a 2 s n a 3
    s n-1
  • In matrix form
  • Sn en sn-1 s
    n-2 s n-3 a1
  • Sn1 en1 sn s n-1
    s n-2 a 2
  • Sn2 en2 sn1 s n
    s n-1 a 3
  • s e S a

22
LPC Coefficients - cont.
  • S e S a
  • so by simple algebra
  • a S-1 ( s - e )
  • and we have reduced the problem to matrix
    inversion
  • Toeplitz matrix so the inversion is easy
    (Levinson-Durbin algorithm)
  • Unfortunately noise makes this attempt break
    down!
  • Move to next time and the answer will be
    different.
  • Need to somehow average the answers
  • The proper averaging is before the equation
    solving
  • correlation vs autocovariance

23
LPC Coefficients - cont.
  • Cant just average over time - all equations
    would be the same!
  • Lets take the input to be zero
  • Sn Sm am sn-m
  • multiply by Sn-q and sum over n
  • Sn Sn Sn-q Sm am Sn sn-m sn-q
  • we recognize the autocorrelations
  • Cs (q) Sm Cs (m-q) am
  • Yule-Walker equations
  • autocorrelation method sn outside window are
    zero (Toeplitz)
  • autocovariance method use all needed sn (no
    window)
  • Also - pre-emphasis!

24
Alternative features
  • The a coefficients arent the only set of
    features
  • Reflection coefficients (cylinder model)
  • log-area coefficients (cylinder model)
  • pole locations
  • LPC cepstrum coefficients
  • Line Spectral Pair frequencies
  • All theoretically contain the same information
    (algebraic transformations)
  • Euclidean distance in LPC cepstrum space
    Itakura Saito measure
  • so these are popular in speech recognition
  • LPC (a) coefficients dont quantize or
    interpolate well
  • so these arent good for speech compression
  • LSP frequencies are best for compression

25
LSP coefficients
  • a coefficients are not statistically equally
    weighted
  • pole positions are better (geometric)
  • but radius is sensitive near unit circle
  • Is there an all-angle representation?
  • Theorem 1 Every real polynomial with all roots
    on the unit circle
  • is palindromic (e.g. 1 2t t2) or
    antipalindromic (e.g. t t2 - t3)
  • Theorem 2 Every polynomial can be written as the
    sum of
  • palindromic and antipalindromic polynomials
  • Consequence Every polynomial can be represented
    by roots
  • on the unit circle, that is, by angles

26
Voice DSP - Part 2c
  • Echo
  • Cancellation

27
Acoustic Echo
28
Line echo
hybrid
hybrid
Telephone 1
Telephone 2
29
Echo suppressor

In practice need more VOX, over-ride, reset, etc.
30
Why not echo suppresion?
  • Echo suppression makes conversation half duplex
  • Waste of full-duplex infrastructure
  • Conversation unnatural
  • Hard to break in
  • Dead sounding line
  • It would be better to cancel the echo
  • subtract the echo signal allowing desired signal
    through
  • but that requires DSP.

31
Echo cancellation?
  • Unfortunately, its not so easy
  • Outgoing signal is delayed, attenuated, distorted
  • Two echo canceller architectures
  • MODEM TYPE
  • LINE ECHO CANCELLER (LEC)

-
echo path
near end
far end
clean
clean
-
near end
far end
echo path
32
LEC architecture

h y b r i d
A/D
NLP
-
Y
filter H
doubletalk detector
adapt
near end
far end
X
D/A
33
Adaptive Algorithms
  • How do we
  • find the echo cancelling filter?
  • keep it correct even if the echo path parameters
    change?
  • Need an algorithm that continually changes the
    filter parameters
  • All adaptive algorithms are based on the same
    ideas
  • (lack of corellation between desired signal and
    interference)
  • Lets start with a simpler case - adaptive noise
    cancellation

34
Noise cancellation
y
h n
x
e n
y
x
-
n
h
e
35
Noise cancellation - cont.
  • Assume that noise is distorted only by unknown
    gain h
  • We correct by transmitting e n so that the
    audience hears
  • y x h n - e n x (h-e) n
  • the energy of this signal is
  • Ey lt y2 gt lt x2 gt (h-e)2 lt n2 gt 2 (h-e) lt
    x ngt
  • Assume that Cxn lt x ngt 0
  • We need only set e to minimize Ey ! (turn knob
    until minimal)
  • Even if the distortion is a complete filter h
  • we set the ANC filter e to minimize Ey

36
The LMS algorithm
  • Gradient descent on energy
  • correction to H is proportional to error d times
    input X

H H l d X
37
Nonlinear processing
  • Because of finite numeric precision
  • the LEC (linear) filtering can not completely
    remove echo
  • Standard LEC adds center clipping to remove
    residual echo
  • Clipping threshold needs to be properly set by
    adaptation

38
Doubletalk detection
  • Adaptation of H should take place only when far
    end speaks
  • So we freeze adaptation when no far end or
    double-talk,
  • that is whenever near end speaks
  • Geigel algorithm compares absolute value of
    near-end speech
  • to half the maximum absolute value in X buffer
  • If near-end exceeds far-end can assume only
    near-end is speaking
Write a Comment
User Comments (0)
About PowerShow.com