Voice DSP Processing I - PowerPoint PPT Presentation

About This Presentation

Title:

Voice DSP Processing I

Description:

DSP Processing I Yaakov J. Stein Chief Scientist RAD Data Communications Voice DSP Part 1 Speech biology and what we can learn from it Part 2 Speech DSP (AGC, VAD ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 54

Provided by: YJS5

Category:

more less

Transcript and Presenter's Notes

Title: Voice DSP Processing I

1
VoiceDSPProcessing I

Yaakov J. Stein
Chief ScientistRAD Data Communications

2
Voice DSP

Part 1 Speech biology and what we can learn from
it
Part 2 Speech DSP (AGC, VAD, features, echo
cancellation)
Part 3 Speech compression techiques
Part 4 Speech Recognition

3
Voice DSP - Part 1a

Speech production mechanisms
Biology of the vocal tract
Pitch and formants
Sonograms
The basic LPC model
The cepstrum
LPC cepstrum
Line spectral pairs

4
Voice DSP - Part 1b

Speech perception mechanisms
Biology of the ear
Psychophysical phenomena
Webers law
Fechners law
Changes
Masking

5
Voice DSP - Part 1c

Speech quality measurement
Subjective measurement
MOS and its variants
Objective measurement
PSQM, PESQ

6
Voice DSP - Part 2a

Basic speech processing
Simplest processing
AGC
Simplistic VAD
More complex processing
pitch tracking
formant tracking
U/V decision
computing LPC and other features

7
Voice DSP - Part 2b

Echo Cancellation
Sources of echo (acoustic vs. line echo)
Echo suppression and cancellation
Adaptive noise cancellation
The LMS algorithm
Other adaptive algorithms
The standard LEC

8
Voice DSP - Part 3

Speech compression techniques
PCM
ADPCM
SBC
VQ
ABS-CELP
MBE
MELP
STC
Waveform Interpolation

9
Voice DSP - Part 4

Speech Recognition tasks
ASR Engine
Phonetic labeling
DTW
HMM
State-of-the-Art

10
Voice DSP - Part 1a

Speech
production
mechanisms

11
Speech Production Organs

Brain
Hard Palate
Nasal cavity
Velum
Teeth
Uvula
Lips
Mouth cavity
Pharynx
Tongue
Larynx
Trachea
Lungs
12
Speech Production Organs - cont.

Air from lungs is exhaled into trachea (windpipe)
Vocal chords (folds) in larynx can produce
periodic pulses of air
by opening and closing (glottis)
Throat (pharynx), mouth, tongue and nasal cavity
modify air flow
Teeth and lips can introduce turbulence
Epiglottis separates esophagus (food pipe) from
trachea

13
Voiced vs. Unvoiced Speech

When vocal cords are held open air flows
unimpeded
When laryngeal muscles stretch them glottal flow
is in bursts
When glottal flow is periodic called voiced
speech
Basic interval/frequency called the pitch
Pitch period usually between 2.5 and 20
milliseconds
Pitch frequency between 50 and 400 Hz
You can feel the vibration of the larynx
Vowels are always voiced (unless whispered)
Consonants come in voiced/unvoiced pairs
for example B/P K/G D/T V/F J/CH TH/th
W/WH Z/S ZH/SH

14
Excitation spectra

Voiced speech
Pulse train is not sinusoidal - harmonic
rich
Unvoiced speech
Common assumption white noise

f
f
15
Effect of vocal tract

Mouth and nasal cavities have resonances
Resonant frequencies
depend on geometry

16
Effect of vocal tract - cont.

Sound energy at these resonant frequencies is
amplified
Frequencies of peak amplification are called
formants

frequency response
frequency
F0
17
Formant frequencies

Peterson - Barney data (note the vowel triangle)

18
Sonograms
19
Cylinder model(s)

Rough model of throat and mouth cavity
With nasal cavity

Voice Excitation
open
open
Voice Excitation
open/closed
20
Phonemes

The smallest acoustic unit that can change
meaning
Different languages have different phoneme sets
Types (notations
phonetic, CVC, ARPABET)
Vowels
front (heed, hid, head, hat)
mid (hot, heard, hut, thought)
back (boot, book, boat)
dipthongs (buy, boy, down, date)
Semivowels
liquids (w, l)
glides (r, y)

21
Phonemes - cont.

Consonants
nasals (murmurs) (n, m, ng)
stops (plosives)
voiced (b,d,g)
unvoiced (p, t, k)
fricatives
voiced (v, that, z, zh)
unvoiced (f, think, s, sh)
affricatives (j, ch)
whispers (h, what)
gutturals ( ? ,? )
clicks, etc. etc. etc.

22
Basic LPC Model

Pulse Generator
LPC synthesis filter
U/V Switch
White Noise Generator
23
Basic LPC Model - cont.

Pulse generator produces a harmonic rich periodic
impulse train (with pitch period and gain)
White noise generator produces a random signal
(with gain)
U/V switch chooses between voiced and unvoiced
speech
LPC filter amplifies formant frequencies
(all-pole or AR IIR filter)
The output will resemble true speech to within
residual error

24
Cepstrum

Another way of thinking about the LPC model
Speech spectrum is the obtained from
multiplication
Spectrum of (pitch) pulse train times
Vocal tract (formant) frequency response
So log of this spectrum is obtained from addition
Log spectrum of pitch train plus
Log of vocal tract frequency response
Consider this log spectrum to be the spectrum of
some new signal
called the cepstrum
The cepstrum is the sum of two components
excitation plus vocal tract

25
Cepstrum - cont.

Cepstral processing has its own language
Cepstrum (note that this is really a signal in
the time domain)
Quefrency (its units are seconds)
Liftering (filtering)
Alanysis
Saphe
Several variants
complex cepstrum
power cesptrum
LPC cepstrum

26
Do we know enough?

Standard speech model (LPC)
(used by most speech processing/compression/re
cognition systems)
is a model of speech production
Unfortunately, speech production and speech
perception systems
are not matched
So next well look at the biology of the hearing
(auditory) system
and some psychophysics (perception)

27
Voice DSP - Part 1b

Speech
Hearing perception mechanisms

28
Hearing Organs
29
Hearing Organs - cont.

Sound waves impinge on outer ear enter auditory
canal
Amplified waves cause eardrum to vibrate
Eardrum separates outer ear from middle ear
The Eustachian tube equalizes air pressure of
middle ear
Ossicles (hammer, anvil, stirrup) amplify
vibrations
Oval window separates middle ear from inner ear
Stirrup excites oval window which excites liquid
in the cochlea
The cochlea is curled up like a snail
The basilar membrane runs along middle of cochlea
The organ of Corti transduces vibrations to
electric pulses
Pulses are carried by the auditory nerve to the
brain

30
Function of Cochlea

Cochlea has 2 1/2 to 3 turns
were it straightened out it would be 3 cm in
length
The basilar membrane runs down the center of the
cochlea
as does the organ of Corti
15,000 cilia (hairs) contact the vibrating
basilar membrane
and release neurotransmitter stimulating
30,000 auditory neurons
Cochlea is wide (1/2 cm) near oval window and
tapers towards apex
is stiff near oval window and
flexible near apex
Hence high frequencies cause section near oval
window to vibrate
low frequencies cause section
near apex to vibrate
Overlapping bank of filter frequency decomposition

31
Psychophysics - Webers law

Ernst Weber Professor of physiology at Leipzig in
the early 1800s
Just Noticeable Difference
minimal stimulus change that can be detected
by senses
Discovery D I K I
Example
Tactile sense place coins in each hand
subject could discriminate between with 10 coins
and 11,
but not 20/21, but could 20/22!
Similarly vision lengths of lines, taste
saltiness, sound frequency

32
Webers law - cont.

This makes a lot of sense

Bill Gates
33
Psychophysics - Fechners law

Webers law is not a true psychophysical law
it relates stimulus threshold to stimulus
(both physical entities)
not internal representation (feelings) to
physical entity
Gustav Theodor Fechner student of Weber
medicine, physics philosophy
Simplest assumption JND is single internal unit
Using Webers law we find
Y A log I B
Fechner Day (October 22 1850)

34
Fechners law - cont.

Log is very compressive
Fechners law explains the fantastic ranges of
our senses
Sight single photon - direct sunlight 1015
Hearing eardrum move 1 H atom - jet plane 1012
Bel defined to be log10 of power ratio
decibel (dB) one tenth of a Bel
d(dB) 10 log10 P 1 / P 2

35
Fechners law - sound amplitudes

Companding
adaptation of logarithm to positive/negative
signals
m-law and A-law are piecewise linear
approximations
Equivalent to linear sampling at 12-14 bits
(8 bit linear sampling is significantly more
noisy)

36
Fechners law - sound frequencies

octaves, well tempered scale
Critical bands
Frequency warping
Melody 1 KHz 1000, JND afterwards M 1000
log2 ( 1 fKHz )
Barkhausen can be simultaneously heard B 25
75 ( 1 1.4 f2KHz )0.69
excite different basilar
membrane regions

f
37
Psychophysics - changes

Our senses respond to changes

38
Psychophysics - masking

Masking strong tones block weaker ones at nearby
frequencies
narrowband noise blocks
tones (up to critical band)

f
39
Voice DSP - Part 1c

Speech
Quality
Measurement

40
Why does it sound the way
it sounds?

PSTN
BW0.2-3.8 KHz, SNRgt30 dB
PCM, ADPCM (BER 10-3)
five nines reliability
line echo cancellation
Voice over packet network
speech compression
delay, delay variation, jitter
packet loss/corruption/priority
echo cancellation

41
Subjective Voice Quality

Old Measures
5/9
DRT
DAM
The modern scale
MOS
DMOS

meet neat seat feet Pete beat heat
42
MOS according to ITU

P.800 Subjective Determination of Transmission
Quality
Annex B Absolute Category Rating (ACR)
Listening Quality
Listening Effort
5 excellent relaxed
4 good attention needed
3 fair moderate effort
2 poor considerable effort
1 bad no meaning
with feasible
effort

43
MOS according to ITU (cont)

Annex D Degradation Category Rating (DCR)
Annex E Comparison Category Rating (CCR)
ACR not good at high quality speech
DCR
CCR
5 inaudible
4 not annoying
3 slightly annoying much better
2 annoying better
1 very annoying slightly better
0 the same
-1 slightly worse
-2 worse
-3 much worse

44
Some MOS numbers

Effect of Speech Compression
(from ITU-T Study Group 15)
Quiet room 48 KHz 16 bit linear sampling 5.0
PCM (A-law/mlaw) 64 Kb/s 4.1
G.723.1 _at_ 6.3 Kb/s 3.9
G.729 _at_ 8 Kb/s 3.9
ADPCM G.726 32 Kb/s 3.8
toll quality
GSM _at_ 13Kb/s 3.6
VSELP IS54 _at_ 8Kb/s 3.4

45
The Problem(s) with MOS

Accurate MOS tests are the only reliable
benchmark
BUT
MOS tests are off-line
MOS tests are slow
MOS tests are expensive
Different labs give consistently different
results
Most MOS tests only check one aspect of system

46
The Problem(s) with SNR

Naive question Isnt CCR the same as SNR?
SNR does not correlate well with subjective
criteria
Squared difference is not an accurate comparator
Gain
Delay
Phase
Nonlinear processing

47
Speech distance measures

Many objective measures have been proposed
Segmental SNR
Itakura Saito distance
Euclidean distance in Cepstrum space
Bark spectral distortion
Coherence Function
None correlate well with MOS
ITU target - find a quality-measure that does
correlate well

48
Some objective methods

Perceptual Speech Quality Measurement (PSQM)
ITU-T P.861
Perceptual Analysis Measurement System (PAMS)
BT proprietary technique
Perceptual Evaluation of Speech Quality (PESQ)
ITU-T P.862
Objective Measurement of Perceived Audio Quality
(PAQM)
ITU-R BS.1387

49
Objective Quality Strategy
speech
50
PSQM philosophy(from P.861)
Internal Representation
Perceptual model
Audible Difference
Cognitive Model
Perceptual model
Internal Representation
51
PSQM philosophy (cont)

Perceptual Modelling (Internal representation)
Short time Fourier transform
Frequency warping (telephone-band filtering, Hoth
noise)
Intensity warping
Cognitive Modelling
Loudness scaling
Internal cognitive noise
Asymmetry
Silent interval processing
PSQM Values
0 (no degradation) to 6.5 (maximum degradation)
Conversion to MOS
PSQM to MOS calibration using known references
Equivalent Q values

52
Problems with PSQM

Designed for telephony grade speech codecs
Doesnt take network effects into account
filtering
variable time delay
localized distortions
Draft standard P.862 adds
transfer function equalization
time alignment, delay skipping
distortion averaging

53
PESQ philosophy(from P.862)
Perceptual model
Internal Representation
Cognitive Model
Audible Difference
Time Alignment
Perceptual model
Internal Representation

Write a Comment

User Comments (0)