Speech%20 - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Speech%20

Description:

Formant positions. The Speech Signal ... OVE formant synthesis (Gunnar Fant, KTH) ... The frequency of higher formant are attenuated by -12 dB/octave (due to the ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 33
Provided by: robertfor
Learn more at: http://www.bk.isy.liu.se
Category:
Tags: formant | speech

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Speech%20


1
Speech Audio Coding
  • TSBK01 Image Coding and Data Compression
  • Lecture 11, 2003
  • Jörgen Ahlberg

2
Outline
  • Part I - Speech
  • Speech
  • History of speech synthesis coding
  • Speech coding methods
  • Part II Audio
  • Psychoacoustic models
  • MPEG-4 Audio

3
Speech Production
  • The humans vocal apparatus consists of
  • lungs
  • trachea (wind pipe)
  • larynx
  • contains 2 folds of skin called vocal cords which
    blow apart and flap together as air is forced
    through
  • oral tract
  • nasal tract

4
The Speech Signal
1
5
The Speech Signal
6
The Speech Signal
  • Elements of the speech signal
  • spectral resonance (formants, moving)
  • periodic excitation (voicing, pitched) pitch
    contour
  • noise excitation (fricatives, unvoiced, no
    pitch)
  • transients (stop-release bursts)
  • amplitude modulation (nasals, approximants)
  • timing

7
The Speech Signal
Vowels - characterised by formants generally
voiced Tongue lips - effect of rounding.
Examples of vowels a, e, i, o, u, a, ah, oh.
Vibration of vocal cords male 50 - 250Hz,
female up to 500Hz. Vowels have in average much
longer duration than consonants. Most of the
acoustic energy of a speech signal is carried by
vowels.
F1-F2 chart
Formant positions
8
History of Speech Coding
  • 1926 - PCM - first conceived by Paul M. Rainey
    and independently by
  • Alex Reeves (ATT Paris) in 1937.
    Deployed in US PSTN in 1962
  • 1939 - Channel vocoder - first
    analysis-by-synthesis system developed
  • by Homer Dudley of ATT labs - VODER

9
History of Speech Coding
  • 1926 - PCM - first conceived by Paul M. Rainey
    and independently by
  • Alex Reeves (ATT Paris) in 1937.
    Deployed in US PSTN in 1962
  • 1939 - Channel vocoder - first analysis - by -
    synthesis system developed
  • by Homer Dudley of ATT labs - VODER

10
OVE formant synthesis (Gunnar Fant, KTH), 1953
11
History of Speech - Coding
  • 1926 - PCM - first conceived by Paul M. Rainey
    and independently by
  • Alex Reeves (ATT Paris) in 1937.
    Deployed in US PSTN in 1962
  • 1939 - Channel vocoder - first analysis - by -
    synthesis system Homer
  • Dudley of ATT labs - VODER
  • 1952 - delta modulation proposed, differential
    PCM invented
  • 1957 - ?-law encoding proposed (standardised for
    telephone network
  • in 1972 (G.711))
  • 1974 - ADPCM developed
  • 1984 - CELP vocoder proposed (majority of coding
    standards for speech
  • signal today use a variation on CELP)

12
Source-filter Model of Speech Production
  • Signal from a source is filtered by a
    time-varying filter with resonant properties
    similar to that of the vocal tract.
  • The gain controls Av and AN determine the
    intensity of voiced and unvoiced excitation.
  • The frequency of higher formant are attenuated by
    -12 dB/octave (due to the nature of our speech
    organs).
  • This is an over simplified model for speech
    production. However, it is very often adequate
    for understanding the basic principles.

13
Speech Coding Strategies
  • 1. PCM
  • Invented 1926, deployed 1962.
  • The speech signal is sampled at 8 kHz.
  • Uniform quantization requires gt10 bits/sample.
  • Non-uniform quantization (G.711, 1972)
  • Quantizing y to 8 bits -gt 64 kbit/s.

14
Speech Coding Strategies
  • 2. Adaptive DPCM
  • Example G.726 (1974)
  • Adaptive predictor based on six previous
    differences.
  • Gain-adaptive quantizer with 15 levels ) 32
    kbit/s.

15
Speech Coding Strategies
  • 3. Model-based Speech Coding
  • Advanced speech coders are based on models of how
    speech is produced

Excitationsource
Vocaltract
16
An Excitation Source
Noisegenerator
Pulsegenerator
Pitch
17
Vocal Tract Filter 1 A Fixed Filter Bank
18
Vocal Tract Filter 2 A Controllable Filter
19
Linear Predictive Coding (LPC)
  • The controllable filter is modelled as yn ?
    ai yn-i G?nwhere ?n is the input signal and
    yn is the output.
  • We need to estimate the vocal tract parameters
    (ai and G) and the exciatation parameters (pitch,
    v/uv).
  • Typically the source signal is divided in short
    segments and the parameters are estimated for
    each segment.
  • Example The speech signal is sampled at 8 kHz
    and divided in segments of 180 samples (22.5
    ms/segment).

20
Typical Scheme of an LPC Coder
Noisegenerator
Vocal tractfilter
Pulsegenerator
Pitch
v/uv
Gain
Filter coeffs
21
Estimating the Parameters
  • v/uv estimation
  • Based on energy and frequency spectrum.
  • Pitch-period estimation
  • Look for periodicity, either via the a.c.f our
    some other measure, for examplethat gives you
    a minimum value when p equals the pitch period.
  • Typical pitch-periods 20 - 160 samples.

22
Estimating the Parameters
  • Vocal tract filter estimation
  • Find the filter coefficients that minimize the
    error ?2 ( yn - ? ai yn-i G?n )2
  • Compare to the computation of optimal predictors
    (Lecture 7).

23
Estimating the Parameters
  • Assuming a stationary signalwhere R and p
    contain acf values.
  • This is called the autocorrelation method.

24
Estimating the Parameters
  • Alternatively, in case of a non-stationary
    signalwhere
  • This is called the autocovariance method.

25
Example
  • Coding of parameters using LPC10 (1984)

v/uv 1 bit
Pitch 6 bits
Voiced filter 46 bits
Unvoiced filter 46 bits
Synchronization 1 bit
Sum 54 bits ) 2.4 kbit/s
26
The Vocal Tract Filter
  • Different representations
  • LPC parameters
  • PARCOR (Partial Correlation Coefficients)
  • LSF (Line Spectrum Frequencies)

27
Code Excited Linear Prediction Coding (CELP)
Encoding
  • LPC analysis ) V(z)
  • Define perceptual weighting filter. This permits
    more noise at formant
  • frequencies where it will be masked by the
    speech
  • Synthesise speech using each codebook entry in
    turn as the input to V(z)
  • Calculate optimum gain to minimise perceptually
    weighted error energy
  • in speech frame
  • Select codebook entry that gives lowest error
  • Performance
  • 16kbit/s MOS4.2,
  • Delay1.5 ms, 19 MIPS
  • 8 kbit/s MOS4.1,
  • Delay35 ms, 25 MIPS
  • 2.4kbit/s MOS3.3,
  • Delay45 ms, 20 MIPS
  • Transmit LPC parameters and codebook index
  • Decoding
  • Receive LPC parameters and codebook index
  • Re-synthesise speech using V(z) and codebook
    entry

28
Examples
  • G.728
  • V(z) is chosen as a large FIR-filter (M ¼ 50).
  • The gain and FIR-parametrers are estimated
    recursively from previously received samples.
  • The code book contains 127 sequences.
  • GSM
  • The code book contains regular pulse trains with
    variabel frequency and amplitudes.
  • MELP
  • Mixed excitation linear prediction
  • The code book is combined with a noise generator.

29
Other Variations
  • SELP Self Excited Linear Prediction
  • MPLP Multi-Pulse Excited Linear Prediction
  • MBE Multi-Band Excitation Coding

30
Quality Levels
Quality level Bandwidth Bitrate
Broadcast quality 10 kHz gt64 kbit/s
Network (tool) quality 300 3400 kHz 16 64 kbit/s
Communication quality 4 16 kbit/s
Synthetic quality lt4 kbit/s
31
Subjective Assessment
  • In digital communications speech quality is
    classified into four general
  • categories, namely broadcast, network or
    toll, communications, and synthetic.
  • Broadcast wideband speech high quality
    commentary speech generally
  • achieved at rates above 64 kbits/s.
  • MOS (Mean Opinion Score) result of averaging
    opinions scores for a set of
  • between 20 60 untrained subjects.
  • They rate the quality 1 to 5 (1-bad, 2-poor,
    3-fair, 4-good, 5-excellent).
  • MOS of 4 or higher defines good or tool quality
    (network quality)
  • - reconstructed signal generally
    indistinguishable from the original.
  • MOS between 3.5 4.0 defines communication
    quality telephone
  • communications
  • MOS between 2.5 3.5 implies synthetic quality

32
Subjective Assessment
  • DRT (Diagnostic Rhyme Test) listeners should
    recognise one of the two
  • possible words in a set of rhyming pairs (e.g.
    meatl/heat)
  • DAM (Diagnostic Acceptability Measure) - trained
    listeners
  • judge various factors e.g. muffledness,
    buzziness, intelligibility

Quality versus data rate (8kHz sampling rate)
About PowerShow.com