Speech-Coding Techniques - PowerPoint PPT Presentation

Loading...

PPT – Speech-Coding Techniques PowerPoint presentation | free to download - id: 7734f8-MDUyY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Speech-Coding Techniques

Description:

Speech-Coding Techniques Chapter 3 Introduction Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the ... – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 43
Provided by: MFC5
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Speech-Coding Techniques


1
Speech-Coding Techniques
  • Chapter 3

2
Introduction
  • Efficient speech-coding techniques
  • Advantages for VoIP
  • Digital streams of ones and zeros
  • The lower the bandwidth, the lower the quality
  • RTP payload types
  • Processing power
  • The better quality (for a given bandwidth) uses a
    more complex algorithm
  • A balance between quality and cost

3
Voice Quality
  • Bandwidth is easily quantified
  • Voice quality is subjective
  • MOS, Mean Opinion Score
  • ITU-T Recommendation P.800
  • Excellent 5
  • Good 4
  • Fair 3
  • Poor 2
  • Bad 1
  • A minimum of 30 people
  • Listen to voice samples or in conversations

4
  • P.800 recommendations
  • The selection of participants
  • The test environment
  • Explanations to listeners
  • Analysis of results
  • Toll quality
  • A MOS of 4.0 or higher

5
  • Subjective and objective quality-testing
    techniques
  • PSQM Perceptual Speech Quality Measurement
  • ITU-T P.861
  • faithfully represent human judgement and
    perception
  • algorithmic comparison between the output signal
    and a know input
  • type of speaker, loudness, delay, active/silence
    frames, clipping, environmental noise

6
A Little About Speech
  • Speech
  • Air pushed from the lungs past the vocal cords
    and along the vocal tract
  • The basic vibrations vocal cords
  • The sound is altered by the disposition of the
    vocal tract ( tongue and mouth)
  • Model the vocal tract as a filter
  • The shape changes relatively slowly
  • The vibrations at the vocal cords
  • The excitation signal

7
Speech sounds
  • Voiced sound
  • The vocal cords vibrate open and close
  • Interrupt the air flow
  • Quasi-periodic pluses of air
  • The rate of the opening and closing the pitch
  • A high degree of periodicity at the pitch period
  • 2-20 ms

8
  • Voiced speech
  • Power spectrum density

9
  • Unvoiced sounds
  • Forcing air at high velocities through a
    constriction
  • The glottis is held open
  • Noise-like turbulence
  • Show little long-term periodicity
  • Short-term correlations still present

10
  • unvoiced speech
  • Power spectrum density

11
  • Plosive sounds
  • A complete closure in the vocal tract
  • Air pressure is built up and released suddenly
  • A vast array of sounds
  • The speech signal is relatively predictable over
    time
  • The reduction of transmission bandwidth can be
    significant

12
Voice Sampling
  • A-to-D
  • discrete samples of the waveform and represent
    each sample by some number of bits
  • A signal can be reconstructed if it is sampled at
    a minimum of twice the maximum freq.
  • Human speech
  • 300-3800 Hz
  • 8000 samples per second

Each sample is encoded into an 8-bit PCM code
word (e.g. 01100101)
time
gt 8000 x 8 bit/s
13
Quantization
  • How many bits is used to represent
  • Quantization noise
  • The difference between the actual level of the
    input analog signal
  • More bits to reduce
  • Diminishing returns
  • Uniform quantization levels
  • Louder talkers sound better
  • 11.2/11 v.s. 2.2/2

14
  • Non-uniform quantization
  • Smaller quantization steps at smaller signal
    levels
  • Spread signal-to-noise ratio more evenly

15
DTX and Comfort Noise
  • DTX is Discontinuous Transmission
  • Voice activity detector (VAD) detects if there is
    active speech or not.
  • When there is no active speech different DTX
    procedures can be used
  • No Transmission at all
  • Comfort Noise (CN) using RFC 3389
  • Codec built CN in like AMR SID (Silence
    Descriptor)
  • Frequency of Comfort Noise packets varies but is
    usually some fraction of normal packet rate

16
Type of Speech Coders
  • Waveform codecs
  • Sample and code
  • High-quality and not complex
  • Large amount of bandwidth
  • source codecs (vocoders)
  • Match the incoming signal to a math model
  • Linear-predictive filter model of the vocal tract
  • A voiced/unvoiced flag for the excitation
  • The information is sent rather than the signal
  • Low bit rates, but sounds synthetic
  • Higher bit rates do not improve much

17
  • Hybrid codecs
  • Attempt to provide the best of both
  • Perform a degree of waveform matching
  • Utilize the sound production model
  • Quite good quality at low bit rate

18
G.711
  • The most commonplace codec
  • Used in circuit-switched telephone network
  • PCM, Pulse-Code Modulation
  • If uniform quantization
  • 12 bits 8 k/sec 96 kbps
  • Non-uniform quantization
  • 64 kbps DS0 rate
  • mu-law
  • North America
  • A-law
  • Other countries, a little friendlier to lower
    signal levels
  • An MOS of about 4.3

19
DPCM
  • DPCM, Differential PCM
  • Only transmit the difference between the
    predicated value and the actual value
  • Voice changes relatively slowly
  • It is possible to predict the value of a sample
    base on the values of previous samples
  • The receiver perform the same prediction
  • The simplest form
  • No prediction
  • No algorithmic delay

20
ADPCM
  • ADPCM, Adaptive DPCM
  • Predicts sample values based on
  • Past samples
  • Factoring in some knowledge of how speech varies
    over time
  • The error is quantized and transmitted
  • Fewer bits required
  • G.721
  • 32 kbps
  • G.726
  • A-law/mu-law PCM -gt 16, 24, 32, 40 kbps
  • An MOS of about 4.0 at 32 kbps

21
Analysis-by-Synthesis (AbS) Codecs
  • Hybrid codec
  • Fill the gap between waveform and source codecs
  • The most successful and commonly used
  • Time-domain AbS codecs
  • Not a simple two-state, voiced/unvoiced
  • Different excitation signals are attempted
  • Closest to the original waveform is selected
  • MPE, Multi-Pulse Excited
  • RPE, Regular-Pulse Excited
  • CELP, Code-Excited Linear Predictive

22
G.728 LD-CELP
  • CELP codecs
  • A filter its characteristics change over time
  • A codebook of acoustic vectors
  • A vector a set of elements representing various
    char. of the excitation
  • Transmit
  • Filter coefficients, gain, a pointer to the
    vector chosen
  • Low Delay CELP
  • Backward-adaptive coder
  • Use previous samples to determine filter
    coefficients
  • Operates on five samples at a time
  • Delay lt 1 ms
  • Only the pointer is transmitted

23
  • 1024 vectors in the code book
  • 10-bit pointer (index)
  • 16 kbps
  • LD-CELP encoder
  • Minimize a frequency-weighted mean-square error

24
  • LD-CELP decoder
  • An MOS score of about 3.9
  • One-quarter of G.711 bandwidth

25
G.723.1 ACELP
  • 6.3 or 5.3 kbps
  • Both mandatory
  • Can change from one to another during a
    conversation
  • The coder
  • A band-limited input speech signal
  • Sampled at 8 KHz, 16-bit uniform PCM quantization
  • Operate on blocks of 240 samples at a time
  • A look-ahead of 7.5 ms
  • A total algorithmic delay of 37.5 ms other
    delays
  • A high-pass filter to remove any DC component

26
  • Various operations to determine the appropriate
    filter coefficients
  • 5.3 kbps, Algebraic Code-Excited Linear
    Prediction
  • 6.3 kbps, Multi-pulse Maximum Likelihood
    Quantization
  • The transmission
  • Linear predication coefficients
  • Gain parameters
  • Excitation codebook index
  • 24-octet frames at 6.3 kbps, 20-octet frames at
    5.3 kbps

27
  • G.723.1 Annex A
  • Silence Insertion Description (SID) frames of
    size four octets
  • The two lsbs of the first octet
  • 00 6.3kbps 24 octets/frame
  • 01 5.3kbps 20
  • 10 SID frame 4
  • An MOS of about 3.8
  • At least 27.5 ms delay

28
G.729
  • 8 kbps
  • Input frames of 10 ms, 80 samples for 8 KHz
    sampling rate
  • 5 ms look-ahead
  • Algorithmic delay of 15 ms
  • An 80-bit frame for 10 ms of speech
  • A complex codec
  • G.729.A (Annex A), a number of simplifications
  • Same frame structure
  • Encoder/decoder, G.729/G.729.A
  • Slightly lower quality

29
  • G.729.B
  • VAD, Voice Activity Detection
  • Based on analysis of several parameters of the
    input
  • The current frames plus two preceding frames
  • DTX, Discontinuous Transmission
  • Send nothing or send an SID frame
  • SID frame contains information to generate
    comfort noise
  • CNG, Comfort Noise Generation
  • G.729, an MOS of about 4.0
  • G.729A an MOS of about 3.7

30
  • G.729 Annex D
  • a lower-rate extension
  • 6.4 kbps 10 ms speech samples, 64 bits/frame
  • MOS ? 6.3 kbps G.723.1
  • G.729 Annex E
  • a higher bit rate enhancement
  • the linear prediction filter of G.729 has 10
    coef.
  • that of G.729 Annex E has 30 coef.
  • the codebook of G.729 has 35 bits
  • that of G.729 Annex E has 44 bits
  • 118 bits/frame 11.8 kbps

31
Other Codecs
  • CDMA QCELP defined in IS-733
  • Variable-rate coder
  • Two most common rates
  • The high rate, 13.3 kbps
  • A lower rate, 6.2 kbps
  • Silence suppression
  • For use with RTP, RFC 2658

32
  • GSM Enhanced Full-Rate (EFR)
  • GSM 06.60
  • An enhanced version of GSM Full-Rate
  • ACELP-based codec
  • The same bit rate and the same overall packing
    structure
  • 12.2 kbps
  • Support discontinuous transmission
  • For use with RTP, RFC 1890

33
  • GSM Adaptive Multi-Rate (AMR) codec
  • 20 ms coding delay
  • Eight different modes
  • 4.75 kbps to 12.2 kbps
  • 12.2 kbps, GSM EFR
  • 7.4 kbps, IS-641 (TDMA cellular systems)
  • Change the mode at any time
  • Offer discontinuous transmission
  • The SID (Silence Descriptor) is sent in every 8th
    frame and is 5 bytes in size
  • The coding choice of many 3G wireless networks

34
  • The MOS values are for laboratory conditions
  • G.711 does not deal with lost packets
  • G.729 can accommodate a lost frame by
    interpolating from previous frames
  • But cause errors in subsequent speech frames
  • Processing Power
  • G.728 or G.729, 40 MIPS
  • G.726 10 MIPS

35
iLBC
  • a FREE codec for robust VoIP
  • 13.33 kbit/s with an encoding frame length of 30
    ms and 15.20 kbps of 20 ms
  • Computational complexity in a range of G.729A

36
Speex
  • Open-source patent-free speech codec
  • CELP (code-excited linear prediction) codec
  • operating modes
  • narrowband (8 kHz sampling rate)
  • 2.15 24.6 kb/s
  • delay of 30 ms
  • wideband (16 kHz sampling rate)
  • 4-44.2 kb/s
  • delay of 34 ms
  • ultra-wideband (32 kHz sampling rate)
  • intensity stereo encoding
  • variable bit rate (VBR) possible
  • voice activity detection (VAD)

37
  • Cascaded Codecs
  • E.g., G.711 stream -gt G.729 encoder/decoder
  • Might not even come close to G.729
  • Each coder only generate an approximate of the
    incoming signal
  • Audio samples
  • http//www.cs.columbia.edu/hgs/audio/codecs.html

38
Effects of packetization
39
Tones, Signal, and DTMF Digits
  • The hybrid codecs are optimized for human speech
  • Other data may need to be transmitted
  • Tones fax tones, dialing tone, busy tone
  • DTMF digits for two-stage dialing or voice-mail
  • G.711 is OK
  • G.723.1 and G.729 can be unintelligible
  • The ingress gateway needs to intercept
  • The tones and DTMF digits
  • Use an external signaling system

40
  • Easy at the start of a call
  • Difficult in the middle of a call
  • Encode the tones differently from the speech
  • Send them along the same media path
  • An RTP packet provides the name of the tone and
    the duration
  • Or, a dynamic RTP profile an RTP packet
    containing the frequency, volume and the duration
  • RFC 2198
  • An RTP payload format for redundant audio data
  • Sending both types of RTP payload

41
  • RTP Payload Format for DTMF Digits
  • An Internet Draft
  • Both methods described before
  • A large number of tones and events
  • DTMF digits, a busy tone, a congestion tone, a
    ringing tone, etc.
  • The named events
  • E the end of the tone, R reserved

42
  • Payload format
About PowerShow.com