Advanced Speech Coding for VoIP - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Advanced Speech Coding for VoIP

Description:

The vocal tract forms the tube, which is characterized by resonances, which are called formants. ... there can be conversation or music as the background noise. ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 47
Provided by: teropii
Category:
Tags: voip | advanced | coding | speech

less

Transcript and Presenter's Notes

Title: Advanced Speech Coding for VoIP


1
Advanced Speech Coding for VoIP
  • Contents
  • 1) speech codec properties
  • 2) current speech coding standards
  • advanced speech coding methods
  • 3) speech quality factors in VoIP
  • 4) wideband speech coding over VoIP
  • Tero Piirainen, tp105475_at_cs.tut.fi
  • Speech coding basics can be found in "Speech
    Coding for VoIP" by Konsta Koppinen (VoIP seminar
    20.10.2000)

2
Critical speech codec properties
  • 1) Bit rate
  • 2) Codec complexity
  • 3) Speech quality
  • intelligibility
  • echo
  • delay

3
Bit rate
  • wireline quality speech is 64 kbps PCM coded
  • human speech contains lots of redundancy,
    removing redundancy saves bits
  • normal speech can be compressed effecively 110
    maintaining wireline quality, with silence
    suppression up to 120
  • advanced coding means more delay and more
    complexity
  • backgound noise makes it much more difficult to
    code speech
  • low bit rate codecs perform worse in noisy
    environments

4
Complexity
  • better coding ( lower bit rate, better quality)
    requires more processing
  • low bit rate codecs require 20-40 MIPS
  • at the same time, processing power is needed also
    for
  • echo cancellation
  • noise suppression
  • etc.
  • minimizing complexity means minimizing hardware /
    CPU MHz requirements

5
Speech quality
  • Speech quality
  • intelligibility
  • echo
  • delay
  • Intelligibility can be measured by MOS
  • subjective listener tests, rating ranging 1-5
  • 1 bad
  • 2 poor
  • 3 fair
  • 4 good ( wireline quality)
  • 5 excellent

6
Speech coding standards
  • G.711
  • PCM wireline quality
  • high bit rate (64 kbps)
  • G.723.1
  • wireline quality at 15.4 kbps
  • chosen as default IP telephony speech codec by
    International Multimedia Teleconferencing
    Consortium (IMTC) Voice over IP (VoIP) forum
  • heavy computation (30 MIPS)
  • narrowband reference codec
  • G.729A
  • almost wireline quality at 8 kbps
  • low delay 35 ms low complexity

7
Speech coding standards
  • G.722
  • SB-ADPCM Sub-Band Adaptine Differential Pulse
    Code Modulation
  • wideband codec, sampling 16 khz audio signals
  • bit rate 48..64 kbps
  • used in many applications that require audio
    frequency bandwidth coding, such as video
    conferencing and multimedia
  • wideband reference codec
  • ETSI GSM AMR
  • variable rate speech coding suitable for packet
    networks
  • based on the earlier ETSI GSM speech codecs (FR,
    HR, EFR)
  • offer robust coding and wireline quality whilst
    increasing network capacity
  • adaptive coding to one of eight data rate modes
    4.75...12.2 kbps
  • AMR speech coding algorithm is based on ACELP
    (Algebraic-Codebook-Excited Linear Predictive
    Coding)

8
ITU-T SpC summary
9
ETSI SpC summary
10
Speech codecs in H.323
  • H.323 mandatory codec
  • G.711
  • H.323 supported speech codecs
  • G.722
  • G.723
  • G.728
  • G.729
  • GSM codecs (FR, EFR, AMR)
  • also audio codecs (MPEG1,..)

11
Speech production
12
Speech Synthesis Model
13
Linear prediction
  • The vocal tract forms the tube, which is
    characterized by resonances, which are called
    formants.
  • LPC analyzes the speech signal by estimating the
    formants
  • The LPC parameters are transmitted and used as an
    input to LPC synthesis in the receiver end
  • Because speech signals vary, LPC is done in short
    frames, normally 30 to 50 frames per second.

14
Linear Prediction
15
LPC example - unvoiced sound
16
LPC spectrum over time
Intensity
Time
Frequency
17
Long term correlation
  • Vocal cords produces the signal, which is
    characterized by its intensity (loudness) and
    frequency (pitch).
  • Long term correlation is represented by lag. Lag
    is the number of samples between long-term
    periods in continuos signal.
  • The range of lag values for range between 20-150
    corresponding to the frequency range 400-50 Hz.

18
Analysis-by-Synthesis Coding
  • In analysis-by-synthesis speech coding method
    encoder includes a local decoder used for speech
    synthesis.
  • Input speech data is analyzed to obtain required
    coefficients for synthesis filters.
  • Excitation vectors are generated and passed
    through the local decoder i.e. synthesis filters.
  • The synthesized speech for each excitation vector
    is subtracted from the original speech to form an
    objective error.
  • Objective error is spectrally weighted to obtain
    perceptually more meaningful measure of the
    coding error. Excitation vector and gain that
    minimize the subjective error are selected.
  • The search of long-term periodic component in
    speech signal using analysis-by-synthesis method
    can be interpreted as an use of an adaptive
    codebook. The codebook is indexed by lag and the
    gain corresponds to the excitation gain.

19
Basics of Adaptive Codebook
  • LTP-state memory can be interpreted as an
    adaptive codebook in which the consecutive
    codevectors differ only by one new value and a
    shift.
  • LAG is an index to the codebook/delay line and
    GAIN is the excitation gain.
  • Excitation from the adaptive codebook is combined
    with fixed excitation.
  • Delay line is updated with the "best" codevector.

20
Virtual LAG
LAG
INPUT SPEECH
LAG
VIRTUAL LAG
USED LAG
  • In simple case, LAG is bigger than subframe
    length
  • If smaller LAG values are used, virtual LAG trick
    is needed because in decoder only past samples
    are available
  • For samples for which delayed samples would be
    inside current frame LAG value multiples are
    used (utilizing periodicity)
  • Enhances voices with high pitch (children, female)

21
Basics of vector quantization
  • Each vector (e.g. LPC coefficients) can get
    whatever values in k- dimensional space.
  • A vector is replaced and represented by a
    centroid
  • Centroid is one vector in the parameter space for
    which distance to it is at minimum for a cluster
    of vectors
  • Discrete amount of centroids quantization from
    Rk to C, where C is amount of centroids
  • Table of centroids is called a codebook

22
Basics of vector quantization
  • Quantization with a codebook
  • For each vector a centroid or codevector to
    minimize
  • distortion between original
  • sample and the codevector
  • is searched.
  • Codevector is represented by its index in the
    codebook
  • Exhaustive search codebookhas exponentially
    growingrequirements for calculationand storage
    complexity
  • by using specific codebookstructure, the search
    can be fastened and complexity andstorage grow
    be made linear.
  • Example binary tree stucture.

23
Two level vector quantizer
  • Speech codec parameters can be quantized as
    vectors instead of quantizing each parameter
    separately. Vector quantization results savings
    in output bit-rate.
  • The best quatization vector is searched from
    predefined codebook.
  • In order to decrease the complexity of search,
    the search can be done in stages. The original
    vector is divided into smaller parts (vector
    splitting, band splitting etc.)
  • For example in GSM HR speech codec, four best
    candidate vectors out of 64 are selected in
    prequantizer. In the next phase, 4 x 32 vectors
    are evaluated and the best is chosen.

24
Silence suppression
  • In two-way conversation 60 of the time is only
    background noise, the silent periods can be
    suppressed without worsening quality.
  • VAD voice activity detection
  • detects voice activity only speech is coded
    transmitted
  • CNG comfort noise generation
  • completely silent periods feel uncomfortable by
    the receiver
  • CNG algorithms fills the silent periods with
    generated noise
  • some noise parameters are transmitted to maintain
    realistic noise characteristics
  • Lowers the usage of bandwidth gt well suited for
    packet communication.
  • VAD performance
  • If the VAD sensitive is low, the algorithm will
    fail to notice the beginning of speech gt
    front-end-clipping.
  • If the VAD is too sensitive gt inefficiency.
  • The performance of the VAD algorithm becomes
    apparent in noisy environments like in a office,
    where there can be conversation or music as the
    background noise.

25
Example Analysis by Synthesis Coding (EFR)
Input Speech
LSP
LPC
GAIN
PULSE POSITIONS
LPC
LAG
GAIN
26
Main Blocks of GSM EFR Codec
  • High-pass filtering of input speech frame in
    order to remove DC offset.
  • LPC analysis two times per speech frame (every 10
    ms) i.e. two set of coefficients are calculated
    for 10 th order linear prediction filter. LPC
    coefficients are represented as Line Spectrum
    Pairs (LSP) and quantized.
  • Open loop lag search is done two times per speech
    frame (every 10 ms) producing candidate LAG
    values
  • Closed loop lag search is done for each sub-frame
    (every 5 ms) from the basis of candidate lags.
    Optimization of LTP parameters (LTP lag and gain)
    is done using analysis-by-synthesis search
    (adaptive codebook). Lag values can be fractional
    (non-integer) with accuracy of 1/6th. In
    addition, virtual lag is used i.e lag can be
    less that the size of subframe (pitch is over 200
    Hz). Lag value for the first subframe is fully
    coded. For other three subframes, delta coding is
    utilized.
  • Lag is a number of samples between long-term
    periods in continuos signal. The range of lag
    values for GSM EFR codec is 18-143 corresponding
    to the frequency range 444-56 Hz. LTP resolution
    is enhanced in EFR by using fractional LAG
    values, which are computed from upsampled signal.

27
Main Blocks of GSM EFR Codec
  • If smaller lag values are used than subframe
    length, virtual lag calculation is needed because
    in decoder only past samples are available. For
    samples for which delayed samples would be inside
    current frame lag value multiples are used
    utilizing speech signal periodicity. Virtual lag
    enhances high pitch voices of children and women.
  • Codec architecture is called Algebraic Code
    Excited Linear Prediction (ACELP). Excitation
    vector utilizes Algebraic codebook. Vector is
    formed by 10 non-zero (equal to -1 or 1) pulses
    for each subframe (40 samples) in order to
    minimize the coding error.
  • Synthesis filter includes effect of subjective
    weighting filters.
  • Optimal combination pulse positions is determined
    using analysis-by-synthesis search. Codebook
    index i.e combination of pulse positions is
    calculated for each subframe.
  • The gains are quantized. For algebraic codebook
    gain, moving average prediction is used.

28
AMR in short
  • Improved speech quality in variable rate
  • Based on GSM FR, HR and EFR codecs.
  • Ability to trade speech quality and capacity
    smoothly and flexibly by codec mode adaptation
  • Improved robustness to errors
  • Wideband option is under consideration
  • Narrow band codec specifications ready in
    December 1998
  • Link adaption mechanism is required for measuring
    channel quality and selecting speech codec modes

29
Speech quality in VoIP
  • These factors have major effect in speech
    quality
  • Delay
  • Jitter
  • Packet loss
  • Echo
  • Tandeming

30
Delay
  • major factor in speech quality
  • Preferred maximum one-way delay is 200-400 ms,
    without echo with echo maximum is 25 ms
  • three levels of delay
  • algorithmic delay
  • frame sampling lookahead delay typically 15-40
    ms
  • processing delay
  • speech frame encoding decoding delay typically
    5-10 ms
  • communications delay
  • channel delay between encoder decoder
    typically ?

31
Jitter
  • variations in delay
  • data buffering must be used at network edges
    ensuring that a constant stream of speech frames
    can be reproduced
  • jitter may vary during a VoIP call
  • smart buffering, where the buffer size can be
    changed during a call
  • constantly measures network condition in order to
    decide on making these buffer adjustments
  • if the buffer size is increased additional speech
    segments must be synthesised.
  • if buffer size is decreased some parts of the
    speech signal must be dropped.
  • gt These adjustments would preferably be made
    during silent periods.

32
Packet loss
  • problem in heavily loaded networks where packet
    loss rates may be up to 10
  • speech codecs are robust to random bit errors,
    but not on loosing complete speech frames
  • forward error correction for voice frames is not
    relevant in IP networks because it can not handle
    losing whole packets
  • interleaving cannot be used because of extra
    delay
  • retransmission are not possible because of the
    real-time requirements
  • current solutions usually include error detection
    function, usually a plain CRC check, which can be
    used as an indication for packet loss recovery
    procedure.

33
Echo
  • Echo cancelling should be used when the
    round-trip delay exceeds 50 ms
  • in VoIP, echo cancelling becomes complicated, as
    the delays are longer and may vary gt VoIP
    terminals must implement echo cancelling

34
Tandem effect
  • intermediate speech coding and decoding phases
    deteriorate speech quality considerably
  • most evident in low bit rate codecs
  • especially important in cases where VoIP is
    interworking with other networks that use speech
    coding, (mobile networks, etc.)
  • e.g. transcoding between G.723.1 and a GSM codec
    produces poor speech quality.
  • TFO tandem-free operation
  • standardization going on in ETSI for TFO with GSM
    networks
  • considerable savings can be made by TFO in
    operational costs

35
Wideband Speech
  • fundamental bandwidth limitation in public
    switched telephone network prevents the speech
    quality to be further enhanced
  • gt most current codecs achieve good performance
    only to narrowband speech where the audio
    bandwidth is limited to 3.4 kHz
  • wideband speech coding where audio bandwidth is
    extended to 7 kHz
  • wideband coding exceeds wireline quality.

36
Why Wideband Speech ?
  • Wideband speech supplies superior speech quality
    over current narrow band speech services exceeds
    the quality of wireline phones
  • In narrowband speech (100-3600 Hz band) important
    high frequency components lost (eg. in
    's'-sounds)
  • Wideband uses 50 - 7000 Hz band thus improving
    naturalness, presence and intelligibility
  • wideband speech offers superior voice quality
    over the existing narrow band services (cellular
    systems, PSTN)
  • especially suitable for applications with high
    quality audio parts

37
Wideband vs. narrowband
38
Wideband speech quality
  • results indicate that there is a significant
    benefit in the wideband solution over narrowband
  • increased audio bandwidth provided by the wide
    speech bandwidth will create a effect of
    proximity between the users
  • it will almost completely eliminate the feeling
    of "talking over a wire" of the wireline network.
  • Codec MOS
  • GSM EFR (narrowband 3.4kHz) 3.3
  • G.722, 48 kbit/s (wideband 7kHz) 4.06
  • G.722, 56 kbit/s (wideband 7kHz) 4.57

39
Nokia AMR WB Speech Codec
  • Collaboration with VoiceAge (University of
    Sherbrooke in Canada)
  • ACELP technology very similar to AMR NB and GSM
    EFR
  • Multirate codec 9 modes
  • Same coding algorithm in each mode
  • Very high code and data ROM reuse between modes
    (much better than in the AMR NB codec)
  • VAD (integrated into the speech codec) and
    comfort noise generation
  • Link Adaptation
  • 9 speech codec modes
  • 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85,
    23.05, 23.85 kbit/s

40
AMR WB Standardization in 3GPP
  • Initiated based on feasibility study in SMG11
    (2Q-1999)
  • Initially 9 candidates, 5 companies were
    qualified into the selection phase Ericsson,
    FDNS-consortium (FT, DT, Nortel, Siemens),
    Motorola, Nokia and Texas Instruments
  • Nokia WB codec selected as the best codec in 3GPP
    TSG S4 meeting in October 2000-gt will be
    standadized
  • The final specifications are approved in March
    2001 (R4)
  • The selected Nokia WB codec has also been
    approved to proceed into the ITU WB codec
    selection in March 2001

41
Speech Codec Bit Allocation into Parameter Groups
42
AMR WB Speech Quality vs. ITU G722 WB
Speech Quality
G722 64 kbit/s
G722 56 kbit/s
G722 48 kbit/s
Nokia AMR WB
3
23.85
23.05
19.85
18.25
15.85
14.25
12.65
8.85
6.6
AMR WB Bit rates kbit/s
43
AMR-WB Speech Quality in GSM Channel
Subjective Speech Quality Degradation in Function
of Channel Quality
Subjective Speech Quality
AMR-WB
AMR-NB
EFR
Error-free
13
10
7
4
Carrier to Interference Ratio (C/I) dB
44
Applications for Wide Band Speech
  • Wide band telephony
  • AMR NB equal to PSTN speech quality
  • AMR WB improves the quality and provides
    naturalness
  • Conferencing (Conversational multimedia)
  • Quality improvement over the current codecs
    (G.722 at 48 and 56 kbit/s)
  • Bit rate drops to half or less compared to G.722
  • Streaming
  • Low complexity, low bit rate solution for
    browsing type of applications

45
ITU-T wideband activity around 16 kbit/s
  • In 1999, the following guidelines have been
    considered relevant in ITU-T for the new wideband
    activity around 16 kbit/s (12, 16, 20, and 24
    kbit/s)
  • Input and output audio signals should have a
    bandwidth of 7 kHz at a sampling rate of 16 kHz.
  • Primary signals of interest are clean speech and
    speech in background noise.
  • High speech quality with the objective of
    equivalence to G.722 at 56/64 kbit/s.
  • 16 kbit/s is the main bit-rate. It is required
    that the ability of the candidate to scale in
    bit-rate to lower bit-rates (less then 16 kbit/s)
    and up to 24 kbit/s with no fundamental changes
    in either the technology or the algorithm used.
  • Robustness to frame erasures and random bit
    errors.
  • Low algorithmic delay (frame size of 20ms or
    integer sub-multiples)
  • The applications for the new activity were
    considered as follows Voice over IP (VoIP) and
    Internet Applications, PSTN applications, Mobile
    Communications, ISDN wideband telephony, and ISDN
    videotelephony and video-conferencing.

46
Introduction of wideband into systems
  • implementation of WB requires
  • 16 kHz sampling frequency in A/D and D/A
  • Acoustic design of handset
  • Tandem Free Operation (TFO)
Write a Comment
User Comments (0)
About PowerShow.com