EE2F1 Multimedia 1: Speech - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

EE2F1 Multimedia 1: Speech

Description:

From John Holmes, 'Speech synthesis and recognition', courtesy of British ... From: John Holmes and Wendy Holmes, 'Speech synthesis and recognition', Taylor ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 29
Provided by: MartinR72
Category:

less

Transcript and Presenter's Notes

Title: EE2F1 Multimedia 1: Speech


1
EE2F1Multimedia (1) Speech Audio
TechnologyLecture 7 Speech Synthesis
(1)Martin RussellElectronic, Electrical
Computer EngineeringSchool of EngineeringThe
University of Birmingham
2
Stages in text-to-speech synthesis
  • Text normalisation
  • Text-to-phone conversion
  • Linguistic analysis
  • Semantic analysis
  • Conversion of phone-sequence to sequence of
    synthesiser control parameters
  • Synthesis of acoustic speech signal

3
Approaches to synthesis
  • Final stage is to convert phone or word
    sequence into a sequence of synthesiser control
    parameters
  • Two main approaches
  • Waveform concatenation
  • Model-based speech synthesis (inludes
    articulatory synthesis)

4
Waveform Concatenation
  • Join together, or concatenate, stored sections of
    real speech
  • Sections may correspond to whole word, or
    sub-word units
  • Early systems based on whole words
  • E.G Speaking clock - UK telephone system, 1936
  • Storage and access major issues
  • Speech quality requires data-rates of 16,000 to
    32,000 bits per second (bps)

5
1936 Speaking Clock
From John Holmes, Speech synthesis and
recognition, courtesy of British
Telecommunications plc
6
Whole word concatenation (1)
  • Whole word concatenation can give good quality
    speech (as in speaking clock), but has many
    disadvantages
  • pronunciation of a word influenced by
    neighbouring words (co-articulation)
  • prosodic effects like intonation,
    rate-of-speaking and amplitude also influenced by
    context.
  • interpretation of a sentence will be strongly
    influenced by details of individual words used
    (Mary didnt buy Sam a pizza)

7
Whole word concatenation (2)
  • Disadvantages (continued)
  • words must be extracted from the right sort of
    sentence
  • most suitable for applications where structure of
    the sentence is constrained, e.g., announcements,
    lists
  • may need to record more than one example of each
    word, e.g., raised pitch at end of a list,
    pre-pause lengthening

8
Example original recording
The next train to arrive at platform 2 will call
at Bromsgrove, Droitwich Spa, Worcester Foregate
Street and Malvern Link
9
Example trivial concatenative synthesis
The next train to arrive at platform 2 will call
at Malvern Link, Worcester Foregate Street,
Droitwich Spa and Bromsgrove
10
Example repeated
  • Original recording
  • Concatenative synthesis

11
Whole word concatenation (3)
  • Disadvantages (continued)
  • to add new words the original speaker must be
    found, or all words must be re-recorded
  • even with specialist facilities, selection and
    extraction of suitable words is labour intensive
    and time consuming

12
Sub-word concatenation (1)
  • Limitations of word-based methods suggest
    concatenative speech synthesis based on sub-word
    units
  • Need well-annotated, phonetically-balanced corpus
    of speech recordings
  • Extract fragments from waveforms in the corpus
    which represent basic units of speech, and can
    be concatenated and used for speech synthesis

13
Sub-word concatenation (2)
  • Difficulties include
  • identification of a set of suitable units
  • careful annotation of large amounts of data
  • derivation of a good method for concatenation

14
Sub-word concatenation (3)
  • Sub-word concatenation overcomes difficulties
    with adding new words to the application
    vocabulary,
  • But, other problems exacerbated.
  • In particular, coarticulation and pitch
    continuity problems occur within, as well as
    between, words.
  • Necessary to use several examples of each phone
    (corresponding roughly to different allophones).

15
Sub-word concatenation (4)
  • Natural to select fragments that characterise the
    phone target values, but modelling transitions
    between these targets is a significant problem

16
Example sub-word concatenation
stack (original)
task sub-word concatenative synthesis
17
Transitional units (1)
  • Central regions of many speech sounds are
    approximately stationary and less susceptible to
    coarticulation effects.
  • Hence select fragments which characterise
    transitions between phones, rather than phone
    targets.
  • e.g., diphone - transition between two phones.

18
Transitional units (2)
  • There are contextually-induced differences
    between instantiations of the central region of
    phone, which cause discontinuities if they are
    not attended to.
  • Possible solutions are
  • use several different examples of each diphone
  • store short transition regions, and
  • interpolate between end values

19
Transitional units (3)
  • Coping with coarticulation effects by modelling
    transitions and
  • (a) using multiple examples to cope with
    variation in the instantiation of the phone
    centres, and
  • (b) by interpolation between short transition
    regions

20
More on prosody
  • Discontinuity in the fundamental frequency
    exacerbated for sub-word methods.
  • Can use source-filter model to separate-excitation
    signal from vocal-tract shape.
  • Vocal-tract shape descriptions can then be
    concatenated and an appropriately smooth
    fundamental frequency pattern can be added
    separately.

21
PSOLA Pitch Synchronous Overlap and Add
  • PSOLA (Charpentier, 1986)
  • Most successful current approach to concatenative
    synthesis
  • In PSOLA, the end regions of windowed waveform
    samples are overlapped pitch-synchronously and
    added
  • BTs Laureate is an example

22
PSOLA
From John Holmes and Wendy Holmes, Speech
synthesis and recognition, Taylor Francis 2001
23
Speech modification using PSOLA
  • In addition to speech synthesis from segments,
    there are two other common applications of PSOLA
  • Pitch modification
  • Duration modification

24
Increasing pitch using PSOLA
From John Holmes and Wendy Holmes, Speech
synthesis and recognition, Taylor Francis 2001
25
Decreasing pitch using PSOLA
From John Holmes and Wendy Holmes, Speech
synthesis and recognition, Taylor Francis 2001
26
The Laureate System
  • The BT Laureate system is a modern, PSOLA-based
    synthesiser
  • See Edington et al. (1996a), also look at the web
    site
  • Demonstration

27
PSOLA strengths and weaknesses
  • Strengths
  • Produces good quality speech
  • Weaknesses
  • Large, annotated corpus needed for each voice
  • Requires accurate pitch peak detection
  • Inflexible new voices can only be produced by
    recording and labelling significant speech
    corpora from new speakers
  • Automatic annotation of corpora using techniques
    from speech recognition

28
Summary
  • Concatenative speech synthesis
  • Whole word concatenation
  • Importance of prosody
  • Sub-word concatenation
  • Choice of sub-word units
  • PSOLA
Write a Comment
User Comments (0)
About PowerShow.com