TexttoSpeech Introduction

1 / 63
About This Presentation
Title:

TexttoSpeech Introduction

Description:

Questions about Assignment 4 and Assignment 5? ... Formant Synthesis: Start with acoustics, create rules/filters to create each formant ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 64
Provided by: hen4

less

Transcript and Presenter's Notes

Title: TexttoSpeech Introduction


1
Text-to-Speech Introduction
  • Heng Ji
  • hengji_at_cs.qc.cuny.edu
  • Sept 18, 2008

Acknowledgement some slides from Dan Jurafsky
2
Outline
  • Questions about Assignment 4 and Assignment 5?
  • Remember to send me presentation slides by March
    29 Sunday 1159pm
  • Syllabus
  • Text-to-Speech Introduction

3
Applications of Speech Synthesis/Text-to-Speech
(TTS)
  • Games
  • Telephone-based Information (directions, air
    travel, banking, etc)
  • Eyes-free (in car)
  • Education (Reading tutors, L2)
  • Services for the hearing impaired
  • Reading email aloud

4
History The 1936 UK Speaking Clock
From http//web.ukonline.co.uk/freshwater/clocks/s
pkgclock.htm
5
The UK Speaking Clock
  • July 24, 1936
  • Photographic storage on 4 glass disks
  • 2 disks for minutes, 1 for hour, one for seconds.
  • Other words in sentence distributed across 4
    disks, so all 4 used at once.
  • Voice of Miss J. Cain

6
A technician adjusts the amplifiers of the first
speaking clock
From http//web.ukonline.co.uk/freshwater/clocks/s
pkgclock.htm
7
TTS Demos (all are Unit-Selection)
  • ATT
  • http//www.research.att.com/ttsweb/tts/demo.php
  • IBM
  • http//www-306.ibm.com/software/pervasive/tech/dem
    os/tts.shtml
  • Cepstral
  • http//www.cepstral.com/cgi-bin/demos/general
  • Rhetorical ( Scansoft)
  • http//www.rhetorical.com/cgi-bin/demo.cgi
  • Festival
  • http//www-2.cs.cmu.edu/awb/festival_demos/index.
    html

8
ARPAbet Vowels
2009-11-17
8
Speech and Language Processing Jurafsky and
Martin
9
Brief Historical Interlude
  • Pictures and some text from Hartmut Traunmüllers
    web site
  • http//www.ling.su.se/staff/hartmut/kemplne.htm
  • Von Kempeln 1780 b. Bratislava 1734 d. Vienna
    1804
  • Leather resonator manipulated by the operator to
    copy vocal tract configuration during sonorants
    (vowels, glides, nasals)
  • Bellows provided air stream, counterweight
    provided inhalation
  • Vibrating reed produced periodic pressure wave

2009-11-17
9
Speech and Language Processing Jurafsky and
Martin
10
Von Kempelen
  • Small whistles controlled consonants
  • Rubber mouth and nose nose had to be covered
    with two fingers for non-nasals
  • Unvoiced sounds mouth covered, auxiliary bellows
    driven by string provides puff of air

From Traunmüllers web site
2009-11-17
10
Speech and Language Processing Jurafsky and
Martin
11
Modern TTS systems
  • 1960s first full TTS Umeda et al (1968)
  • 1970s
  • Joe Olive 1977 concatenation of linear-prediction
    diphones
  • Speak and Spell
  • 1980s
  • 1979 MIT MITalk (Allen, Hunnicut, Klatt)
  • 1990s-present
  • Diphone synthesis
  • Unit selection synthesis

2009-11-17
11
Speech and Language Processing Jurafsky and
Martin
12
2. Overview of TTSArchitectures of Modern
Synthesis
  • Articulatory Synthesis
  • Model movements of articulators and acoustics of
    vocal tract
  • Formant Synthesis
  • Start with acoustics, create rules/filters to
    create each formant
  • Concatenative Synthesis
  • Use databases of stored speech to assemble new
    utterances.

Text from Richard Sproat slides
2009-11-17
12
Speech and Language Processing Jurafsky and
Martin
13
Fundamental Components

TTS System
words
Text Pre-processing
Prosody
Concatenation
14
Development Tools
  • FreeTTS
  • http//freetts.sourceforge.net/docs/index.php
  • Festival
  • http//festvox.org/festival/

15
Festival
  • Open source speech synthesis system
  • Designed for development and runtime use
  • Use in many commercial and academic systems
  • Distributed with RedHat 9.x
  • Hundreds of thousands of users
  • Multilingual
  • No built-in language
  • Designed to allow addition of new languages
  • Additional tools for rapid voice development
  • Statistical learning tools
  • Scripts for building models

Text from Richard Sproat
16
Festival as software
  • http//festvox.org/festival/
  • General system for multi-lingual TTS
  • C/C code with Scheme scripting language
  • General replaceable modules
  • Lexicons, LTS, duration, intonation, phrasing,
    POS tagging, tokenizing, diphone/unit selection,
    signal processing
  • General tools
  • Intonation analysis (f0, Tilt), signal
    processing, CART building, N-gram, SCFG, WFST

Text from Richard Sproat
17
Festival as software
  • http//festvox.org/festival/
  • No fixed theories
  • New languages without new C code
  • Multiplatform (Unix/Windows)
  • Full sources in distribution
  • Free software

Text from Richard Sproat
18
Getting help
  • Online manual
  • http//festvox.org/docs/manual-1.4.3
  • Alt-h (or esc-h) on current symbol short help
  • Alt-s (or esc-s) to speak help
  • Alt-m goto man page
  • Use TAB key for completion

19
Converting from words to phones
  • Two methods
  • Dictionary-based
  • Rule-based (Letter-to-soundLTS)
  • Early systems, all LTS
  • MITalk was radical in having huge 10K word
    dictionary
  • Now systems use a combination
  • CMU dictionary 127K words
  • http//www.speech.cs.cmu.edu/cgi-bin/cmudict

20
Two steps
  • PGE will file schedules on April 20.
  • TEXT ANALYSIS Text into intermediate
    representation
  • WAVEFORM SYNTHESIS From the intermediate
    representation into waveform

2009-11-17
20
Speech and Language Processing Jurafsky and
Martin
21
The Hourglass
2009-11-17
21
Speech and Language Processing Jurafsky and
Martin
22
Rules for end-of-utterance detection
  • A dot with one or two letters is an abbrev
  • A dot with 3 cap letters is an abbrev.
  • An abbrev followed by 2 spaces and a capital
    letter is an end-of-utterance
  • Non-abbrevs followed by capitalized word are
    breaks
  • This fails for
  • Cog. Sci. Newsletter
  • Lots of cases at end of line.
  • Badly spaced/capitalized sentences

2009-11-17
22
From Alan Black lecture notes
Speech and Language Processing Jurafsky and
Martin
23
Decision Tree is a word end-of-utterance?
2009-11-17
23
Speech and Language Processing Jurafsky and
Martin
24
Learning Decision Trees
  • DTs are rarely built by hand
  • Hand-building only possible for very simple
    features, domains
  • Lots of algorithms for DT induction

2009-11-17
24
Speech and Language Processing Jurafsky and
Martin
25
Next Step Identify Types of Tokens, and Convert
Tokens to Words
  • Pronunciation of numbers often depends on type
  • 1776 date
  • seventeen seventy six.
  • 1776 phone number
  • one seven seven six
  • 1776 quantifier
  • one thousand seven hundred (and) seventy six
  • 25 day
  • twenty-fifth

2009-11-17
25
Speech and Language Processing Jurafsky and
Martin
26
Classify token into 1 of 20 types
  • EXPN abbrev, contractions (adv, N.Y., mph,
    govt)
  • LSEQ letter sequence (CIA, D.C., CDs)
  • ASWD read as word, e.g. CAT, proper names
  • MSPL misspelling
  • NUM number (cardinal) (12,45,1/2, 0.6)
  • NORD number (ordinal) e.g. May 7, 3rd, Bill
    Gates II
  • NTEL telephone (or part) e.g. 212-555-4523
  • NDIG number as digits e.g. Room 101
  • NIDE identifier, e.g. 747, 386, I5, PC110
  • NADDR number as stresst address, e.g. 5000
    Pennsylvania
  • NZIP, NTIME, NDATE, NYER, MONEY, BMONY,
    PRCT,URL,etc
  • SLNT not spoken (KENTREALTY)

2009-11-17
26
Speech and Language Processing Jurafsky and
Martin
27
Dictionaries arent always sufficient
  • Unknown words
  • Seem to be linear with number of words in unseen
    text
  • Mostly person, company, product names
  • But also foreign words, etc.
  • So commercial systems have 3-part system
  • Big dictionary
  • Special code for handling names
  • Machine learned LTS system for other unknown words

28
Improvements
  • Take names out of the training data
  • And acronyms
  • Detect both of these separately
  • And build special-purpose tools to do LTS for
    names and acronyms
  • Names
  • Can do morphology (Walters -gt Walter, Lucasville)
  • Can write stress-shifting rules (Jordan -gt
    Jordanian)
  • Rhyme analogy Plotsky by analogy with Trostsky
    (replace tr with pl)
  • Liberman and Church for 250K most common names,
    got 212K (85) from these modified-dictionary
    methods, used LTS for rest.

29
Text Pre-Processing (Block Diagram)
Word Segmenter
Acronym Converter
NumberConverter
Word to Diphone Translator (Phonetization)
NumberConverter
MLDS
Diphone Dictionary
30
Text Normalization
  • Analysis of raw text into pronounceable words
  • Sample problems
  • He stole 100 million from the bank
  • It's 13 St. Andrews St.
  • The home page is http//www.cs.qc.cuny.edu
  • yes, see you the following tues, that's 09/23/08
  • Steps
  • Identify tokens in text
  • Chunk tokens into reasonably sized sections
  • Map tokens to words
  • Identify types for words

31
Number Converter
  • Replace numerals with their textual versions
  • 100 one hundred
  • Handle fractional and decimal numbers
  • 0.25 point two five

32
Acronym Converter
  • Replace acronyms with single letter components
  • A.B.C. A B C
  • Change abbreviations to full textual format
  • Mr. Mister

33
Word Segmenter
  • Divide sentence into word segments
  • Special delimiter to separate segments (i.e.
    )
  • Segments can be
  • A single word
  • An acronym
  • A numeral
  • Identify punctuation marks

34
2. Homograph disambiguation
19 most frequent homographs, from Liberman and
Church
  • use 319
  • increase 230
  • close 215
  • record 195
  • house 150
  • contract 143
  • lead 131
  • live 130
  • lives 105
  • protest 94

survey 91 project 90 separate 87 present 80 read
72 subject 68 rebel 48 finance 46 estimate 46
Not a huge problem, but still important
2009-11-17
34
Speech and Language Processing Jurafsky and
Martin
35
POS Tagging for homograph disambiguation
  • Many homographs can be distinguished by POS
  • use y uw s y uw z
  • close k l ow s k l ow z
  • house h aw s h aw z
  • live l ay v l ih v
  • REcord reCORD
  • INsult inSULT
  • OBject obJECT
  • OVERflow overFLOW
  • DIScount disCOUNT
  • CONtent conTENT

2009-11-17
35
Speech and Language Processing Jurafsky and
Martin
36
3. Letter-to-Sound Getting from words to phones
  • Two methods
  • Dictionary-based
  • Rule-based (Letter-to-soundLTS)
  • Early systems, all LTS
  • MITalk was radical in having huge 10K word
    dictionary
  • Now systems use a combination

2009-11-17
36
Speech and Language Processing Jurafsky and
Martin
37
Names
  • Big problem area is names
  • Names are common
  • 20 of tokens in typical newswire text will be
    names
  • 1987 Donnelly list (72 million households)
    contains about 1.5 million names
  • Personal names McArthur, DAngelo, Jiminez,
    Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang,
    Nguyen
  • Company/Brand names Infinit, Kmart, Cytyc,
    Medamicus, Inforte, Aaon, Idexx Labs, Bebe

2009-11-17
37
Speech and Language Processing Jurafsky and
Martin
38
Names
  • Methods
  • Can do morphology (Walters -gt Walter, Lucasville)
  • Can write stress-shifting rules (Jordan -gt
    Jordanian)
  • Rhyme analogy Plotsky by analogy with Trostsky
    (replace tr with pl)
  • Liberman and Church for 250K most common names,
    got 212K (85) from these modified-dictionary
    methods, used LTS for rest.
  • Can do automatic country detection (from letter
    trigrams) and then do country-specific rules
  • Can train g2p system specifically on names
  • Or specifically on types of names (brand names,
    Russian names, etc)

2009-11-17
38
Speech and Language Processing Jurafsky and
Martin
39
Acronyms
  • We saw above
  • Use machine learning to detect acronyms
  • EXPN
  • ASWORD
  • LETTERS
  • Use acronym dictionary, hand-written rules to
    augment

2009-11-17
39
Speech and Language Processing Jurafsky and
Martin
40
Letter-to-Sound Rules
  • Earliest algorithms handwritten
    ChomskyHalle-style rules
  • Festival version of such LTS rules
  • (LEFTCONTEXT ITEMS RIGHTCONTEXT NEWITEMS )
  • Example
  • ( c h C k )
  • ( c h ch )
  • denotes beginning of word
  • C means all consonants
  • Rules apply in order
  • christmas pronounced with k
  • But word with ch followed by non-consonant
    pronounced ch
  • E.g., choice

2009-11-17
40
Speech and Language Processing Jurafsky and
Martin
41
Stress rules in hand-written LTS
  • English famously evil one from Allen et al 1987
  • Where X must contain all prefixes
  • Assign 1-stress to the vowel in a syllable
    preceding a weak syllable followed by a
    morpheme-final syllable containing a short vowel
    and 0 or more consonants (e.g. difficult)
  • Assign 1-stress to the vowel in a syllable
    preceding a weak syllable followed by a
    morpheme-final vowel (e.g. oregano)
  • etc

2009-11-17
41
Speech and Language Processing Jurafsky and
Martin
42
Modern method Learning LTS rules automatically
  • Induce LTS from a dictionary of the language
  • Black et al. 1998
  • Applied to English, German, French
  • Two steps
  • alignment
  • (CART-based) rule-induction

2009-11-17
42
Speech and Language Processing Jurafsky and
Martin
43
Alignment
  • Letters c h e c k e d
  • Phones ch _ eh _ k _ t
  • Black et al Method 1
  • First scatter epsilons in all possible ways to
    cause letters and phones to align
  • Then collect stats for P(phoneletter) and select
    best to generate new stats
  • This iterated a number of times until settles
    (5-6)
  • This is EM (expectation maximization) alg

2009-11-17
43
Speech and Language Processing Jurafsky and
Martin
44
Alignment
  • Some alignments will turn out to be really bad.
  • These are just the cases where pronunciation
    doesnt match letters
  • Dept d ih p aa r t m ah n t
  • CMU s iy eh m y uw
  • Lieutenant l eh f t eh n ax n t (British)
  • Also foreign words
  • These can just be removed from alignment training

2009-11-17
44
Speech and Language Processing Jurafsky and
Martin
45
TTS Prosody

done
Acoustic Manipulation
MLDS
Diphone Retrieval
Concatenation
yes
no
Diphone Database
46
Prosodyfrom wordsphones to boundaries, accent,
F0, duration
  • Prosodic phrasing
  • Need to break utterances into phrases
  • Punctuation is useful, not sufficient
  • Accents
  • Predictions of accents which syllables should be
    accented
  • Realization of F0 contour given accents/tones,
    generate F0 contour
  • Duration
  • Predicting duration of each phone

2009-11-17
46
Speech and Language Processing Jurafsky and
Martin
47
Defining Intonation
  • Ladd (1996) Intonational phonology
  • The use of suprasegmental phonetic features
  • Suprasegmental above and beyond the
    segment/phone
  • F0
  • Intensity (energy)
  • Duration
  • to convey sentence-level pragmatic meanings
  • i.e. meanings that apply to phrases or utterances
    as a whole, not lexical stress, not lexical tone.

2009-11-17
47
Speech and Language Processing Jurafsky and
Martin
48
Three aspects of prosody
  • Prominence some syllables/words are more
    prominent than others
  • Structure/boundaries sentences have prosodic
    structure
  • Some words group naturally together
  • Others have a noticeable break or disjuncture
    between them
  • Tune the intonational melody of an utterance.

From Ladd (1996)
2009-11-17
48
Speech and Language Processing Jurafsky and
Martin
49
Prosodic Prominence Pitch Accents
  • A What types of foods are a good source of
    vitamins?
  • B1 Legumes are a good source of VITAMINS.
  • B2 LEGUMES are a good source of vitamins.
  • Prominent syllables are
  • Louder
  • Longer
  • Have higher F0 and/or sharper changes in F0
    (higher F0 velocity)

Slide from Jennifer Venditti
2009-11-17
49
Speech and Language Processing Jurafsky and
Martin
50
Stress vs. accent (2)
  • The speaker decides to make the word vitamin more
    prominent by accenting it.
  • Lexical stress tell us that this prominence will
    appear on the first syllable, hence VItamin.

2009-11-17
50
Speech and Language Processing Jurafsky and
Martin
51
Which word receives an accent?
  • It depends on the context. For example, the new
    information in the answer to a question is often
    accented, while the old information usually is
    not.
  • Q1 What types of foods are a good source of
    vitamins?
  • A1 LEGUMES are a good source of vitamins.
  • Q2 Are legumes a source of vitamins?
  • A2 Legumes are a GOOD source of vitamins.
  • Q3 Ive heard that legumes are healthy, but what
    are they a good source of ?
  • A3 Legumes are a good source of VITAMINS.

Slide from Jennifer Venditti
2009-11-17
51
Speech and Language Processing Jurafsky and
Martin
52
Factors in accent prediction
  • Part of speech
  • Content words are usually accented
  • Function words are rarely accented
  • Of, for, in on, that, the, a, an, no, to, and but
    or will may would can her is their its our there
    is am are was were, etc

2009-11-17
52
Speech and Language Processing Jurafsky and
Martin
53
Complex Noun Phrase Structure
  • Sproat, R. 1994. English noun-phrase accent
    prediction for text-to-speech. Computer Speech
    and Language 879-94.
  • Proper Names, stress on right-most word
  • New York CITY Paris, FRANCE
  • Adjective-Noun combinations, stress on noun
  • Large HOUSE, red PEN, new NOTEBOOK
  • Noun-Noun compounds stress left noun
  • HOTdog (food) versus HOT DOG (overheated animal)
  • WHITE house (place) versus WHITE HOUSE (made of
    stucco)
  • examples
  • MEDICAL Building, APPLE cake, cherry PIE.
  • What about Madison avenue, Park street ???
  • Some Rules
  • FurnitureRoom -gt RIGHT (e.g., kitchen TABLE)
  • Proper-name Street -gt LEFT (e.g. PARK street)

2009-11-17
53
Speech and Language Processing Jurafsky and
Martin
54
Levels of prominence
  • Most phrases have more than one accent
  • The last accent in a phrase is perceived as more
    prominent
  • Called the Nuclear Accent
  • Emphatic accents like nuclear accent often used
    for semantic purposes, such as indicating that a
    word is contrastive, or the semantic focus.
  • The kind of thing you represent via s in IM,
    or capitalized letters
  • I know SOMETHING interesting is sure to
    happen, she said to herself.
  • Can also have words that are less prominent than
    usual
  • Reduced words, especially function words.
  • Often use 4 classes of prominence
  • emphatic accent,
  • pitch accent,
  • unaccented,
  • reduced

2009-11-17
54
Speech and Language Processing Jurafsky and
Martin
55
Yes-No question
are legumes a good source of VITAMINS
Rise from the main accent to the end of the
sentence.
Slide from Jennifer Venditti
2009-11-17
55
Speech and Language Processing Jurafsky and
Martin
56
Surprise-redundancy tune
How many times do I have to tell you ...
legumes are a good source of vitamins
Low beginning followed by a gradual rise to a
high at the end.
Slide from Jennifer Venditti
2009-11-17
56
Speech and Language Processing Jurafsky and
Martin
57
Contradiction tune
Ive heard that linguini is a good source of
vitamins.
linguini isnt a good source of vitamins
... how could you think that?
Sharp fall at the beginning, flat and low, then
rising at the end.
Slide from Jennifer Venditti
2009-11-17
57
Speech and Language Processing Jurafsky and
Martin
58
Duration
  • Simplest
  • fixed size for all phones (100 ms)
  • Next simplest
  • average duration for that phone (from training
    data). Samples from SWBD in ms
  • aa 118 b 68
  • ax 59 d 68
  • ay 138 dh 44
  • eh 87 f 90
  • ih 77 g 66
  • Next Next Simplest
  • add in phrase-final and initial lengthening plus
    stress

2009-11-17
58
Speech and Language Processing Jurafsky and
Martin
59
Intermediate representationusing Festival
  • Do you really want to see all of it?

2009-11-17
59
Speech and Language Processing Jurafsky and
Martin
60
Waveform Synthesis
  • Given
  • String of phones
  • Prosody
  • Desired F0 for entire utterance
  • Duration for each phone
  • Stress value for each phone, possibly accent
    value
  • Generate
  • Waveforms

2009-11-17
60
Speech and Language Processing Jurafsky and
Martin
61
Diphone TTS architecture
  • Training
  • Choose units (kinds of diphones)
  • Record 1 speaker saying 1 example of each diphone
  • Mark the boundaries of each diphones,
  • cut each diphone out and create a diphone
    database
  • Synthesizing an utterance,
  • grab relevant sequence of diphones from database
  • Concatenate the diphones, doing slight signal
    processing at boundaries
  • use signal processing to change the prosody (F0,
    energy, duration) of selected sequence of diphones

2009-11-17
61
Speech and Language Processing Jurafsky and
Martin
62
Recent stuff
  • Problems with Unit Selection Synthesis
  • Cant modify signal
  • (mixing modified and unmodified sounds bad)
  • But database often doesnt have exactly what you
    want
  • Solution HMM (Hidden Markov Model) Synthesis
  • Won recent TTS bakeoff.
  • Sounds less natural to researchers
  • But naïve subjects preferred it
  • Has the potential to improve on both diphone and
    unit selection.

2009-11-17
62
Speech and Language Processing Jurafsky and
Martin
63
Summary
  • ARPAbet
  • TTS Architectures
  • TTS Components
  • Text Analysis
  • Text Normalization
  • Homonym Disambiguation
  • Grapheme-to-Phoneme (Letter-to-Sound)
  • Intonation
  • Waveform Generation
  • Diphones
  • Unit Selection
  • HMM

2009-11-17
63
Speech and Language Processing Jurafsky and
Martin
Write a Comment
User Comments (0)