Speech%20Processing%20Text%20to%20Speech%20Synthesis

About This Presentation

Title:

Speech%20Processing%20Text%20to%20Speech%20Synthesis

Description:

... how to pick the right unit? Search Joining the units dumb (just stick'em together) PSOLA (Pitch-Synchronous Overlap and Add) MBROLA (Multi-band overlap and add) ... – PowerPoint PPT presentation

Number of Views:882

Avg rating:3.0/5.0

Slides: 111

Provided by: KathyM151

Category:

more less

Transcript and Presenter's Notes

Title: Speech%20Processing%20Text%20to%20Speech%20Synthesis

1
Speech ProcessingText to Speech Synthesis

August 11, 2005

2
CS 224S / LINGUIST 236Speech Recognition and
Synthesis

Dan Jurafsky

Lecture 3 TTS Overview, History, and
Letter-to-Sound
IP Notice lots of info, text, and diagrams on
these slides comes (thanks!) from Alan Blacks
excellent lecture notes and from Richard Sproats
great new slides.
3

Text-To-Speech defined as the automatic
production of speech, through a
grapheme-to-phoneme transcription of the
sentences to utter.
Text Analyzer
Normalization
Morphological Analyzer
Contextual Analyzer
Syntactic-Prosodic Parser
Grapheme to phoneme rules
Prosody Generator

Dictionary based G2P

Rule based G2P

5
Text Normalization

Analysis of raw text into pronounceable words
Sample problems
The robbers stole Rs 100 lakhs from the bank
It's 13/4 Modern Ave.
The home page is http//www.facweb.iitkgp.ernet.in
/sudeshna/
yes, see you the following tues, that's 23/08/05
Steps
Identify tokens in text
Chunk tokens into reasonably sized sections
Map tokens to words
Identify types for words

6
Grapheme to Phoneme

How to pronounce a word? Look in dictionary! But
Unknown words and names will be missing
Turkish, German, and other hard languages
uygarlaStIramadIklarImIzdanmISsInIzcasIna
(behaving) as if you are among those whom we
could not civilize
uygar laS tIr ama dIk lar ImIz dan mIS
sInIz casIna civilized bec caus NegAble
ppart pl p1pl abl past 2pl AsIf
So need Letter to Sound Rules
Also homograph disambiguation (wind, live, read)

7
Grapheme to Phoneme in Indian languages

Hindi do not need a dictionary. Letter to sound
rules can capture the pronunciation of most
words.
Bengali Harder than Hindi, but mostly can be
handled using rules, and a list of exceptions.

8
Prosodyfrom wordsphones to boundaries, accent,
F0, duration

The term prosody refers to certain properties of
the speech signal which are related to audible
changes in pitch, loudness, syllable length.
Prosodic phrasing
Need to break utterances into phrases
Punctuation is useful, not sufficient
Accents
Predictions of accents which syllables should be
accented
Realization of F0 contour given accents/tones,
generate F0 contour
Duration
Predicting duration of each phone

9
Waveform synthesisfrom segments, f0, duration
to waveform

Collecting diphones
need to record diphones in correct contexts
l sounds different in onset than coda, t is
flapped sometimes, etc.
need quiet recording room
then need to label them very very exactly
Unit selection how to pick the right unit?
Search
Joining the units
dumb (just stick'em together)
PSOLA (Pitch-Synchronous Overlap and Add)
MBROLA (Multi-band overlap and add)

10
Festival

Open source speech synthesis system
Designed for development and runtime use
Use in many commercial and academic systems
Distributed with RedHat 9.x
Hundreds of thousands of users
Multilingual
No built-in language
Designed to allow addition of new languages
Additional tools for rapid voice development
Statistical learning tools
Scripts for building models

Text from Richard Sproat
11
Festival as software

http//festvox.org/festival/
General system for multi-lingual TTS
C/C code with Scheme scripting language
General replaceable modules
Lexicons, LTS, duration, intonation, phrasing,
POS tagging, tokenizing, diphone/unit selection,
signal processing
General tools
Intonation analysis (f0, Tilt), signal
processing, CART building, N-gram, SCFG, WFST

Text from Richard Sproat
12
Festival as software

http//festvox.org/festival/
No fixed theories
New languages without new C code
Multiplatform (Unix/Windows)
Full sources in distribution
Free software

Text from Richard Sproat
13
CMU FestVox project

Festival is an engine, how do you make voices?
Festvox building synthetic voices
Tools, scripts, documentation
Discussion and examples for building voices
Example voice databases
Step by step walkthroughs of processes
Support for English and other languages
Support for different waveform synthesis methods
Diphone
Unit selection
Limited domain

Text from Richard Sproat
14
Synthesis tools

I want my computer to talk
Festival Speech Synthesis
I want my computer to talk in my voice
FestVox Project
I want it to be fast and efficient
Flite

Text from Richard Sproat
15
Using Festival

How to get Festival to talk
Scheme (Festivals scripting language)
Basic Festival commands

Text from Richard Sproat
16
Getting it to talk

Say a file
festival --tts file.txt
From Emacs
say region, say buffer
Command line interpreter
festivalgt (SayText hello)

Text from Richard Sproat
17
Quick Intro to Scheme

Scheme is a dialect of LISP
expressions are
atoms or
lists
a bcd hello world 12.3
(a b c)
(a (1 2) seven)
Interpreter evaluates expressions
Atoms evaluate as variables
Lists evaluate as functional calls
bxx
3.14
( 2 3)

Text from Richard Sproat
18
Quick Intro to Scheme

Setting variables
(set! a 3.14)
Defining functions
(define (timestwo n) ( 2 n))
(timestwo a)
6.28

Text from Richard Sproat
19
Lists in Scheme

festivalgt (set! alist '(apples pears bananas))
(apples pears bananas)
festivalgt (car alist)
apples
festivalgt (cdr alist)
(pears bananas)
festivalgt (set! blist (cons 'oranges alist))
(oranges apples pears bananas)
festivalgt append alist blist
ltSUBR(6) appendgt
(apples pears bananas)
(oranges apples pears bananas)
festivalgt (append alist blist)
(apples pears bananas oranges apples pears
bananas)
festivalgt (length alist)
3
festivalgt (length (append alist blist))
7

Text from Richard Sproat
20
Scheme speech

Make an utterance of type text
festivalgt (set! utt1 (Utterance Text hello))
ltUtterance 0xf6855718gt
Synthesize an utterance
festivalgt (utt.synth utt1)
ltUtterance 0xf6855718gt
Play waveform
festivalgt (utt.play utt1)
ltUtterance 0xf6855718gt
Do all together
festivalgt (SayText This is an example)
ltUtterance 0xf6961618gt

Text from Richard Sproat
21
Scheme speech

In a file
(define (SpeechPlus a b)
(SayText
(format nil
d plus d equals d
a b ( a b))))
Loading files
festivalgt (load file.scm)
t
Do all together
festivalgt (SpeechPlus 2 4)
ltUtterance 0xf6961618gt

Text from Richard Sproat
22
Scheme speech

(define (sp_time hour minute)
(cond
(( lt hour 12)
(SayText
(format nil
It is d d in the morning
hour minute )))
(( lt hour 18)
(SayText
(format nil
It is d d in the afternoon
(- hour 12) minute )))
(t
(SayText
(format nil
It is d d in the evening
(- hour 12) minute )))))

Text from Richard Sproat
23
Getting help

Online manual
http//festvox.org/docs/manual-1.4.3
Alt-h (or esc-h) on current symbol short help
Alt-s (or esc-s) to speak help
Alt-m goto man page
Use TAB key for completion

24
Lexicons and Lexical Entries

You can explicitly give pronunciations for words
Each lg/dialect has its own lexicon
You can lookup words with
(lex.lookup WORD)
You can add entries to the current lexicon
(lex.add.entry NEWENTRY)
Entry (WORD POS (SYL0 SYL1))
Syllable ((PHONE0 PHONE1 ) STRESS )
Example
(cepstra n ((k eh p) 1) ((s t r aa) 0))))

25
Converting from words to phones

Two methods
Dictionary-based
Rule-based (Letter-to-soundLTS)
Early systems, all LTS
MITalk was radical in having huge 10K word
dictionary
Now systems use a combination
CMU dictionary 127K words
http//www.speech.cs.cmu.edu/cgi-bin/cmudict

26
Dictionaries arent always sufficient

Unknown words
Seem to be linear with number of words in unseen
text
Mostly person, company, product names
But also foreign words, etc.
So commercial systems have 3-part system
Big dictionary
Special code for handling names
Machine learned LTS system for other unknown words

27
Letter-to-Sound Rules

Festival LTS rules
(LEFTCONTEXT ITEMS RIGHTCONTEXT NEWITEMS )
Example
( c h C k )
( c h ch )
denotes beginning of word
C means all consonants
Rules apply in order
christmas pronounced with k
But word with ch followed by non-consonant
pronounced ch
E.g., choice

28
Stress rules in LTS

English famously evil one from Allen et al 1987
V -gt 1-stress / X_C Vshort C C?V Vshort
CV
Where X must contain all prefixes
Assign 1-stress to the vowel in a syllable
preceding a weak syllable followed by a
morpheme-final syllable containing a short vowel
and 0 or more consonants (e.g. difficult)
Assign 1-stress to the vowel in a syllable
preceding a weak syllable followed by a
morpheme-final vowel (e.g. oregano)
etc

29
Modern method Learning LTS rules automatically

Induce LTS from a dictionary of the language
Black et al. 1998
Applied to English, German, French
Two steps alignment and (CART-based)
rule-induction

30
Alignment

Letters c h e c k e d
Phones ch _ eh _ k _ t
Black et al Method 1
First scatter epsilons in all possible ways to
cause letters and phones to align
Then collect stats for P(letterphone) and select
best to generate new stats
This iterated a number of times until settles
(5-6)
This is EM (expectation maximization) alg

31
Alignment

Black et al method 2
Hand specify which letters can be rendered as
which phones
C goes to k/ch/s/sh
W goes to w/v/f, etc
Once mapping table is created, find all valid
alignments, find p(letterphone), score all
alignments, take best

32
Alignment

Some alignments will turn out to be really bad.
These are just the cases where pronunciation
doesnt match letters
Dept d ih p aa r t m ah n t
CMU s iy eh m y uw
Lieutenant l eh f t eh n ax n t (British)
Also foreign words
These can just be removed from alignment training

33
Building CART trees

Build a CART tree for each letter in alphabet (26
plus accented) using context of -3 letters
c h e c -gt ch
c h e c k e d -gt _
This produces 92-96 correct LETTER accuracy
(58-75 word acc) for English

34
Improvements

Take names out of the training data
And acronyms
Detect both of these separately
And build special-purpose tools to do LTS for
names and acronyms
Names
Can do morphology (Walters -gt Walter, Lucasville)
Can write stress-shifting rules (Jordan -gt
Jordanian)
Rhyme analogy Plotsky by analogy with Trostsky
(replace tr with pl)
Liberman and Church for 250K most common names,
got 212K (85) from these modified-dictionary
methods, used LTS for rest.

35
Speech Recognition
36
Speech Recognition

Applications of Speech Recognition (ASR)
Dictation
Telephone-based Information (directions, air
travel, banking, etc)
Hands-free (in car)
Speaker Identification
Language Identification
Second language ('L2') (accent reduction)
Audio archive searching

37
LVCSR

Large Vocabulary Continuous Speech Recognition
20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)

38
LVCSR Design Intuition

Build a statistical model of the speech-to-words
process
Collect lots and lots of speech, and transcribe
all the words.
Train the model on the labeled speech
Paradigm Supervised Machine Learning Search

39
Speech Recognition Architecture
40
The Noisy Channel Model

Search through space of all possible sentences.
Pick the one that is most probable given the
waveform.

41
The Noisy Channel Model (II)

What is the most likely sentence out of all
sentences in the language L given some acoustic
input O?
Treat acoustic input O as sequence of individual
observations
O o1,o2,o3,,ot
Define a sentence as a sequence of words
W w1,w2,w3,,wn

42
Noisy Channel Model (III)

Probabilistic implication Pick the highest prob
S
We can use Bayes rule to rewrite this
Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax

43
A quick derivation of Bayes Rule

Conditionals
Rearranging
And also

44
Bayes (II)

We know
So rearranging things

45
Noisy channel model
likelihood
prior
46
The noisy channel model

Ignoring the denominator leaves us with two
factors P(Source) and P(SignalSource)

47
Speech Architecture meets Noisy Channel
48
Five easy pieces

Feature extraction
Acoustic Modeling
HMMs, Lexicons, and Pronunciation
Decoding
Language Modeling

49
Feature Extraction

Digitize Speech
Extract Frames

50
Digitizing Speech
51
Digitizing Speech (A-D)

Sampling
measuring amplitude of signal at time t
16,000 Hz (samples/sec) Microphone (Wideband)
8,000 Hz (samples/sec) Telephone
Why?
Need at least 2 samples per cycle
max measurable frequency is half sampling rate
Human speech lt 10,000 Hz, so need max 20K
Telephone filtered at 4K, so 8K is enough

52
Digitizing Speech (II)

Quantization
Representing real value of each amplitude as
integer
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats
16 bit PCM
8 bit mu-law log compression
LSB (Intel) vs. MSB (Sun, Apple)
Headers
Raw (no header)
Microsoft wav
Sun .au

40 byte header
53
Frame Extraction

A frame (25 ms wide) extracted every 10 ms

25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
54
MFCC (Mel Frequency Cepstral Coefficients)

Do FFT to get spectral information
Like the spectrogram/spectrum we saw earlier
Apply Mel scaling
Linear below 1kHz, log above, equal samples above
and below 1kHz
Models human ear more sensitivity in lower freqs
Plus Discrete Cosine Transformation

55
Final Feature Vector

39 Features per 10 ms frame
12 MFCC features
12 Delta MFCC features
12 Delta-Delta MFCC features
1 (log) frame energy
1 Delta (log) frame energy
1 Delta-Delta (log frame energy)
So each frame represented by a 39D vector

56
Where we are

Given a sequence of acoustic feature vectors,
one every 10 ms
Goal output a string of words
Well spend 6 lectures on how to do this
Rest of today
Markov Models
Hidden Markov Models in the abstract
Forward Algorithm
Viterbi Algorithm
Start of HMMs for speech

57
Acoustic Modeling

Given a 39d vector corresponding to the
observation of one frame oi
And given a phone q we want to detect
Compute p(oiq)
Most popular method
GMM (Gaussian mixture models)
Other methods
MLP (multi-layer perceptron)

58
Acoustic Modeling MLP computes p(qo)
59
Gaussian Mixture Models

Also called fully-continuous HMMs
P(oq) computed by a Gaussian

60
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means

P(oq)

P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
61
Training Gaussians

A (single) Gaussian is characterized by a mean
and a variance
Imagine that we had some training data in which
each phone was labeled
We could just compute the mean and variance from
the data

62
But we need 39 gaussians, not 1!

The observation o is really a vector of length 39
So need a vector of Gaussians

63
Actually, mixture of gaussians

Each phone is modeled by a sum of different
gaussians
Hence able to model complex facts about the data

Phone A
Phone B
64
Gaussians acoustic modeling

Summary each phone is represented by a GMM
parameterized by
M mixture weights
M mean vectors
M covariance matrices
Usually assume covariance matrix is diagonal
I.e. just keep separate variance for each
cepstral feature

65
ASR Lexicon Markov Models for pronunciation
66
The Hidden Markov model
67
Formal definition of HMM

States a set of states Q q1, q2qN
Transition probabilities a set of probabilities
A a01,a02,an1,ann.
Each aij represents P(ji)
Observation likelihoods a set of likelihoods
Bbi(ot), probability that state i generated
observation t
Special non-emitting initial and final states

68
Pieces of the HMM

Observation likelihoods (b), p(oq), represents
the acoustics of each phone, and are computed by
the gaussians (Acoustic Model, or AM)
Transition probabilities represent the
probability of different pronunciations
(different sequences of phones)
States correspond to phones

69
Pieces of the HMM

Actually, I lied when I say states correspond to
phones
Actually states usually correspond to triphones
CHEESE (phones) ch iy z
CHEESE (triphones) -chiy, ch-iyz, iy-z

70
Pieces of the HMM

Actually, I lied again when I said states
correspond to triphones
In fact, each triphone has 3 states for
beginning, middle, and end of the triphone.

71
A real HMM
72
Cross-word triphones

Word-Internal Context-Dependent Models
OUR LIST
SIL AAR AA-R LIH L-IHS IH-ST S-T
Cross-Word Context-Dependent Models
OUR LIST
SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL

73
Summary

ASR Architecture
The Noisy Channel Model
Five easy pieces of an ASR system
Feature Extraction
39 MFCC features
Acoustic Model
Gaussians for computing p(oq)
Lexicon/Pronunciation Model
HMM
Next time Decoding how to combine these to
compute words from speech!

74
Perceptual properties

Pitch perceptual correlate of frequency
Loudness perceptual correlate of power, which is
related to square of amplitude

75
Speech Recognition

Applications of Speech Recognition (ASR)
Dictation
Telephone-based Information (directions, air
travel, banking, etc)
Hands-free (in car)
Speaker Identification
Language Identification
Second language ('L2') (accent reduction)
Audio archive searching

76
LVCSR

Large Vocabulary Continuous Speech Recognition
20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)

77
LVCSR Design Intuition

Build a statistical model of the speech-to-words
process
Collect lots and lots of speech, and transcribe
all the words.
Train the model on the labeled speech
Paradigm Supervised Machine Learning Search

78
Speech Recognition Architecture
79
The Noisy Channel Model

Search through space of all possible sentences.
Pick the one that is most probable given the
waveform.

80
The Noisy Channel Model (II)

What is the most likely sentence out of all
sentences in the language L given some acoustic
input O?
Treat acoustic input O as sequence of individual
observations
O o1,o2,o3,,ot
Define a sentence as a sequence of words
W w1,w2,w3,,wn

81
Noisy Channel Model (III)

Probabilistic implication Pick the highest prob
S
We can use Bayes rule to rewrite this
Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax

82
A quick derivation of Bayes Rule

Conditionals
Rearranging
And also

83
Bayes (II)

We know
So rearranging things

84
Noisy channel model
likelihood
prior
85
The noisy channel model

Ignoring the denominator leaves us with two
factors P(Source) and P(SignalSource)

86
Five easy pieces

Feature extraction
Acoustic Modeling
HMMs, Lexicons, and Pronunciation
Decoding
Language Modeling

87
Feature Extraction

Digitize Speech
Extract Frames

88
Digitizing Speech
89
Digitizing Speech (A-D)

Sampling
measuring amplitude of signal at time t
16,000 Hz (samples/sec) Microphone (Wideband)
8,000 Hz (samples/sec) Telephone
Why?
Need at least 2 samples per cycle
max measurable frequency is half sampling rate
Human speech lt 10,000 Hz, so need max 20K
Telephone filtered at 4K, so 8K is enough

90
Digitizing Speech (II)

Quantization
Representing real value of each amplitude as
integer
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats
16 bit PCM
8 bit mu-law log compression
LSB (Intel) vs. MSB (Sun, Apple)
Headers
Raw (no header)
Microsoft wav
Sun .au

40 byte header
91
Frame Extraction

A frame (25 ms wide) extracted every 10 ms

25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
92
MFCC (Mel Frequency Cepstral Coefficients)

Do FFT to get spectral information
Like the spectrogram/spectrum we saw earlier
Apply Mel scaling
Linear below 1kHz, log above, equal samples above
and below 1kHz
Models human ear more sensitivity in lower freqs
Plus Discrete Cosine Transformation

93
Final Feature Vector

39 Features per 10 ms frame
12 MFCC features
12 Delta MFCC features
12 Delta-Delta MFCC features
1 (log) frame energy
1 Delta (log) frame energy
1 Delta-Delta (log frame energy)
So each frame represented by a 39D vector

94
Acoustic Modeling

Given a 39d vector corresponding to the
observation of one frame oi
And given a phone q we want to detect
Compute p(oiq)
Most popular method
GMM (Gaussian mixture models)
Other methods
MLP (multi-layer perceptron)

95
Acoustic Modeling MLP computes p(qo)
96
Gaussian Mixture Models

Also called fully-continuous HMMs
P(oq) computed by a Gaussian

97
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means

P(oq)

P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
98
Training Gaussians

A (single) Gaussian is characterized by a mean
and a variance
Imagine that we had some training data in which
each phone was labeled
We could just compute the mean and variance from
the data

99
But we need 39 gaussians, not 1!

The observation o is really a vector of length 39
So need a vector of Gaussians

100
Actually, mixture of gaussians

Each phone is modeled by a sum of different
gaussians
Hence able to model complex facts about the data

Phone A
Phone B
101
Gaussians acoustic modeling

Summary each phone is represented by a GMM
parameterized by
M mixture weights
M mean vectors
M covariance matrices
Usually assume covariance matrix is diagonal
I.e. just keep separate variance for each
cepstral feature

102
ASR Lexicon Markov Models for pronunciation
103
The Hidden Markov model
104
Formal definition of HMM

States a set of states Q q1, q2qN
Transition probabilities a set of probabilities
A a01,a02,an1,ann.
Each aij represents P(ji)
Observation likelihoods a set of likelihoods
Bbi(ot), probability that state i generated
observation t
Special non-emitting initial and final states

105
Pieces of the HMM

Observation likelihoods (b), p(oq), represents
the acoustics of each phone, and are computed by
the gaussians (Acoustic Model, or AM)
Transition probabilities represent the
probability of different pronunciations
(different sequences of phones)
States correspond to phones

106
Pieces of the HMM

Actually, I lied when I say states correspond to
phones
Actually states usually correspond to triphones
CHEESE (phones) ch iy z
CHEESE (triphones) -chiy, ch-iyz, iy-z

107
Pieces of the HMM

Actually, I lied again when I said states
correspond to triphones
In fact, each triphone has 3 states for
beginning, middle, and end of the triphone.

108
A real HMM
109
Cross-word triphones

Word-Internal Context-Dependent Models
OUR LIST
SIL AAR AA-R LIH L-IHS IH-ST S-T
Cross-Word Context-Dependent Models
OUR LIST
SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL

110
Summary

ASR Architecture
The Noisy Channel Model
Five easy pieces of an ASR system
Feature Extraction
39 MFCC features
Acoustic Model
Gaussians for computing p(oq)
Lexicon/Pronunciation Model
HMM
Next time Decoding how to combine these to
compute words from speech!

Write a Comment

User Comments (0)

About PowerShow.com

Speech%20Processing%20Text%20to%20Speech%20Synthesis - PowerPoint PPT Presentation

Speech%20Processing%20Text%20to%20Speech%20Synthesis

... how to pick the right unit? Search Joining the units dumb (just stick'em together) PSOLA (Pitch-Synchronous Overlap and Add) MBROLA (Multi-band overlap and add) ... – PowerPoint PPT presentation