Title: DYNAMIC%20ADAPTATION%20FOR%20LANGUAGE%20AND%20DIALECT%20IN%20A%20SPEECH%20SYNTHESIS%20SYSTEM%20Craig%20Olinsky%20Media%20Lab%20Europe%20/%20University%20College%20Dublin
1DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A
SPEECH SYNTHESIS SYSTEMCraig OlinskyMedia Lab
Europe / University College Dublin
2OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
- Many of the areas which could most benefit from
community-focused IT resource development have
very high illiteracy rates among their populace.
For such users, speech-based systems provide the
most obvious and natural mechanism for them to
interface with computers. - Without the widespread available of high quality
speech databases, computer-readable lexicons, and
other pre-processed linguistic information that
is available for, for instance, standard dialects
of French or German, it is expensive and
difficult to build such systems. - (learning from sample case in other
presentation)
3OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
- Even within a particular language (including
those major ones), the personalization of a
speech Synthesis system for a particular use,
market, and especially accent can provide much
benefit to a deployed system. Recent articles
have suggested, in fact, that humans connect
better as listeners with a speaker and voice who
sound like them, not only finding it easier to
listen to and understand what is said, but also
finding it more natural to assign emotional state
and judge such factors as authority and honesty,
and even intelligibility.
4OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
- Perhaps the system can LISTEN to the user, and
then CHANGE ITS OUTPUT to sound more like what it
hears? - Instead of creating a dedicated system for every
purpose, set up a number of baseline systems
(along different languages, language families,
etc.) and set them learning. - We benefit from the work put in developing the
baseline system, while requiring a (minimum?) of
additional focused training data. - Assumption Learning Accent, Dialect,
Language not a distinct process, but all a
matter of degree?
5OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
- HUMAN ANALOGUE People who live for a period of
time in an area where a different accent or
dialect of their language is spoken often
(involuntarily) start to pick up the local
manners of speech. - SPEECH RECOGNITION ANALOGUE Speaker
Adaptation -- a procedure in which the acoustic
model of the recognition system (or in limited
cases the language mode as well), after being
fully trained, is provided with additional speech
data. Based upon this data, the values,
parameters, nodes, weights, or other coefficients
representing the acoustic model are shifted
towards the new information such that the
system should exhibit improved performance on
data representing the new training data, even
though such data may not have been included in
its initial training procedure.
6BACKGROUND SPEAKER ADAPTATION FOR SPEECH
RECOGNITION SYSTEMS
- QUICK PROCEDURE OVERVIEW
- Given a set of recording target utterances and
associated transcripts - Generate synthesized utterance from transcript
using current synthesizer (letter-to-sound rules,
phones, speech database, etc.) - Compare target recording to generated source form
to determine how the two pronunciations differ. - Re-organize the phone units and speech unit
selection process to incorporate differences and
info from target recording units. - Modify the lexical entries and letter-to-sound
rules of the existing synthesizer to produce
output that closer resembles the target
utterance.
7VARIATION AND ADAPTATION
- Ignoring for a moment issues such as vocabulary
choice and other semantic issues of usage, it is
possible to consider variation from accent,
dialect, and even across languages as a
difference in degree of variation in a few key
areas - the phonetic inventory which comprises the basic
building blocks in which things are pronounced - a set of pronunciation rules or examples which
dictate how the phonetic units are put together
to assign a pronunciation to an orthographic
form, and subsequently speak the desired text,
and - a collection of conventionalized stress and
intonational patterns which help provide
structure and syntactic/semantic context to the
overall produced utterances.
8VARIATION AND ADAPTATION
- Cross-Speaker Adaptation. In such a mode, a
generalized speech synthesizer is adapted towards
the voice of a single user of the system. This
can be done in one of two ways Assuming that
the original voice of the synthesizer is that
of a professional speaker, either qualities of
the users voice can be applied to the default
voice, while still retaining the database of
sound samples of the original speaker for use as
the concatenative synthetic voice conversely,
the database can be expanded (or replaced) with
samples of the users voice, while some abstract
quality of the original professional voice is
nonetheless retained, ideally providing some
measure of the clearness and understandability
for which the original speaker was initially
retained. The ability to create natual-sounding
speech from concatenation of samples drawn from a
speech database comprised of recordings from
multiple users, and/or of multiple quality, would
also help encourage an open-source bazaar of
decentralized users attempting to amass the large
number of recorded forms necessary for a
multi-purpose unit-selection synthesizer.
9VARIATION AND ADAPTATION
- Cross-Dialect Adaptation. This is almost exactly
the case expressed above, except for that the
default voice form and the specific users
voice different in dialect, or to some greater
degree than the average set of native speakers
from a given area. That is, we would expect not
only quality of voice variation, but also limited
difference, in vocabulary, phonetic inventory,
distinguishable minimal-pairs, accent, and the
like. The result is that not only the
unit-selection database, but also those
components which assign phonetic realizations to
the given text the letter-to-sound rules and the
pronunciation dictionary or lexicon, may need
alteration. - Cross-Language Adaptation. In this case, we
retain some degree of phonetic inventory
similarity between the source and destination
language, but our letter-to-sound rules and
lexicon need gross modification, or may even be
unusable (even some language pairs where are very
similar in pronunciation, such as Japanese and
Korean, could nonetheless use unrelated
orthographic form, or voice versa).
10VARIATION AND ADAPTATION
- Cross-Language Adaptation, Single Speaker
Variant. In this case, we have recordings from a
single speaker (i.e., the user), which we want to
be able to speak naturally in languages in which
the user is not a native speaker. We thus want
to use information about these other languages to
adapt the synthesizer of the users voice to
speak multilingually. (This is especially
significant in our global community, where many
proper nouns of personal names and locations
cannot be properly pronounced simply by following
the phonological rules of a single language). - Language Acquisition. In the extreme case, we
wish to bootstrap an empty synthesizer (with no
lexicon or knowledge of pronunciation rules
whatsoever) to speak like us simply by speaking
to it, without hard-cording direct linguistic or
phonetic knowledge. This is a task that a
non-technical, non-expert native speaker user
should be able to perform.
11VARIATION AND ADAPTATION
- Ignoring for a moment issues such as vocabulary
choice and other semantic issues of usage, it is
possible to consider variation from accent,
dialect, and even across languages as a
difference in degree of variation in a few key
areas - the phonetic inventory which comprises the basic
building blocks in which things are pronounced - a set of pronunciation rules or examples which
dictate how the phonetic units are put together
to assign a pronunciation to an orthographic
form, and subsequently speak the desired text,
and - a collection of conventionalized stress and
intonational patterns which help provide
structure and syntactic/semantic context to the
overall produced utterances.
12OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
- Synthesis adds an additional problem to
recognition adaptation the fact that the
database of recorded segments themselves is
itself used for concatentation. This means that
we can not just merge the entire set of recorded
data together there would be noticeable
discrepancies between concatenative units taken
from each individual speaker. On the other hand,
if we just use the new set of segments, we arent
adapting were just building a new synthesizer.
For this study, we take the new target data to be
a small data set not enough to be a good set of
units for synthesis on its own.
13OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
- We are thus required to use existing (source)
units for synthesis. However, these source
recordings and their associated existing
synthetic voice have a specific accent/dialect,
with a pre-defined phone set. Even with a proper
dictionary and proper letter-to-sound rules
providing use with a proper pronunciation
taking into account pronunciation variation for
our target accent., stringing the best match
units together likely wont sound like a native
speaker of that accent. The vowel quality might
be vastly different, or phones might be missing
in the source language (e.g., a French /r/). We
want to adapt for this. Overall, we want to
sound native in the target accent/dialect/language
, using units recorded from the speaker of a
different one.
14PHONE UNIT ADAPTATION
- If the variation between source and target speech
is large enough, it is likely that describe the
target speech with a different phone set than
that of the source speech. - We may still find that the pronunciation of a
particular phone in the target corresponds more
closely with that of a different one than our
source pronunciation lexicon would suggest (for
instance, schwa reduction). - Or we might have an existing target
pronunciation lexicon or pronunciation rules with
a predefined phone set we with to use. - To utilize data from our source synthesizer in
such a case, we need to assign appropriate
mappings between source and target phones. - This can be seen as a matter of degree as to how
much effort or knowledge is incorporated into
creating the mapping, how closely such a mapping
corresponds to the observered data, and thus our
(assumed) rating of the quality of such a
mapping.
15PHONE UNIT ADAPTATION
Figure 1 Degrees of Phoneme Mapping
(alleged) WORST (alleged)
BEST Source
Naïve Mapping
Linguistically-Motivated
Data-Driven Target Phoneme Set
Phoneme Mapping
Mapping Phoneme Set
16PHONE UNIT ADAPTATION
- na(t)ive approach this approach follows the
principle a non-native would follow when speaking
a second language he basically has the phonetic
inventory of the first language and partially
uses that inventory when speaking the second
language. - phonetic approach this strategy follows
principles in the production of sounds in the
human vocal tract that sound that agrees in the
most phonetic features with the untrained one is
taken instead of the unknown one of the goal
language. - data-driven approach this approach determines
the similarity among phones with the data given
by the trained recognizer according to a
distance measure the most similar units may be
joined.
17PRONUCIATION ADAPTATION
- Typically taken for granted in multilingual
speech adaptation studies is the presences of a
pronunciation dictionary and/or rules for the
target language - On the far extremes, we assume the existing of
well-targeted pronunciation rules in the worst
case, one designed for the source speech, and the
best case, one specifically designed for the
target. In between, we use a number of methods
to derive or create a pronunciation module, based
either upon the existing source-language methods,
the target speech data itself, or some
combination.
18PRONUNCIATION ADAPTATION
Figure 2 Letter to sound rules/ lexicon
(alleged) WORST (alleged) BEST Principled
Foreign Langua
Trrained Principled Source-Only
Approximation Neutral
from Target data Target-Only
19PRONUNCIATION ADAPTATION
- Principled Source-Only this approach merely
uses pronunciation methods specifically designed
for the source speech to generate a pronunciation
form for the target. This approach can result in
extremely inaccurate pronunciation
approximations, such as one might inspect from a
native English attempt at a native pronunciation
of an unusual foreign - Foreign Approximation this approach can be
seen as akin to the na(t)ive approach of phone
mapping as discussed above. In this case, the
speaker recognizes that the word being pronounced
is not a native one, and relaxes some of the
language-specific rules or attempts to move the
pronunciation closer to that of the assumed
language of the word in question. The result is
closer, but still inaccurate and strongly
accented.
20PRONUNCIATION ADAPTATION
- Language-Neutral this approach purely ignores
all language-specific information, assuming
either a set of very generic or regular
pronunciation rules, proposing a (relatively)
direct relation between orthographic form and
pronunciation. Such rules would closely
resembles those used for a language with
artificially few pronunciation exceptions, such
as Esperanto, rather than that of English. - Trained from Target Data in this method, an
aligned text and speech signal are provided to a
recognizer, along with (possibly) a limited set
of pronunciation transcriptions as training data.
In some automatic way, the system learns a set
of pronunciation rules and/or a lexicon of
pronunciations which closely matches the training
data. - Principled Target-Only this approach assumes a
provided pronunciation modules specifically
designed to generate correct pronunciations for
our target language/dialect/accent.
21UNIT DATABASE COMPOSITION
Figure 3 Methods of Comprising Unit Database
(alleged) WORST (alleged)
BEST Source Speaker Union of
al Source Speaker Set
of Digitally Target Speaker Only
Recordings uncovered
phones Altered Segments
Only
(unprincipled) from target only
22ADAPTATION FROM MIMICRY
- We know from the beginning that our source unit
database is of the best quality (in terms of
recording, segmentation, labelling, etc.) - But we cant directly synthesize from the source
database, because we will get accented,
non-native sounding speech. - Is there a way to generate in a non-accented or
differently-accented way from a single speech
database? - Try to find a neutrally accented speaker?
(What does this mean someone heavily
polylingual? Someone geographically in between
the two languages or accents?) - Look at mimicry studies how someone
(intentionally) modifies their voice to sound
like a different speaker.
23ADAPTATION FROM MIMICRY
- Anders Eriksson and Pär Wretling How Flexible
is the Human Voice? A Case Study of Mimicry - Close mimicry of global speech rate
- No change for timing at segmental level
- Mean fundamental frequency and variation matched
timing closely - Formant frequencies attained with variant
success - Vowel imitation intermediate between voice and
target -
- Fundamental frequency changes were more
successful than changes in timing
24STAGES OF THE EXPERIMENT
- Our development efforts and systems will follow
the four modes listed in the research overview in
order of ascribed complexity. For the
Cross-Speaker Adaptation case, we will utilize a
base voice and training speaker of native
American English. - For the Cross-Dialect Adaptation study, we will
retain the use of English for the basic case,
adapting over a selection of American, British,
and Irish English dialects. - We will then finish with two data sets for
Cross-Language Adaptation, proceeding in order of
linguistic variation variation over the set of
Celtic languages still in current use (Irish,
Scottish Gaelic, and, slightly more distantly,
Welsh) and a selection of Asian Indian Languages,
including (at least) Bengali and Hindi.