DYNAMIC%20ADAPTATION%20FOR%20LANGUAGE%20AND%20DIALECT%20IN%20A%20SPEECH%20SYNTHESIS%20SYSTEM%20Craig%20Olinsky%20Media%20Lab%20Europe%20/%20University%20College%20Dublin - PowerPoint PPT Presentation

About This Presentation
Title:

DYNAMIC%20ADAPTATION%20FOR%20LANGUAGE%20AND%20DIALECT%20IN%20A%20SPEECH%20SYNTHESIS%20SYSTEM%20Craig%20Olinsky%20Media%20Lab%20Europe%20/%20University%20College%20Dublin

Description:

Cross-Speaker Adaptation. ... Cross-Dialect Adaptation. ... linguistic variation variation over the set of Celtic languages still ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: DYNAMIC%20ADAPTATION%20FOR%20LANGUAGE%20AND%20DIALECT%20IN%20A%20SPEECH%20SYNTHESIS%20SYSTEM%20Craig%20Olinsky%20Media%20Lab%20Europe%20/%20University%20College%20Dublin


1
DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A
SPEECH SYNTHESIS SYSTEMCraig OlinskyMedia Lab
Europe / University College Dublin
2
OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
  • Many of the areas which could most benefit from
    community-focused IT resource development have
    very high illiteracy rates among their populace.
    For such users, speech-based systems provide the
    most obvious and natural mechanism for them to
    interface with computers.
  • Without the widespread available of high quality
    speech databases, computer-readable lexicons, and
    other pre-processed linguistic information that
    is available for, for instance, standard dialects
    of French or German, it is expensive and
    difficult to build such systems.
  • (learning from sample case in other
    presentation)

3
OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
  • Even within a particular language (including
    those major ones), the personalization of a
    speech Synthesis system for a particular use,
    market, and especially accent can provide much
    benefit to a deployed system. Recent articles
    have suggested, in fact, that humans connect
    better as listeners with a speaker and voice who
    sound like them, not only finding it easier to
    listen to and understand what is said, but also
    finding it more natural to assign emotional state
    and judge such factors as authority and honesty,
    and even intelligibility.

4
OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
  • Perhaps the system can LISTEN to the user, and
    then CHANGE ITS OUTPUT to sound more like what it
    hears?
  • Instead of creating a dedicated system for every
    purpose, set up a number of baseline systems
    (along different languages, language families,
    etc.) and set them learning.
  • We benefit from the work put in developing the
    baseline system, while requiring a (minimum?) of
    additional focused training data.
  • Assumption Learning Accent, Dialect,
    Language not a distinct process, but all a
    matter of degree?

5
OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
  • HUMAN ANALOGUE People who live for a period of
    time in an area where a different accent or
    dialect of their language is spoken often
    (involuntarily) start to pick up the local
    manners of speech.
  • SPEECH RECOGNITION ANALOGUE Speaker
    Adaptation -- a procedure in which the acoustic
    model of the recognition system (or in limited
    cases the language mode as well), after being
    fully trained, is provided with additional speech
    data. Based upon this data, the values,
    parameters, nodes, weights, or other coefficients
    representing the acoustic model are shifted
    towards the new information such that the
    system should exhibit improved performance on
    data representing the new training data, even
    though such data may not have been  included in
    its initial training procedure.

6
BACKGROUND SPEAKER ADAPTATION FOR SPEECH
RECOGNITION SYSTEMS
  • QUICK PROCEDURE OVERVIEW
  • Given a set of recording target utterances and
    associated transcripts
  • Generate synthesized utterance from transcript
    using current synthesizer (letter-to-sound rules,
    phones, speech database, etc.)
  • Compare target recording to generated source form
    to determine how the two pronunciations differ.
  • Re-organize the phone units and speech unit
    selection process to incorporate differences and
    info from target recording units.
  • Modify the lexical entries and letter-to-sound
    rules of the existing synthesizer to produce
    output that closer resembles the target
    utterance.

7
VARIATION AND ADAPTATION
  • Ignoring for a moment issues such as vocabulary
    choice and other semantic issues of usage, it is
    possible to consider variation from accent,
    dialect, and even across languages as a
    difference in degree of variation in a few key
    areas
  • the phonetic inventory which comprises the basic
    building blocks in which things are pronounced
  • a set of pronunciation rules or examples which
    dictate how the phonetic units are put together
    to assign a pronunciation to an orthographic
    form, and subsequently speak the desired text,
    and
  • a collection of conventionalized stress and
    intonational patterns which help provide
    structure and syntactic/semantic context to the
    overall produced utterances.

8
VARIATION AND ADAPTATION
  • Cross-Speaker Adaptation. In such a mode, a
    generalized speech synthesizer is adapted towards
    the voice of a single user of the system. This
    can be done in one of two ways Assuming that
    the original voice of the synthesizer is that
    of a professional speaker, either qualities of
    the users voice can be applied to the default
    voice, while still retaining the database of
    sound samples of the original speaker for use as
    the concatenative synthetic voice conversely,
    the database can be expanded (or replaced) with
    samples of the users voice, while some abstract
    quality of the original professional voice is
    nonetheless retained, ideally providing some
    measure of the clearness and understandability
    for which the original speaker was initially
    retained. The ability to create natual-sounding
    speech from concatenation of samples drawn from a
    speech database comprised of recordings from
    multiple users, and/or of multiple quality, would
    also help encourage an open-source bazaar of
    decentralized users attempting to amass the large
    number of recorded forms necessary for a
    multi-purpose unit-selection synthesizer.

9
VARIATION AND ADAPTATION
  • Cross-Dialect Adaptation. This is almost exactly
    the case expressed above, except for that the
    default voice form and the specific users
    voice different in dialect, or to some greater
    degree than the average set of native speakers
    from a given area. That is, we would expect not
    only quality of voice variation, but also limited
    difference, in vocabulary, phonetic inventory,
    distinguishable minimal-pairs, accent, and the
    like. The result is that not only the
    unit-selection database, but also those
    components which assign phonetic realizations to
    the given text the letter-to-sound rules and the
    pronunciation dictionary or lexicon, may need
    alteration.
  • Cross-Language Adaptation. In this case, we
    retain some degree of phonetic inventory
    similarity between the source and destination
    language, but our letter-to-sound rules and
    lexicon need gross modification, or may even be
    unusable (even some language pairs where are very
    similar in pronunciation, such as Japanese and
    Korean, could nonetheless use unrelated
    orthographic form, or voice versa).

10
VARIATION AND ADAPTATION
  • Cross-Language Adaptation, Single Speaker
    Variant. In this case, we have recordings from a
    single speaker (i.e., the user), which we want to
    be able to speak naturally in languages in which
    the user is not a native speaker. We thus want
    to use information about these other languages to
    adapt the synthesizer of the users voice to
    speak multilingually. (This is especially
    significant in our global community, where many
    proper nouns of personal names and locations
    cannot be properly pronounced simply by following
    the phonological rules of a single language).
  • Language Acquisition. In the extreme case, we
    wish to bootstrap an empty synthesizer (with no
    lexicon or knowledge of pronunciation rules
    whatsoever) to speak like us simply by speaking
    to it, without hard-cording direct linguistic or
    phonetic knowledge. This is a task that a
    non-technical, non-expert native speaker user
    should be able to perform.

11
VARIATION AND ADAPTATION
  • Ignoring for a moment issues such as vocabulary
    choice and other semantic issues of usage, it is
    possible to consider variation from accent,
    dialect, and even across languages as a
    difference in degree of variation in a few key
    areas
  • the phonetic inventory which comprises the basic
    building blocks in which things are pronounced
  • a set of pronunciation rules or examples which
    dictate how the phonetic units are put together
    to assign a pronunciation to an orthographic
    form, and subsequently speak the desired text,
    and
  • a collection of conventionalized stress and
    intonational patterns which help provide
    structure and syntactic/semantic context to the
    overall produced utterances.

12
OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
  • Synthesis adds an additional problem to
    recognition adaptation the fact that the
    database of recorded segments themselves is
    itself used for concatentation. This means that
    we can not just merge the entire set of recorded
    data together there would be noticeable
    discrepancies between concatenative units taken
    from each individual speaker. On the other hand,
    if we just use the new set of segments, we arent
    adapting were just building a new synthesizer.
    For this study, we take the new target data to be
    a small data set not enough to be a good set of
    units for synthesis on its own.

13
OVERVIEW ADAPTATION IN SPEECH SYNTHESIS
  • We are thus required to use existing (source)
    units for synthesis. However, these source
    recordings and their associated existing
    synthetic voice have a specific accent/dialect,
    with a pre-defined phone set. Even with a proper
    dictionary and proper letter-to-sound rules
    providing use with a proper pronunciation
    taking into account pronunciation variation for
    our target accent., stringing the best match
    units together likely wont sound like a native
    speaker of that accent. The vowel quality might
    be vastly different, or phones might be missing
    in the source language (e.g., a French /r/). We
    want to adapt for this. Overall, we want to
    sound native in the target accent/dialect/language
    , using units recorded from the speaker of a
    different one.

14
PHONE UNIT ADAPTATION
  • If the variation between source and target speech
    is large enough, it is likely that describe the
    target speech with a different phone set than
    that of the source speech.
  • We may still find that the pronunciation of a
    particular phone in the target corresponds more
    closely with that of a different one than our
    source pronunciation lexicon would suggest (for
    instance, schwa reduction).
  • Or we might have an existing target
    pronunciation lexicon or pronunciation rules with
    a predefined phone set we with to use.  
  • To utilize data from our source synthesizer in
    such a case, we need to assign appropriate
    mappings between source and target phones.
  • This can be seen as a matter of degree as to how
    much effort or knowledge is incorporated into
    creating the mapping, how closely such a mapping
    corresponds to the observered data, and thus our
    (assumed) rating of the quality of such a
    mapping.

15
PHONE UNIT ADAPTATION
Figure 1 Degrees of Phoneme Mapping
  (alleged) WORST (alleged)
BEST Source
Naïve Mapping
Linguistically-Motivated
Data-Driven Target Phoneme Set
Phoneme Mapping
Mapping Phoneme Set
 
16
PHONE UNIT ADAPTATION
  • na(t)ive approach this approach follows the
    principle a non-native would follow when speaking
    a second language he basically has the phonetic
    inventory of the first language and partially
    uses that inventory when speaking the second
    language.
  • phonetic approach this strategy follows
    principles in the production of sounds in the
    human vocal tract that sound that agrees in the
    most phonetic features with the untrained one is
    taken instead of the unknown one of the goal
    language.
  • data-driven approach this approach determines
    the similarity among phones with the data given
    by the trained recognizer according to a
    distance measure the most similar units may be
    joined.

17
PRONUCIATION ADAPTATION
  • Typically taken for granted in multilingual
    speech adaptation studies is the presences of a
    pronunciation dictionary and/or rules for the
    target language
  • On the far extremes, we assume the existing of
    well-targeted pronunciation rules in the worst
    case, one designed for the source speech, and the
    best case, one specifically designed for the
    target. In between, we use a number of methods
    to derive or create a pronunciation module, based
    either upon the existing source-language methods,
    the target speech data itself, or some
    combination.

18
PRONUNCIATION ADAPTATION
Figure 2 Letter to sound rules/ lexicon
(alleged) WORST (alleged) BEST Principled
Foreign Langua
Trrained Principled Source-Only
Approximation Neutral
from Target data Target-Only
 
19
PRONUNCIATION ADAPTATION
  • Principled Source-Only this approach merely
    uses pronunciation methods specifically designed
    for the source speech to generate a pronunciation
    form for the target. This approach can result in
    extremely inaccurate pronunciation
    approximations, such as one might inspect from a
    native English attempt at a native pronunciation
    of an unusual foreign
  • Foreign Approximation this approach can be
    seen as akin to the na(t)ive approach of phone
    mapping as discussed above. In this case, the
    speaker recognizes that the word being pronounced
    is not a native one, and relaxes some of the
    language-specific rules or attempts to move the
    pronunciation closer to that of the assumed
    language of the word in question. The result is
    closer, but still inaccurate and strongly
    accented.

20
PRONUNCIATION ADAPTATION
  •  Language-Neutral this approach purely ignores
    all language-specific information, assuming
    either a set of very generic or regular
    pronunciation rules, proposing a (relatively)
    direct relation between orthographic form and
    pronunciation. Such rules would closely
    resembles those used for a language with
    artificially few pronunciation exceptions, such
    as Esperanto, rather than that of English.
  • Trained from Target Data in this method, an
    aligned text and speech signal are provided to a
    recognizer, along with (possibly) a limited set
    of pronunciation transcriptions as training data.
    In some automatic way, the system learns a set
    of pronunciation rules and/or a lexicon of
    pronunciations which closely matches the training
    data.
  • Principled Target-Only this approach assumes a
    provided pronunciation modules specifically
    designed to generate correct pronunciations for
    our target language/dialect/accent.

21
UNIT DATABASE COMPOSITION
Figure 3 Methods of Comprising Unit Database
(alleged) WORST (alleged)
BEST Source Speaker Union of
al Source Speaker Set
of Digitally Target Speaker Only
Recordings uncovered
phones Altered Segments
Only
(unprincipled) from target only
 
22
ADAPTATION FROM MIMICRY
  • We know from the beginning that our source unit
    database is of the best quality (in terms of
    recording, segmentation, labelling, etc.)
  • But we cant directly synthesize from the source
    database, because we will get accented,
    non-native sounding speech.
  • Is there a way to generate in a non-accented or
    differently-accented way from a single speech
    database?
  • Try to find a neutrally accented speaker?
    (What does this mean someone heavily
    polylingual? Someone geographically in between
    the two languages or accents?)
  • Look at mimicry studies how someone
    (intentionally) modifies their voice to sound
    like a different speaker.

23
ADAPTATION FROM MIMICRY
  • Anders Eriksson and Pär Wretling How Flexible
    is the Human Voice? A Case Study of Mimicry
  • Close mimicry of global speech rate
  • No change for timing at segmental level
  • Mean fundamental frequency and variation matched
    timing closely
  • Formant frequencies attained with variant
    success
  • Vowel imitation intermediate between voice and
    target
  •  
  • Fundamental frequency changes were more
    successful than changes in timing

24
STAGES OF THE EXPERIMENT
  • Our development efforts and systems will follow
    the four modes listed in the research overview in
    order of ascribed complexity. For the
    Cross-Speaker Adaptation case, we will utilize a
    base voice and training speaker of native
    American English.
  • For the Cross-Dialect Adaptation study, we will
    retain the use of English for the basic case,
    adapting over a selection of American, British,
    and Irish English dialects.
  • We will then finish with two data sets for
    Cross-Language Adaptation, proceeding in order of
    linguistic variation variation over the set of
    Celtic languages still in current use (Irish,
    Scottish Gaelic, and, slightly more distantly,
    Welsh) and a selection of Asian Indian Languages,
    including (at least) Bengali and Hindi.
Write a Comment
User Comments (0)
About PowerShow.com