Vocal Tract Length Normalization for AcoustictoArticulatory Mapping Using Neural Networks - PowerPoint PPT Presentation

Loading...

PPT – Vocal Tract Length Normalization for AcoustictoArticulatory Mapping Using Neural Networks PowerPoint presentation | free to view - id: a67ea-NmRmN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Vocal Tract Length Normalization for AcoustictoArticulatory Mapping Using Neural Networks

Description:

Vocal -Tract Length Normalization for Acoustic-to-Articulatory ... Estimating a normalized vocal-tract length (VTL) factor of the speaker from short utterances, ... – PowerPoint PPT presentation

Number of Views:280
Avg rating:3.0/5.0
Slides: 21
Provided by: sorin2
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Vocal Tract Length Normalization for AcoustictoArticulatory Mapping Using Neural Networks


1
Vocal -Tract Length Normalization for
Acoustic-to-Articulatory Mapping Using Neural
Networks
2pSC5
  • Sorin Dusan and Li Deng
  • sdusan_at_speech2.uwaterloo.ca,
    deng_at_speech2.uwaterloo.ca
  • http//crg7.uwaterloo.ca/sdusan
  • Department of Electrical and Computer Engineering
  • University of Waterloo, Waterloo, Ontario, CANADA
    N2L 3G1

2
Objective
ASA 138th Meeting
Sorin Dusan Li Deng
Vocal -Tract Length Normalization
  • Accounting for Inter-Speaker Variability
  • The main goal of this work is to compensate for
    the variability of the speech signal, caused by
    the physiological differences in vocal-tract
    size, for an acoustic-to-articulatory inversion
    application, by
  • Estimating a normalized vocal-tract length (VTL)
    factor of the speaker from short
    utterances,
  • Normalizing the acoustic parameters of the speech
    signal using a frequency warping method based on
    the estimated normalization factor
    .

3
Background
ASA 138th Meeting
Sorin Dusan Li Deng
Vocal -Tract Length Normalization
  • Inter-Speaker Variability
  • A main source of variability in speech is caused
    by the physiological differences in vocal-tract
    size and shape of the speakers due to age, gender
    etc.
  • Among these anatomical parameters one of the most
    important components is the overall vocal-tract
    length of the speaker.
  • VTL differences can produce variations of 20 or
    more in formant frequencies between male and
    female speakers (Fant 1 Nordstrom6).

4
Background
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Speaker Normalization
  • A way of accounting for the variability of the
    speech signal due to differences in vocal-tract
    size of the speakers.
  • A normalization based on a single parameter,
    corresponding to the overall VTL, can account for
    a large part of the inter-speaker variability
    (6).
  • VTL normalization can be done employing a linear
    frequency scaling (Lin and Che 3) or a
    frequency warping method (Lee and Rose 2),
    based on the estimated VTL factor .

5
Background
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Speech Inversion Based on a Reference Model
  • Use articulatory-acoustic data of a single
    speaker to train a speech inversion model. Then
    estimate the articulatory trajectories from the
    new speech acoustic signals of the reference
    speaker (7).
  • To apply the speech inversion for other speakers
  • Estimate the overall VTL and apply normalization,
  • Estimate the articulatory trajectories using the
    normalized acoustic parameters,
  • Scale the vocal-tract parameters with VTL factor.

6
Assumptions
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Vocal-Tract Size
  • The variations in size of the three-dimensional
    vocal-tract found in humans can influence in many
    ways the acoustic parameters of the speech
    signal.
  • A first approximation of these variations can be
    done using the VTL as a single scaling parameter
    (Nordstrom 6). A better approximation method
    uses two parameters, one for the length of mouth
    cavity and the other for the length of pharynx
    cavity (Fant 1).
  • We assumed a first order approximation using VTL.

7
Assumptions
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 1. Spectra of synthesized /a/ using linear
    variation of VTL

8
Models
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
ASA 138th Meeting
We used Maedas articulatory and acoustic models
(4,5), to synthesize the training and
evaluation data
Figure 3 Vocal-tract profile of Maedas
articulatory model
9
Method
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • The speaker normalization method presented here
    represents a pre-processing stage of the
    acoustic-to-articulatory inversion and consists
    of two parts estimation of the overall VTL of
    the speakers and normalization of the speech
    acoustic parameters
  • (1) Estimation of VTL
  • The estimation of the overall VTL of a speaker
    was carried out using a neural network method. We
    used a feed-forward neural network with two
    hidden layers of sigmoid neurons and an output
    layer of linear neurons.

10
Method
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • The network had 10 inputs, corresponding to the
    first 10 mel-frequency cepstrum coefficients
    (MFCC), 30 tansig neurons in the first layer, 10
    logsig neurons in the second layer, one purelin
    neuron in the output layer and one output,
    representing the VTL factor
  • The MFCC parameters and the corresponding VTL
    factors , from synthesized training data
    were used in the back-propagation method for
    training the network. Evaluation of VTL
    estimation has been done on synthesized test data.

11
Method
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • The estimation of the VTL factor , is based
    on averaging the output of the neural network for
    a given input sequence of vowels, as presented in
    figure (2).
  • (2) Speech Normalization
  • We used a frequency warping normalization with a
    factor , inversely proportional to
    the VTL factor . We adopted a method of
    shifting the filter-bank of the cepstrum analyzer
    with the factor, like in (Lee and Rose 2).

12
Method
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 2. Estimating the VTL factor for an
    /euyoai/ sequence

13
Data
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
ASA 138th Meeting
  • We synthesized training and evaluation data
    containing sequences of 6 English vowels.
  • Training Data
  • The training data contained the utterance
    /ayeoiu/ synthesized 250 times by linearly
    scaling the VTL between 15.0 cm and 18.75 cm
    (100-125 range).
  • Evaluation Data
  • We synthesized different evaluation data sets,
    each containing 250 utterances of the same vowels
    but in a different context using the same VTL
    range.

14
Data
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 4. Two training /ayeoiu/ utterances with
    different VTL

15
Estimation Results (Training)
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 5. Estimation of VTL factor (/ayeoiu/
    training data)

16
Estimation Results (Evaluation)
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 6. Estimation of VTL factor (/euyoai/
    evaluation data)

17
Normalization Results
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 7. Normalization of vowels produced with
    VTL18.75cm

18
Normalization Results
ASA 138th Meeting
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
  • Figure 8. Normalization of an /euyoai/ sequence
    (VTL18.75cm)

19
Conclusions
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
ASA 138th Meeting
  • We developed a method of estimating the VTL of a
    speaker from a single utterance of a sequence of
    vowels, using neural networks, and of normalizing
    the acoustic parameters, using frequency warping,
    for the acoustic-to-articulatory inversion.
  • The estimation method was evaluated on various
    synthesized data on which it showed an average
    error of under 1 and a maximum error of 3.2 for
    utterances of 6 vowels in different contexts.
  • The MFCC normalization reduced the acoustic
    parameter RMS errors for vowels with about 50.

20
Selected References
Vocal -Tract Length Normalization
Sorin Dusan Li Deng
ASA 138th Meeting
  • 1 G. Fant, Non-uniform vowel normalization,
    STL-QPSR 2-3 pp. 1-19, 1975
  • 2 L. Lee and R. C. Rose, Speaker normalization
    using efficient frequency warping procedures,
    ICASSP96, pp. 353-356, 1996
  • 3 Q. Lin and C. Che, Normalizing the vocal
    tract length for speaker independent speech
    recognition, IEEE Signal Processing Letters, 2
    11, pp. 201-203, 1995
  • 4 S. Maeda, Improved articulatory model, JASA
    84 S146, 1988
  • 5 S. Maeda, A digital simulation method of the
    vocal-tract system, Speech Communication, 1
    pp.199-229, 1982
  • 6 P. E. Nordstrom, Attempts to simulate female
    and infant vocal tracts from male area
    functions, STL-QPSR 2-3 pp. 20-33, 1975
  • 7 S. Dusan, Estimation of articulatory
    gestures from the speech acoustic signal, PhD
    Thesis, 1999 (in progress)
About PowerShow.com