A SegmentBased Generative Model of Speech - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

A SegmentBased Generative Model of Speech

Description:

Vibrating vocal cords: voiced speech. Frequency of vibration called pitch ... Voice background music. Timescale modified (slower) Denoising. Compression ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 19
Provided by: Kan6150
Category:

less

Transcript and Presenter's Notes

Title: A SegmentBased Generative Model of Speech


1
A Segment-Based Generative Model of Speech
  • Kannan Achan
  • Joint work with
  • Sam Roweis, Aaron Hertzmann, Brendan Frey
  • University of Toronto
  • http//www.psi.toronto.edu/kannan/segmental

2
  • Speech processing in purely time domain is
    generally considered difficult
  • perceptual similarity is not preserved
  • Microphones, room acoustics.
  • Time frequency representation is generally used
  • Phase !

Same utterance different microphones
3
Voiced, Unvoiced, Silence .
  • Vibrating vocal cords voiced speech
  • Frequency of vibration called pitch
  • Turbulent air flow unvoiced speech
  • Coloured noise
  • Silence period

4
Time domain modeling
Voiced region
UnVoiced region
  • Identify the Glottal pulse
  • period for voiced regions

Mechanism to identify region as unvoiced
5
Time domain modeling voiced region
  • Goal find segments s1,s2,.sN corresponding to
    glottal pulse period
  • Notice that adjacent segments look similar
  • Transformation t(a,b,g)
  • Time Warp (a) Stretch/Shrink - Maps a n-vector
    to an-vector
  • Amplitude Scaling (b) Scalar multiplication
    (bx)
  • Amplitude Shift (g) Scalar addition (xg)
  • ?For a given ak, we can find bk and gk using
    linear regression

6
Generative Model
  • Assuming segments are generated by a first order
    Markov process 4 types of transitions are
    possible
  • Voiced to Voiced
  • Voiced to Unvoiced
  • Unvoiced to Voiced
  • Unvoiced to Unvoiced
  • Given segment boundaries b, segment types v
    (voiced v1 or unvoiced v0) and transformation t
    , the generative model is a conditional Markov
    model

7
Generative Model
  • Successive voiced regions
  • red overlay in the 2nd period is the prediction

8
Generative Model
When 2 successive frames are not voiced, we
assume that phase information in the latter
cannot be reliably predicted ? use model of
power spectrum Define f(y) abs(F(y))/abs(F(y)
) (normalized power spectrum)
9
Regularizer - Upward Zero Crossings(due to John
Hopfield)
  • Constraint the segment boundaries to start and
    end at only upward zero crossings

To further regularize the space of valid segment
boundaries, we can impose constraint on the
minimum and maximum length of segments
10
Inference (approx. E step)
  • Computational task
  • Infer segment boundaries, segment types and
    transformation parameters
  • Exact inference intractable
  • Valid configurations of boundary variable
    exponential
  • Find MAP estimates using dynamic programming
  • 2-dimensional dynamic programming grid with size
    given by the cardinality of upward zero crossings
  • For every valid pair of boundary configuration
    (a,b), entry in the grid refers to the
    probability of (a,b) being the last segment in
    the best segmentation of the signal up to b.
  • Grid is sparse

11
Learning
  • Learn the parameters of the model l0 and l1 by
    maximizing the expected value of the complete log
    likelihood (posterior is the delta function
    computed during inference)
  • Updates for l0 and l1 correspond to normalized
    average spectrum of voiced and unvoiced segments
    ?s correspond to variances on these spectra

12
Results typical segmentation
13
Time Scale Modification
  • Stochastically remove or add frames

Original clip
2 x slower
2 x faster
14
Pitch Tracking / Voicing Detection
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
15
Filling in missing/corrupted region of speech
  • Our algorithm treats the corrupted region as
    unvoiced.
  • To reconstruct - fill in the corrupted region by
    generating new segments with periods between the
    two bounding voiced regions.

16
Clipped speech restoration
  • Saturation due to poor recording / quantization
  • Can we use the inferred transformation/pitch to
    complete?

Work in progress
17
Voice/Gender conversion
  • A very naïve approach
  • Pitch of male voice around 110Hz
  • Pitch of female voice around 210 Hz
  • Idea Stretch/shrink segments to
    decrease/increase pitch
  • Cubic spline smoother along segment boundaries

Work in progress
18
Current Work
  • Multiple sound sources
  • Several templates evolving simultaneously
  • Need for more complicated model
  • Example
  • Voice background music
  • Timescale modified (slower)
  • Denoising
  • Compression
  • Companding (volume normalization)
  • Reverberant filtering
Write a Comment
User Comments (0)
About PowerShow.com