A SegmentBased Probabilistic Generative Model of Speech presentation

About This Presentation

Transcript and Presenter's Notes

Title: A SegmentBased Probabilistic Generative Model of Speech

1
A Segment-Based Probabilistic Generative Model
of Speech

Kannan Achan
Joint work with
Sam Roweis, Aaron Hertzmann, Brendan Frey
University of Toronto
http//www.psi.toronto.edu/kannan/segmental

2
Time Domain Speech Processing

Speech processing in purely time domain is
generally considered difficult
Very high variability
Microphones, room acoustics.
Noise can be a serious problem
Time frequency representation is generally used
Stable
Spectrogram reading
But,
phase information is generally discarded
Timing information is lost
Employs arbitrary windowing

Same utterance different microphones
3
Still, time domain is appealing
No information is discarded from the input
signal Notice that there is a lot of amazing
structure in the time signal
4
Speech wave a quick primer

Vibrating vocal cords voiced speech
Frequency of vibration called pitch
Turbulent air flow unvoiced speech
Coloured noise
Silence period

5
Goal Segment the Waveform

Group samples into one of voiced/unvoiced/silence
regions
Segment voiced regions in to glottal pulses

6
Generative Model of Speech Production

Assuming segments are generated by a first order
Markov process 4 types of transitions are
possible
1. Voiced to Voiced 3. Unvoiced to Voiced
2. Voiced to Unvoiced 4. Unvoiced to Unvoiced
Given segment boundaries b, segment types v
(voiced v1 or unvoiced v0) and transformation t
, the generative model is a conditional Markov
model

For simplicity, we discard silence periods
beforehand
7
Time domain modeling voiced region

Voiced ? Voiced transition Next segment is a
noisy copy of the transformed version of the
previous one
Transformations t(a,b,g)
Time Warp (a) Stretch/Shrink - Maps a
n-vector to an-vector
Amplitude Scaling (b) Scalar multiplication
(bx)
Amplitude Shift (g) Scalar addition (xg)

8
Generative Model Harmonic region

Successive voiced regions
red overlay in the 2nd period is the prediction

?Best transformation can be found locally using
linear regression
9
Generative Model Non-Harmonic region
When 2 successive frames are not voiced, we
assume that phase information in the latter
cannot be reliably predicted ? Model only the
power spectrum lk and ?k are the (learned) mean
and covariance of the normalized power spectrum
of the model. f(y) is the normalized power
spectrum of y
10
Waveform continuity Zero Crossings(due to
John Hopfield)

Constraint the segment boundaries to start and
end at only upward zero crossings

Ensures waveform continuity
Makes optimization tractable
To further regularize the space of valid segment
boundaries, we can impose constraint on the
minimum and maximum length of segments

11
Inference

Computational task
Infer segment boundaries (b), segment types (v)
and transformation parameters (t)
Exact inference intractable
Valid configurations of boundary variable
exponential
Find MAP estimates using dynamic programming
2-dimensional dynamic programming grid with size
given by the cardinality of zero crossings
For every valid pair of boundary configuration
(a,b), entry in the grid refers to the
probability of (a,b) being the last segment in
the best segmentation of the signal up to b.
Grid is sparse

12
Learning

Learn the parameters of the model l0 and l1 by
maximizing the expected value of the complete log
likelihood (posterior is the delta function
computed during inference)
Updates for l0 and l1 correspond to normalized
average spectrum of voiced and unvoiced segments
?s correspond to variances on these spectra

13
Results typical segmentation
14
Time Scale Modification

Stochastically remove or add frames

Original clip
2 x slower
2 x faster
15
Pitch Tracking
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
16
Voicing detection
Performed automatically during the course of
dynamic programming. ?We can just read off the
optimal segmentation labels
17
Quantitative Results
18
Filling in missing/corrupted region of speech

Our algorithm treats the corrupted region as
unvoiced.
To reconstruct - fill in the corrupted region by
generating new segments with periods between the
two bounding voiced regions.

19
Work in progress Declipping / Denoising

Clipped speech restoration
Saturation due to poor recording / quantization
Can we use the inferred transformation to
complete?

20
Conclusion

We have presented a simple segmental model for
analyzing speech waveforms directly in time
domain.
wide range of applications become possible under
this single framework
We are also investigating many possible
applications including, voice conversion, volume
equalization and reverberant filtering

21
Dynamic programming in detail
22
Estimating transformation can be seen as linear
regression
23
Generative model of noisy time domain signals
24
Voice/Gender conversion

A very naïve approach
Pitch of male voice around 110Hz
Pitch of female voice around 210 Hz
Idea Stretch/shrink segments to
decrease/increase pitch
Cubic spline smoother along segment boundaries

Work in progress
25
Current Work

Multiple sound sources
Several templates evolving simultaneously
Need for more complicated model
Example
Voice background music
Timescale modified (slower)
Denoising
Compression
Companding (volume normalization)
Reverberant filtering

Write a Comment

User Comments (0)

About PowerShow.com

A SegmentBased Probabilistic Generative Model of Speech PowerPoint PPT Presentation