A Segmental HMM for Speech Waveforms - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

A Segmental HMM for Speech Waveforms

Description:

Vibrating vocal cords: voiced speech. Frequency of vibration called pitch ... Voice background music. Timescale modified (slower) ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 24

Provided by: Kan6150

Category:

more less

Transcript and Presenter's Notes

Title: A Segmental HMM for Speech Waveforms

1
A Segmental HMM for Speech Waveforms

Kannan Achan, Sam Roweis, Brendan Frey

Speech processing in purely time domain is
generally considered difficult
perceptual similarity is not preserved
Microphones, room acoustics.
Time frequency representation is generally used

3
How is Speech produced?

Energy source air pressure from lungs
Larynx, vocal chords
Modulated by a transfer function that depends on
the shape of vocal tract

4
Voiced, Unvoiced, Silence .

Vibrating vocal cords voiced speech
Frequency of vibration called pitch
Turbulent air flow unvoiced speech
Coloured noise
Silence period

5
Time domain modeling
Voiced region
UnVoiced region

Identify the Glottal pulse
period for voiced regions

Mechanism to identify region as unvoiced
6
Time domain modeling
Goal break the speech signal S in to segments
s1,s2,.sN corresponding to glottal pulse
period Notice that adjacent segments look
similar Idea First order Markov chain Model
each segment as a transformed version of previous
segment
7
Transformations

Time Warp (a) horizontal stretching
Stretches/shrinks the segment
Matrix multiplication Sx
S is a an x n matrix
Maps a n-vector to an-vector
Amplitude Scaling (b) vertical stretching
Scalar multiplication (bx)
Amplitude Shift (g) vertical shift
Scalar addition (xg)

8
Probability model

Segments s1,s2,.sK
Segment boundaries b1,b2,.bK1
Transformation Tk maps sk-1 to sk
Parameterized as Tk(ak,bk,gk)

?ak is discretized range handset using expected
pitch period and sampling frequency ?For a given
ak, we can find bk and gk using linear regression
9
Regularizer - Upward Zero Crossings(due to John
Hopfield)

Constraint the segment boundaries to start and
end at only upward zero crossings
Goal Break the speech signal in to glottal pulse
periods starting at an upward zero crossing

10
A Simple Greedy algorithm

Given a good initial guess St (bt,bt1)
Enumerate N zero crossings that occur
immediately after bt1. z1t,z2t,zNt
For each zero crossing zkt
Resample St to be of length (zkt -bt1)
Find the optimal amplitude scaling and shift
parameters using linear regression
Compute the error with the target
segment x(zkt bt1)
End
Select the target segment with the
least error as St1

11
Problems with Greedy technique
Unvoiced to voiced transition lack of a reliable
template leads to poor estimates We need an
efficient optimizer to infer the segments
12
Embedded HMM Neal, Beal Roweis(NIPS 2003)

Addresses the issue of sampling from the
posterior distribution in a non linear dynamical
system
We will use it as an optimizer to find MAP
estimate

13
HMM Create a Pool of Segments

Let z11 z21 zK1 be the participating zero
crossings in the current solution.
Define Nghd(zkt,g) as the set
containing zkt and
g neighbouring upward zero
crossings around zkt.
States of HMM are given by the set Nghd(zkt,g) X
Nghd(zkt1,g)
consider only those tuples that form a valid
segment (non-negative length)
Define compatibility function to enforce segment
continuity

14
Dynamic programming to infer segment boundaries

Using current estimate form a pool of candidate
segments
Run Viterbi algorithm to find the best path

b11,b21
b12,b22
b1T,b2T
bk1,bl1
bk2,bl2
bkT,blT
bx1,by1
bx2,by2
bxT,byT
?Monotonically improves the model likelihood of
the observed waveform
15
Results
(C) After 3 iterations
16
Time Scale Modification

Stochastically remove or add frames

Original clip
2 x slower
2 x faster
17
Voicing/Unvoicing detection

Periodicity of the segments can be used to
discriminate voiced and unvoiced regions
Voiced regions more periodic

18
Pitch Tracking
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
19
Pitch on spectrogram
20
Clipped speech restoration