A Segmental HMM for Speech Waveforms - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

A Segmental HMM for Speech Waveforms

Description:

Vibrating vocal cords: voiced speech. Frequency of vibration called pitch ... Voice background music. Timescale modified (slower) ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 24
Provided by: Kan6150
Category:

less

Transcript and Presenter's Notes

Title: A Segmental HMM for Speech Waveforms


1
A Segmental HMM for Speech Waveforms
  • Kannan Achan, Sam Roweis, Brendan Frey

2
  • Speech processing in purely time domain is
    generally considered difficult
  • perceptual similarity is not preserved
  • Microphones, room acoustics.
  • Time frequency representation is generally used

3
How is Speech produced?
  • Energy source air pressure from lungs
  • Larynx, vocal chords
  • Modulated by a transfer function that depends on
    the shape of vocal tract

4
Voiced, Unvoiced, Silence .
  • Vibrating vocal cords voiced speech
  • Frequency of vibration called pitch
  • Turbulent air flow unvoiced speech
  • Coloured noise
  • Silence period

5
Time domain modeling
Voiced region
UnVoiced region
  • Identify the Glottal pulse
  • period for voiced regions

Mechanism to identify region as unvoiced
6
Time domain modeling
Goal break the speech signal S in to segments
s1,s2,.sN corresponding to glottal pulse
period Notice that adjacent segments look
similar Idea First order Markov chain Model
each segment as a transformed version of previous
segment
7
Transformations
  • Time Warp (a) horizontal stretching
  • Stretches/shrinks the segment
  • Matrix multiplication Sx
  • S is a an x n matrix
  • Maps a n-vector to an-vector
  • Amplitude Scaling (b) vertical stretching
  • Scalar multiplication (bx)
  • Amplitude Shift (g) vertical shift
  • Scalar addition (xg)

8
Probability model
  • Segments s1,s2,.sK
  • Segment boundaries b1,b2,.bK1
  • Transformation Tk maps sk-1 to sk
  • Parameterized as Tk(ak,bk,gk)

?ak is discretized range handset using expected
pitch period and sampling frequency ?For a given
ak, we can find bk and gk using linear regression
9
Regularizer - Upward Zero Crossings(due to John
Hopfield)
  • Constraint the segment boundaries to start and
    end at only upward zero crossings
  • Goal Break the speech signal in to glottal pulse
    periods starting at an upward zero crossing

10
A Simple Greedy algorithm
  • Given a good initial guess St (bt,bt1)
  • Enumerate N zero crossings that occur
  • immediately after bt1. z1t,z2t,zNt
  • For each zero crossing zkt
  • Resample St to be of length (zkt -bt1)
  • Find the optimal amplitude scaling and shift
  • parameters using linear regression
  • Compute the error with the target
  • segment x(zkt bt1)
  • End
  • Select the target segment with the
  • least error as St1

11
Problems with Greedy technique
Unvoiced to voiced transition lack of a reliable
template leads to poor estimates We need an
efficient optimizer to infer the segments
12
Embedded HMM Neal, Beal Roweis(NIPS 2003)
  • Addresses the issue of sampling from the
    posterior distribution in a non linear dynamical
    system
  • We will use it as an optimizer to find MAP
    estimate

13
HMM Create a Pool of Segments
  • Let z11 z21 zK1 be the participating zero
    crossings in the current solution.
  • Define Nghd(zkt,g) as the set
  • containing zkt and
  • g neighbouring upward zero
  • crossings around zkt.
  • States of HMM are given by the set Nghd(zkt,g) X
    Nghd(zkt1,g)
  • consider only those tuples that form a valid
    segment (non-negative length)
  • Define compatibility function to enforce segment
    continuity

14
Dynamic programming to infer segment boundaries
  • Using current estimate form a pool of candidate
    segments
  • Run Viterbi algorithm to find the best path

b11,b21
b12,b22
b1T,b2T
bk1,bl1
bk2,bl2
bkT,blT
bx1,by1
bx2,by2
bxT,byT
?Monotonically improves the model likelihood of
the observed waveform
15
Results
(C) After 3 iterations
16
Time Scale Modification
  • Stochastically remove or add frames

Original clip
2 x slower
2 x faster
17
Voicing/Unvoicing detection
  • Periodicity of the segments can be used to
    discriminate voiced and unvoiced regions
  • Voiced regions more periodic

18
Pitch Tracking
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
19
Pitch on spectrogram
20
Clipped speech restoration
  • Saturation due to poor recording / quantization
  • Can we use the inferred transformation/pitch to
    complete?

Work in progress
21
Voice/Gender conversion
  • A very naïve approach
  • Pitch of male voice around 110Hz
  • Pitch of female voice around 210 Hz
  • Idea Stretch/shrink segments to
    decrease/increase pitch
  • Cubic spline smoother along segment boundaries

.
female
Work in progress
Trumpet
22
Multiple sound sources
  • Several templates evolving simultaneously
  • Need for more complicated model
  • Example
  • Voice background music
  • Timescale modified (slower)

23
.
  • Model hidden state corresponding to the state of
    glottis
  • 0 un-voiced /silence
  • 1 voiced
  • Handle multiple speakers
  • Need for more complicated models
Write a Comment
User Comments (0)
About PowerShow.com