Title: Probabilistic Inference of Speech Signals from Phaseless Spectrograms
1Probabilistic Inference of Speech Signals from
Phaseless Spectrograms
- Kannan Achan, Sam Roweis, Brendan Frey
- University of Toronto
2Time-Frequency Representation
- Spectrogram most common representation of
speech - summary of short time spectral features
- contains useful features of a speech signal.
- energy distribution across various frequencies
over time
3Short-time Spectral Analysis
- mk magnitude of the Fourier transform of
samples in the kth window - Windows overlap - we assume an overlap on n/2
- Use Hamming/Hanning windows to avoid boundary
effects.
F Fourier matrix and sk samples in the kth
frame
Spectrogram M
4Applications and the Bottleneck
- Time Scale Modification
- Lengthen/shorten speech without altering
frequency content - Idea Sub sample columns of the spectrogram
?
2x faster
5Applications and the Bottleneck
- Speech de-noising
- Spectral subtraction, Algonquin,
- Idea De-noise the spectrogram
Noisy waveform
Noisy spectrogram
?
De-noised waveform
De-noised spectrogram
6What is the Bottleneck?
Input speech waveform S
Techniques that modify the spectrogram need a
reliable procedure to reconstruct the underlying
time domain signal
Windowed FFT
Windowed IFFT
F
abs(.)
angle(.)
Spectrogram
Phase
Time scale modification, denoising,..
Corresponding Phase?
Modified Spectrogram
Output speech waveform S
7Is the Problem Solvable?
- Goal perform frequency to time transformation
- Instead of using magnitude and phase ? signal
- Use multiple magnitude constraints
- Recall overlapping windows
- Every si contributes to 2 columns in the
spectrogram - 2 constraints for every si , both due to
magnitude - there is hope
8Standard Method Griffin and Lim
- Alternate until convergence
- estimate the phase given the hypothesized time
domain signal and the observed spectrum - estimate the time domain signal given observed
spectrum and a hypothesized phase
- Issues
- Inconsistent estimates of phase and signal at any
iteration - convergence
- poor perceptual quality
9Probabilistic Inference of Time-domain Speech
Signals
- Given a spectrogram M, goal is to infer the
underlying time domain signal S - Say, we have prior knowledge about the speaker
- Idea Prior distribution P(S)
- Rth order auto-regressive model
10Probabilistic Inference of Time-domain Speech
Signals
- Probability model for spectrogram M and speech s
Prior (AR model)
Time domain signal (to be inferred)
Likelihood function
m
m
m
Observed spectrogram
11Probability Model
Prior P(s) Rth order auto-regressive model
- each sample is predicted as a linear combination
of previous R samples (all poles filter). - AR parameters (ar) are known in advance.
Likelihood P(Ms) The probability of observing a
spectrogram M given a time domain signal s
12Making the Signal Explicit in the Model
Simplify mk(s) by introducing the Fourier
transform matrix, F
- Log probability is a quartic in the unknowns,
si - Derivative is a cubic in the unknowns
13Inference - ICM
- Iterative Conditional Modes
- Iteratively select a variable and assign the MAP
estimate given observations and other variables - Guaranteed to increase P(S,M)
- Issues
- Updates single sample at any time
- prone to poor local optima
14Inference (joint optimization)
- Directly search for max(log(P(S,M))
- Jointly optimize s1, s2, . sN using conjugate
gradients involves computing,
- Algorithm makes global changes to the waveform
- Avoids inconsistent phase estimates (phase is
implicit in the formulation) - Guaranteed to reduce discrepancy between the
spectrogram of estimated waveform and the
observed spectrogram
15Experiments
- Setup
- Hamming window of length 256
- Overlap 128 samples
- 12th order auto regressive model as prior
- Data Randomly chosen utterances from NIST/ WSJ
database. - Evaluation
- Perceptual quality of sound in estimated signal.
Audio demonstrations http//www.psi.utoronto.ca/
kannan/spectrogram/ - SNR analysis
- Application Time scale modification
16Results
Original signal
input
Griffin Lim
Our algorithm (CG)
17Results in dB gain
- Phase not implicit in the model
- Use an approximation to SNR
Application Time scale modification (audio
demonstration)
18Variational Inference
- Current work Find the posterior distribution
P(SM) - Exact inference intractable!
- Mean field inference approximate using a fully
factored distribution
- Goal infer mean and variance of every time sample
- Minimize KL divergence between Q(s) and P(S,M)
19Mean Field Inference
- G(?ยต,?) accounts for uncertainty in S. Estimates
with high uncertainty dont influence other
estimates. - If we set ?0, the first (entropy) and third term
vanish this is equivalent to our earlier
formulation
20Conclusion
- Probabilistic model of time domain signal and
spectrogram - Takes advantage of using prior information, if
available - Joint optimization avoids poor local optimum
- can easily be extended to
- other types of prior information
- deal with missing or noisy spectrogram frames
21References
- Griffin, D. W and Lim, J. S Signal estimation
from modified short time Fourier transform, IEEE
Trans. on Acoustics, Speech and Signal
Processing, 1984 32/2 - Roucos, S. and A. M. Wilgus. High Quality
Time-Scale Modification for Speech. Proceedings
of the International Conference on Acoustics,
Speech, and Signal Processing, IEEE, 1985,
493-496. - Green, P, Barker J, Cooke M, Josifovski L.
Handling Missing and Unreliable Information in
Speech Recognition, AISTATS 8, 2001 - Frey B.J, Kristjansson T, Deng L, Acero A,
Learning dynamic noise models from noisy speech
for robust speech recognition, Advances in Neural
Information Processing (NIPS) 2001