Title: A Microphone Array Beamforming Approach to Blind Speech Separation
1A Microphone Array Beamforming Approach to Blind
Speech Separation
PASCAL Speech Separation Challenge II
- Iain McCowan, Ivan Himawan, Mike Lincoln.
- CSIRO, U Edin, QUT, IDIAP.
2The Challenge
- To separate the speech of two talkers
simultaneously reading sentences from the Wall
Street Journal (WSJ) speech corpus. - Recordings are made using 16 microphones in 2
eight element circular arrays on a table in the
centre of a reverberant meeting room. - Knowledge of the room layout and speaker and
microphone placements may NOT be used UNLESS
derived automatically.
3Possible Solutions
- Traditionally, two different approaches to
multi-channel speech enhancement - Microphone Array Beamforming,
- Blind Source Separation.
- Beamforming more robust for ASR, however requires
knowledge of microphone and speaker locations.
4Our Approach
- Automatically derive microphone and speaker
positions. - Apply microphone array beamforming to separate
the speech.
5Our System
- Array Shape Calibration.
- Speaker Localisation.
- Beamforming.
- Post-filtering.
- Speech Recognition.
61. Array Shape Calibration
- Consists of determining the relative positions of
array elements. - Existing techniques rely on
- Good initial estimate of geometry.
- Multiple explicit calibrating signals of known
location. - Purpose built devices, e.g. with speakers
co-located with microphones. - Tested on simulated data.
71. Array Shape Calibration
- A diffuse noise field is a good model for a
number of practical environments (e.g. office,
car).
81. Array Shape Calibration
91. Array Shape Calibration
Multi-channel Microphone Signals
a. Detect Noise Frames
Measured Noise Coherence Matrix
b. Fit Model to Measured Coherence
Inter-Microphone Distance Matrix
c. Multidimensional Scaling
Microphone Position Vector
101. Array Shape Calibration
1 I. McCowan, M. Lincoln, and I. Himawan.
Microphone Array Calibration in Diffuse Noise
Fields. Submitted to IEEE TASLP., 2006.
111. Array Shape Calibration
- Global position calibration for all 16
microphones. - K-means to cluster into localised sub-arrays.
- Increase K until cluster dimensions reach minimum
threshold (all elements lt 5cm apart). - Re-calibrate positions for each sub-array.
- Ensure common coordinate system by aligning to
initial global estimates.
122. Speaker Localisation
- Note in this step we assume two stationary
speakers. - For each sub-array
- Grid search over SRP-PHAT values.
- Take 2 prominent sources for each file.
- Merge estimates across sub-arrays
- Take globally most confident source as first
estimate. - Select second estimate as one with greatest
azimuth angular separation from first (low
likelihood of being the same speaker).
133. Beamforming
- Filter-sum beamforming for spatial filtering of
signals. - Superdirective filter weights.
- Maximise gain in desired direction while
minimising average gain over all other
directions. - Shown to be robust beamformer in ASR
applications. - Better gain than simple delay-sum, but less
signal distortion than many adaptive techniques.
144. Post-filtering
- Simple masking post-filter to separate speech
- Motivation 2
- For 2 speech signals combined additively, the log
spectrum is well modelled as the maximum of the 2
individual log spectra. This is due to sparsity
of speech signal over frequency and time.
2 S. T. Roweis, Factorial models and
refiltering for speech separation and denoising,
in Eurospeech, 2003, pp. 10091012.
155. Speech Recognition
- Provided evaluation recognition system.
- HTK, HMM/GMM, Tri-gram LM.
- Adaptation on Dev set to account for distant
microphones and processing.
16Development Results
- Results on SSC2 Dev set.
- Evaluation measures
- Array calibration and speaker localisation
- Accuracy compared to known microphone and desired
speaker locations. - Beamformer, Post-filter
- Speech Recognition (WER).
- To avoid speaker-dependent adaptation, these dev
results generated using K-folds.
171. Array Calibration Accuracy
182. Speaker Localisation Accuracy
193. Speech Recognition
WER
Both
Best
20Evaluation Results
WER
Both
Best
21Conclusions
- This was a difficult challenge.
- Array processing yields major improvement over
single distant microphone. - Results show that lapel-like performance is a
realistic target for ongoing research. - Ways to improve on our baseline system
- Better speaker localisation.
- Investigate other post-filter strategies.
- More sophisticated ASR.