A Microphone Array Beamforming Approach to Blind Speech Separation

About This Presentation

Title:

A Microphone Array Beamforming Approach to Blind Speech Separation

Description:

Good initial estimate of geometry. Multiple explicit ... 1. Array Shape Calibration. A diffuse noise field is a good model for a number of practical ... – PowerPoint PPT presentation

Number of Views:420

Avg rating:3.0/5.0

Slides: 22

Provided by: homepage7

Category:

more less

Transcript and Presenter's Notes

Title: A Microphone Array Beamforming Approach to Blind Speech Separation

1
A Microphone Array Beamforming Approach to Blind
Speech Separation
PASCAL Speech Separation Challenge II

Iain McCowan, Ivan Himawan, Mike Lincoln.
CSIRO, U Edin, QUT, IDIAP.

2
The Challenge

To separate the speech of two talkers
simultaneously reading sentences from the Wall
Street Journal (WSJ) speech corpus.
Recordings are made using 16 microphones in 2
eight element circular arrays on a table in the
centre of a reverberant meeting room.
Knowledge of the room layout and speaker and
microphone placements may NOT be used UNLESS
derived automatically.

3
Possible Solutions

Traditionally, two different approaches to
multi-channel speech enhancement
Microphone Array Beamforming,
Blind Source Separation.
Beamforming more robust for ASR, however requires
knowledge of microphone and speaker locations.

4
Our Approach

Automatically derive microphone and speaker
positions.
Apply microphone array beamforming to separate
the speech.

5
Our System

Array Shape Calibration.
Speaker Localisation.
Beamforming.
Post-filtering.
Speech Recognition.

6
1. Array Shape Calibration

Consists of determining the relative positions of
array elements.
Existing techniques rely on
Good initial estimate of geometry.
Multiple explicit calibrating signals of known
location.
Purpose built devices, e.g. with speakers
co-located with microphones.
Tested on simulated data.

7
1. Array Shape Calibration

A diffuse noise field is a good model for a
number of practical environments (e.g. office,
car).

8
1. Array Shape Calibration
9
1. Array Shape Calibration
Multi-channel Microphone Signals
a. Detect Noise Frames
Measured Noise Coherence Matrix
b. Fit Model to Measured Coherence
Inter-Microphone Distance Matrix
c. Multidimensional Scaling
Microphone Position Vector
10
1. Array Shape Calibration
1 I. McCowan, M. Lincoln, and I. Himawan.
Microphone Array Calibration in Diffuse Noise
Fields. Submitted to IEEE TASLP., 2006.
11
1. Array Shape Calibration

Global position calibration for all 16
microphones.
K-means to cluster into localised sub-arrays.
Increase K until cluster dimensions reach minimum
threshold (all elements lt 5cm apart).
Re-calibrate positions for each sub-array.
Ensure common coordinate system by aligning to
initial global estimates.

12
2. Speaker Localisation

Note in this step we assume two stationary
speakers.
For each sub-array
Grid search over SRP-PHAT values.
Take 2 prominent sources for each file.
Merge estimates across sub-arrays
Take globally most confident source as first
estimate.
Select second estimate as one with greatest
azimuth angular separation from first (low
likelihood of being the same speaker).

13
3. Beamforming

Filter-sum beamforming for spatial filtering of
signals.
Superdirective filter weights.
Maximise gain in desired direction while
minimising average gain over all other
directions.
Shown to be robust beamformer in ASR
applications.
Better gain than simple delay-sum, but less
signal distortion than many adaptive techniques.

14
4. Post-filtering

Simple masking post-filter to separate speech
Motivation 2
For 2 speech signals combined additively, the log
spectrum is well modelled as the maximum of the 2
individual log spectra. This is due to sparsity
of speech signal over frequency and time.

2 S. T. Roweis, Factorial models and
refiltering for speech separation and denoising,
in Eurospeech, 2003, pp. 10091012.
15
5. Speech Recognition

Provided evaluation recognition system.
HTK, HMM/GMM, Tri-gram LM.
Adaptation on Dev set to account for distant
microphones and processing.

16
Development Results

Results on SSC2 Dev set.
Evaluation measures
Array calibration and speaker localisation
Accuracy compared to known microphone and desired
speaker locations.
Beamformer, Post-filter
Speech Recognition (WER).
To avoid speaker-dependent adaptation, these dev
results generated using K-folds.

17
1. Array Calibration Accuracy
18
2. Speaker Localisation Accuracy
19
3. Speech Recognition
WER
Both
Best
20
Evaluation Results
WER
Both
Best
21
Conclusions

This was a difficult challenge.
Array processing yields major improvement over
single distant microphone.
Results show that lapel-like performance is a
realistic target for ongoing research.
Ways to improve on our baseline system
Better speaker localisation.
Investigate other post-filter strategies.
More sophisticated ASR.

Write a Comment

User Comments (0)