Parham Aarabi presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parham Aarabi

1
Parham Aarabi Assistant Professor, Canada
Research Chair in Multi-Sensor Information
Systems, and Founder/Director of the Artificial
Perception Lab
The Edward S. Rogers Sr. Department of Electrical
and Computer Engineering University of Toronto
2
Our Research Goal
Multi-Sensor Information Fusion for
Human-Computer Interaction Applications Examples
include Multi-Microphone Sound
Localization Multi-Microphone Speech
Separation Audiovisual Speech Processing Robotic
s
Why
Improve life for humans (e.g. speech recognition
for cars or the disabled, more intelligent
robotics, intelligent environments/cars/homes,
etc.)
3
The Artificial Perception Lab
9 Graduate Students, 1 postdoc, 25 undergraduate
researchers
4
Goal for today
Discuss central APL projects Sound
Localization Speech Separation/Enhancement
(briefly) Introduce other ongoing
projects Acoustical Robotic Navigation Audiovis
ual Sound Localization
5
Microphone Array
Cameras
6
(No Transcript)
7
Now we transition
Basic Sound Localization
8
Basic Sound Localization
Microphone arrays can be used to localize sound
sources since each source emanates a sound wave
that arrives at each microphone at different
times and amplitudes.
Applications -smart rooms -automatic
teleconfer. -robust speech recog. -robotics
-other HCI related app.
9
Basic Sound Localization
10
Basic Sound Localization
Knowledge about the TDOA between each microphone
pair constrains the source location to a
hyperbola in 2D, or hyperboloid in 3D
11
TDOA-based Sound Localization
TDOA estimate
12
Basic Sound Localization
13
Sound Localization

Sound localization can be expressed as
F(y) is a Spatial Likelihood Function (SLF)
Most basic example Delay-and-sum beamforming
based (a.k.a. Steered Response Power or SRP)

14
The simplest SLF generation technique
delay-and-sum energy scanning Many more advanced
techniques exist
15
Sound Localization

Filter-and-sum beamforming based (i.e. using
Generalized Cross Correlations Knapp76)
The SRP-PHAT algorithm Dibiase01 uses the
Phase Transform

16
Microphone Array
17
Sound Localization

Problem with SRP-PHAT is that all microphones are
weighted equally
We should be weighting microphones according to
their level-of-access Aarabi01,Aarabi03,
Mungamuru03

18
Microphone Array
Microphone Arrays
19
Measuring the reliability of a microphone array

3 primary factors affect the reliability of a
microphone
Source Directivity
Microphone Directivity
Source-Microphone Distance

20
Enhanced Sound Localization

Since we are modeling directivities, it is now
possible to extract source orientation
So, we now have Enhanced Sound Localization
Spatial Likelihood Function is now also a
function of source orientation, ?

21
Enhanced Sound Localization

Temporal ML Algorithm
Weighted SRP-PHAT Algorithm

22
Sound Localization Example Using 24 Mics.
23
Sound Localization Example Using 24 Mics.
24
Sound Localization Example Using 24 Mics.
High Likelihood
Low Likelihood
25
Sound Localization Example Using 24 Mics.
26
More experiments with Stationary Speaker
27
Results Stationary Speaker
28
Experimental Setup
29
Moving Speaker Comparison with SRP-PHAT
Weighted SRP-PHAT
SRP-PHAT
30
Moving Speaker Comparison with SRP-PHAT
Over 100 seconds (1000 frames 100ms frames) of
moving speaker trials, resulting in
SRP-PHAT 7.3 Anomalies 23cm average
error Weighted SRP-PHAT 3 Anomalies 20cm
average error An anomaly is when the distance
error is greater than 1m
31
Now we transition
Sound Localization Hardware Implementation
Sound Localization Algorithms
32
Hardware based Speech Localization

Problems with current sound localization
implementations
Scalability (not appropriate for 10 mics.)
Power requirements (not good for mobile
applications)
Space requirements (multiple chips, etc.)
As a result, we implemented a hardware based
2-microphone sound localization (TDOA estimation)
system in 0.18 µm CMOS

33
Hardware based Speech Localization

Initially implemented on FPGA NguyenICASSP03/ICM
E03
Used the Phase Transform Technique, as shown
below

34
Solution

A full custom ASIC solution is capable of
100 resource utilization
Efficient power utilization
Efficient scalability options

35
Chip Block Diagram
DSP Front-end
DSP Core
36
Maximum Likelihood Engine

Most computationally expensive part of chip

37
The Result
38
Chip Testing
39
Chip Features

1.8 V core consumes 28.98 mW (10 times more
efficient than our FPGA implementation, 20-50
times more efficient than typical DSP
implementations)
At 20dB SNR, about 20 of the localizations
resulted in anomalies, with a 2.2 degree average
angle error in non-anomalous estimations
The next step is to combine speech localization
and separation into a single VLSI chip, for
Tablet PC/PDA/Cell phone applications

40
Now we transition
Sound Localization Hardware Implementation
Speech Separation
41
Speech Separation Using Time-Frequency
Masking AarabiFusion02, ShiICASSP03/ICME03,
AarabiICME03
Question How can we use knowledge about the
location of the sound source in order to remove
noise/unwanted background speakers?
42
Speech Separation Using Time-Frequency Masking
43
Speech Separation Using Time-Frequency Masking
Frequency (?)
Time index (k)
44
How do we process the noisy recordings to get
back our signal of interest? Idea scale each
time-frequency (TF) block based on the phases in
each recording
Microphone 2 Recording
Microphone 1 Recording
45
Time-Frequency (TF) speech representation
The spectrograms X1k(?) and X2k(?) are not
the complete frequency domain representation we
also have and
Frequency (?)
Time index (k)
46
Using the two phase functions, we obtain a TF
mask
47
Example
Result after applying mask to the first
microphones signal
Original signal
48
Speech Recognition Results
TFM (Time-Frequency Masking) outperforms both DS
(Delay-and-sum) and SD (Superdirective
Bitzer99) at 0dB SNR.
49
Multi-Microphone Probabilistic Speech
Separation RennieICASSP03/ICME03
The previous technique assumed no prior knowledge
about the speech sources. Question How can
such prior knowledge be used, in conjunction with
the spatial position of the source, to separate
multiple speakers?
50
M Sources, N Mics.
s (t)
M
x (t)
1
s (t)
2
x (t)
N
s (t)
1
51
Multi-Microphone Probabilistic Speech Separation
Our approach 1-Learn probabilistic models for
each source 2-Estimate the original source
signals by computing the most likely (or,
alternatively, the expected) source signal given
the prior speech model, the mixed microphone
recordings, and the time-delays of arrival This
is an extension of the work of Deng,
Kristjansson, Frey, and Acero, as well as others.
52
Graphical model representation (for each
frequency)
s
s
s
53
Preliminary Results with 2 Microphones

Each microphone receives a mixture of 2, 3 or 4
speakers, i.e. 0dB SNR between speakers.
In addition, the microphone signal is corrupted
by independent white Gaussian noise at 20db, 10db
and 0db.

54
Now we transition
Speech Separation
Other Topics
55
Audiovisual Sound Localization Aarabi01
56
Acoustic Robot Navigation
57
Conclusions

The fusion of multiple sensors allows for more
accurate sound localization and speech
recognition.
Current research efforts include
Audiovisual Speech Separation
Dynamic Camera and Microphone Arrays
Multi-Rate Multi-Microphone Signal Enhancement
Multi-Microphone Speaker Identification

58
Please visit
www.apl.utoronto.ca

Write a Comment

User Comments (0)

About PowerShow.com

Parham Aarabi PowerPoint PPT Presentation