Parham Aarabi - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Parham Aarabi

Description:

Parham Aarabi – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 59
Provided by: ValuedGate1252
Category:
Tags: aarabi | parham | sass

less

Transcript and Presenter's Notes

Title: Parham Aarabi


1
Parham Aarabi Assistant Professor, Canada
Research Chair in Multi-Sensor Information
Systems, and Founder/Director of the Artificial
Perception Lab
The Edward S. Rogers Sr. Department of Electrical
and Computer Engineering University of Toronto
2
Our Research Goal
Multi-Sensor Information Fusion for
Human-Computer Interaction Applications Examples
include Multi-Microphone Sound
Localization Multi-Microphone Speech
Separation Audiovisual Speech Processing Robotic
s
Why
Improve life for humans (e.g. speech recognition
for cars or the disabled, more intelligent
robotics, intelligent environments/cars/homes,
etc.)
3
The Artificial Perception Lab
9 Graduate Students, 1 postdoc, 25 undergraduate
researchers
4
Goal for today
Discuss central APL projects Sound
Localization Speech Separation/Enhancement
(briefly) Introduce other ongoing
projects Acoustical Robotic Navigation Audiovis
ual Sound Localization
5
Microphone Array
Cameras
6
(No Transcript)
7
Now we transition
Basic Sound Localization
8
Basic Sound Localization
Microphone arrays can be used to localize sound
sources since each source emanates a sound wave
that arrives at each microphone at different
times and amplitudes.
Applications -smart rooms -automatic
teleconfer. -robust speech recog. -robotics
-other HCI related app.
9
Basic Sound Localization
10
Basic Sound Localization
Knowledge about the TDOA between each microphone
pair constrains the source location to a
hyperbola in 2D, or hyperboloid in 3D
11
TDOA-based Sound Localization
TDOA estimate
12
Basic Sound Localization
13
Sound Localization
  • Sound localization can be expressed as
  • F(y) is a Spatial Likelihood Function (SLF)
  • Most basic example Delay-and-sum beamforming
    based (a.k.a. Steered Response Power or SRP)

14
The simplest SLF generation technique
delay-and-sum energy scanning Many more advanced
techniques exist
15
Sound Localization
  • Filter-and-sum beamforming based (i.e. using
    Generalized Cross Correlations Knapp76)
  • The SRP-PHAT algorithm Dibiase01 uses the
    Phase Transform

16
Microphone Array
17
Sound Localization
  • Problem with SRP-PHAT is that all microphones are
    weighted equally
  • We should be weighting microphones according to
    their level-of-access Aarabi01,Aarabi03,
    Mungamuru03

18
Microphone Array
Microphone Arrays
19
Measuring the reliability of a microphone array
  • 3 primary factors affect the reliability of a
    microphone
  • Source Directivity
  • Microphone Directivity
  • Source-Microphone Distance

20
Enhanced Sound Localization
  • Since we are modeling directivities, it is now
    possible to extract source orientation
  • So, we now have Enhanced Sound Localization
  • Spatial Likelihood Function is now also a
    function of source orientation, ?

21
Enhanced Sound Localization
  • Temporal ML Algorithm
  • Weighted SRP-PHAT Algorithm

22
Sound Localization Example Using 24 Mics.
23
Sound Localization Example Using 24 Mics.
24
Sound Localization Example Using 24 Mics.
High Likelihood
Low Likelihood
25
Sound Localization Example Using 24 Mics.
26
More experiments with Stationary Speaker
27
Results Stationary Speaker
28
Experimental Setup
29
Moving Speaker Comparison with SRP-PHAT
Weighted SRP-PHAT
SRP-PHAT
30
Moving Speaker Comparison with SRP-PHAT
Over 100 seconds (1000 frames 100ms frames) of
moving speaker trials, resulting in
SRP-PHAT 7.3 Anomalies 23cm average
error Weighted SRP-PHAT 3 Anomalies 20cm
average error An anomaly is when the distance
error is greater than 1m
31
Now we transition
Sound Localization Hardware Implementation
Sound Localization Algorithms
32
Hardware based Speech Localization
  • Problems with current sound localization
    implementations
  • Scalability (not appropriate for 10 mics.)
  • Power requirements (not good for mobile
    applications)
  • Space requirements (multiple chips, etc.)
  • As a result, we implemented a hardware based
    2-microphone sound localization (TDOA estimation)
    system in 0.18 µm CMOS

33
Hardware based Speech Localization
  • Initially implemented on FPGA NguyenICASSP03/ICM
    E03
  • Used the Phase Transform Technique, as shown
    below

34
Solution
  • A full custom ASIC solution is capable of
  • 100 resource utilization
  • Efficient power utilization
  • Efficient scalability options

35
Chip Block Diagram
DSP Front-end
DSP Core
36
Maximum Likelihood Engine
  • Most computationally expensive part of chip

37
The Result
38
Chip Testing
39
Chip Features
  • 1.8 V core consumes 28.98 mW (10 times more
    efficient than our FPGA implementation, 20-50
    times more efficient than typical DSP
    implementations)
  • At 20dB SNR, about 20 of the localizations
    resulted in anomalies, with a 2.2 degree average
    angle error in non-anomalous estimations
  • The next step is to combine speech localization
    and separation into a single VLSI chip, for
    Tablet PC/PDA/Cell phone applications

40
Now we transition
Sound Localization Hardware Implementation
Speech Separation
41
Speech Separation Using Time-Frequency
Masking AarabiFusion02, ShiICASSP03/ICME03,
AarabiICME03
Question How can we use knowledge about the
location of the sound source in order to remove
noise/unwanted background speakers?
42
Speech Separation Using Time-Frequency Masking
43
Speech Separation Using Time-Frequency Masking
Frequency (?)
Time index (k)
44
How do we process the noisy recordings to get
back our signal of interest? Idea scale each
time-frequency (TF) block based on the phases in
each recording
Microphone 2 Recording
Microphone 1 Recording
45
Time-Frequency (TF) speech representation
The spectrograms X1k(?) and X2k(?) are not
the complete frequency domain representation we
also have and
Frequency (?)
Time index (k)
46
Using the two phase functions, we obtain a TF
mask
47
Example
Result after applying mask to the first
microphones signal
Original signal
48
Speech Recognition Results
TFM (Time-Frequency Masking) outperforms both DS
(Delay-and-sum) and SD (Superdirective
Bitzer99) at 0dB SNR.
49
Multi-Microphone Probabilistic Speech
Separation RennieICASSP03/ICME03
The previous technique assumed no prior knowledge
about the speech sources. Question How can
such prior knowledge be used, in conjunction with
the spatial position of the source, to separate
multiple speakers?
50
M Sources, N Mics.
s (t)
M
x (t)
1
s (t)
2
x (t)
N
s (t)
1
51
Multi-Microphone Probabilistic Speech Separation
Our approach 1-Learn probabilistic models for
each source 2-Estimate the original source
signals by computing the most likely (or,
alternatively, the expected) source signal given
the prior speech model, the mixed microphone
recordings, and the time-delays of arrival This
is an extension of the work of Deng,
Kristjansson, Frey, and Acero, as well as others.
52
Graphical model representation (for each
frequency)
s
s
s
53
Preliminary Results with 2 Microphones
  • Each microphone receives a mixture of 2, 3 or 4
    speakers, i.e. 0dB SNR between speakers.
  • In addition, the microphone signal is corrupted
    by independent white Gaussian noise at 20db, 10db
    and 0db.

54
Now we transition
Speech Separation
Other Topics
55
Audiovisual Sound Localization Aarabi01
56
Acoustic Robot Navigation
57
Conclusions
  • The fusion of multiple sensors allows for more
    accurate sound localization and speech
    recognition.
  • Current research efforts include
  • Audiovisual Speech Separation
  • Dynamic Camera and Microphone Arrays
  • Multi-Rate Multi-Microphone Signal Enhancement
  • Multi-Microphone Speaker Identification

58
Please visit
www.apl.utoronto.ca
Write a Comment
User Comments (0)
About PowerShow.com