Bayesian Machine Learning for Signal Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian Machine Learning for Signal Processing

Description:

6th International Conference on Independent Component Analysis. and Blind Source Separation, ... First papers on a statistical machine learning approach to ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 34
Provided by: HagaiA
Learn more at: http://www.cnel.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Machine Learning for Signal Processing


1
Bayesian Machine Learning for Signal
Processing
  • Hagai T. Attias
  • Golden Metallic,
    Inc.
  • San Francisco, CA
  • Tutorial
  • 6th International Conference on Independent
    Component Analysis
  • and Blind Source Separation, Charleston,
    SC, March 2006

2
ICA / BSS is 15 Years Old
  • First pair of papers Comon, Jutten Herault,
    Signal Processing,
  • 1991
  • First papers on a statistical machine learning
    approach to
  • ICA/BSS Bell Sejnowski 1995 Cardoso 1996
    Pearlmutter
  • Parra 1997
  • First conference on ICA/BSS Helsinki, 2000
  • Lesson drawn by many ICA is a cool problem.
    Lets find many
  • approaches to it and many places where its
    useful.
  • Lesson drawn by some statistical machine
    learning is a cool
  • framework. Lets use it to transform adaptive
    signal processing.
  • ICA is a good start.

3
Noise Cancellation
4
From Noise Cancellation to ICA
Background
Microphone
ICA
TV
TV
5
Noise Cancellation Derivation
  • y sensor, x sources, n time point
  • y1(n) x1(n) w(n) x2(n)
  • y2(n) x2(n)
  • Joint probability distribution of observed sensor
    data
  • p(y) px(x1 y1 wy2, x2 y2)
  • Assume the sources are independent, identically
    distributed Gaussians, with mean 0 and precisions
    v1, v2
  • Observed data likelihood
  • L log p(y) -0.5 v1 (y1 wy2)2
    const.
  • dL / dw 0 ? linear equation for w

6
Noise Cancellation ? ICA Derivation
  • y sensor, x sources, n time point
  • y(n) A x(n) , A square mixing matrix
  • x(n) G y(n) , G square unmixing matrix
  • Probability distribution of observed sensor data
  • p(y) G px(G y)
  • Assume the sources are i.i.d. non-Gaussians
  • Observed data likelihood
  • L log p(y) log G log p(x1)
    log p(x2)
  • dL / dG 0 ? non-linear equation for G

7
Sensor Noise and Hidden Variables
  • y sensor, x sources, u noise, n time
    point
  • y(n) A x(n) u(n)
  • x are now hidden variables even if A is known,
    one cannot
  • obtain x exactly from y
  • However, one can compute the posterior
    probability of x
  • conditioned on y
  • p(xy) p(yx) p(x) / p(y)
  • where p(yx) pu(y A x)
  • To learn A from data, one must use an expectation
  • maximization (EM) algorithm (and often
    approximate it)

8
Probabilistic Graphical Models
  • Model the distribution of observed data
  • Graph structure determines the probabilistic
    dependence between variables
  • We focus on DAGs directed acyclic graphs
  • Node variable
  • Arrow probabilistic dependence

x
x
y
p(y,x) p(yx) p(x)
p(x)
9
Linear Classification
  • c class label discrete, multinomial
  • y data continuous, Gaussian
  • p(c) pc , p(yc) N( y µc, ?c )
  • Training set pairs y,c
  • Learn parameters by maximum likelihood
  • L log p(y,c) log p(yc) log p(c)
  • Test set y, classify using p(cy) p(y,c) /
    p(y)

c
p(y,c) p(yc) p(c)
y
10
Linear Regression
x
  • x predictor continuous, Gaussian
  • y dependent continuous, Gaussian
  • p(x) N(x µ, ? ) , p(yx) N( y Ax, ? )
  • Training set pairs y,x
  • Learn parameters by maximum likelihood
  • L log p(y,x) log p(yx) log p(x)
  • Test set x, predict using p(yx)

p(y,x) p(yx) p(x)
y
11
Clustering
  • c class label discrete, multinomial
  • y data continuous, Gaussian
  • p(c) pc , p(yc) N( y µc, ?c )
  • Training set y
  • p(y) is a mixture of Gaussians (MoG)
  • Learn parameters by expectation maximization
    (EM)
  • Test set y, cluster using p(cy) p(y,c) /
    p(y)
  • Limit of zero variance vector quantization (VQ)

c
p(y,c) p(yc) p(c) But now c is hidden
y
12
Factor Analysis
  • x factors continuous, Gaussian
  • y data continuous, Gaussian
  • p(x) N( x 0, I ) , p(yx) N( y Ax, ? )
  • Training set y
  • p(y) is Gaussian with covariance AA?-1
  • Learn parameters by expectation maximization
    (EM)
  • Test set y, obtain factors by p(xy) p(y,x)
    / p(x)
  • Limit of zero noise principal component analysis
    (PCA)

p(y,x) p(yx) p(x) But now x is hidden
13
Dynamical Models
Hidden Markov Model (Baum-Welch)
State Space Model (Kalman Smoothing)
Switching State Space Model (Intractable)
14
Probabilistic Inference
Factor analysis model p(x) N(x0,I) p(yx,A,?)
N(yAx,?)
  • Nodes inside frame variables, vary in time
  • Nodes outside frame parameters, constant in time
  • Parameters have prior distributions p(A), p(?)
  • Bayesian Inference compute full posterior
    distribution p(x,A,?y)
  • over all hidden nodes conditioned on
    observed nodes
  • Bayes rule p(x,A,?y)p(yx,A,?)p(x)p(A)p(?)/p(y
    )
  • In hidden variable models, joint posterior can
    generally not be computed exactly. The
    normalization factor p(y) is instractable

A
x
?
y
15
MAP and Maximum Likelihood
  • MAP maximum aposteriori consider only the
    parameter values the maximize the posterior
    p(x,A,?y)
  • This is the maximum likelihood method
  • compute A,? that maximize L log p(yA,?)
  • However, in hidden variable models L is a
    complicated function of the parameters direct
    maximization would require gradient based
    techniques which are slow
  • Solution the EM algorithm
  • Iterative algorithm, each iteration has an E-step
    and an M-step
  • E-step compute posterior over hidden variables
    p(xy)
  • M-step maximize complete data likelihood E log
    p(y,x,A,?) w.r.t. the parameters A,? E
    posterior average over x

16
Derivation of the EM Algorithm
  • Instead of the likelihood L log p(y), consider
  • F(q) E log p(y,x) E log q(xy)
  • where q(xy) is a trial posterior and E
    averaged over x w.r.t. q
  • Can show F(q) L KL q(xy) p(xy) lt
    L
  • Hence F is upper bounded by L, and FL when
    qtrue posterior
  • EM performs an alternate maximization of F
  • The E-step maximizes F w.r.t. the posterior q
  • The M-step maximizes F w.r.t. the parameters A,?
  • Hence EM performs maximum likelihood

17
ICA by EM MoG Sources
  • Each source distribution
  • p(x) is a 1-dim mixture
  • of Gaussians
  • The Gaussian labels s are
  • hidden variables
  • The data y A x, hence
  • x G y are not hidden
  • Likelihood L log G log p(x)
  • F(q) log G E log p(x,s)
  • E log q(sy)
  • E-step q(sy) p(x,s) / z
  • M-step G ? G e(I-F(x)x)G
  • (natural gradient)
  • F(x) is linear in x and q
  • Can also learn the source parameters
  • MoG1, MoG2 at the M-step

18
Noisy, Non-Square ICA Independent Factor Analysis
  • The Gaussian labels s are
  • hidden variables
  • The data y A x u,
  • hence x are also hidden
  • p(yx) N( y Ax, ? )
  • Likelihood L log p(y)
  • must marginalize over x,s
  • F(q) E log p(y,x,s)
  • E log q(x,sy)
  • E-step q(x,sy) q(xs,y)q(sy)
  • M-step linear eqs for A, ?
  • Can also learn the source parameters
  • MoG1, MoG2 at the M-step
  • Convergence problem in low noise

19
Intractability of Inference
  • In many models of interest the
  • E-step is computationally intractable
  • Switching state space model
  • posterior over discrete state p(sy)
  • is exponential in time
  • Independent factor analysis
  • posterior over Gaussian labels
  • is exponential in number of sources
  • Approximations must be made
  • MAP approximation consider only the
  • most likely state configuration(s)
  • Markov Chain Monte Carlo convergence
  • may be quite slow and hard to determine

20
Variational EM
  • Idea use an approximate posterior which has a
    factorized form
  • Example switching state space model
  • factorize the continuous states from the
    discrete states
  • p(x,sy) q(x,sy) q(xy) q(sy)
  • make no other assumptions (e.g., functional
    forms)
  • To derive, consider F(q) from the derivation of
    EM
  • F(q) E log p(y,x,s) - E log q(xy) E
    log q(sy)
  • E performs posterior averaging w.r.t. q
  • Maximize F alternately w.r.t. q(xy) and q(sy)
  • q(xy) Es p(y,x,s) / zs
  • q(sy) Ex p(y,x,s) / zx
  • This adds an internal loop in the E-step M-step
    is unchanged
  • Convergence is guaranteed since F(q) is upper
    bounded by L

21
Switching Model 2 Variational Approximations
  • Model
    Variational approximation I

s(1)
s(2)
s(3)
x(2)
x(3)
x(1)
Variational approximation II
s(1)
s(2)
s(3)
I Baum-Welch, Kalman Gaussian, smoothed II
Baum-Welch, MoG Multimodal, not smoothed
x(2)
x(3)
x(1)
22
IFA 2 Variational Approximations
  • Model
    Variational approximation I

Variational approximation II
I Source posterior is Gaussian, correlated
II Source posterior is multimodal, independent
23
Model Order Selection
  • How does one determine the optimal number of
    factors in FA?
  • Maximum likelihood would always prefer more
    complex models, since they fit the data better
    but they overfit
  • The probabilistic inference approach place a
    prior p(A) over the model parameters, and
    consider the marginal likelihood
  • L log p(y) E log p(y,A) E log p(Ay)
  • Compute L for each number of factors. Choose
    the number that maximizes L
  • An alternative approach place a prior p(A)
    assuming a maximum number of factors. The prior
    has a hyperparameter for each column of A its
    precision a. Optimize the precisions by
    maximizing L. Unnecessary columns will have a ?
    infinity
  • Both approaches require computing the parameter
    posterior p(Ay), which is usually intractable

24
Variational Bayesian EM
  • Idea use an approximate posterior which
    factorizes the parameters from the hidden
    variables
  • Example factor analysis
  • p(x,Ay) q(x,Ay) q(xy) q(Ay)
  • make no other assumptions (e.g., functional
    forms)
  • To derive, consider F(q) from the derivation of
    EM
  • F(q) E log p(y,x,A) - E log q(xy) E
    log q(Ay)
  • E performs posterior averaging w.r.t. q
  • Maximize F alternately w.r.t. q(xy) and q(Ay)
  • E-step q(xy) EA p(y,x,A) / zA
  • M-step q(Ay) Ex p(y,x,A) / zx
  • Plus, maximize F w.r.t. the noise precision ? and
    hyperparameters a (MAP approximation)

25
VB Approximation for IFA
  • Model
    VB approximation

s1
s2
x1
x2
A
26
Conjugate Priors
  • Which form should one choose for prior
    distributions?
  • Conjugate prior idea Choose a prior such that
    the resulting posterior distribution would have
    the same functional form as the prior
  • Single Gaussian posterior over mean is
  • p(µy) p(yµ) p(µ) / p(y)
  • conjugate prior is Gaussian
  • Single Gaussian posterior over mean precision
    is
  • p(µ,?y) p(yµ,?) p(µ,?)
  • conjugate prior is Normal-Wishart
  • Factor analysis VB posterior over mixing matrix
    is
  • q(Ay) Ex p(y,xA) p(A) / z
  • conjugate prior is Gaussian

27
Separation of Convoluted Mixtures of Speech
Sources
  • Blind separation methods use extremely simple
    models for source distributions
  • Speech signals have a rich structure. Models that
    capture aspects of it could result in improved
    separation, deconvolution, and noise robustness
  • One such model work in the windowed FFT domain
  • x(n,k) G(k) y(n,k)
  • where nframe index, kfrequency
  • Train a MoG model on the x(n,k) such that
    different components capture different speech
    spectra
  • Plus this model into IFA and use EM to obtain
    separation of convoluted mixtures

28
Noise Suppression in Speech Signals
  • Existing methods based on,
  • e.g., spectral subtraction and array
  • processing, often produce
  • unsatisfactory noise suppression
  • Algorithms based on probabilistic
  • models can (1) exploit rich speech
  • models, (2) learn the noise from
  • noisy data (not just silent segments)
  • (3) can work with one or more
  • microphones
  • Use speech model in the windowed FFT domain
  • ?(k) noise precision per frequency (inverse
    spectrum)

29
Interference Suppression and Evoked Source
Separation in MEG data
  • y(n) MEG sensor data, x(n) evoked brain
    sources,
  • u(n) interference sources, v(n) sensor
    noise
  • Evoked stimulus experimental paradigm evoked
    sources are active only after the stimulus onset
  • pre-stimulus y(n) B u(n) v(n)
  • post-stimulus y(n) A x(n) B u(n) v(n)
  • SEIFA is an extension of IFA to this case model
    x by MoG, model u by Gaussians N(0,I), model v by
    Gaussian N(0,?)
  • Use pre-stimulus to learn interference mixing
    matrix B and noise precision ? use post-stimulus
    to learn evoked mixing matrix A
  • Use VB-EM to infer from data the optimal number
    of interference factors u and of evoked factors
    x also protect from overfitting
  • Cleaned data y A x Contribution of factor
    j yi Aij xj
  • Advantages over ICA no need to discard
    information by dim reduction can exploit
    stimulus onset information superior noise
    suppression

30
Stimulus Evoked Independent Factor Analysis
  • Pre-stimulus
    Post-stimulus

u
y
B
31
Brain Source Localization using MEG
  • Problem localize brain sources that respond to a
    stimulus
  • Response model is simple y(n) F s(n) v(n)
  • F lead field (known), s brain voxel
    activity
  • However, the number of voxels (3000-1000) is
    much larger than the number of sensors (100-300)
  • One approach fit multiple dipole sources cost
    is exponential in the number of sources
  • Idea loop over voxels for each one, use VB-EM
    to learn a
  • modified FA model
  • y(n) F z(n) A x(n) v(n)
  • where F lead field for that voxel, z
    voxel activity,
  • x response from all other active voxels
  • Obtain a localization map by plotting ltz(n)2gt per
    voxel
  • Superior results to exising (beamforming based)
    methods can handle correlated sources

32
MEG Localization Model
  • Pre-stimulus
    Post-stimulus

z
u
u
x
y
y
A,B
B
F
33
Conclusion
  • Statistical machine learning provides a
    principled framework for
  • formulating and solving adaptive signal
    processing problems
  • Process
  • (1) design a probabilistic model that
    corresponds to the problem
  • (2) use machinery for exact and approximate
    inference to learn
  • the model from data, including model
    order
  • (3) extend the model, by e.g. incorporating
    rich signal models,
  • to improve performance
  • Problems treated here noise suppression, source
    separation,
  • source localization
  • Domains Speech, audio, biomedical data
  • Domains outside this tutorial image, video,
    text, coding, ..
  • Future algorithms derived from probabilistic
    models take over
  • and completely transform adaptive signal
    processing
Write a Comment
User Comments (0)
About PowerShow.com