Loading...

PPT – Bayesian Machine Learning for Signal Processing PowerPoint presentation | free to download - id: 21b8-NmEyY

The Adobe Flash plugin is needed to view this content

Bayesian Machine Learning for Signal

Processing

- Hagai T. Attias
- Golden Metallic,

Inc. - San Francisco, CA
- Tutorial
- 6th International Conference on Independent

Component Analysis - and Blind Source Separation, Charleston,

SC, March 2006

ICA / BSS is 15 Years Old

- First pair of papers Comon, Jutten Herault,

Signal Processing, - 1991
- First papers on a statistical machine learning

approach to - ICA/BSS Bell Sejnowski 1995 Cardoso 1996

Pearlmutter - Parra 1997
- First conference on ICA/BSS Helsinki, 2000
- Lesson drawn by many ICA is a cool problem.

Lets find many - approaches to it and many places where its

useful. - Lesson drawn by some statistical machine

learning is a cool - framework. Lets use it to transform adaptive

signal processing. - ICA is a good start.

Noise Cancellation

From Noise Cancellation to ICA

Background

Microphone

ICA

TV

TV

Noise Cancellation Derivation

- y sensor, x sources, n time point
- y1(n) x1(n) w(n) x2(n)
- y2(n) x2(n)
- Joint probability distribution of observed sensor

data - p(y) px(x1 y1 wy2, x2 y2)
- Assume the sources are independent, identically

distributed Gaussians, with mean 0 and precisions

v1, v2 - Observed data likelihood
- L log p(y) -0.5 v1 (y1 wy2)2

const. - dL / dw 0 ? linear equation for w

Noise Cancellation ? ICA Derivation

- y sensor, x sources, n time point
- y(n) A x(n) , A square mixing matrix
- x(n) G y(n) , G square unmixing matrix
- Probability distribution of observed sensor data
- p(y) G px(G y)
- Assume the sources are i.i.d. non-Gaussians
- Observed data likelihood
- L log p(y) log G log p(x1)

log p(x2) - dL / dG 0 ? non-linear equation for G

Sensor Noise and Hidden Variables

- y sensor, x sources, u noise, n time

point - y(n) A x(n) u(n)
- x are now hidden variables even if A is known,

one cannot - obtain x exactly from y
- However, one can compute the posterior

probability of x - conditioned on y
- p(xy) p(yx) p(x) / p(y)
- where p(yx) pu(y A x)
- To learn A from data, one must use an expectation

- maximization (EM) algorithm (and often

approximate it)

Probabilistic Graphical Models

- Model the distribution of observed data
- Graph structure determines the probabilistic

dependence between variables - We focus on DAGs directed acyclic graphs
- Node variable
- Arrow probabilistic dependence

x

x

y

p(y,x) p(yx) p(x)

p(x)

Linear Classification

- c class label discrete, multinomial
- y data continuous, Gaussian
- p(c) pc , p(yc) N( y µc, ?c )
- Training set pairs y,c
- Learn parameters by maximum likelihood
- L log p(y,c) log p(yc) log p(c)
- Test set y, classify using p(cy) p(y,c) /

p(y)

c

p(y,c) p(yc) p(c)

y

Linear Regression

x

- x predictor continuous, Gaussian
- y dependent continuous, Gaussian
- p(x) N(x µ, ? ) , p(yx) N( y Ax, ? )
- Training set pairs y,x
- Learn parameters by maximum likelihood
- L log p(y,x) log p(yx) log p(x)
- Test set x, predict using p(yx)

p(y,x) p(yx) p(x)

y

Clustering

- c class label discrete, multinomial
- y data continuous, Gaussian
- p(c) pc , p(yc) N( y µc, ?c )
- Training set y
- p(y) is a mixture of Gaussians (MoG)
- Learn parameters by expectation maximization

(EM) - Test set y, cluster using p(cy) p(y,c) /

p(y) - Limit of zero variance vector quantization (VQ)

c

p(y,c) p(yc) p(c) But now c is hidden

y

Factor Analysis

- x factors continuous, Gaussian
- y data continuous, Gaussian
- p(x) N( x 0, I ) , p(yx) N( y Ax, ? )
- Training set y
- p(y) is Gaussian with covariance AA?-1
- Learn parameters by expectation maximization

(EM) - Test set y, obtain factors by p(xy) p(y,x)

/ p(x) - Limit of zero noise principal component analysis

(PCA)

p(y,x) p(yx) p(x) But now x is hidden

Dynamical Models

Hidden Markov Model (Baum-Welch)

State Space Model (Kalman Smoothing)

Switching State Space Model (Intractable)

Probabilistic Inference

Factor analysis model p(x) N(x0,I) p(yx,

A,?) N(yAx,?)

- Nodes inside frame variables, vary in time
- Nodes outside frame parameters, constant in time

- Parameters have prior distributions p(A), p(?)
- Bayesian Inference compute full posterior

distribution p(x,A,?y) - over all hidden nodes conditioned on

observed nodes - Bayes rule p(x,A,?y)p(yx,A,?)p(x)p(A)p(?)/p(y

) - In hidden variable models, joint posterior can

generally not be computed exactly. The

normalization factor p(y) is instractable

A

x

?

y

MAP and Maximum Likelihood

- MAP maximum aposteriori consider only the

parameter values the maximize the posterior

p(x,A,?y) - This is the maximum likelihood method
- compute A,? that maximize L log p(yA,?)
- However, in hidden variable models L is a

complicated function of the parameters direct

maximization would require gradient based

techniques which are slow - Solution the EM algorithm
- Iterative algorithm, each iteration has an E-step

and an M-step - E-step compute posterior over hidden variables

p(xy) - M-step maximize complete data likelihood E log

p(y,x,A,?) w.r.t. the parameters A,? E

posterior average over x

Derivation of the EM Algorithm

- Instead of the likelihood L log p(y), consider
- F(q) E log p(y,x) E log q(xy)
- where q(xy) is a trial posterior and E

averaged over x w.r.t. q - Can show F(q) L KL q(xy) p(xy) L
- Hence F is upper bounded by L, and FL when

qtrue posterior - EM performs an alternate maximization of F
- The E-step maximizes F w.r.t. the posterior q
- The M-step maximizes F w.r.t. the parameters A,?
- Hence EM performs maximum likelihood

ICA by EM MoG Sources

- Each source distribution
- p(x) is a 1-dim mixture
- of Gaussians
- The Gaussian labels s are
- hidden variables
- The data y A x, hence
- x G y are not hidden
- Likelihood L log G log p(x)
- F(q) log G E log p(x,s)
- E log q(sy)
- E-step q(sy) p(x,s) / z
- M-step G ? G e(I-F(x)x)G
- (natural gradient)
- F(x) is linear in x and q
- Can also learn the source parameters
- MoG1, MoG2 at the M-step

Noisy, Non-Square ICA Independent Factor Analysis

- The Gaussian labels s are
- hidden variables
- The data y A x u,
- hence x are also hidden
- p(yx) N( y Ax, ? )
- Likelihood L log p(y)
- must marginalize over x,s
- F(q) E log p(y,x,s)
- E log q(x,sy)
- E-step q(x,sy) q(xs,y)q(sy)
- M-step linear eqs for A, ?
- Can also learn the source parameters
- MoG1, MoG2 at the M-step
- Convergence problem in low noise

Intractability of Inference

- In many models of interest the
- E-step is computationally intractable
- Switching state space model
- posterior over discrete state p(sy)
- is exponential in time
- Independent factor analysis
- posterior over Gaussian labels
- is exponential in number of sources
- Approximations must be made
- MAP approximation consider only the
- most likely state configuration(s)
- Markov Chain Monte Carlo convergence
- may be quite slow and hard to determine

Variational EM

- Idea use an approximate posterior which has a

factorized form - Example switching state space model
- factorize the continuous states from the

discrete states - p(x,sy) q(x,sy) q(xy) q(sy)
- make no other assumptions (e.g., functional

forms) - To derive, consider F(q) from the derivation of

EM - F(q) E log p(y,x,s) - E log q(xy) E

log q(sy) - E performs posterior averaging w.r.t. q
- Maximize F alternately w.r.t. q(xy) and q(sy)
- q(xy) Es p(y,x,s) / zs
- q(sy) Ex p(y,x,s) / zx
- This adds an internal loop in the E-step M-step

is unchanged - Convergence is guaranteed since F(q) is upper

bounded by L

Switching Model 2 Variational Approximations

- Model

Variational approximation I

s(1)

s(2)

s(3)

x(2)

x(3)

x(1)

Variational approximation II

s(1)

s(2)

s(3)

I Baum-Welch, Kalman Gaussian, smoothed

II Baum-Welch, MoG Multimodal, not smoothed

x(2)

x(3)

x(1)

IFA 2 Variational Approximations

- Model

Variational approximation I

Variational approximation II

I Source posterior is Gaussian, correlated

II Source posterior is multimodal,

independent

Model Order Selection

- How does one determine the optimal number of

factors in FA? - Maximum likelihood would always prefer more

complex models, since they fit the data better

but they overfit - The probabilistic inference approach place a

prior p(A) over the model parameters, and

consider the marginal likelihood - L log p(y) E log p(y,A) E log

p(Ay) - Compute L for each number of factors. Choose

the number that maximizes L - An alternative approach place a prior p(A)

assuming a maximum number of factors. The prior

has a hyperparameter for each column of A its

precision a. Optimize the precisions by

maximizing L. Unnecessary columns will have a ?

infinity - Both approaches require computing the parameter

posterior p(Ay), which is usually intractable

Variational Bayesian EM

- Idea use an approximate posterior which

factorizes the parameters from the hidden

variables - Example factor analysis
- p(x,Ay) q(x,Ay) q(xy) q(Ay)
- make no other assumptions (e.g., functional

forms) - To derive, consider F(q) from the derivation of

EM - F(q) E log p(y,x,A) - E log q(xy) E

log q(Ay) - E performs posterior averaging w.r.t. q
- Maximize F alternately w.r.t. q(xy) and q(Ay)
- E-step q(xy) EA p(y,x,A) / zA
- M-step q(Ay) Ex p(y,x,A) / zx
- Plus, maximize F w.r.t. the noise precision ? and

hyperparameters a (MAP approximation)

VB Approximation for IFA

- Model

VB approximation

s1

s2

x1

x2

A

Conjugate Priors

- Which form should one choose for prior

distributions? - Conjugate prior idea Choose a prior such that

the resulting posterior distribution would have

the same functional form as the prior - Single Gaussian posterior over mean is
- p(µy) p(yµ) p(µ) / p(y)
- conjugate prior is Gaussian
- Single Gaussian posterior over mean precision

is - p(µ,?y) p(yµ,?) p(µ,?)
- conjugate prior is Normal-Wishart
- Factor analysis VB posterior over mixing matrix

is - q(Ay) Ex p(y,xA) p(A) / z
- conjugate prior is Gaussian

Separation of Convoluted Mixtures of Speech

Sources

- Blind separation methods use extremely simple

models for source distributions - Speech signals have a rich structure. Models that

capture aspects of it could result in improved

separation, deconvolution, and noise robustness - One such model work in the windowed FFT domain
- x(n,k) G(k) y(n,k)
- where nframe index, kfrequency
- Train a MoG model on the x(n,k) such that

different components capture different speech

spectra - Plus this model into IFA and use EM to obtain

separation of convoluted mixtures

Noise Suppression in Speech Signals

- Existing methods based on,
- e.g., spectral subtraction and array
- processing, often produce
- unsatisfactory noise suppression
- Algorithms based on probabilistic
- models can (1) exploit rich speech
- models, (2) learn the noise from
- noisy data (not just silent segments)
- (3) can work with one or more
- microphones
- Use speech model in the windowed FFT domain
- ?(k) noise precision per frequency (inverse

spectrum)

Interference Suppression and Evoked Source

Separation in MEG data

- y(n) MEG sensor data, x(n) evoked brain

sources, - u(n) interference sources, v(n) sensor

noise - Evoked stimulus experimental paradigm evoked

sources are active only after the stimulus onset - pre-stimulus y(n) B u(n) v(n)
- post-stimulus y(n) A x(n) B u(n) v(n)

- SEIFA is an extension of IFA to this case model

x by MoG, model u by Gaussians N(0,I), model v by

Gaussian N(0,?) - Use pre-stimulus to learn interference mixing

matrix B and noise precision ? use post-stimulus

to learn evoked mixing matrix A - Use VB-EM to infer from data the optimal number

of interference factors u and of evoked factors

x also protect from overfitting - Cleaned data y A x Contribution of factor

j yi Aij xj - Advantages over ICA no need to discard

information by dim reduction can exploit

stimulus onset information superior noise

suppression

Stimulus Evoked Independent Factor Analysis

- Pre-stimulus

Post-stimulus

u

y

B

Brain Source Localization using MEG

- Problem localize brain sources that respond to a

stimulus - Response model is simple y(n) F s(n) v(n)
- F lead field (known), s brain voxel

activity - However, the number of voxels (3000-1000) is

much larger than the number of sensors

(100-300) - One approach fit multiple dipole sources cost

is exponential in the number of sources - Idea loop over voxels for each one, use VB-EM

to learn a - modified FA model
- y(n) F z(n) A x(n) v(n)
- where F lead field for that voxel, z

voxel activity, - x response from all other active voxels
- Obtain a localization map by plotting per

voxel - Superior results to exising (beamforming based)

methods can handle correlated sources

MEG Localization Model

- Pre-stimulus

Post-stimulus

z

u

u

x

y

y

A,B

B

F

Conclusion

- Statistical machine learning provides a

principled framework for - formulating and solving adaptive signal

processing problems - Process
- (1) design a probabilistic model that

corresponds to the problem - (2) use machinery for exact and approximate

inference to learn - the model from data, including model

order - (3) extend the model, by e.g. incorporating

rich signal models, - to improve performance
- Problems treated here noise suppression, source

separation, - source localization
- Domains Speech, audio, biomedical data
- Domains outside this tutorial image, video,

text, coding, .. - Future algorithms derived from probabilistic

models take over - and completely transform adaptive signal

processing