Loading...

PPT – Correlation with Errors-In-Variables and an Application to Galaxies PowerPoint presentation | free to download - id: 6db949-NjExY

The Adobe Flash plugin is needed to view this content

Correlation with Errors-In-Variables and an

Application to Galaxies

- William H. Jefferys
- University of Texas at Austin, USA

A Problem in Correlation

- A graduate student at Maryland asked me to assist

her on a problem involving galaxy data. She

wanted to know if the data showed clear evidence

of correlation, and if so, what the correlation

was and how strong was the evidence for it.

The Data

Comments on Data

- At first glance the correlation seems obvious.
- But there is an unusual feature of her problem

she knew that the data were imperfect, and for

each data point had an error bar in both x and y.

Standard treatments of correlation do not address

this situation.

The Data

Comments on Data

- The presence of the error bars contributes to

uncertainty as to how big the correlation is and

how well it is determined - The data are sparse, so we also need to be

concerned about small number statistics - The student also was concerned about how the

lowest point affected any correlation. What would

happen, she wondered, if it were not included in

the sample? She was afraid that the ellipticity

of the particular galaxy was so low that it might

not have been measured accurately and in fact

that the galaxy might belong to a different class

of galaxies

The Data

Bayesian Analysis and Astronomy

- She had been trying to use our program GaussFit

to analyze the data, but it is not designed for

tasks such as measuring correlation. - I, of course, suggested to her that a Bayesian

approach might be appropriate - Bayesian methods offer many advantages for

astronomical research and have attracted much

recent interest. - Astronomy and Astrophysics Abstracts lists 169

articles with the keywords Bayes or Bayesian

in the past 5 years, and the number is increasing

rapidly (there were 53 in 2000 alone, up from 33

in 1999).

Advantages of Bayesian Methods

- Bayesian methods allow us to do things that would

be difficult or impossible with standard

(frequentist) statistical analysis. - It is simple to incorporate prior physical or

statistical information - Interpretation of results is very natural
- Model comparison is easy and straightforward.

(This is such a problem) - It is a systematic way of approaching statistical

problems, rather than a collection of ad hoc

techniques. Very complex problems (difficult or

impossible to handle classically) are

straightforwardly analyzed within a Bayesian

framework.

The Fundamental Idea

- In a nutshell, Bayesian analysis entails the

following uniform and systematic steps - Choose prior distributions (priors) that

reflect your knowledge about each parameter and

model under consideration, prior to looking at

the data - Determine the likelihood function for the data

under each model and parameter value - Compute and normalize the full posterior

distribution, conditioned on the data, using

Bayes theorem - Derive summaries of quantities of interest from

the full posterior distribution by integrating

over the posterior distribution to produce

marginal distributions or integrals of interest

(e.g., means, variances).

Priors

- Choose prior distributions (priors) that

reflect your knowledge about each parameter and

model prior to looking at the data - The investigator is required to provide all

relevant prior information that he has before

proceeding - There is always prior information. For example,

we cannot count a negative number of photons.

Parallaxes are greater than zero. We now know

that the most likely value of the Hubble constant

is in the ballpark of 60-80 km/sec/mpc (say) with

smaller probabilities of its being higher or

lower. - Prior information can be statistical in nature,

e.g., we may have statistical knowledge about the

spatial or velocity distribution of stars, or the

variation in a telescopes plate scale.

Prior Information

- In Bayesian analysis, our knowledge about an

unknown quantity q is encoded in a prior

distribution on the quantity in question, e.g.,

p(qB), where B is background information. - Where prior information is vague or

uninformative, a vague prior often recovers

results similar to a classical analysis. - EXCEPTION Model selection/model averaging
- Sensitive dependence of the results on reasonable

variations in prior information indicates that no

analysis, Bayesian or other, can give reliable

results.

Likelihood Function

- Determine the likelihood function of the data

under each model and parameter value. - The likelihood function describes the statistical

properties of the mathematical model of our

problem. It tells us how the statistics of the

observations (e.g., normal or Poisson data) are

related to the parameters and any background

information. - It is proportional to the sampling distribution

for observing the data Y, given the parameters,

but we are interested in its functional

dependence on the parameters - The likelihood is known up to a constant but

arbitrary factor which cancels out in the

analysis.

Posterior Distribution

- Compute and normalize the full posterior

distribution, conditioned on the data, using

Bayes theorem. - The posterior distribution encodes what we know

about the parameters and model after we observe

the data. Thus, Bayesian analysis models

learning. - Bayes theorem says that
- Bayes theorem is a trivial result of probability

theory. The denominator is just a normalization

factor and can often be dispensed with

Bayes Theorem (Proof)

- By standard probability theory, from which

Bayes theorem follows immediately.

Integration and Marginalization

- Derive summaries of quantities of interest from

the full posterior distribution by integrating

over the posterior distribution to produce

marginal distributions or integrals of interest

(e.g., means, variances). - Bayesian methodology provides a simple and

systematic way of handling nuisance parameters

required by the analysis but which are of no

interest to us. We just integrate them out

(marginalize them) to obtain the marginal

distribution of the parameter(s) of interest - Likewise, computing summary statistics is simple

e.g., posterior means and variances

Bayesian Model Selection/Averaging

- Given models Mi, each of which depends on a

vector of parameters JM, and given data Y,

Bayes theorem tells us that - The probabilities p (JM Mi ) and p (Mi ) are

the prior probabilities of the parameters given

the model and of the model, respectively p (Y

JM, Mi ) is the likelihood function, and p (JM,

Mi Y ) is the joint posterior probability

distribution of the parameters and models, given

the data. - Note that some parameters may not appear in some

models, and there is no requirement that the

models be nested.

Bayesian Model Selection

- Assume for the moment that we have supplied

priors and performed the necessary integrations

to produce a normalized posterior distribution. - In practice this is often done by simulation

using Markov Chain Monte Carlo (MCMC). - Once this has been done, it is simple in

principle to compute posterior probabilities of

the models - This set of numbers summarizes our degree of

belief in each of the models, after looking at

the data. In model selection, we choose the model

with the highest posterior probability

Strategy

- I do not see a simple frequentist approach to

this students problem - A reasonable Bayesian approach is fairly

straightforward - Assume that the underlying true (but unknown)

galaxy parameters xi and hi (corresponding to the

observed xi and yi) are distributed as a

bivariate normal distribution

Strategy

- Since we do not know xi and hi for each galaxy

but instead only the observed values xi and yi,

we introduced the xi and hi for each galaxy as

latent variables. These are parameters to be

estimated. - Here a and b give the true center of the

distribution r is the true correlation

coefficient, and sx and sh are the true standard

deviations. None of these quantities are known.

They are also parameters which must be estimated.

Strategy

- Since we are using a bivariate normal, the

variance-covariance matrix is with inverse

Strategy

- So the density is where j(xa,hb)

Strategy

- This expression may be regarded as our prior on

the latent variables xi and hi. It depends on the

other parameters (a, b, r, sx, sh). We can

regard these as hyperparameters, which will in

turn require their own priors. - The joint prior on all the latent variables xi

and hi can be written as a product where X and

H are vectors whose components are xi and hi

respectively.

Strategy

- We know the distributions of the data xi and yi

conditional on xi and hi. Their joint

distribution is given by - Here si and ti are the standard deviations of the

data points, assumed known perfectly for this

analysis (these are the basis of the error bars I

showed earlier...).

The Data

The true value is somewhere near this error

ellipse

Strategy

- Now we can write down the likelihood, the joint

probability of observing the data, conditional on

the parameters (here only the latent parameters

appear, the others are implicit through the prior

on the latent parameters) where X, Y, S, and

T are vectors whose components are xi, yi, si,

and ti, respectively.

Priors

- The next step is to assign priors for each of the

parameters, including the latent variables.

Lacking special information, I chose conventional

priors for all but X and H. Thus, I assign - Improper constant flat priors on a and b.
- Improper Jeffreys priors 1/sx and 1/sh on sx and

sh. - We have two models, one with correlation (M1) and

one without (M0). I assign p(M1) p(M0)1/2 - We will compare M1 and M0 by computing their

posterior probabilities. I chose the prior

p(rM1) on r to be flat and normalized on 1,1

and zero elsewhere I chose a delta-function

prior p(rM0) d(r0) on M0 - Priors on X and H were displayed earlier

Posterior Distribution

- The posterior distribution is proportional to the

prior times the likelihood, as Bayes instructs us

Simulation Strategy

- We used simulation to generate a sample from the

posterior distribution through a combination of

Gibbs and Metropolis-Hastings samplers

(Metropolis-within-Gibbs). - The sample can be used to calculate quantities of

interest - Compute posterior mean and variance of the

correlation coefficient r, (calculate sample mean

and variance of r) - Plot the posterior distribution of r, (plot a

histogram of r from the sample). - Determine quantiles of the posterior distribution

of r (use quantiles of the sample). - Compute posterior probabilities of each model

(calculate the frequency of the model in the

sample).

Posterior Conditionals

- The conditional distribution on xi and hi looks

like - By combining terms and completing the square we

can sample xi and hi from a bivariate normal.

Posterior Conditionals

- Similarly, the posterior conditional on a, b is
- Again, by completing the square we can sample a

and b from a bivariate normal - Note that if this were not an EIV problem, we

would have xs and ys instead of xs and hs

Posterior Conditionals

- The posterior conditional on sx and sh is
- It wasnt obvious to me that we could sample this

in a Gibbs step, but maybe theres a way to do

it. I just used independent M-H steps with a

uniform symmetric proposal distribution, tuning

the step size for good mixing, and this worked

fine.

Posterior Conditionals

- We do a reversible-jump step on r and M

simultaneously. Here the idea is to propose a

model M and at the same time a correlation r in a

M-H step and either accept or reject according to

the M-H a. - Proposing from M0 to M0 or from M1 to M1 is

basically simple, just an ordinary M-H step. - If we are proposing between models, then things

are a bit more complicated. This is due to the

fact that the dimensionalities of the parameter

spaces are different between the two models.

Posterior Conditionals

- The posterior conditional of r under M1 is
- (The leading factor of 1/2 comes from the prior

on r and is very important)

Posterior Conditionals

- The posterior conditional of r under M0 is
- The d-function guarantees that r0.
- Here, the proportionality factor is chosen so to

match the factor of 1/2 under M0. The factors

come from the priors p(rM1)U(1,1) which has

an implicit factor 1/2, and p(rM0)d(r).

Posterior Conditionals

- The M-H ratio when jumping from (M,r) to (M,r)

is therefore (where the qs are the proposals) - We sampled using a beta proposal q(rM1,) with

parameters tuned by experiment for efficiency and

good mixing under the complex model, and with a

proposal q(M1) that also was chosen by

experiment with an eye to getting an accurate

estimate of the posterior odds on M1. - The idea is that a beta proposal on r matches the

actual conditional pretty well so will be

accepted with high probability the M-H ratio

will be close to 1.

Sampling Strategy for Our Problem

- To summarize
- We sampled the a, b, xj, hj in Gibbs steps (a and

b appear in the posterior distribution as a

bivariate normal distribution, as do the xj, hj). - We sampled sx, sh with M-H steps using symmetric

uniform proposals centered on the current point,

adjusting the maximum step for good mixing - We sampled r and M in a simultaneous

reversible-jump M-H step, using a beta proposal

on r with parameters tuned by experiment for

efficiency and good mixing under the complex

model, and with a proposal on M that also was

chosen by experiment with an eye to getting an

accurate estimate of the posterior odds on M.

Results

- For the data set including the circled point, we

obtained - Odds on model with correlation 207 (assumes

prior odds equal to 1) - Median rho -0.81
- Mean rho -0.79 0.10

Posterior distribution of r (Including all points)

Results

- For the data set including the circled point, we

obtained - Odds on model with correlation 207 (assumes

prior odds equal to 1) - Median rho -0.81
- Mean rho -0.79 0.10
- For the data set without the circled point we

obtained - Odds on model with correlation 9.9
- Median rho -0.70
- Mean rho -0.68 0.16

Posterior distribution of r (Excluding 1 point)

Final Comments

- This problem combines several interesting

features - Latent variables, introduced because this is an

errors-in-variables problem - Model selection, implemented through

reversible-jump MCMC simulation - A combination of Gibbs and Metropolis-Hastings

steps to implement the sampler (Metropolis-within

-Gibbs) - It is a good example of how systematic

application of basic Bayesian analysis can yield

a satisfying solution of a problem that, when

looked at frequentistically, seems almost

intractable

Final Comments

- One final comment If you look at the tail area

in either of the two cases investigated, you will

see that it is much less than the 1/200 or 1/10

odds ratio that we calculated for the odds of M0

against M1. This is an example of how tail areas

in general are not reliable statistics for

deciding whether a hypothesis should be selected.

Practical Computation

- Until recently, a major practical difficulty has

been computing the required integrals, limiting

the method to situations where results can be

obtained exactly or with analytic approximations - In the past 15 years, considerable progress has

been made in solving the computational

difficulties, particularly with the development

of Markov Chain Monte Carlo (MCMC) methods for

simulating a random sample from the full

posterior distribution, from which marginal

distributions and summary means and variances (as

well as other averages) can be calculated

conveniently. - These methods have their origin in physics. The

Metropolis-Hastings and Gibbs sampler methods are

two popular schemes that originated in early

attempts to solve large physics problems by Monte

Carlo methods.

Practical Computation Markov Chains

- Start from an arbitrary point in the space of

models and parameters. Following a specific set

of rules, which depend only on the unnormalized

posterior distribution, generate a random walk in

this space, such that the distribution of the

generated points converges to a sample drawn from

the underlying posterior distribution. - The random walk is a Markov Chain That is, each

step depends only upon the immediately previous

step, and not on any of the earlier steps. - Many rules for generating the transition from one

state to the next are possible. All converge to

the same distribution. One attempts to choose a

rule that will give efficient sampling with a

reasonable expenditure of effort and time.

Gibbs Sampler

- The Gibbs Sampler is a scheme for generating a

sample from the full posterior distribution by

sampling in succession from the conditional

distributions, when they can be sampled from

conveniently. Thus, let the parameter vector q be

decomposed into a set of subvectors q1, q2, ,

qn. Suppose it is possible to sample from the

conditional distributions - p(q1 q2, q3,, qn),
- p(q2 q1, q3,, qn),
- p(qn q1, q2,, qn-1).

Gibbs Sampler (2)

- Starting from an arbitrary initial vector
- q(0) (q1(0), q2(0), , qn(0)), generate in

succession vectors q(1), q(2), by sampling in

succession from the conditional distributions - p(q1(k) q2(k-1) , q3(k-1),, qn(k-1)),
- p(q2(k) q1(k), q3(k-1),, qn(k-1)),
- p(qn(k) q1(k), q2(k),, qn-1(k) ), with q(k)

(q1(k), q2(k), , qn(k)). - In the limit, the sample thus generated will

converge (in distribution) to a sample drawn from

the full posterior distribution.

Metropolis-Hastings Step

- Gibbs sampling works when the conditional

distributions are standard distributions from

which samples can easily be drawn. This is not

usually the case, and we would have to replace

Gibbs steps with another scheme. - A Metropolis-Hastings step involves producing a

sample from a suitable proposal distribution

q(qq), where q is the value at the previous

step. Then a calculation is done to see whether

to accept the new q as the new step, or to keep

the old q as the new step. If we retain the old

value, the sampler does not move the parameter

q at this step. If we accept the new value, it

will move. - We choose q so that random samples can easily be

generated from it, and with other

characteristics. This is part of the art of

designing good (efficient) sampling schemes.

Metropolis-Hastings Step (2)

- Specifically, if p(q) is the target distribution

from which we wish to sample, first generate q

from q(qq). - Then calculate
- amin1,(p(q) q(qq))/(p(q) q(qq))
- Then generate a random number r uniform on 0,1
- Accept the proposed q if ra, otherwise keep q.
- Note that if p(q) q(qq) for all q, q, then

we will always accept the new value. In this case

the Metropolis-Hastings step becomes an ordinary

Gibbs step. - Generally, although the Metropolis-Hastings steps

are guaranteed to produce a Markov chain with the

right limiting distribution, one gets better

performance the closer we can approximate p(q)

by q(qq).