# Correlation with Errors-In-Variables and an Application to Galaxies - PowerPoint PPT Presentation

PPT – Correlation with Errors-In-Variables and an Application to Galaxies PowerPoint presentation | free to download - id: 6db949-NjExY

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Correlation with Errors-In-Variables and an Application to Galaxies

Description:

### Title: Bayesian Model Selection and Model Averaging, With Applications to the Cepheid Distance Scale Subject: Cepheid Variables Author: William Jefferys – PowerPoint PPT presentation

Number of Views:4
Avg rating:3.0/5.0
Slides: 36
Provided by: Willia593
Category:
Tags:
Transcript and Presenter's Notes

Title: Correlation with Errors-In-Variables and an Application to Galaxies

1
Correlation with Errors-In-Variables and an
Application to Galaxies
• William H. Jefferys
• University of Texas at Austin, USA

2
A Problem in Correlation
her on a problem involving galaxy data. She
wanted to know if the data showed clear evidence
of correlation, and if so, what the correlation
was and how strong was the evidence for it.

3
The Data
4
• At first glance the correlation seems obvious.
• But there is an unusual feature of her problem
she knew that the data were imperfect, and for
each data point had an error bar in both x and y.
Standard treatments of correlation do not address
this situation.

5
The Data
6
• The presence of the error bars contributes to
uncertainty as to how big the correlation is and
how well it is determined
• The data are sparse, so we also need to be
• The student also was concerned about how the
lowest point affected any correlation. What would
happen, she wondered, if it were not included in
the sample? She was afraid that the ellipticity
of the particular galaxy was so low that it might
not have been measured accurately and in fact
that the galaxy might belong to a different class
of galaxies

7
The Data
8
Bayesian Analysis and Astronomy
• She had been trying to use our program GaussFit
to analyze the data, but it is not designed for
• I, of course, suggested to her that a Bayesian
approach might be appropriate
• Bayesian methods offer many advantages for
astronomical research and have attracted much
recent interest.
• Astronomy and Astrophysics Abstracts lists 169
articles with the keywords Bayes or Bayesian
in the past 5 years, and the number is increasing
rapidly (there were 53 in 2000 alone, up from 33
in 1999).

9
• Bayesian methods allow us to do things that would
be difficult or impossible with standard
(frequentist) statistical analysis.
• It is simple to incorporate prior physical or
statistical information
• Interpretation of results is very natural
• Model comparison is easy and straightforward.
(This is such a problem)
• It is a systematic way of approaching statistical
problems, rather than a collection of ad hoc
techniques. Very complex problems (difficult or
impossible to handle classically) are
straightforwardly analyzed within a Bayesian
framework.

10
The Fundamental Idea
• In a nutshell, Bayesian analysis entails the
following uniform and systematic steps
• Choose prior distributions (priors) that
model under consideration, prior to looking at
the data
• Determine the likelihood function for the data
under each model and parameter value
• Compute and normalize the full posterior
distribution, conditioned on the data, using
Bayes theorem
• Derive summaries of quantities of interest from
the full posterior distribution by integrating
over the posterior distribution to produce
marginal distributions or integrals of interest
(e.g., means, variances).

11
Priors
• Choose prior distributions (priors) that
model prior to looking at the data
• The investigator is required to provide all
relevant prior information that he has before
proceeding
• There is always prior information. For example,
we cannot count a negative number of photons.
Parallaxes are greater than zero. We now know
that the most likely value of the Hubble constant
is in the ballpark of 60-80 km/sec/mpc (say) with
smaller probabilities of its being higher or
lower.
• Prior information can be statistical in nature,
e.g., we may have statistical knowledge about the
spatial or velocity distribution of stars, or the
variation in a telescopes plate scale.

12
Prior Information
• In Bayesian analysis, our knowledge about an
unknown quantity q is encoded in a prior
distribution on the quantity in question, e.g.,
p(qB), where B is background information.
• Where prior information is vague or
uninformative, a vague prior often recovers
results similar to a classical analysis.
• EXCEPTION Model selection/model averaging
• Sensitive dependence of the results on reasonable
variations in prior information indicates that no
analysis, Bayesian or other, can give reliable
results.

13
Likelihood Function
• Determine the likelihood function of the data
under each model and parameter value.
• The likelihood function describes the statistical
properties of the mathematical model of our
problem. It tells us how the statistics of the
observations (e.g., normal or Poisson data) are
related to the parameters and any background
information.
• It is proportional to the sampling distribution
for observing the data Y, given the parameters,
but we are interested in its functional
dependence on the parameters
• The likelihood is known up to a constant but
arbitrary factor which cancels out in the
analysis.

14
Posterior Distribution
• Compute and normalize the full posterior
distribution, conditioned on the data, using
Bayes theorem.
• The posterior distribution encodes what we know
about the parameters and model after we observe
the data. Thus, Bayesian analysis models
learning.
• Bayes theorem says that
• Bayes theorem is a trivial result of probability
theory. The denominator is just a normalization
factor and can often be dispensed with

15
Bayes Theorem (Proof)
• By standard probability theory, from which
Bayes theorem follows immediately.

16
Integration and Marginalization
• Derive summaries of quantities of interest from
the full posterior distribution by integrating
over the posterior distribution to produce
marginal distributions or integrals of interest
(e.g., means, variances).
• Bayesian methodology provides a simple and
systematic way of handling nuisance parameters
required by the analysis but which are of no
interest to us. We just integrate them out
(marginalize them) to obtain the marginal
distribution of the parameter(s) of interest
• Likewise, computing summary statistics is simple
e.g., posterior means and variances

17
Bayesian Model Selection/Averaging
• Given models Mi, each of which depends on a
vector of parameters JM, and given data Y,
Bayes theorem tells us that
• The probabilities p (JM Mi ) and p (Mi ) are
the prior probabilities of the parameters given
the model and of the model, respectively p (Y
JM, Mi ) is the likelihood function, and p (JM,
Mi Y ) is the joint posterior probability
distribution of the parameters and models, given
the data.
• Note that some parameters may not appear in some
models, and there is no requirement that the
models be nested.

18
Bayesian Model Selection
• Assume for the moment that we have supplied
priors and performed the necessary integrations
to produce a normalized posterior distribution.
• In practice this is often done by simulation
using Markov Chain Monte Carlo (MCMC).
• Once this has been done, it is simple in
principle to compute posterior probabilities of
the models
• This set of numbers summarizes our degree of
belief in each of the models, after looking at
the data. In model selection, we choose the model
with the highest posterior probability

19
Strategy
• I do not see a simple frequentist approach to
this students problem
• A reasonable Bayesian approach is fairly
straightforward
• Assume that the underlying true (but unknown)
galaxy parameters xi and hi (corresponding to the
observed xi and yi) are distributed as a
bivariate normal distribution

20
Strategy
• Since we do not know xi and hi for each galaxy
but instead only the observed values xi and yi,
we introduced the xi and hi for each galaxy as
latent variables. These are parameters to be
estimated.
• Here a and b give the true center of the
distribution r is the true correlation
coefficient, and sx and sh are the true standard
deviations. None of these quantities are known.
They are also parameters which must be estimated.

21
Strategy
• Since we are using a bivariate normal, the
variance-covariance matrix is with inverse

22
Strategy
• So the density is where j(xa,hb)

23
Strategy
• This expression may be regarded as our prior on
the latent variables xi and hi. It depends on the
other parameters (a, b, r, sx, sh). We can
regard these as hyperparameters, which will in
turn require their own priors.
• The joint prior on all the latent variables xi
and hi can be written as a product where X and
H are vectors whose components are xi and hi
respectively.

24
Strategy
• We know the distributions of the data xi and yi
conditional on xi and hi. Their joint
distribution is given by
• Here si and ti are the standard deviations of the
data points, assumed known perfectly for this
analysis (these are the basis of the error bars I
showed earlier...).

25
The Data
The true value is somewhere near this error
ellipse
26
Strategy
• Now we can write down the likelihood, the joint
probability of observing the data, conditional on
the parameters (here only the latent parameters
appear, the others are implicit through the prior
on the latent parameters) where X, Y, S, and
T are vectors whose components are xi, yi, si,
and ti, respectively.

27
Priors
• The next step is to assign priors for each of the
parameters, including the latent variables.
Lacking special information, I chose conventional
priors for all but X and H. Thus, I assign
• Improper constant flat priors on a and b.
• Improper Jeffreys priors 1/sx and 1/sh on sx and
sh.
• We have two models, one with correlation (M1) and
one without (M0). I assign p(M1) p(M0)1/2
• We will compare M1 and M0 by computing their
posterior probabilities. I chose the prior
p(rM1) on r to be flat and normalized on 1,1
and zero elsewhere I chose a delta-function
prior p(rM0) d(r0) on M0
• Priors on X and H were displayed earlier

28
Posterior Distribution
• The posterior distribution is proportional to the
prior times the likelihood, as Bayes instructs us

29
Simulation Strategy
• We used simulation to generate a sample from the
posterior distribution through a combination of
Gibbs and Metropolis-Hastings samplers
(Metropolis-within-Gibbs).
• The sample can be used to calculate quantities of
interest
• Compute posterior mean and variance of the
correlation coefficient r, (calculate sample mean
and variance of r)
• Plot the posterior distribution of r, (plot a
histogram of r from the sample).
• Determine quantiles of the posterior distribution
of r (use quantiles of the sample).
• Compute posterior probabilities of each model
(calculate the frequency of the model in the
sample).

30
Posterior Conditionals
• The conditional distribution on xi and hi looks
like
• By combining terms and completing the square we
can sample xi and hi from a bivariate normal.

31
Posterior Conditionals
• Similarly, the posterior conditional on a, b is
• Again, by completing the square we can sample a
and b from a bivariate normal
• Note that if this were not an EIV problem, we
would have xs and ys instead of xs and hs

32
Posterior Conditionals
• The posterior conditional on sx and sh is
• It wasnt obvious to me that we could sample this
in a Gibbs step, but maybe theres a way to do
it. I just used independent M-H steps with a
uniform symmetric proposal distribution, tuning
the step size for good mixing, and this worked
fine.

33
Posterior Conditionals
• We do a reversible-jump step on r and M
simultaneously. Here the idea is to propose a
model M and at the same time a correlation r in a
M-H step and either accept or reject according to
the M-H a.
• Proposing from M0 to M0 or from M1 to M1 is
basically simple, just an ordinary M-H step.
• If we are proposing between models, then things
are a bit more complicated. This is due to the
fact that the dimensionalities of the parameter
spaces are different between the two models.

34
Posterior Conditionals
• The posterior conditional of r under M1 is
• (The leading factor of 1/2 comes from the prior
on r and is very important)

35
Posterior Conditionals
• The posterior conditional of r under M0 is
• The d-function guarantees that r0.
• Here, the proportionality factor is chosen so to
match the factor of 1/2 under M0. The factors
come from the priors p(rM1)U(1,1) which has
an implicit factor 1/2, and p(rM0)d(r).

36
Posterior Conditionals
• The M-H ratio when jumping from (M,r) to (M,r)
is therefore (where the qs are the proposals)
• We sampled using a beta proposal q(rM1,) with
parameters tuned by experiment for efficiency and
good mixing under the complex model, and with a
proposal q(M1) that also was chosen by
experiment with an eye to getting an accurate
estimate of the posterior odds on M1.
• The idea is that a beta proposal on r matches the
actual conditional pretty well so will be
accepted with high probability the M-H ratio
will be close to 1.

37
Sampling Strategy for Our Problem
• To summarize
• We sampled the a, b, xj, hj in Gibbs steps (a and
b appear in the posterior distribution as a
bivariate normal distribution, as do the xj, hj).
• We sampled sx, sh with M-H steps using symmetric
uniform proposals centered on the current point,
adjusting the maximum step for good mixing
• We sampled r and M in a simultaneous
reversible-jump M-H step, using a beta proposal
on r with parameters tuned by experiment for
efficiency and good mixing under the complex
model, and with a proposal on M that also was
chosen by experiment with an eye to getting an
accurate estimate of the posterior odds on M.

38
Results
• For the data set including the circled point, we
obtained
• Odds on model with correlation 207 (assumes
prior odds equal to 1)
• Median rho -0.81
• Mean rho -0.79 0.10

39
Posterior distribution of r (Including all points)
40
Results
• For the data set including the circled point, we
obtained
• Odds on model with correlation 207 (assumes
prior odds equal to 1)
• Median rho -0.81
• Mean rho -0.79 0.10
• For the data set without the circled point we
obtained
• Odds on model with correlation 9.9
• Median rho -0.70
• Mean rho -0.68 0.16

41
Posterior distribution of r (Excluding 1 point)
42
• This problem combines several interesting
features
• Latent variables, introduced because this is an
errors-in-variables problem
• Model selection, implemented through
reversible-jump MCMC simulation
• A combination of Gibbs and Metropolis-Hastings
steps to implement the sampler (Metropolis-within
-Gibbs)
• It is a good example of how systematic
application of basic Bayesian analysis can yield
a satisfying solution of a problem that, when
looked at frequentistically, seems almost
intractable

43
• One final comment If you look at the tail area
in either of the two cases investigated, you will
see that it is much less than the 1/200 or 1/10
odds ratio that we calculated for the odds of M0
against M1. This is an example of how tail areas
in general are not reliable statistics for
deciding whether a hypothesis should be selected.

44
Practical Computation
• Until recently, a major practical difficulty has
been computing the required integrals, limiting
the method to situations where results can be
obtained exactly or with analytic approximations
• In the past 15 years, considerable progress has
been made in solving the computational
difficulties, particularly with the development
of Markov Chain Monte Carlo (MCMC) methods for
simulating a random sample from the full
posterior distribution, from which marginal
distributions and summary means and variances (as
well as other averages) can be calculated
conveniently.
• These methods have their origin in physics. The
Metropolis-Hastings and Gibbs sampler methods are
two popular schemes that originated in early
attempts to solve large physics problems by Monte
Carlo methods.

45
Practical Computation Markov Chains
• Start from an arbitrary point in the space of
models and parameters. Following a specific set
of rules, which depend only on the unnormalized
posterior distribution, generate a random walk in
this space, such that the distribution of the
generated points converges to a sample drawn from
the underlying posterior distribution.
• The random walk is a Markov Chain That is, each
step depends only upon the immediately previous
step, and not on any of the earlier steps.
• Many rules for generating the transition from one
state to the next are possible. All converge to
the same distribution. One attempts to choose a
rule that will give efficient sampling with a
reasonable expenditure of effort and time.

46
Gibbs Sampler
• The Gibbs Sampler is a scheme for generating a
sample from the full posterior distribution by
sampling in succession from the conditional
distributions, when they can be sampled from
conveniently. Thus, let the parameter vector q be
decomposed into a set of subvectors q1, q2, ,
qn. Suppose it is possible to sample from the
conditional distributions
• p(q1 q2, q3,, qn),
• p(q2 q1, q3,, qn),
• p(qn q1, q2,, qn-1).

47
Gibbs Sampler (2)
• Starting from an arbitrary initial vector
• q(0) (q1(0), q2(0), , qn(0)), generate in
succession vectors q(1), q(2), by sampling in
succession from the conditional distributions
• p(q1(k) q2(k-1) , q3(k-1),, qn(k-1)),
• p(q2(k) q1(k), q3(k-1),, qn(k-1)),
• p(qn(k) q1(k), q2(k),, qn-1(k) ), with q(k)
(q1(k), q2(k), , qn(k)).
• In the limit, the sample thus generated will
converge (in distribution) to a sample drawn from
the full posterior distribution.

48
Metropolis-Hastings Step
• Gibbs sampling works when the conditional
distributions are standard distributions from
which samples can easily be drawn. This is not
usually the case, and we would have to replace
Gibbs steps with another scheme.
• A Metropolis-Hastings step involves producing a
sample from a suitable proposal distribution
q(qq), where q is the value at the previous
step. Then a calculation is done to see whether
to accept the new q as the new step, or to keep
the old q as the new step. If we retain the old
value, the sampler does not move the parameter
q at this step. If we accept the new value, it
will move.
• We choose q so that random samples can easily be
generated from it, and with other
characteristics. This is part of the art of
designing good (efficient) sampling schemes.

49
Metropolis-Hastings Step (2)
• Specifically, if p(q) is the target distribution
from which we wish to sample, first generate q
from q(qq).
• Then calculate
• amin1,(p(q) q(qq))/(p(q) q(qq))
• Then generate a random number r uniform on 0,1
• Accept the proposed q if ra, otherwise keep q.
• Note that if p(q) q(qq) for all q, q, then
we will always accept the new value. In this case
the Metropolis-Hastings step becomes an ordinary
Gibbs step.
• Generally, although the Metropolis-Hastings steps
are guaranteed to produce a Markov chain with the
right limiting distribution, one gets better
performance the closer we can approximate p(q)
by q(qq).