Correlation with Errors-In-Variables and an Application to Galaxies - PowerPoint PPT Presentation

Loading...

PPT – Correlation with Errors-In-Variables and an Application to Galaxies PowerPoint presentation | free to download - id: 6db949-NjExY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Correlation with Errors-In-Variables and an Application to Galaxies

Description:

Title: Bayesian Model Selection and Model Averaging, With Applications to the Cepheid Distance Scale Subject: Cepheid Variables Author: William Jefferys – PowerPoint PPT presentation

Number of Views:4
Avg rating:3.0/5.0
Date added: 3 February 2020
Slides: 36
Provided by: Willia593
Learn more at: http://stat.psu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Correlation with Errors-In-Variables and an Application to Galaxies


1
Correlation with Errors-In-Variables and an
Application to Galaxies
  • William H. Jefferys
  • University of Texas at Austin, USA

2
A Problem in Correlation
  • A graduate student at Maryland asked me to assist
    her on a problem involving galaxy data. She
    wanted to know if the data showed clear evidence
    of correlation, and if so, what the correlation
    was and how strong was the evidence for it.

3
The Data
4
Comments on Data
  • At first glance the correlation seems obvious.
  • But there is an unusual feature of her problem
    she knew that the data were imperfect, and for
    each data point had an error bar in both x and y.
    Standard treatments of correlation do not address
    this situation.

5
The Data
6
Comments on Data
  • The presence of the error bars contributes to
    uncertainty as to how big the correlation is and
    how well it is determined
  • The data are sparse, so we also need to be
    concerned about small number statistics
  • The student also was concerned about how the
    lowest point affected any correlation. What would
    happen, she wondered, if it were not included in
    the sample? She was afraid that the ellipticity
    of the particular galaxy was so low that it might
    not have been measured accurately and in fact
    that the galaxy might belong to a different class
    of galaxies

7
The Data
8
Bayesian Analysis and Astronomy
  • She had been trying to use our program GaussFit
    to analyze the data, but it is not designed for
    tasks such as measuring correlation.
  • I, of course, suggested to her that a Bayesian
    approach might be appropriate
  • Bayesian methods offer many advantages for
    astronomical research and have attracted much
    recent interest.
  • Astronomy and Astrophysics Abstracts lists 169
    articles with the keywords Bayes or Bayesian
    in the past 5 years, and the number is increasing
    rapidly (there were 53 in 2000 alone, up from 33
    in 1999).

9
Advantages of Bayesian Methods
  • Bayesian methods allow us to do things that would
    be difficult or impossible with standard
    (frequentist) statistical analysis.
  • It is simple to incorporate prior physical or
    statistical information
  • Interpretation of results is very natural
  • Model comparison is easy and straightforward.
    (This is such a problem)
  • It is a systematic way of approaching statistical
    problems, rather than a collection of ad hoc
    techniques. Very complex problems (difficult or
    impossible to handle classically) are
    straightforwardly analyzed within a Bayesian
    framework.

10
The Fundamental Idea
  • In a nutshell, Bayesian analysis entails the
    following uniform and systematic steps
  • Choose prior distributions (priors) that
    reflect your knowledge about each parameter and
    model under consideration, prior to looking at
    the data
  • Determine the likelihood function for the data
    under each model and parameter value
  • Compute and normalize the full posterior
    distribution, conditioned on the data, using
    Bayes theorem
  • Derive summaries of quantities of interest from
    the full posterior distribution by integrating
    over the posterior distribution to produce
    marginal distributions or integrals of interest
    (e.g., means, variances).

11
Priors
  • Choose prior distributions (priors) that
    reflect your knowledge about each parameter and
    model prior to looking at the data
  • The investigator is required to provide all
    relevant prior information that he has before
    proceeding
  • There is always prior information. For example,
    we cannot count a negative number of photons.
    Parallaxes are greater than zero. We now know
    that the most likely value of the Hubble constant
    is in the ballpark of 60-80 km/sec/mpc (say) with
    smaller probabilities of its being higher or
    lower.
  • Prior information can be statistical in nature,
    e.g., we may have statistical knowledge about the
    spatial or velocity distribution of stars, or the
    variation in a telescopes plate scale.

12
Prior Information
  • In Bayesian analysis, our knowledge about an
    unknown quantity q is encoded in a prior
    distribution on the quantity in question, e.g.,
    p(qB), where B is background information.
  • Where prior information is vague or
    uninformative, a vague prior often recovers
    results similar to a classical analysis.
  • EXCEPTION Model selection/model averaging
  • Sensitive dependence of the results on reasonable
    variations in prior information indicates that no
    analysis, Bayesian or other, can give reliable
    results.

13
Likelihood Function
  • Determine the likelihood function of the data
    under each model and parameter value.
  • The likelihood function describes the statistical
    properties of the mathematical model of our
    problem. It tells us how the statistics of the
    observations (e.g., normal or Poisson data) are
    related to the parameters and any background
    information.
  • It is proportional to the sampling distribution
    for observing the data Y, given the parameters,
    but we are interested in its functional
    dependence on the parameters
  • The likelihood is known up to a constant but
    arbitrary factor which cancels out in the
    analysis.

14
Posterior Distribution
  • Compute and normalize the full posterior
    distribution, conditioned on the data, using
    Bayes theorem.
  • The posterior distribution encodes what we know
    about the parameters and model after we observe
    the data. Thus, Bayesian analysis models
    learning.
  • Bayes theorem says that
  • Bayes theorem is a trivial result of probability
    theory. The denominator is just a normalization
    factor and can often be dispensed with

15
Bayes Theorem (Proof)
  • By standard probability theory, from which
    Bayes theorem follows immediately.

16
Integration and Marginalization
  • Derive summaries of quantities of interest from
    the full posterior distribution by integrating
    over the posterior distribution to produce
    marginal distributions or integrals of interest
    (e.g., means, variances).
  • Bayesian methodology provides a simple and
    systematic way of handling nuisance parameters
    required by the analysis but which are of no
    interest to us. We just integrate them out
    (marginalize them) to obtain the marginal
    distribution of the parameter(s) of interest
  • Likewise, computing summary statistics is simple
    e.g., posterior means and variances

17
Bayesian Model Selection/Averaging
  • Given models Mi, each of which depends on a
    vector of parameters JM, and given data Y,
    Bayes theorem tells us that
  • The probabilities p (JM Mi ) and p (Mi ) are
    the prior probabilities of the parameters given
    the model and of the model, respectively p (Y
    JM, Mi ) is the likelihood function, and p (JM,
    Mi Y ) is the joint posterior probability
    distribution of the parameters and models, given
    the data.
  • Note that some parameters may not appear in some
    models, and there is no requirement that the
    models be nested.

18
Bayesian Model Selection
  • Assume for the moment that we have supplied
    priors and performed the necessary integrations
    to produce a normalized posterior distribution.
  • In practice this is often done by simulation
    using Markov Chain Monte Carlo (MCMC).
  • Once this has been done, it is simple in
    principle to compute posterior probabilities of
    the models
  • This set of numbers summarizes our degree of
    belief in each of the models, after looking at
    the data. In model selection, we choose the model
    with the highest posterior probability

19
Strategy
  • I do not see a simple frequentist approach to
    this students problem
  • A reasonable Bayesian approach is fairly
    straightforward
  • Assume that the underlying true (but unknown)
    galaxy parameters xi and hi (corresponding to the
    observed xi and yi) are distributed as a
    bivariate normal distribution

20
Strategy
  • Since we do not know xi and hi for each galaxy
    but instead only the observed values xi and yi,
    we introduced the xi and hi for each galaxy as
    latent variables. These are parameters to be
    estimated.
  • Here a and b give the true center of the
    distribution r is the true correlation
    coefficient, and sx and sh are the true standard
    deviations. None of these quantities are known.
    They are also parameters which must be estimated.

21
Strategy
  • Since we are using a bivariate normal, the
    variance-covariance matrix is with inverse

22
Strategy
  • So the density is where j(xa,hb)

23
Strategy
  • This expression may be regarded as our prior on
    the latent variables xi and hi. It depends on the
    other parameters (a, b, r, sx, sh). We can
    regard these as hyperparameters, which will in
    turn require their own priors.
  • The joint prior on all the latent variables xi
    and hi can be written as a product where X and
    H are vectors whose components are xi and hi
    respectively.

24
Strategy
  • We know the distributions of the data xi and yi
    conditional on xi and hi. Their joint
    distribution is given by
  • Here si and ti are the standard deviations of the
    data points, assumed known perfectly for this
    analysis (these are the basis of the error bars I
    showed earlier...).

25
The Data
The true value is somewhere near this error
ellipse
26
Strategy
  • Now we can write down the likelihood, the joint
    probability of observing the data, conditional on
    the parameters (here only the latent parameters
    appear, the others are implicit through the prior
    on the latent parameters) where X, Y, S, and
    T are vectors whose components are xi, yi, si,
    and ti, respectively.

27
Priors
  • The next step is to assign priors for each of the
    parameters, including the latent variables.
    Lacking special information, I chose conventional
    priors for all but X and H. Thus, I assign
  • Improper constant flat priors on a and b.
  • Improper Jeffreys priors 1/sx and 1/sh on sx and
    sh.
  • We have two models, one with correlation (M1) and
    one without (M0). I assign p(M1) p(M0)1/2
  • We will compare M1 and M0 by computing their
    posterior probabilities. I chose the prior
    p(rM1) on r to be flat and normalized on 1,1
    and zero elsewhere I chose a delta-function
    prior p(rM0) d(r0) on M0
  • Priors on X and H were displayed earlier

28
Posterior Distribution
  • The posterior distribution is proportional to the
    prior times the likelihood, as Bayes instructs us

29
Simulation Strategy
  • We used simulation to generate a sample from the
    posterior distribution through a combination of
    Gibbs and Metropolis-Hastings samplers
    (Metropolis-within-Gibbs).
  • The sample can be used to calculate quantities of
    interest
  • Compute posterior mean and variance of the
    correlation coefficient r, (calculate sample mean
    and variance of r)
  • Plot the posterior distribution of r, (plot a
    histogram of r from the sample).
  • Determine quantiles of the posterior distribution
    of r (use quantiles of the sample).
  • Compute posterior probabilities of each model
    (calculate the frequency of the model in the
    sample).

30
Posterior Conditionals
  • The conditional distribution on xi and hi looks
    like
  • By combining terms and completing the square we
    can sample xi and hi from a bivariate normal.

31
Posterior Conditionals
  • Similarly, the posterior conditional on a, b is
  • Again, by completing the square we can sample a
    and b from a bivariate normal
  • Note that if this were not an EIV problem, we
    would have xs and ys instead of xs and hs

32
Posterior Conditionals
  • The posterior conditional on sx and sh is
  • It wasnt obvious to me that we could sample this
    in a Gibbs step, but maybe theres a way to do
    it. I just used independent M-H steps with a
    uniform symmetric proposal distribution, tuning
    the step size for good mixing, and this worked
    fine.

33
Posterior Conditionals
  • We do a reversible-jump step on r and M
    simultaneously. Here the idea is to propose a
    model M and at the same time a correlation r in a
    M-H step and either accept or reject according to
    the M-H a.
  • Proposing from M0 to M0 or from M1 to M1 is
    basically simple, just an ordinary M-H step.
  • If we are proposing between models, then things
    are a bit more complicated. This is due to the
    fact that the dimensionalities of the parameter
    spaces are different between the two models.

34
Posterior Conditionals
  • The posterior conditional of r under M1 is
  • (The leading factor of 1/2 comes from the prior
    on r and is very important)

35
Posterior Conditionals
  • The posterior conditional of r under M0 is
  • The d-function guarantees that r0.
  • Here, the proportionality factor is chosen so to
    match the factor of 1/2 under M0. The factors
    come from the priors p(rM1)U(1,1) which has
    an implicit factor 1/2, and p(rM0)d(r).

36
Posterior Conditionals
  • The M-H ratio when jumping from (M,r) to (M,r)
    is therefore (where the qs are the proposals)
  • We sampled using a beta proposal q(rM1,) with
    parameters tuned by experiment for efficiency and
    good mixing under the complex model, and with a
    proposal q(M1) that also was chosen by
    experiment with an eye to getting an accurate
    estimate of the posterior odds on M1.
  • The idea is that a beta proposal on r matches the
    actual conditional pretty well so will be
    accepted with high probability the M-H ratio
    will be close to 1.

37
Sampling Strategy for Our Problem
  • To summarize
  • We sampled the a, b, xj, hj in Gibbs steps (a and
    b appear in the posterior distribution as a
    bivariate normal distribution, as do the xj, hj).
  • We sampled sx, sh with M-H steps using symmetric
    uniform proposals centered on the current point,
    adjusting the maximum step for good mixing
  • We sampled r and M in a simultaneous
    reversible-jump M-H step, using a beta proposal
    on r with parameters tuned by experiment for
    efficiency and good mixing under the complex
    model, and with a proposal on M that also was
    chosen by experiment with an eye to getting an
    accurate estimate of the posterior odds on M.

38
Results
  • For the data set including the circled point, we
    obtained
  • Odds on model with correlation 207 (assumes
    prior odds equal to 1)
  • Median rho -0.81
  • Mean rho -0.79 0.10

39
Posterior distribution of r (Including all points)
40
Results
  • For the data set including the circled point, we
    obtained
  • Odds on model with correlation 207 (assumes
    prior odds equal to 1)
  • Median rho -0.81
  • Mean rho -0.79 0.10
  • For the data set without the circled point we
    obtained
  • Odds on model with correlation 9.9
  • Median rho -0.70
  • Mean rho -0.68 0.16

41
Posterior distribution of r (Excluding 1 point)
42
Final Comments
  • This problem combines several interesting
    features
  • Latent variables, introduced because this is an
    errors-in-variables problem
  • Model selection, implemented through
    reversible-jump MCMC simulation
  • A combination of Gibbs and Metropolis-Hastings
    steps to implement the sampler (Metropolis-within
    -Gibbs)
  • It is a good example of how systematic
    application of basic Bayesian analysis can yield
    a satisfying solution of a problem that, when
    looked at frequentistically, seems almost
    intractable

43
Final Comments
  • One final comment If you look at the tail area
    in either of the two cases investigated, you will
    see that it is much less than the 1/200 or 1/10
    odds ratio that we calculated for the odds of M0
    against M1. This is an example of how tail areas
    in general are not reliable statistics for
    deciding whether a hypothesis should be selected.

44
Practical Computation
  • Until recently, a major practical difficulty has
    been computing the required integrals, limiting
    the method to situations where results can be
    obtained exactly or with analytic approximations
  • In the past 15 years, considerable progress has
    been made in solving the computational
    difficulties, particularly with the development
    of Markov Chain Monte Carlo (MCMC) methods for
    simulating a random sample from the full
    posterior distribution, from which marginal
    distributions and summary means and variances (as
    well as other averages) can be calculated
    conveniently.
  • These methods have their origin in physics. The
    Metropolis-Hastings and Gibbs sampler methods are
    two popular schemes that originated in early
    attempts to solve large physics problems by Monte
    Carlo methods.

45
Practical Computation Markov Chains
  • Start from an arbitrary point in the space of
    models and parameters. Following a specific set
    of rules, which depend only on the unnormalized
    posterior distribution, generate a random walk in
    this space, such that the distribution of the
    generated points converges to a sample drawn from
    the underlying posterior distribution.
  • The random walk is a Markov Chain That is, each
    step depends only upon the immediately previous
    step, and not on any of the earlier steps.
  • Many rules for generating the transition from one
    state to the next are possible. All converge to
    the same distribution. One attempts to choose a
    rule that will give efficient sampling with a
    reasonable expenditure of effort and time.

46
Gibbs Sampler
  • The Gibbs Sampler is a scheme for generating a
    sample from the full posterior distribution by
    sampling in succession from the conditional
    distributions, when they can be sampled from
    conveniently. Thus, let the parameter vector q be
    decomposed into a set of subvectors q1, q2, ,
    qn. Suppose it is possible to sample from the
    conditional distributions
  • p(q1 q2, q3,, qn),
  • p(q2 q1, q3,, qn),
  • p(qn q1, q2,, qn-1).

47
Gibbs Sampler (2)
  • Starting from an arbitrary initial vector
  • q(0) (q1(0), q2(0), , qn(0)), generate in
    succession vectors q(1), q(2), by sampling in
    succession from the conditional distributions
  • p(q1(k) q2(k-1) , q3(k-1),, qn(k-1)),
  • p(q2(k) q1(k), q3(k-1),, qn(k-1)),
  • p(qn(k) q1(k), q2(k),, qn-1(k) ), with q(k)
    (q1(k), q2(k), , qn(k)).
  • In the limit, the sample thus generated will
    converge (in distribution) to a sample drawn from
    the full posterior distribution.

48
Metropolis-Hastings Step
  • Gibbs sampling works when the conditional
    distributions are standard distributions from
    which samples can easily be drawn. This is not
    usually the case, and we would have to replace
    Gibbs steps with another scheme.
  • A Metropolis-Hastings step involves producing a
    sample from a suitable proposal distribution
    q(qq), where q is the value at the previous
    step. Then a calculation is done to see whether
    to accept the new q as the new step, or to keep
    the old q as the new step. If we retain the old
    value, the sampler does not move the parameter
    q at this step. If we accept the new value, it
    will move.
  • We choose q so that random samples can easily be
    generated from it, and with other
    characteristics. This is part of the art of
    designing good (efficient) sampling schemes.

49
Metropolis-Hastings Step (2)
  • Specifically, if p(q) is the target distribution
    from which we wish to sample, first generate q
    from q(qq).
  • Then calculate
  • amin1,(p(q) q(qq))/(p(q) q(qq))
  • Then generate a random number r uniform on 0,1
  • Accept the proposed q if ra, otherwise keep q.
  • Note that if p(q) q(qq) for all q, q, then
    we will always accept the new value. In this case
    the Metropolis-Hastings step becomes an ordinary
    Gibbs step.
  • Generally, although the Metropolis-Hastings steps
    are guaranteed to produce a Markov chain with the
    right limiting distribution, one gets better
    performance the closer we can approximate p(q)
    by q(qq).
About PowerShow.com