Quantifying Uncertainty in Inverse Problems - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Quantifying Uncertainty in Inverse Problems

Description:

Forward problems are Probability and inverse problems are Statistics. ... Literature confuses numerical stability and accuracy. ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 56
Provided by: philip72
Category:

less

Transcript and Presenter's Notes

Title: Quantifying Uncertainty in Inverse Problems


1
Quantifying Uncertainty in Inverse Problems
  • Mathematical Geophysics and Uncertainty in Earth
    Models
  • Colorado School of Mines
  • 14-25 June 2004
  • P.B. Stark
  • Department of Statistics
  • University of California
  • Berkeley, CA 94720-3860
  • www.stat.berkeley.edu/stark

2
Inverse Problems are Statistics
  • After my 1st talk at Berkeley, Lucien Le Cam told
    me I dont understand all this fuss about
    inverse problems. Forward problems are
    Probability and inverse problems are Statistics.
  • It took me 10 years to understand just how right
    he was.

3
Inverse Problems as Statistics
  • Measurable space X of possible data.
  • Set ? of descriptions of the worldmodels.
  • Family P Pq q 2 Q of probability
    distributions on X indexed by models ?.
  • Forward operator q a Pq maps model ? into a
    probability measure on X .
  • X valued data X are a sample from Pq.
  • Pq is everything randomness in the truth,
    measurement error, systematic error, censoring,
    etc.

4
Models
  • ? usually has special structure.
  • ? could be a convex subset of a separable Banach
    space T. (geomag, seismo, grav, MT, )
  • Physical significance of ? generally gives qaPq
    reasonable analytic properties, e.g., continuity.

5
Forward Problems in Physical Science
  • Often thought of as a composition of steps
  • transform idealized model q into perfect,
    noise-free, infinite-dimensional data
    (approximate physics)
  • keep a finite number of the perfect data, because
    can only measure, record, and compute with finite
    lists
  • possibly corrupt the list with measurement error.
  • Equivalent to single-step procedure with
    corruption on par with physics, and mapping
    incorporating the censoring.

6
Inverse Problems
  • Observe data X drawn from P? for some unknown
    ???. (Assume ? contains at least two points
    otherwise, data superfluous.)
  • Use X and the knowledge that ??? to learn about
    ? for example, to estimate a parameter g(?) (the
    value g(?) at ? of a continuous G-valued function
    g defined on ?).

7
Modeling/Prediction/Estimation
  • What does it mean to solve an inverse problem?
  • Constructing a model that fits the data ?
    estimation (E.g., LS, RLS). Predicting other data
    ? estimating model.
  • Properties of a model that fits the data need not
    be shared by the true model.
  • The data need not constrain the parameter of
    interest Non-uniqueness.
  • Confusion between properties of algorithms and
    properties of problems.
  • Well-posed questions of P. Sabatier

8
Inverse Problems in Physical Science
  • Inverse problems in science often solved using
    applied math methods for ill-posed problems
    (e.g., Tichonov regularization, analytic
    inversions)
  • Those methods are designed to answer different
    questions can behave poorly with data (e.g., bad
    bias variance)
  • Inference ? construction Statistical viewpoint
    may be more appropriate for interpreting real
    data with stochastic errors.

9
Elements of the Statistical View
  • Distinguish between characteristics of the
    problem, and characteristics of methods used to
    draw inferences.
  • One fundamental qualitative property of a
    parameter
  • g is identifiable if for all ?, z ? T,
  • g(?) ? g(z) ? Ph ? Pz.
  • In most inverse problems, g(?) ? not
    identifiable, and few linear functionals of ? are
    identifiable.

10
Deterministic and Statistical Connections
  • Identifiabilitydistinct parameter values yield
    distinct probability distributions for the
    observablesis similar to uniquenessforward
    operator maps at most one model into the observed
    data.
  • Consistencyparameter can be estimated with
    arbitrary accuracy as the number of data growsis
    related to stability of a recovery
    algorithmsmall changes in the data produce small
    changes in the recovered model.
  • ? quantitative connections too.

11
More Notation
  • Let T be a separable Banach space, T its
    normed dual.
  • Write the pairing between T and T
  • lt, gt T x T ? R.

12
Linear Forward Problems
  • A forward problem is linear if
  • T is a subset of a separable Banach space T
  • X Rn, X (Xj)j1n
  • For some fixed sequence (?j)j1n of elements of
    T,
  • Xj hkj, qi ej, q 2 Q,
  • where e (ej)j1n is a vector of stochastic
    errors whose distribution does not depend on ?.

13
Linear Forward Problems, contd.
  • Linear functionals ?j are the representers
  • Distribution P? is the probability distribution
    of X.
  • Typically, dim(T) ? at least, n lt dim(T), so
    estimating ? is an underdetermined problem.
  • Define
  • K T ? Rn
  • q ? (lt?j, ?gt)j1n .
  • Abbreviate forward problem by X K? e, ? ? T.

14
Linear Inverse Problems
  • Use X K? e, and the constraint ? ? T to
    estimate or make inferences about g(?).
  • Probability distribution of X depends on ? only
    through K?, so if there are two points
  • ?1, ?2 ? T such that K?1 K?2 but
  • g(?1)?g(?2),
  • then g(?) is not identifiable.
  • This is like what geophysicists call
    non-uniqueness.

15
Example Sampling w/ systematic random error
  • Observe
  • Xj f(tj) rj ej, j 1, 2, , n,
  • f 2 C, a set of smooth of functions on 0, 1
  • tj 2 0, 1
  • rj? 1, j1, 2, , n
  • ?j iid N(0, 1)
  • Take Q C -1, 1n, X Rn, and q (f, r1, ,
    rn).
  • Then Pq has density
  • (2p)-n/2 exp-åj1n (xj f(tj)-rj)2.

16
Discretization and Regularization
  • Discretization introduces error.
  • Also masks error reduces apparent uncertainty
    unidentifiable parameters seem identifiable.
  • Literature confuses numerical stability and
    accuracy.
  • Discretization, regularization increase stability
    and bias. Without constraints/prior, cant tell
    whether added bias is big or small. Usually could
    be infinite.

17
Discretize
f(x) ¼ åi1m ai fi(x), m gt nfi(x)i1m linearly
independent on 0, 1 For many choices of
fi(x)i1m, e.g., sinusoids, splines, boxcars,
apparently can estimate f with bounded bias
(without bias if ri are 0). Illusion
artifact of the discretization.
18
Sketch Identifiability
Pz Ph h z, so q not identifiable
g cannot be estimated with bounded bias
Pz Ph g(h) g(z), so g not identifiable
19
Backus-Gilbert Theory
Let Q T be a Hilbert space. Let g 2 T T be
a linear parameter. Let kjj1n µ T. Then g(q)
is identifiable iff g L K for some 1 n
matrix L. If also Ee 0, then L X is
unbiased for g. If also e has covariance matrix S
EeeT, then the MSE of L X is L S LT.
20
Sketch Backus-Gilbert
21
Decision Rules
  • A (randomized) decision rule
  • d X ? M1(A)
  • x ? dx(.),
  • is a measurable mapping from the space X of
    possible data to the collection M1(A) of
    probability distributions on a separable metric
    space A of actions.
  • A non-randomized decision rule is a randomized
    decision rule that, to each x ?X, assigns a unit
    point mass at some value
  • a a(x) ? A.

22
Why randomized rules?
  • In some problems, have better behavior.
  • Allowing randomized rules can make the set of
    decisions convex (by allowing mixtures of
    different decisions), which makes the math
    easier.
  • If the risk is convex, Rao-Blackwell theorem says
    that the optimal decision is not randomized.
    (More on this later.)

23
Example randomization natural
  • Coin has chance 1/3 of landing with one side
    showing chance 2/3 of the other showing. Dont
    know which side is which.
  • Want to decide whether P(heads) 1/3 or 2/3.
  • Toss subject coin 10 times. X heads.
  • Toss fair coin once. U heads.

Use data to pick the more likely scenario, but if
data dont help, decide by tossing a fair coin.
24
Estimators
  • An estimator of a parameter g(?) is a decision
    rule d for which the space A of possible actions
    is the space G of possible parameter values.
  • gg(X) is common notation for an estimator of
    g(?).
  • Usually write non-randomized estimator as a
    G-valued function of x instead of a M1(G)-valued
    function.

25
Comparing Decision Rules
  • Infinitely many decision rules and estimators.
  • Which one to use?
  • The best one!
  • But what does best mean?

26
Loss and Risk
  • 2-player game Nature v. Statistician.
  • Nature picks ? from T. ? is secret, but
    statistician knows T.
  • Statistician picks d from a set D of rules. d is
    secret.
  • Generate data X from P?, apply d.
  • Statistician pays loss L(?, d(X)). L should be
    dictated by scientific context, but
  • Risk is expected loss r(?, d) EqL(?, d(X))
  • Good rule d has small risk, but what does small
    mean?

27
Strategy
  • Rare that one d has smallest risk 8q?Q.
  • d is admissible if not dominated (no estimator
    does at least as well for every q, and better for
    at least one q).
  • Minimax decision minimizes rQ(d) supq?Qr(?, d)
    over d?D
  • Minimax risk is rQ infd 2 D rQ(d)
  • Bayes decision minimizes
  • rp(d) sQr(q,d)p(dq) over d?D
  • for a given prior probability distribution p on
    Q.
  • Bayes risk is rp infd 2 D rp(d).

28
Minimax is Bayes for least favorable prior
Pretty generally for convex ?, D,
concave-convexlike r,
  • If minimax risk gtgt Bayes risk, prior pnot the
    data or Qcontrols the apparent uncertainty of
    the Bayes estimate.

29
Common Risk Mean Distance Error (MDE)
  • Let dG denote the metric on G, and let g be an
    estimator of g.
  • MDE at ? of g is
  • MDE?(g, g) Eq d(g(X), g(?)).
  • If metric derives from norm, MDE is mean norm
    error (MNE).
  • If the norm is Hilbertian, MNE2 is mean squared
    error (MSE).

30
Shrinkage
  • Suppose X N(q, I) with dim(q) d 3.
  • X not admissible for q for squared-error loss
    (Stein, 1956).
  • Dominated by X(1 a/(b X2)) for small a
    and big b.
  • James-Stein better dJS(X) X(1-a/X2), 0 lt a
    2(d-2).
  • Even better to take positive part of shrinkage
    factor
  • dJS(X) X(1-a/X2), 0 lt a 2(d-2).
  • dJS isnt minimax for MSE, but close.
  • Implications for Backus-Gilbert estimates of d 3
    linear functionals.
  • 9 extensions to other distributions see Evans
    Stark (1996).

31
Example Bounded Normal Mean
  • Observe X N (q, 1). Know a priori q 2 -t, t.
  • Want to estimate g(q) q.
  • f() standard normal density.F() standard
    normal cumulative distribution function.
  • Suppose we choose to use squared-error loss
  • L(q, d) (q - d)2
  • r(q, d) Eq L(q, d(X)) Eq (q - d(X))2
  • rQ(d) supq 2 Q r(q, d) supq 2 Q Eq (q -
    d(X))2
  • rQ infd 2 D supq 2 Q Eq (q - d(X))2

32
Risk of X for bounded normal mean
  • Consider the simple (maximum likelihood)
    estimator
  • d(X) X.
  • EX q, so X is unbiased for q, and q is
    unbiasedly estimable.
  • r(q, X) Eq (q X)2 Var(X) 1.
  • Consider uniform Bayesian prior to capture
    constraint q 2 -t, t
  • p U-t, t, the uniform distribution on the
    interval -t, t.
  • rp(X) s-tt r(q, X) p(dq) s-tt 1 (2t)-1 dq
    1.
  • In this example, frequentist risk of X equals
    Bayes risk of X for uniform prior p (but X is not
    the Bayes estimator).

33
Truncation is better (but not best)
  • Easy to find an estimator better than X from both
    frequentist and Bayes perspectives.
  • Truncation estimate dT

dT is biased, but has smaller MSE than X,
whatever be q 2 Q. (dT is the constrained maximum
likelihood estimate.)
34
Risk of dT
x
Pq(X lt -t)
dT
f(xq)
0
t
-t
q
-t
0
q
t
35
Minimax MSE Estimate of BNM
  • Truncation estimate better than X, but neither
    minimax nor Bayes.
  • Clear that r min(1, t2) MSE(X) 1, and rQ(0)
    t2.
  • Minimax MSE estimator is a nonlinear shrinkage
    estimator.
  • Minimax MSE risk is t2/(1t2).

36
Bayes estimation of BNM
  • Posterior density of q given x is

37
Posterior Mean
  • The mean of the posterior density minimizes the
    Bayes risk, when the loss is squared error

38
Bayes estimator is also nonlinear shrinkage
Philip B. Stark function f bayesUnif(x, tau) f
x - (normpdf(tau - x) - normpdf(-tau-x))./(normc
df(tau-x) - normcdf(-tau-x)) return
6
X
dT
4
dp
2
0
Bayes estimator dp, t3
-2
-4
-6
-6
-4
-2
0
2
4
6
For t 3, Bayes risk rp ¼ 0.7 (by simulation)
.Minimax risk rQ 0.75.
39
Bayes/Minimax Risks for Mean Squared Error
Philip B. Stark nsim 20000 risk 0 tau
5 for theta -tau.01tau pred
bayesunif(theta randn(nsim,1),tau) risk risk
(pred - theta)'(pred - theta)/nsim end risk/le
ngth(-tau.01tau)
Philip B. Stark
Philip B. Stark
Knowing q U-t, t versus knowing q 2 -t, t.
40
Confidence Interval for BNM
  • Might be interested in a confidence set for q
    instead of point estimate. A 1-a confidence set
    I satisfies
  • Pq (I (X) 3 q) 1-a, 8q 2 Q.
  • Actions are now sets, not points. Decision rules
    are probability measures on sets.
  • Sensible loss? Lebesgue measure of confidence
    set. But consider the application!
  • Risk is expected measure.

41
Some Confidence Procedures
  • Naïve X za/2, X za/2
  • Truncated X za/2, X za/2 Å -t, t
  • Affine fixed-length aXbc/2, aXbc/2
  • Nonlinear fixed length f(X)c/2, f(X)c/2, f
    measurable
  • Variable length l(X), u(X), l, u
    measurable
  • Likelihood ratio test (LRT) Include all h 2 -t,
    t s.t.
  • f(h)/f(0) ch.
  • Choose ch to get 1-a coverage.

42
Minimax Expected Length
  • Seek procedure w/ min max expected length for q 2
    -t, t.
  • For t 2za, minimax is Truncated Pratt simple
    form. (Evans et al., 2003 Schafer Stark,
    2003.)
  • Table for a 0.05.
  • Naïve has length 3.92.

43
Regularization
  • From statistical perspective, regularization
    tries to exploit a bias/variance tradeoff to
    reduce MNE.
  • There are situations in which regularization is
    optimaldepends on prior information, forward
    operator, loss function, regularization
    functional. See, e.g., Donoho (1995), OSullivan
    (1986).
  • Generally need some prior information about q to
    know whether a given amount of smoothing
    increases bias2 by more than it decreases
    variance. (But c.f. shrinkage.)
  • Can think of regularization in dual ways
    minimizing a measure of size s.t. fit data
    adequately, or minimizing measure of misfit s.t.
    keep the model small. Complementary
    interpretations of the Lagrange multiplier.
  • Generally, to get consistency, need the smoothing
    to decrease at the right rate as the number of
    data increases. Rate depends on true smoothness.

44
Sketch Regularization
45
Consistency of Occams Inversion
  • Common approach minimize norm (or other
    regularization functional) subject to mean data
    misfit 1.
  • Sometimes called Occams Inversion (Constable et
    al., 1987) simplest hypothesis consistent with
    the data.
  • In many circumstances, this estimator is
    inconsistent as number of data grows, greater
    and greater chance that the estimator is 0.
    Allowable misfit grows faster than norm of
    noise-free data image.
  • In common situations, consistency of the general
    approach requires data redundancy and averaging.

46
Singular Value Decomposition, Linear Problems
  • Assume Q ½ T , separable Hilbert space ejj1n
    iid N(0, s2)kjj1n linearly independent
  • K is compact infinite-dimensional null
    space.Let K ltn ! T be the adjoint operator to
    K.
  • 9 n triples (nj, xj, lj)j1n, nj 2 T, xj 2 X
    and lj 2 lt, such that
  • Knj lj xj,
  • K xj lj nj.
  • njj1n can be chosen to be orthonormal in T
    xjj1n can be chosen to be orthonormal in X.
  • lj gt 0, 8 j. Order s.t. l1 l2 L gt 0.
  • (nj, xj, lj)j1n are singular value
    decomposition of K.

47
Singular Value Weighting
  • Can write minimum norm model that fits data
    exactly as
  • MNE dMN(X) åj1n lj-1 (xj X) nj.
  • Write q q q? (span of nj and its
    orthocomplement)
  • Biasq(dMN) Eq dMN(X) - q q?.
  • Varq dMN Eq åj1n lj-1 (xj e) nj 2 s2
    åj1n lj-2.
  • Components associated with small lj make variance
    big noise components multiplied by lj-1.
  • Singular value truncation reconstruct q using
    nj with lj t
  • dSVT åj1m lj-1 (xj X) nj, where m max k
    lk gt t.
  • Mollifies the noise magnification but increases
    bias.

48
Bias of SVT
  • Bias of SVT bigger than of MNE by projection of q
    onto span njjm1n.
  • Variance of SVT smaller by s2 åj m 1n lj-2.
  • With enough prior information about q (to control
    bias) can exploit bias-variance tradeoff to
    reduce MSE.
  • SVT ? family of estimators that re-weight the
    singular functions in the reconstruction
  • dw åj1n w(lj) (xj X) nj.
  • Regularization using norm penalty, with
    regularization parameter l, corresponds to
  • w(u) u/(u2 l).
  • These tend to have smaller norm smaller than
    maximum likelihood estimate can be viewed as
    shrinkage.

49
Examples of Singular Functions
  • Linear, time-invariant filters complex sinusoids
  • Circular convolution sinusoids
  • Band and time-limiting prolate spheroidal
    wavefunctions
  • Main-field geomagnetism spherical harmonics,
    times radial polynomials in 1/r
  • Abel transform Jacobi polynomials and Chebychev
    polynomials
  • See Donoho (1995) for more examples and
    references.

50
Minimax Estimation of Linear parameters
  • Observe X Kq e 2 Rn, with
  • q 2 Q µ T, T a separable Hilbert space
  • Q convex
  • eii1n iid N(0,s2).
  • Seek to learn about g(q) Q ! R, linear, bounded
    on Q
  • For variety of risks (MSE, MAD, length of
    fixed-length confidence interval), minimax risk
    is controlled by modulus of continuity of g,
    calibrated to the noise level.
  • Full problem no harder than hardest 1-dimensional
    subproblem reduces to BNM (Donoho, 1994).

51
Example Geomagnetism
  • Q q 2 l2(w) ål11 wl åm-ll qlm 2 q .
  • Estimate g(q) qlm.
  • Symmetry of Q and linearity of K, g, let us
    characterize the modulus

Problem maximize linear functional of a vector
in the intersection of two ellipsoids. In the
main-field geomagnetism problem, as the data
sampling becomes more uniform over the spherical
idealization of a satellite orbit, both the norm
(prior information) and the operator K are
diagonalized by spherical harmonics.
52
Modulus of Continuity
53
Distinguishing two models
  • Data tell the difference between two models z and
    h if the L1 distance between Pz and Ph is large

54
Summary
  • Solving inverse problem means different things
    to different audiences.
  • Statistical viewpoint is useful abstraction.
    Physics in mapping ? ? P?Prior information in
    constraint ???.
  • There is more information in the assertion q p,
    with p supported on Q, than there is in the
    constraint q 2 Q.
  • Separating model q from parameters g(q) of
    interest is useful Sabatiers well posed
    questions. Many interesting questions can be
    answered without knowing the entire model.
  • Thinking about measures of performance is useful.
  • Difficulty of problem ? performance of specific
    method.

55
References Acknowledgements
  • Constable, S.C., R.L. Parker C.G. Constable,
    1987. Occam's inversion A practical algorithm
    for generating smooth models from electromagnetic
    sounding data, Geophysics, 52, 289-300.
  • Donoho, D.L., 1994. Statistical Estimation and
    Optimal Recovery, Ann. Stat., 22, 238-270.
  • Donoho, D.L., 1995. Nonlinear solution of linear
    inverse problems by wavelet-vaguelette
    decomposition, Appl. Comput. Harm. Anal.,2,
    101-126.
  • Evans, S.N. Stark, P.B., 2002. Inverse Problems
    as Statistics, Inverse Problems, 18, R1-R43.
  • Evans, S.N., B.B. Hansen P.B. Stark, 2003.
    Minimax expected measure confidence sets for
    restricted location parameters, Tech. Rept. 617,
    Dept. of Statistics, UC Berkeley
  • Le Cam, L., 1986. Asymptotic Methods in
    Statistical Decision Theory, Springer-Verlag, NY,
    1986, 742pp.
  • OSullivan, F., 1986. A statistical perspective
    on ill-posed inverse problems, Statistical
    Science, 1, 502-518.
  • Schafer, C.M. P.B. Stark, 2003. Using what we
    know inference with physical constraints, Tech.
    Rept., Dept. of Statistics, UC Berkeley
  • Stark, P.B., 1992. Inference in
    infinite-dimensional inverse problems
    Discretization and duality, J. Geophys. Res., 97,
    14,055-14,082.
  • Stark, P.B., 1992. Minimax confidence intervals
    in geomagnetism, Geophys. J. Intl., 108,
    329-338.Created using TexPoint by G. Necula,
    http//raw.cs.berkeley.edu/texpoint
Write a Comment
User Comments (0)
About PowerShow.com