Noisy%20Global%20Function%20Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Noisy%20Global%20Function%20Optimization

Description:

Trapped in local minima (below prior mean) Does not acknowledge model uncertainty ... Parameters for priors found by maximum likelihood. We could be more ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 39
Provided by: daniell114
Category:

less

Transcript and Presenter's Notes

Title: Noisy%20Global%20Function%20Optimization


1
Noisy Global Function Optimization
  • Dan Lizotte

2
Noisy Function?
  • Conditional distribution
  • Function Evaluation F(x) is a sample from
    P(Fx)
  • There is no P(x)
  • We get to pick x, deterministically or
    stochastically, as usual.

3
Common Assumptions
  • F(x) µ(x) e(x)
  • What they dont tell you
  • µ(x) arbitrary deterministic function
  • e(x) is a r.v., E(e) 0, (i.e. E(F(x)) µ(x))
  • Really only makes sense if e(x) is unimodal
  • Samples are probably close to µ
  • But maybe not normal

4
Whats the Difference?
  • Deterministic Function Optimization
  • Oh, I have this function f(x)
  • Gradient is ?f
  • Hessian is H
  • Noisy Function Optimization
  • Oh, I have this r.v. F(x) µ(x) e(x)
  • I think the form of µ is like
  • I think that P(e(x)) something

5
Whats the Plan?
  • Get samples of F(x) µ(x) e(x)
  • Estimate and minimize µ(x)
  • Regression Optimization
  • i.e., reduce to deterministic global minimization

6
Frequentist (?) Approach
  • Maybe Im being unfair
  • Not claiming all non-Bayesians would think this
    is a good idea
  • Raises interesting questions

7
Suppose for a moment
  • You thought F(x) µ(x) e
  • You thought µ(x) ax2 bx c
  • You thought e N(0,s2)
  • What do you do?
  • Estimate a,b,c
  • Minimize µ(x) (return xmin -b/2a)
  • Estimate how?
  • Sample points, do least squares (max L)
  • Sample where? Does it matter?

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
xmin-1.25
StdDev 5.77, Mean -0.22
12
xmin-1.25
StdDev 0.078, Mean -1.23
13
Does choice of x matter?
  • Clearly.
  • A frequentist would probably try to choose x to
    minimize the variance of -b/2a
  • Cant do that without knowing the real function
  • Some sort of sequential decision process

14
Bayesian Approach
  • A Bayesian
  • Has a prior
  • Gets data
  • Computes a posterior
  • In this case, prior and posterior over F
  • Encoding our uncertainty about F enables
    principled decision making
  • Everybody loves Gaussians

15
Gaussian
  • Unimodal
  • Concentrated
  • Easy to compute with
  • Sometimes
  • Tons of crazy properties

16
Multivariate Gaussian
  • Same thing, but moreso
  • Some things are harder
  • No nice form for cdf
  • Classical view Points

17
Covariance Matrix
  • Shape param
  • Eigenstuffindicates variance and correlations

18
(No Transcript)
19
Higher Dimensions
  • Visualizing gt 3 dimensions isdifficult
  • Thinking about vectors in the i,j,k
    engineering sense is a trap
  • Means and marginals is practical
  • But then we dont see correlations
  • Marginal distributions are Gaussian
  • ex., F6 N(µ(F6), s(F6))

s(F6)
µ(F6)
20
Yet Higher Dimensions
  • Why stop there?
  • We indexed before with Z. Why not R?
  • Need functions µ(Fx), s(Fx,Fy) for all x, y ?R
  • F is now an uncountably infinite dimensional
    vector
  • Dont panic Its just a function

21
Getting Ridiculous
  • Why stop there?
  • We indexed before with R. Why not Rd?
  • Need functions µ(Fx), s2(Fx,Fy) for all x, y ?Rd

22
Gaussian Process
  • Probability distribution indexed by an arbitrary
    set
  • Each element gets a Gaussian distribution over
    the reals with mean µ(x)
  • These distributions are dependent/correlated as
    defined by s2(Fx,Fy)
  • Any subset of indices defines a multivariate
    gaussian distribution

23
Gaussian Process
  • Index set can be pretty much whatever
  • Reals
  • Real vectors
  • Graphs
  • Strings
  • Most interesting structure is in s2(Fx,Fy), the
    kernel.

24
Bayesian Updates for GPs
  • Oh yeah Bayesian, remember?
  • How do Bayesians use a Gaussian Process?
  • Start with GP prior
  • Get some data
  • Compute a posterior
  • Ask interesting questions about the posterior

25
Prior
26
Data
27
Posterior
28
Computing the Posterior
  • Given
  • Prior
  • List of observed data points Fobs
  • (indexed by a list o1, o2, , oj
  • List of query points Fask
  • (indexed by a list a1, a2, , ak)

29
What now?
  • We constantly have a model Fpost of our function
    F
  • As we accumulate data, the model improves
  • How should we accumulate data?
  • Use the model to select which point to sample next

30
Candidate Selection
  • Caveat This will take some work.
  • Only useful if F is expensive to evaluate
  • Dog walking is 10s per evaluation
  • Here are some ideas

31
Idea 0 Min Posterior Mean
  • For which point x does F(x) have the lowest
    posterior mean?
  • This is, in general, a non-convex, global
    optimization problem.
  • WHAT??!!
  • I know, but remember F is expensive
  • Problems
  • Trapped in local minima (below prior mean)
  • Does not acknowledge model uncertainty

32
Idea 1 Thompson Sampling
  • Idea 0 is too dumb.
  • Sample a function from the posterior
  • Can only be sampled at a list of points
  • Minimize the sampled function
  • This returns a sample from P(X is min)
  • XT P(F(X) lt F(X) forall X)
  • Problems
  • Explores too much

33
Idea 2 PMin
  • Thompson is more sensible, but still not quite
    right
  • Why not select
  • x argmin P(F(X) lt F(X) forall X)
  • i.e., sample F(x) next where x is most likely to
    be the minimum of the function
  • Because its hard
  • Or at least I cant do it. Our domains kinda
    big.
  • We can simulate this with repeated Thompson
    sampling
  • This is the optimal greedy action

34
Finally
  • AIBOS!

35
Seriously
  • did we have to go through all that to get to the
    dogs?
  • Yeah, sorry.
  • Now its easy
  • An AIBO walk is a map from R51 to R
  • 51 motion parameters map to velocity
  • Optimize the parameters to go fast

36
AIBO Walking
  • Set up a Gaussian process over R51
  • Kernel is also Gaussian (careful!)
  • Parameters for priors found by maximum likelihood
  • We could be more Bayesian here and use priors
    over the model parameters
  • Walk, get velocity, pick new parameters, walk

37
AIBO Walking
  • Results?
  • Pretty not too bad!
  • But not earth-shattering
  • Even choosing uniformly random parameters isnt
    too bad
  • But using, say PMax shows a definite improvement
  • I think theres more structure in the problem
    than people let on

38
Whew.
  • I need a break
  • Does anybody have any questions or brilliant
    ideas?
Write a Comment
User Comments (0)
About PowerShow.com