Noisy%20Global%20Function%20Optimization presentation

About This Presentation

Transcript and Presenter's Notes

Title: Noisy%20Global%20Function%20Optimization

1
Noisy Global Function Optimization

Dan Lizotte

2
Noisy Function?

Conditional distribution
Function Evaluation F(x) is a sample from
P(Fx)
There is no P(x)
We get to pick x, deterministically or
stochastically, as usual.

3
Common Assumptions

F(x) µ(x) e(x)
What they dont tell you
µ(x) arbitrary deterministic function
e(x) is a r.v., E(e) 0, (i.e. E(F(x)) µ(x))
Really only makes sense if e(x) is unimodal
Samples are probably close to µ
But maybe not normal

4
Whats the Difference?

Deterministic Function Optimization
Oh, I have this function f(x)
Gradient is ?f
Hessian is H
Noisy Function Optimization
Oh, I have this r.v. F(x) µ(x) e(x)
I think the form of µ is like
I think that P(e(x)) something

5
Whats the Plan?

Get samples of F(x) µ(x) e(x)
Estimate and minimize µ(x)
Regression Optimization
i.e., reduce to deterministic global minimization

6
Frequentist (?) Approach

Maybe Im being unfair
Not claiming all non-Bayesians would think this
is a good idea
Raises interesting questions

7
Suppose for a moment

You thought F(x) µ(x) e
You thought µ(x) ax2 bx c
You thought e N(0,s2)
What do you do?
Estimate a,b,c
Minimize µ(x) (return xmin -b/2a)
Estimate how?
Sample points, do least squares (max L)
Sample where? Does it matter?

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
xmin-1.25
StdDev 5.77, Mean -0.22
12
xmin-1.25
StdDev 0.078, Mean -1.23
13
Does choice of x matter?

Clearly.
A frequentist would probably try to choose x to
minimize the variance of -b/2a
Cant do that without knowing the real function
Some sort of sequential decision process

14
Bayesian Approach

A Bayesian
Has a prior
Gets data
Computes a posterior
In this case, prior and posterior over F
Encoding our uncertainty about F enables
principled decision making
Everybody loves Gaussians

15
Gaussian

Unimodal
Concentrated
Easy to compute with
Sometimes
Tons of crazy properties

16
Multivariate Gaussian

Same thing, but moreso
Some things are harder
No nice form for cdf
Classical view Points

17
Covariance Matrix

Shape param
Eigenstuffindicates variance and correlations

18
(No Transcript)
19
Higher Dimensions

Visualizing gt 3 dimensions isdifficult
Thinking about vectors in the i,j,k
engineering sense is a trap
Means and marginals is practical
But then we dont see correlations
Marginal distributions are Gaussian
ex., F6 N(µ(F6), s(F6))

s(F6)
µ(F6)
20
Yet Higher Dimensions

Why stop there?
We indexed before with Z. Why not R?
Need functions µ(Fx), s(Fx,Fy) for all x, y ?R
F is now an uncountably infinite dimensional
vector
Dont panic Its just a function

21
Getting Ridiculous

Why stop there?
We indexed before with R. Why not Rd?
Need functions µ(Fx), s2(Fx,Fy) for all x, y ?Rd

22
Gaussian Process

Probability distribution indexed by an arbitrary
set
Each element gets a Gaussian distribution over
the reals with mean µ(x)
These distributions are dependent/correlated as
defined by s2(Fx,Fy)
Any subset of indices defines a multivariate
gaussian distribution

23
Gaussian Process

Index set can be pretty much whatever
Reals
Real vectors
Graphs
Strings
Most interesting structure is in s2(Fx,Fy), the
kernel.

24
Bayesian Updates for GPs

Oh yeah Bayesian, remember?
How do Bayesians use a Gaussian Process?
Start with GP prior
Get some data
Compute a posterior
Ask interesting questions about the posterior

25
Prior
26
Data
27
Posterior
28
Computing the Posterior

Given
Prior
List of observed data points Fobs
(indexed by a list o1, o2, , oj
List of query points Fask
(indexed by a list a1, a2, , ak)

29
What now?

We constantly have a model Fpost of our function
F
As we accumulate data, the model improves
How should we accumulate data?
Use the model to select which point to sample next

30
Candidate Selection

Caveat This will take some work.
Only useful if F is expensive to evaluate
Dog walking is 10s per evaluation
Here are some ideas

31
Idea 0 Min Posterior Mean

For which point x does F(x) have the lowest
posterior mean?
This is, in general, a non-convex, global
optimization problem.
WHAT??!!
I know, but remember F is expensive
Problems
Trapped in local minima (below prior mean)
Does not acknowledge model uncertainty

32
Idea 1 Thompson Sampling

Idea 0 is too dumb.
Sample a function from the posterior
Can only be sampled at a list of points
Minimize the sampled function
This returns a sample from P(X is min)
XT P(F(X) lt F(X) forall X)
Problems
Explores too much

33
Idea 2 PMin

Thompson is more sensible, but still not quite
right
Why not select
x argmin P(F(X) lt F(X) forall X)
i.e., sample F(x) next where x is most likely to
be the minimum of the function
Because its hard
Or at least I cant do it. Our domains kinda
big.
We can simulate this with repeated Thompson
sampling
This is the optimal greedy action

34
Finally

AIBOS!

35
Seriously

did we have to go through all that to get to the
dogs?
Yeah, sorry.
Now its easy
An AIBO walk is a map from R51 to R
51 motion parameters map to velocity
Optimize the parameters to go fast

36
AIBO Walking

Set up a Gaussian process over R51
Kernel is also Gaussian (careful!)
Parameters for priors found by maximum likelihood
We could be more Bayesian here and use priors
over the model parameters
Walk, get velocity, pick new parameters, walk

37
AIBO Walking

Results?
Pretty not too bad!
But not earth-shattering
Even choosing uniformly random parameters isnt
too bad
But using, say PMax shows a definite improvement
I think theres more structure in the problem
than people let on

Noisy%20Global%20Function%20Optimization PowerPoint PPT Presentation