Fast Probabilistic Modeling for Combinatorial Optimization - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Fast Probabilistic Modeling for Combinatorial Optimization

Description:

Hillclimbing & GAs did relatively poorly ... Why not much better results? Too much emphasis on exploitation rather than exploration? ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 28

Provided by: scottd153

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast Probabilistic Modeling for Combinatorial Optimization

1
Fast Probabilistic Modeling for Combinatorial
Optimization

Scott Davies
School of Computer Science
Carnegie Mellon University
(and Justsystem Pittsburgh Research Center)
Joint work with Shumeet Baluja

2
Combinatorial Optimization

Maximize evaluation function f(x)
input fixed-length bitstring x (x1, x2, ,xn)
output real value
x might represent
job shop schedules
TSP tours
discretized numeric values
etc.
Our focus Black Box optimization
No domain-dependent heuristics

x1001001...
f(x)
37.4
3
Commonly Used Approaches

Hill-climbing, simulated annealing
Generate candidate solutions neighboring single
current working solution (e.g. differing by one
bit)
Typically make no attempt to model how particular
bits affect solution quality
Genetic algorithms
Attempt to implicitly capture dependency of
solution quality on bit values by maintaining a
population of candidate solutions
Use crossover and mutation operators on
population members to generate new candidate
solutions

4
Using Explicit Probabilistic Models

Maintain an explicit probability distribution P
from which we generate new candidate solutions
Initialize P to uniform distribution
Until termination criteria met
Stochastically generate K candidate solutions
from P
Evaluate them
Update P to make it more likely to generate
solutions similar to the good solutions
Return best bitstring ever evaluated
Several different choices for what sorts of P to
use and how to update it after candidate solution
evaluation

5
Population-Based Incremental Learning

Population-Based Incremental Learning (PBIL)
Baluja, 1995
Maintains a vector of probabilities one
independent probability P(xi1) for each bit xi.
(Each initialized to 0.5)
Until termination criteria met
Generate a population of K bitstrings from P.
Evaluate them.
For each of the best M bitstrings of these K,
adjust P to make that bitstring more likely
Also sometimes uses mutation and/or pushes P
away from bad solutions
Often works very well (compared to hillclimbing
and GAs) despite its independence assumption

if xi is set to 1
if xi is set to 0
or
6
Modeling Inter-Bit Dependencies

MIMIC De Bonet, et al., 1997 sort bits into
Markov Chain in which each bits distribution is
conditionalized on value of previous bit in chain
e.g. P(x1, x2, , xn) P(x3)P(x6x3)
P(x2x6)
Generate chain from best N of bitstrings
evaluated so far generate new bitstrings from
chain repeat
More general framework Bayesian networks

7
Bayesian Networks

P(x1,,xn) P(x3)P(x12)P(x2x3,x12)P(x6x3)..
.

Specified by
Network structure (must be a DAG)
Probability distribution for each variable (bit)
given each possible combination of its parents
values

8
Optimization with Bayesian Networks

Two operations we need to perform during
optimization
Given a network, generate bitstrings from the
probability distribution represented by that
network
Trivial with Bayesian networks
Given a set of good bitstrings, find the
network most likely to have generated it
Given a fixed network structure, trivial to fill
in the probability tables optimally
But what structure should we use?

9
Learning Network Structures from Data

Problem statement given a dataset D and a set of
allowable Bayesian networks, find the network B
with the maximum posterior probability

10
Learning Network Structures from Data

Maximizing log (DB) reduces to maximizing
where
Pi is the set of xis parents in B
H () is the average entropy of an empirical
distribution exhibited in D

11
Single-Parent Tree-Shaped Networks

Lets allow each bit to be conditioned on at most
one other bit.

Adding an arc from xj to xi increases network
score by
H(xi) - H(xixj) (xjs information gain
with xi)
H(xj) - H(xjxi) (not necessarily obvious,
but true)
I(xi, xj) Mutual information
between xi and xj

12
Optimal Single-Parent Tree-Shaped Networks

To find optimal single-parent tree-shaped
network, just find maximum spanning tree using
I(xi, xj) as the weight for the edge between xi
and xj. Chow and Liu, 1968.
Can be done in O(n2) time (assuming D has already
been reduced to sufficient statistics)

13
Optimal Dependency Trees for Combinatorial
Optimization

Baluja Davies, 1997
Start with a dataset D initialized from the
uniform distribution
Until termination criteria met
Build optimal dependency tree T with which to
model D.
Generate K bitstrings from probability
distribution represented by T. Evaluate them.
Add best M bitstrings to D after decaying the
weight of all datapoints already in D by a factor
a between 0 and 1.
Return best bitstring ever evaluated.
Running time O(Kn Mn2) per iteration

14
Graph-Coloring Example

Noisy Graph-Coloring example
For each edge connected to vertices of different
colors, add 1 to evaluation function with
probability 0.5

15
Optimal Dependency Tree Results

Tested following optimization algoirthms on
variety of small problems, up to 256 bits long
Optimization with optimal dependency trees
Optimization with chain-shaped networks (ala
MIMIC)
PBIL
Hillclimbing
Genetic algorithms
General trend
Hillclimbing GAs did relatively poorly
Among probabilistic methods, PBIL lt Chains lt
Trees more accurate models are better

16
Modeling Higher-Order Dependencies

The maximum spanning tree algorithm gives us the
optimal Bayesian Network in which each node has
at most one parent.
What about finding the best network in which each
node has at most K parents for Kgt1?
NP-complete problem! Chickering, et al., 1995
However, can use search heuristics to look for
good network structures (e.g., Heckerman, et
al., 1995), e.g. hillclimbing.

17
Bayesian Network-Based Combinatorial Optimization

Initialize D with C bitstrings from uniform
distribution, and Bayesian network B to empty
network containing no edges
Until termination criteria met
Perform steepest-ascent hillclimbing from B to
find locally optimal network B. Repeat until no
changes increase score
Evaluate how each possible edge addition, removal
or deletion would affect penalized log-likelihood
score
Perform change that maximizes increase in score.
Set B to B.
Generate and evaluate K bitstrings from B.
Decay weight of datapoints in D by a.
Add best M of the K recently generated datapoints
to D.
Return best bit string ever evaluated.

18
Bayesian Networks Results

Does better than Tree-based optimization
algorithm on some toy problems
Roughly the same as Tree-based algorithm on
others,
Significantly more computation than Tree-based
algorithm, however, despite efficiency hacks
Why not much better results?
Too much emphasis on exploitation rather than
exploration?
Steepest-ascent hillclimbing over network
structures not good enough, particularly when
starting from old networks?

19
Using Probabilistic Models for Intelligent
Restarts

Tree-based algorithms O(n2) execution time per
iteration very expensive for large problems
Even more so for algorithm based on more
complicated Bayesian networks
One possible approach use probabilistic models
to select good starting points for faster
optimization algorithms, e.g. hillclimbing or
PBIL

20
COMIT

Combining Optimizers with Mutual Information
Trees Baluja and Davies, 1998
Initialize dataset D with bitstrings drawn from
uniform distribution
Until termination criteria met
Build optimal dependency tree T with which to
model D.
Use T to stochastically generate K bitstrings.
Evaluate them.
Execute a fast-search procedure, initialized with
the best solutions generated from T.
Replace up to M of the worst bitstrings in D with
the best bitstrings found during the fast-search
run just executed.
Return best bitstring ever evaluated

21
COMIT with Hill-climbing

Use single best bitstring generated by tree as
starting point of a stochastic hillclimbing
algorithm that allows at most PATIENCE moves to
points of equal value before giving up
D kept at 1000 M set to 100.
We compare COMIT vs.
Hillclimbing with restarts from bitstrings chosen
randomly from uniform distribution
Hillclimbing with restarts from best bitstring
out of K chosen randomly from uniform
distribution

22
COMIT w/Hillclimbing Example of Behavior

TSP domain 100 cities, 700 bits

Hillclimber
Tour Length 103
COMIT
Evaluation Number 103
23
COMIT w/Hillclimbing Results

Each number is average over at least 25 runs
Each algorithm given 200,000 evaluation function
calls
Highlighted better than each non-COMIT
hillclimber with confidence gt 95
AHCxxx pick best of xxx randomly generated
starting points before hillclimbing
COMITxxx pick best of xxx starting points
generated by tree before hillclimbing

24
COMIT with PBIL

Generate K samples from T evaluate them
Initialize PBILs P vector according to the
unconditional distributions contained in the best
C of these examples
PBIL run terminated after 5000 evaluations
without improvement
D kept at 1000 M set to 100 (as before)
PBIL parameters a.15 update P with single best
example out of 50 each iteration some mutation
of P used as well

25
COMIT w/PBIL Results

Each number is average over at least 50 runs
Each algorithm given 600,000 evaluation function
calls
Highlighted better than opposing algorithm(s)
with confidence gt 95
PBILR PBIL with restart after 5000 evaluations
without improvement (P reinitialized to 0.5)

26
Conclusions Future Work

COMIT makes probabilistic modeling applicable to
much larger problems
COMIT led to significant improvements over
baseline search algorithm in most problems tested
Future work
Applying COMIT to more interesting problems
Using COMIT to combine results of multiple search
algorithms
Incorporating domain knowledge

27
Future Work

Making algorithm based on complex Bayesian
Networks more practical
Combine w/simpler search algorithms, ala COMIT?
Applying COMIT to more interesting problems and
other baseline search algorithms
WALKSAT?
Using more sophisticated probabilistic models
Using COMIT to combine results of multiple search
algorithms

Write a Comment

User Comments (0)