Title: Fast Probabilistic Modeling for Combinatorial Optimization
1Fast Probabilistic Modeling for Combinatorial
Optimization
- Scott Davies
- School of Computer Science
- Carnegie Mellon University
- (and Justsystem Pittsburgh Research Center)
- Joint work with Shumeet Baluja
2Combinatorial Optimization
- Maximize evaluation function f(x)
- input fixed-length bitstring x (x1, x2, ,xn)
- output real value
- x might represent
- job shop schedules
- TSP tours
- discretized numeric values
- etc.
- Our focus Black Box optimization
- No domain-dependent heuristics
x1001001...
f(x)
37.4
3Commonly Used Approaches
- Hill-climbing, simulated annealing
- Generate candidate solutions neighboring single
current working solution (e.g. differing by one
bit) - Typically make no attempt to model how particular
bits affect solution quality - Genetic algorithms
- Attempt to implicitly capture dependency of
solution quality on bit values by maintaining a
population of candidate solutions - Use crossover and mutation operators on
population members to generate new candidate
solutions
4Using Explicit Probabilistic Models
- Maintain an explicit probability distribution P
from which we generate new candidate solutions - Initialize P to uniform distribution
- Until termination criteria met
- Stochastically generate K candidate solutions
from P - Evaluate them
- Update P to make it more likely to generate
solutions similar to the good solutions - Return best bitstring ever evaluated
- Several different choices for what sorts of P to
use and how to update it after candidate solution
evaluation
5Population-Based Incremental Learning
- Population-Based Incremental Learning (PBIL)
Baluja, 1995 - Maintains a vector of probabilities one
independent probability P(xi1) for each bit xi.
(Each initialized to 0.5) - Until termination criteria met
- Generate a population of K bitstrings from P.
Evaluate them. - For each of the best M bitstrings of these K,
adjust P to make that bitstring more likely - Also sometimes uses mutation and/or pushes P
away from bad solutions - Often works very well (compared to hillclimbing
and GAs) despite its independence assumption
if xi is set to 1
if xi is set to 0
or
6Modeling Inter-Bit Dependencies
- MIMIC De Bonet, et al., 1997 sort bits into
Markov Chain in which each bits distribution is
conditionalized on value of previous bit in chain - e.g. P(x1, x2, , xn) P(x3)P(x6x3)
P(x2x6) - Generate chain from best N of bitstrings
evaluated so far generate new bitstrings from
chain repeat - More general framework Bayesian networks
7Bayesian Networks
- P(x1,,xn) P(x3)P(x12)P(x2x3,x12)P(x6x3)..
.
- Specified by
- Network structure (must be a DAG)
- Probability distribution for each variable (bit)
given each possible combination of its parents
values
8Optimization with Bayesian Networks
- Two operations we need to perform during
optimization - Given a network, generate bitstrings from the
probability distribution represented by that
network - Trivial with Bayesian networks
- Given a set of good bitstrings, find the
network most likely to have generated it - Given a fixed network structure, trivial to fill
in the probability tables optimally - But what structure should we use?
9Learning Network Structures from Data
- Problem statement given a dataset D and a set of
allowable Bayesian networks, find the network B
with the maximum posterior probability
10Learning Network Structures from Data
- Maximizing log (DB) reduces to maximizing
- where
- Pi is the set of xis parents in B
- H () is the average entropy of an empirical
distribution exhibited in D
11Single-Parent Tree-Shaped Networks
- Lets allow each bit to be conditioned on at most
one other bit.
- Adding an arc from xj to xi increases network
score by - H(xi) - H(xixj) (xjs information gain
with xi) - H(xj) - H(xjxi) (not necessarily obvious,
but true) - I(xi, xj) Mutual information
between xi and xj
12Optimal Single-Parent Tree-Shaped Networks
- To find optimal single-parent tree-shaped
network, just find maximum spanning tree using
I(xi, xj) as the weight for the edge between xi
and xj. Chow and Liu, 1968. - Can be done in O(n2) time (assuming D has already
been reduced to sufficient statistics)
13Optimal Dependency Trees for Combinatorial
Optimization
- Baluja Davies, 1997
- Start with a dataset D initialized from the
uniform distribution - Until termination criteria met
- Build optimal dependency tree T with which to
model D. - Generate K bitstrings from probability
distribution represented by T. Evaluate them. - Add best M bitstrings to D after decaying the
weight of all datapoints already in D by a factor
a between 0 and 1. - Return best bitstring ever evaluated.
- Running time O(Kn Mn2) per iteration
14Graph-Coloring Example
- Noisy Graph-Coloring example
- For each edge connected to vertices of different
colors, add 1 to evaluation function with
probability 0.5
15Optimal Dependency Tree Results
- Tested following optimization algoirthms on
variety of small problems, up to 256 bits long - Optimization with optimal dependency trees
- Optimization with chain-shaped networks (ala
MIMIC) - PBIL
- Hillclimbing
- Genetic algorithms
- General trend
- Hillclimbing GAs did relatively poorly
- Among probabilistic methods, PBIL lt Chains lt
Trees more accurate models are better
16Modeling Higher-Order Dependencies
- The maximum spanning tree algorithm gives us the
optimal Bayesian Network in which each node has
at most one parent. - What about finding the best network in which each
node has at most K parents for Kgt1? - NP-complete problem! Chickering, et al., 1995
- However, can use search heuristics to look for
good network structures (e.g., Heckerman, et
al., 1995), e.g. hillclimbing.
17Bayesian Network-Based Combinatorial Optimization
- Initialize D with C bitstrings from uniform
distribution, and Bayesian network B to empty
network containing no edges - Until termination criteria met
- Perform steepest-ascent hillclimbing from B to
find locally optimal network B. Repeat until no
changes increase score - Evaluate how each possible edge addition, removal
or deletion would affect penalized log-likelihood
score - Perform change that maximizes increase in score.
- Set B to B.
- Generate and evaluate K bitstrings from B.
- Decay weight of datapoints in D by a.
- Add best M of the K recently generated datapoints
to D. - Return best bit string ever evaluated.
18Bayesian Networks Results
- Does better than Tree-based optimization
algorithm on some toy problems - Roughly the same as Tree-based algorithm on
others, - Significantly more computation than Tree-based
algorithm, however, despite efficiency hacks - Why not much better results?
- Too much emphasis on exploitation rather than
exploration? - Steepest-ascent hillclimbing over network
structures not good enough, particularly when
starting from old networks?
19Using Probabilistic Models for Intelligent
Restarts
- Tree-based algorithms O(n2) execution time per
iteration very expensive for large problems - Even more so for algorithm based on more
complicated Bayesian networks - One possible approach use probabilistic models
to select good starting points for faster
optimization algorithms, e.g. hillclimbing or
PBIL
20COMIT
- Combining Optimizers with Mutual Information
Trees Baluja and Davies, 1998 - Initialize dataset D with bitstrings drawn from
uniform distribution - Until termination criteria met
- Build optimal dependency tree T with which to
model D. - Use T to stochastically generate K bitstrings.
Evaluate them. - Execute a fast-search procedure, initialized with
the best solutions generated from T. - Replace up to M of the worst bitstrings in D with
the best bitstrings found during the fast-search
run just executed. - Return best bitstring ever evaluated
21COMIT with Hill-climbing
- Use single best bitstring generated by tree as
starting point of a stochastic hillclimbing
algorithm that allows at most PATIENCE moves to
points of equal value before giving up - D kept at 1000 M set to 100.
- We compare COMIT vs.
- Hillclimbing with restarts from bitstrings chosen
randomly from uniform distribution - Hillclimbing with restarts from best bitstring
out of K chosen randomly from uniform
distribution
22COMIT w/Hillclimbing Example of Behavior
- TSP domain 100 cities, 700 bits
Hillclimber
Tour Length 103
COMIT
Evaluation Number 103
23COMIT w/Hillclimbing Results
- Each number is average over at least 25 runs
- Each algorithm given 200,000 evaluation function
calls - Highlighted better than each non-COMIT
hillclimber with confidence gt 95 - AHCxxx pick best of xxx randomly generated
starting points before hillclimbing - COMITxxx pick best of xxx starting points
generated by tree before hillclimbing
24COMIT with PBIL
- Generate K samples from T evaluate them
- Initialize PBILs P vector according to the
unconditional distributions contained in the best
C of these examples - PBIL run terminated after 5000 evaluations
without improvement - D kept at 1000 M set to 100 (as before)
- PBIL parameters a.15 update P with single best
example out of 50 each iteration some mutation
of P used as well
25COMIT w/PBIL Results
- Each number is average over at least 50 runs
- Each algorithm given 600,000 evaluation function
calls - Highlighted better than opposing algorithm(s)
with confidence gt 95 - PBILR PBIL with restart after 5000 evaluations
without improvement (P reinitialized to 0.5)
26Conclusions Future Work
- COMIT makes probabilistic modeling applicable to
much larger problems - COMIT led to significant improvements over
baseline search algorithm in most problems tested - Future work
- Applying COMIT to more interesting problems
- Using COMIT to combine results of multiple search
algorithms - Incorporating domain knowledge
27Future Work
- Making algorithm based on complex Bayesian
Networks more practical - Combine w/simpler search algorithms, ala COMIT?
- Applying COMIT to more interesting problems and
other baseline search algorithms - WALKSAT?
- Using more sophisticated probabilistic models
- Using COMIT to combine results of multiple search
algorithms