Fast Probabilistic Modeling for Combinatorial Optimization - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Fast Probabilistic Modeling for Combinatorial Optimization

Description:

Hillclimbing & GAs did relatively poorly ... Why not much better results? Too much emphasis on exploitation rather than exploration? ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 28
Provided by: scottd153
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Probabilistic Modeling for Combinatorial Optimization


1
Fast Probabilistic Modeling for Combinatorial
Optimization
  • Scott Davies
  • School of Computer Science
  • Carnegie Mellon University
  • (and Justsystem Pittsburgh Research Center)
  • Joint work with Shumeet Baluja

2
Combinatorial Optimization
  • Maximize evaluation function f(x)
  • input fixed-length bitstring x (x1, x2, ,xn)
  • output real value
  • x might represent
  • job shop schedules
  • TSP tours
  • discretized numeric values
  • etc.
  • Our focus Black Box optimization
  • No domain-dependent heuristics

x1001001...
f(x)
37.4
3
Commonly Used Approaches
  • Hill-climbing, simulated annealing
  • Generate candidate solutions neighboring single
    current working solution (e.g. differing by one
    bit)
  • Typically make no attempt to model how particular
    bits affect solution quality
  • Genetic algorithms
  • Attempt to implicitly capture dependency of
    solution quality on bit values by maintaining a
    population of candidate solutions
  • Use crossover and mutation operators on
    population members to generate new candidate
    solutions

4
Using Explicit Probabilistic Models
  • Maintain an explicit probability distribution P
    from which we generate new candidate solutions
  • Initialize P to uniform distribution
  • Until termination criteria met
  • Stochastically generate K candidate solutions
    from P
  • Evaluate them
  • Update P to make it more likely to generate
    solutions similar to the good solutions
  • Return best bitstring ever evaluated
  • Several different choices for what sorts of P to
    use and how to update it after candidate solution
    evaluation

5
Population-Based Incremental Learning
  • Population-Based Incremental Learning (PBIL)
    Baluja, 1995
  • Maintains a vector of probabilities one
    independent probability P(xi1) for each bit xi.
    (Each initialized to 0.5)
  • Until termination criteria met
  • Generate a population of K bitstrings from P.
    Evaluate them.
  • For each of the best M bitstrings of these K,
    adjust P to make that bitstring more likely
  • Also sometimes uses mutation and/or pushes P
    away from bad solutions
  • Often works very well (compared to hillclimbing
    and GAs) despite its independence assumption

if xi is set to 1
if xi is set to 0
or
6
Modeling Inter-Bit Dependencies
  • MIMIC De Bonet, et al., 1997 sort bits into
    Markov Chain in which each bits distribution is
    conditionalized on value of previous bit in chain
  • e.g. P(x1, x2, , xn) P(x3)P(x6x3)
    P(x2x6)
  • Generate chain from best N of bitstrings
    evaluated so far generate new bitstrings from
    chain repeat
  • More general framework Bayesian networks

7
Bayesian Networks
  • P(x1,,xn) P(x3)P(x12)P(x2x3,x12)P(x6x3)..
    .
  • Specified by
  • Network structure (must be a DAG)
  • Probability distribution for each variable (bit)
    given each possible combination of its parents
    values

8
Optimization with Bayesian Networks
  • Two operations we need to perform during
    optimization
  • Given a network, generate bitstrings from the
    probability distribution represented by that
    network
  • Trivial with Bayesian networks
  • Given a set of good bitstrings, find the
    network most likely to have generated it
  • Given a fixed network structure, trivial to fill
    in the probability tables optimally
  • But what structure should we use?

9
Learning Network Structures from Data
  • Problem statement given a dataset D and a set of
    allowable Bayesian networks, find the network B
    with the maximum posterior probability

10
Learning Network Structures from Data
  • Maximizing log (DB) reduces to maximizing
  • where
  • Pi is the set of xis parents in B
  • H () is the average entropy of an empirical
    distribution exhibited in D

11
Single-Parent Tree-Shaped Networks
  • Lets allow each bit to be conditioned on at most
    one other bit.
  • Adding an arc from xj to xi increases network
    score by
  • H(xi) - H(xixj) (xjs information gain
    with xi)
  • H(xj) - H(xjxi) (not necessarily obvious,
    but true)
  • I(xi, xj) Mutual information
    between xi and xj

12
Optimal Single-Parent Tree-Shaped Networks
  • To find optimal single-parent tree-shaped
    network, just find maximum spanning tree using
    I(xi, xj) as the weight for the edge between xi
    and xj. Chow and Liu, 1968.
  • Can be done in O(n2) time (assuming D has already
    been reduced to sufficient statistics)

13
Optimal Dependency Trees for Combinatorial
Optimization
  • Baluja Davies, 1997
  • Start with a dataset D initialized from the
    uniform distribution
  • Until termination criteria met
  • Build optimal dependency tree T with which to
    model D.
  • Generate K bitstrings from probability
    distribution represented by T. Evaluate them.
  • Add best M bitstrings to D after decaying the
    weight of all datapoints already in D by a factor
    a between 0 and 1.
  • Return best bitstring ever evaluated.
  • Running time O(Kn Mn2) per iteration

14
Graph-Coloring Example
  • Noisy Graph-Coloring example
  • For each edge connected to vertices of different
    colors, add 1 to evaluation function with
    probability 0.5

15
Optimal Dependency Tree Results
  • Tested following optimization algoirthms on
    variety of small problems, up to 256 bits long
  • Optimization with optimal dependency trees
  • Optimization with chain-shaped networks (ala
    MIMIC)
  • PBIL
  • Hillclimbing
  • Genetic algorithms
  • General trend
  • Hillclimbing GAs did relatively poorly
  • Among probabilistic methods, PBIL lt Chains lt
    Trees more accurate models are better

16
Modeling Higher-Order Dependencies
  • The maximum spanning tree algorithm gives us the
    optimal Bayesian Network in which each node has
    at most one parent.
  • What about finding the best network in which each
    node has at most K parents for Kgt1?
  • NP-complete problem! Chickering, et al., 1995
  • However, can use search heuristics to look for
    good network structures (e.g., Heckerman, et
    al., 1995), e.g. hillclimbing.

17
Bayesian Network-Based Combinatorial Optimization
  • Initialize D with C bitstrings from uniform
    distribution, and Bayesian network B to empty
    network containing no edges
  • Until termination criteria met
  • Perform steepest-ascent hillclimbing from B to
    find locally optimal network B. Repeat until no
    changes increase score
  • Evaluate how each possible edge addition, removal
    or deletion would affect penalized log-likelihood
    score
  • Perform change that maximizes increase in score.
  • Set B to B.
  • Generate and evaluate K bitstrings from B.
  • Decay weight of datapoints in D by a.
  • Add best M of the K recently generated datapoints
    to D.
  • Return best bit string ever evaluated.

18
Bayesian Networks Results
  • Does better than Tree-based optimization
    algorithm on some toy problems
  • Roughly the same as Tree-based algorithm on
    others,
  • Significantly more computation than Tree-based
    algorithm, however, despite efficiency hacks
  • Why not much better results?
  • Too much emphasis on exploitation rather than
    exploration?
  • Steepest-ascent hillclimbing over network
    structures not good enough, particularly when
    starting from old networks?

19
Using Probabilistic Models for Intelligent
Restarts
  • Tree-based algorithms O(n2) execution time per
    iteration very expensive for large problems
  • Even more so for algorithm based on more
    complicated Bayesian networks
  • One possible approach use probabilistic models
    to select good starting points for faster
    optimization algorithms, e.g. hillclimbing or
    PBIL

20
COMIT
  • Combining Optimizers with Mutual Information
    Trees Baluja and Davies, 1998
  • Initialize dataset D with bitstrings drawn from
    uniform distribution
  • Until termination criteria met
  • Build optimal dependency tree T with which to
    model D.
  • Use T to stochastically generate K bitstrings.
    Evaluate them.
  • Execute a fast-search procedure, initialized with
    the best solutions generated from T.
  • Replace up to M of the worst bitstrings in D with
    the best bitstrings found during the fast-search
    run just executed.
  • Return best bitstring ever evaluated

21
COMIT with Hill-climbing
  • Use single best bitstring generated by tree as
    starting point of a stochastic hillclimbing
    algorithm that allows at most PATIENCE moves to
    points of equal value before giving up
  • D kept at 1000 M set to 100.
  • We compare COMIT vs.
  • Hillclimbing with restarts from bitstrings chosen
    randomly from uniform distribution
  • Hillclimbing with restarts from best bitstring
    out of K chosen randomly from uniform
    distribution

22
COMIT w/Hillclimbing Example of Behavior
  • TSP domain 100 cities, 700 bits

Hillclimber
Tour Length 103
COMIT
Evaluation Number 103
23
COMIT w/Hillclimbing Results
  • Each number is average over at least 25 runs
  • Each algorithm given 200,000 evaluation function
    calls
  • Highlighted better than each non-COMIT
    hillclimber with confidence gt 95
  • AHCxxx pick best of xxx randomly generated
    starting points before hillclimbing
  • COMITxxx pick best of xxx starting points
    generated by tree before hillclimbing

24
COMIT with PBIL
  • Generate K samples from T evaluate them
  • Initialize PBILs P vector according to the
    unconditional distributions contained in the best
    C of these examples
  • PBIL run terminated after 5000 evaluations
    without improvement
  • D kept at 1000 M set to 100 (as before)
  • PBIL parameters a.15 update P with single best
    example out of 50 each iteration some mutation
    of P used as well

25
COMIT w/PBIL Results
  • Each number is average over at least 50 runs
  • Each algorithm given 600,000 evaluation function
    calls
  • Highlighted better than opposing algorithm(s)
    with confidence gt 95
  • PBILR PBIL with restart after 5000 evaluations
    without improvement (P reinitialized to 0.5)

26
Conclusions Future Work
  • COMIT makes probabilistic modeling applicable to
    much larger problems
  • COMIT led to significant improvements over
    baseline search algorithm in most problems tested
  • Future work
  • Applying COMIT to more interesting problems
  • Using COMIT to combine results of multiple search
    algorithms
  • Incorporating domain knowledge

27
Future Work
  • Making algorithm based on complex Bayesian
    Networks more practical
  • Combine w/simpler search algorithms, ala COMIT?
  • Applying COMIT to more interesting problems and
    other baseline search algorithms
  • WALKSAT?
  • Using more sophisticated probabilistic models
  • Using COMIT to combine results of multiple search
    algorithms
Write a Comment
User Comments (0)
About PowerShow.com