Architecture and Details of a High Quality, Large-Scale Analytical Placer - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Architecture and Details of a High Quality, Large-Scale Analytical Placer

Description:

If top node of heap is 'valid' then cluster it with its closest neighbor ... calculate the clustering score of the new node and reinsert into the heap ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 40
Provided by: caden7
Category:

less

Transcript and Presenter's Notes

Title: Architecture and Details of a High Quality, Large-Scale Analytical Placer


1
Architecture and Details of a High Quality,
Large-Scale Analytical Placer
  • Andrew B. Kahng, Sherief Reda and Qinke Wang
  • VLSI CAD Lab
  • University of California, San Diego
  • http//vlsicad.ucsd.edu/
  • Work partially supported by the MARCO Gigascale
    Systems Research Center. ABK is currently with
    Blaze DFM, Inc., Sunnyvale, CA.

2
Outline
  • History of APlace
  • From APlace1.0 to APlace2.0
  • Anatomy of APlace2.0
  • New techniques in APlace2.0
  • Experimental Results
  • Conclusions and Future Work

3
History of APlace
  • Research to study Synopsys patent
  • Naylor et al., US Patent 6,301,693 (2001)
  • Extensible foundation APlace1.0
  • Timing-driven placement
  • Mixed-size placement
  • Area-I/O placement
  • ISPD-2005 placement contest ? APlace2.0
  • Many parts of APlace rewritten
  • Superior performance

4
Outline
  • History of APlace
  • From APlace1.0 to APlace2.0
  • Anatomy of APlace2.0
  • New techniques in APlace2.0
  • Experimental Results
  • Conclusions and Future Works

5
APlace Problem Formulation
  • Constrained Nonlinear Optimization Divide the
    layout area into uniform bins, and seek to
    minimize HPWL etc. so that total cell area in
    every bin is equalized
  • density function that equals
    the total cell area in a global bin g
  • D average cell area over all global bins

6
Nonlinear Optimization
  • Smooth approximation of placement objectives
    wirelength, density function, etc.
  • Quadratic Penalty method
  • Solve a sequence of unconstrained minimization
    problems for a sequence of ยต ? 0
  • Conjugate Gradient (CG) solver
  • Useful for finding an unconstrained minimum
    of a high-dimensional function
  • Adaptable to large-scale placement problems
    memory requirement is linear in problem size

7
Wirelength Approximation
  • Half-Perimeter Wirelength (HPWL)
  • Half-perimeter of nets bounding box
  • Simple, close measure of routing congestion
  • Not strictly convex, or everywhere differentiable
  • Log-Sum-Exp approximation
  • Naylor et al., US Patent 6,301,693 (2001)
  • Precise, closer to HPWL when a ? 0
  • Strictly convex, continuously differentiable

8
? Smoothing Parameter
  • Significance criterion for choosing nets with
    large wirelength to minimize
  • Larger gradients for longer nets
  • Minimize long nets more efficiently than short
    nets
  • Two-pin net
  • Partial gradient for x1
  • close to 0, when net length x1- x2 is small
    compared to ?
  • close to 1 or -1, o.w.

9
Area Potential Function
  • Overlap area
  • overlap along the x and y directions
  • 0/1 function with cell size ignored
  • Area potential function defines an area
    potential exerted by a cell to nearby grids
  • smooth bell-shaped function for standard cells
    Naylor et al., US Patent 6,301,693 (2001)

10
Module Area Potential Function
  • Mixed-size placement decide scope of area
    potential based on module's dimension
  • p(d) potential function
  • d distance from module to grid
  • radius r w/2 2wg for block with width w
  • convex curved lt w/2 wg
  • concave curvew/2 wg lt d lt w/2 2wg
  • smooth at d w/2 wg

p(d)
1-ad
2
2
b(r-d)
d
-w/2-2wg
w/2 wg
11
Changes APlace1.0 ? APlace2.0
  • Strong scalability from new clustering algorithm
  • Dynamic adjustment of weights for wirelength and
    overlap penalty during global placement
  • Improvements to legalization, detailed placement
  • whitespace compaction
  • cell reordering algorithms
  • global greedy cell movement
  • APlace2.0 vs. APlace1.0 up to 19 WL reduction

  • 1.5-2x speedup

12
IBM BigBlue4 Placement
2.1M instances, HPWL 833.21, CPU 23h
13
Outline
  • History of APlace
  • From APlace1.0 to APlace2.0
  • Anatomy of APlace2.0
  • New techniques in APlace2.0
  • Experimental Results
  • Conclusions and Future Works

14
Anatomy of APlace 2.0
Clustering
Adaptive APlace engine
Global Phase
Unclustering
Legalization
WS arrangement
Detailed Phase
Cell order polishing
Global moving
15
New Feature 1 Multi-Level Clustering
Objective cluster to reduce runtime and allow
scalable implementations with no compromise to
quality
netlist
reduce netlist size by 10x
  • Multi-level approach using best-choice clustering
    (ISPD05)

size 2000?
no
yes
  • Clustering ratio ? 10
  • Top-level clusters ? 2000
  • Wirelength calculation
  • assume modules located at cluster center
  • only consider inter-cluster parts of nets

global placement
no
flat?
uncluster
yes
Legalization
16
Best-Choice Clustering
  • Each clustering level uses the best-choice
    heuristic with lazy updates and tight area
    control
  • For each clustering level
  • Calculate the clustering score of each node to
    its neighbors
  • based on the number of connections and areas
  • Sort all nodes based on their best scores using
    a heap
  • Until target clustering ratio is reached
  • If top node of heap is valid then cluster it
    with its closest neighbor
  • Else recalculate the top node score and reinsert
    in heap Continue
  • calculate the clustering score of the new node
    and reinsert into the heap
  • update netlist and mark all neighbors of the new
    node as invalid

17
Two Clustering Concerns
  • Mark boundaries of clustering hierarchy at each
    clustering level
  • ? allow exact reversal of clustering during
    unclustering
  • Meet target number of objects by avoiding
    saturation
  • ? bypass small fixed objects during
    clustering

fixed object
bypass fixed objects
cluster
18
Multiple Levels of Grids
  • Adaptive grid size based on average cluster size
  • Better global optimization
  • use solution of placement problem constrained
    with coarser grids as initial solution for
    problem constrained with finer grids
  • Better scalability
  • larger grid size spreads modules faster
  • Different levels of relaxation for density
    constraints
  • According to grid size

19
New Feature 2 Adaptive WL Weight
  • Important to QOR
  • Initial weight value
  • For each cluster level and grid level
  • Based on wirelength and density partial
    derivatives
  • Goal Magnitudes of gradients roughly equal
  • Decrease WL weight by half whenever CG solver
    obtains a stable solution

20
New Feature 3 Legalization and Detailed
Placement
Variant of greedy legalization algorithm
(Hill01)
  1. Sort all cells from left to right move each cell
    in order to the closest legal position
  2. Sort all cells from right to left move each cell
    in order to the closest legal position(s)
  3. Pick the better of (1) and (2)
  • Detailed Placement Components
  • Global cell movement (Goto81, KenningsM98
    BoxPlace, FP)
  • Whitespace compaction (KahngTZ99, KahngMR04)
  • Cell order polishing (similar to rowIroning, FS
    detailed placer)
  • Intra-row cell reordering
  • Inter-row cell reordering

21
Global Moving
  • Move cell to optimal location among available
    whitespace
  • improve quality when utilization is low
  • Two steps
  • search for available location in optimal region
    of a cells placement
  • search for available location in best bin
  • divide placement area into uniform bins
  • choose best" bin according to available
    whitespace and cost of moving cell to bin center
  • assume normal distribution of whitespace with
    width and estimate if an available location exists

22
WhiteSpace (WS) Compaction
row
start node
sites
1
2
3
4
5
6
7
8
9
10
11
12
cell 1
cell 2
cell 3
cell n
end node
  • Each chain represents the possible placement
    sites for each cell
  • The cost on the arrows is the change in HWPL of
    the cell move to each site
  • The order of chains correspond to the order of
    cells from left to right in a row
  • A Shortest path from source to sink gives the
    best way to compact WS

23
Cell Order Polishing
  • Permute a small window of neighboring cells in
    order to improve wirelength
  • MetaPlacers rowIroning up to 15 cells in one
    row assuming equal whitespace distribution
  • FengShui's cell ordering six objects in one or
    more rows regarding whitespace as pseudo cells
  • Branch-and-bound algorithm
  • four nearby cells in one or multiple rows
  • consider optimal placement for each permutation
  • more accurate, overlap-free permutations and no
    cell shifting

24
Single-Row Cell Ordering
  • Cost of placing first j cells of a permutation
  • cost wirelength increase when placing a cell
  • ?WL? 0, only if cell is leftmost of rightmost
  • remaining cells placed to the right of first j
    cells
  • unrelated to order or placement of remaining
    cells
  • BB algorithm
  • construct permutations in lexicographic order
  • next permutation has same prefix as the previous
    one
  • beginning rows of DP table can be reused as
    possible
  • cut branch when minimum cost of placing first j
    cells gt best cost till now

25
Two- or Three-Row Cell Ordering
  • DP algorithm
  • decide how many cells assigned to each row from
    up to down
  • construct a permutation in lexicographic order
  • find optimal placement within the window
  • Y-cost of placing first j cells accurate
  • remaining cells placed lower than first j cells
  • X-cost of placing first j cells inaccurate when
    a net connects placed and unplaced cells
  • results show still effective with small set of
    cells and small window

26
Outline
  • Introduction
  • Clustering
  • Global Placement
  • Detailed Placement
  • Experimental Results
  • IBM ISPD04
  • IBM-PLACE v2
  • IBM ICCAD04
  • IBM ISPD05
  • Conclusions and Future Works

27
IBM ISPD04
  • Test basic placer performance with standard cells

APlace2.0 mPL5 Capo9.0 Dragon3 FP1 FS2.6
ibm10 17.20 17.3 1.1 1.04 1.07 1.07
ibm11 13.22 14 1.09 1.03 1.09 1.04
ibm12 21.83 22.3 1.11 1.03 1.08 1.07
ibm13 16.46 16.6 1.1 1.05 1.11 1.09
ibm14 30.55 31.6 1.1 1.05 1.11 1.04
ibm15 38.38 38.5 1.09 1.04 1.13 1.07
ibm16 41.36 43 1.1 1.05 1.07 1.09
ibm17 60.82 61.3 1.09 1.08 1.08 1.08
ibm18 39.32 41 1.09 1.02 1.1 1.04
Average 0.97 1 1.09 1.03 1.08 1.06
  • 3 better than the best other - mPL5 (ISPD05)

28
IBM Place V2
  • Test placer under whitespace presence and
    routability

Circuit APlace2.0 Vias mPLWSA
ibm09-easy 3.023 495073 3.5
ibm09-hard 3.027 503410 3.65
ibm10-easy 5.977 758598 6.84
ibm10-hard 5.931 772744 6.76
ibm11-easy 4.577 638523 5.16
ibm11-hard 4.654 656525 5.15
ibm12-easy 8.337 892915 10.52
ibm12-hard 8.317 902465 10.13
Average 0.88 1
  • 12 better than mPL-RWSA (ICCAD04)

29
IBM ICCAD04
  • Test placer performance with cells and blocks
    (floorplacement)

APlace2.0 FS2.6 Capo9.0
ibm10 28.55 41.96 34.98
ibm11 18.67 21.19 22.31
ibm12 33.51 40.84 40.78
ibm13 23.03 25.45 28.7
ibm14 35.9 39.93 40.97
ibm15 46.82 51.96 59.19
ibm16 54.58 62.77 67
ibm17 66.49 69.38 78.78
ibm18 42.14 45.59 50.39
Average 0.86 1 1.05
  • 14 and 19 better than FS and Capo, respectively

30
IBM ISPD05
  • Test placer performance with cells and
    movable/fixed blocks

adaptec2 adaptec4 BB1 BB2 BB3 BB4 AVG
APlace2.0 87.31 187.65 94.64 143.82 357.89 833.21 1
mFAR 91.53 190.84 97.7 168.7 379.95 876.28 1.06
Dragon 94.72 200.88 102.39 159.71 380.45 903.96 1.08
mPL 97.11 200.94 98.31 173.22 369.66 904.19 1.09
FastPlace 107.86 204.48 101.56 169.89 458.49 889.87 1.16
Capo 99.71 211.25 108.21 172.3 382.63 1098.76 1.17
NTUP 100.31 206.45 106.54 190.66 411.81 1154.15 1.21
FengShui 122.99 337.22 114.57 285.43 471.15 1040.05 1.5
KW 157.65 352.01 149.44 322.22 656.19 1403.79 1.84
  • 6 better than the best other placer (mFAR)

31
APlace2.0 Conclusions
  • 60 days clean sheet of paper Qinke Wang
    Sherief Reda
  • Scalable implementation
  • State-of-the-art clustering and global placement
    engines
  • Improved detailed placement engine
  • Better than best published results by
  • 3 ISPD04 suite
  • 14 ICCAD04
  • 12 IBMPLACE V.2
  • 6 ISPD05 Placement Contest
  • Recent Applications (other than restoring
    functionality)
  • IR-drop driven placement (ICCD-2005 Best Paper)
  • Lens aberration-aware placement (DATE-2006)
  • Toward APlace3.0 ?

32
Thank You
  • Questions?

33
Goals and Plan
  • Goals
  • Build a new placer to win the competition
  • Scalable, robust, high-quality implementation
  • Leave no stone unturned / QOR on the table
  • Plan and Schedule
  • Work within most promising framework APlace
  • 30 days for coding 30 days for tuning

34
Philosophy
  • Respect the competition
  • Well-funded groups with decades of experience
  • ABKGroups Capo, MLPart, APlace all unfunded
    side projects
  • No placement-related industry interactions
  • QOR target 24-26 better than Capo v9r6 on all
    known benchmarks
  • Nearly pulled out 10 days before competition
  • Work smart
  • Solve scalability and speed basics first
  • Slimmed-down data structure, -msse compiler
    options, etc.
  • Ordered list of 15 QOR ideas to implement
  • Daily regressions on all known benchmarks
  • Synthetic testcases to predict bb3, bb4, etc.

35
Implementation Framework
New APlace Flow
  • APlace weaknesses
  • Weak clustering
  • Poor legalization / detailed placement

Clustering
Adaptive APlace engine
Global Phase
Unclustering
  • New APlace
  • New clustering
  • Adaptive parameter setting for scalability
  • New legalization iterative detailed placement

Legalization
WS arrangement
Detailed Phase
Cell order polishing
Global moving
36
Parameterization and Parallelizing
Tuning Knobs
  • Clustering ratio, top-level clusters, cluster
    area constraints
  • Initial wirelength weight, wirelength weight
    reduction ratio
  • Max CG iterations for each wirelength weight
  • Target placement discrepancy
  • Detailed placement parameters, etc.

Resources
  • SDSC ROCKS Cluster 8 Xeon CPUs at 2.8GHz
  • Michigan Prof. Sylvesters Group 8 various CPUs
  • UCSD FWGrid 60 Opteron CPUs at 1.6GHz
  • UCSD VLSICAD Group 8 Xeon CPUs at 2.4GHz

Wirelength Improvement after Tuning 2-3
37
Artificial Benchmark Synthesis
  • Synthetic benchmarks to test code scalability and
    performance
  • Rapid response to broadcast of s00-nam.pdf
  • Created synthetic versions of bigblue3 and
    bigblue4 within 48 hours
  • Mimicked fixed-block layout diagrams in the
    artificial benchmark creation
  • This process was useful we identified (and
    solved) a problem with clustering in presence of
    many small fixed blocks

38
Results
Circuit GP HPWL Leg HPWL DP HPWL CPU (h)
adaptec1 80.20 81.80 79.50 3
adaptec2 84.70 92.18 87.31 3
adaptec3 218.00 230.00 218.00 10
adaptec4 182.90 194.75 187.71 13
bigblue1 93.67 97.85 94.64 5
bigblue2 140.68 147.85 143.80 12
bigblue3 357.28 407.09 357.89 22
bigblue4 813.91 868.07 833.21 50
39
Conclusions
  • ISPD05 an exercise in process and philosophy
  • At end, we were still 4 short of where we wanted
  • Not happy with how we handled 5-day time frame
  • Auto-tuning ? first results best results
  • During competition, wrote but then left out
    annealing DP improvements that gained another
    0.5
  • Students and IBM ARL did a really, really great
    job
  • Currently restoring capabilities (congestion,
    timing-driven, etc.) and cleaning (antecedents in
    Naylor patent)
Write a Comment
User Comments (0)
About PowerShow.com