Architecture and Details of a High Quality, Large-Scale Analytical Placer

About This Presentation

Title:

Architecture and Details of a High Quality, Large-Scale Analytical Placer

Description:

If top node of heap is 'valid' then cluster it with its closest neighbor ... calculate the clustering score of the new node and reinsert into the heap ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 40

Provided by: caden7

Category:

more less

Transcript and Presenter's Notes

Title: Architecture and Details of a High Quality, Large-Scale Analytical Placer

1
Architecture and Details of a High Quality,
Large-Scale Analytical Placer

Andrew B. Kahng, Sherief Reda and Qinke Wang
VLSI CAD Lab
University of California, San Diego
http//vlsicad.ucsd.edu/
Work partially supported by the MARCO Gigascale
Systems Research Center. ABK is currently with
Blaze DFM, Inc., Sunnyvale, CA.

2
Outline

History of APlace
From APlace1.0 to APlace2.0
Anatomy of APlace2.0
New techniques in APlace2.0
Experimental Results
Conclusions and Future Work

3
History of APlace

Research to study Synopsys patent
Naylor et al., US Patent 6,301,693 (2001)
Extensible foundation APlace1.0
Timing-driven placement
Mixed-size placement
Area-I/O placement
ISPD-2005 placement contest ? APlace2.0
Many parts of APlace rewritten
Superior performance

4
Outline

History of APlace
From APlace1.0 to APlace2.0
Anatomy of APlace2.0
New techniques in APlace2.0
Experimental Results
Conclusions and Future Works

5
APlace Problem Formulation

Constrained Nonlinear Optimization Divide the
layout area into uniform bins, and seek to
minimize HPWL etc. so that total cell area in
every bin is equalized

density function that equals
the total cell area in a global bin g
D average cell area over all global bins

6
Nonlinear Optimization

Smooth approximation of placement objectives
wirelength, density function, etc.
Quadratic Penalty method
Solve a sequence of unconstrained minimization
problems for a sequence of µ ? 0
Conjugate Gradient (CG) solver
Useful for finding an unconstrained minimum
of a high-dimensional function
Adaptable to large-scale placement problems
memory requirement is linear in problem size

7
Wirelength Approximation

Half-Perimeter Wirelength (HPWL)
Half-perimeter of nets bounding box
Simple, close measure of routing congestion
Not strictly convex, or everywhere differentiable
Log-Sum-Exp approximation
Naylor et al., US Patent 6,301,693 (2001)
Precise, closer to HPWL when a ? 0
Strictly convex, continuously differentiable

8
? Smoothing Parameter

Significance criterion for choosing nets with
large wirelength to minimize
Larger gradients for longer nets
Minimize long nets more efficiently than short
nets

Two-pin net
Partial gradient for x1
close to 0, when net length x1- x2 is small
compared to ?
close to 1 or -1, o.w.

9
Area Potential Function

Overlap area
overlap along the x and y directions
0/1 function with cell size ignored
Area potential function defines an area
potential exerted by a cell to nearby grids
smooth bell-shaped function for standard cells
Naylor et al., US Patent 6,301,693 (2001)

10
Module Area Potential Function

Mixed-size placement decide scope of area
potential based on module's dimension
p(d) potential function
d distance from module to grid
radius r w/2 2wg for block with width w

convex curved lt w/2 wg
concave curvew/2 wg lt d lt w/2 2wg
smooth at d w/2 wg

p(d)
1-ad
2
2
b(r-d)
d
-w/2-2wg
w/2 wg
11
Changes APlace1.0 ? APlace2.0

Strong scalability from new clustering algorithm
Dynamic adjustment of weights for wirelength and
overlap penalty during global placement
Improvements to legalization, detailed placement
whitespace compaction
cell reordering algorithms
global greedy cell movement
APlace2.0 vs. APlace1.0 up to 19 WL reduction
1.5-2x speedup

12
IBM BigBlue4 Placement
2.1M instances, HPWL 833.21, CPU 23h
13
Outline

History of APlace
From APlace1.0 to APlace2.0
Anatomy of APlace2.0
New techniques in APlace2.0
Experimental Results
Conclusions and Future Works

14
Anatomy of APlace 2.0
Clustering
Adaptive APlace engine
Global Phase
Unclustering
Legalization
WS arrangement
Detailed Phase
Cell order polishing
Global moving
15
New Feature 1 Multi-Level Clustering
Objective cluster to reduce runtime and allow
scalable implementations with no compromise to
quality
netlist
reduce netlist size by 10x

Multi-level approach using best-choice clustering
(ISPD05)

size 2000?
no
yes

Clustering ratio ? 10
Top-level clusters ? 2000
Wirelength calculation
assume modules located at cluster center
only consider inter-cluster parts of nets

global placement
no
flat?
uncluster
yes
Legalization
16
Best-Choice Clustering

Each clustering level uses the best-choice
heuristic with lazy updates and tight area
control

For each clustering level

Calculate the clustering score of each node to
its neighbors
based on the number of connections and areas
Sort all nodes based on their best scores using
a heap
Until target clustering ratio is reached
If top node of heap is valid then cluster it
with its closest neighbor
Else recalculate the top node score and reinsert
in heap Continue
calculate the clustering score of the new node
and reinsert into the heap
update netlist and mark all neighbors of the new
node as invalid

17
Two Clustering Concerns

Mark boundaries of clustering hierarchy at each
clustering level
? allow exact reversal of clustering during
unclustering
Meet target number of objects by avoiding
saturation
? bypass small fixed objects during
clustering

fixed object
bypass fixed objects
cluster
18
Multiple Levels of Grids

Adaptive grid size based on average cluster size
Better global optimization
use solution of placement problem constrained
with coarser grids as initial solution for
problem constrained with finer grids
Better scalability
larger grid size spreads modules faster
Different levels of relaxation for density
constraints
According to grid size

19
New Feature 2 Adaptive WL Weight

Important to QOR
Initial weight value
For each cluster level and grid level
Based on wirelength and density partial
derivatives
Goal Magnitudes of gradients roughly equal
Decrease WL weight by half whenever CG solver
obtains a stable solution

20
New Feature 3 Legalization and Detailed
Placement
Variant of greedy legalization algorithm
(Hill01)

Sort all cells from left to right move each cell
in order to the closest legal position
Sort all cells from right to left move each cell
in order to the closest legal position(s)
Pick the better of (1) and (2)

Detailed Placement Components
Global cell movement (Goto81, KenningsM98
BoxPlace, FP)
Whitespace compaction (KahngTZ99, KahngMR04)
Cell order polishing (similar to rowIroning, FS
detailed placer)
Intra-row cell reordering
Inter-row cell reordering

21
Global Moving

Move cell to optimal location among available
whitespace
improve quality when utilization is low
Two steps
search for available location in optimal region
of a cells placement
search for available location in best bin
divide placement area into uniform bins
choose best" bin according to available
whitespace and cost of moving cell to bin center
assume normal distribution of whitespace with
width and estimate if an available location exists

22
WhiteSpace (WS) Compaction
row
start node
sites
1
2
3
4
5
6
7
8
9
10
11
12
cell 1
cell 2
cell 3
cell n
end node

Each chain represents the possible placement
sites for each cell
The cost on the arrows is the change in HWPL of
the cell move to each site
The order of chains correspond to the order of
cells from left to right in a row
A Shortest path from source to sink gives the
best way to compact WS

23
Cell Order Polishing

Permute a small window of neighboring cells in
order to improve wirelength
MetaPlacers rowIroning up to 15 cells in one
row assuming equal whitespace distribution
FengShui's cell ordering six objects in one or
more rows regarding whitespace as pseudo cells
Branch-and-bound algorithm
four nearby cells in one or multiple rows
consider optimal placement for each permutation
more accurate, overlap-free permutations and no
cell shifting

24
Single-Row Cell Ordering

Cost of placing first j cells of a permutation
cost wirelength increase when placing a cell
?WL? 0, only if cell is leftmost of rightmost
remaining cells placed to the right of first j
cells
unrelated to order or placement of remaining
cells
BB algorithm
construct permutations in lexicographic order
next permutation has same prefix as the previous
one
beginning rows of DP table can be reused as
possible
cut branch when minimum cost of placing first j
cells gt best cost till now

25
Two- or Three-Row Cell Ordering

DP algorithm
decide how many cells assigned to each row from
up to down
construct a permutation in lexicographic order
find optimal placement within the window
Y-cost of placing first j cells accurate
remaining cells placed lower than first j cells
X-cost of placing first j cells inaccurate when
a net connects placed and unplaced cells
results show still effective with small set of
cells and small window

26
Outline

Introduction
Clustering
Global Placement
Detailed Placement
Experimental Results
IBM ISPD04
IBM-PLACE v2
IBM ICCAD04
IBM ISPD05
Conclusions and Future Works

27
IBM ISPD04

Test basic placer performance with standard cells

APlace2.0 mPL5 Capo9.0 Dragon3 FP1 FS2.6
ibm10 17.20 17.3 1.1 1.04 1.07 1.07
ibm11 13.22 14 1.09 1.03 1.09 1.04
ibm12 21.83 22.3 1.11 1.03 1.08 1.07
ibm13 16.46 16.6 1.1 1.05 1.11 1.09
ibm14 30.55 31.6 1.1 1.05 1.11 1.04
ibm15 38.38 38.5 1.09 1.04 1.13 1.07
ibm16 41.36 43 1.1 1.05 1.07 1.09
ibm17 60.82 61.3 1.09 1.08 1.08 1.08
ibm18 39.32 41 1.09 1.02 1.1 1.04
Average 0.97 1 1.09 1.03 1.08 1.06

3 better than the best other - mPL5 (ISPD05)

28
IBM Place V2

Test placer under whitespace presence and
routability

Circuit APlace2.0 Vias mPLWSA
ibm09-easy 3.023 495073 3.5
ibm09-hard 3.027 503410 3.65
ibm10-easy 5.977 758598 6.84
ibm10-hard 5.931 772744 6.76
ibm11-easy 4.577 638523 5.16
ibm11-hard 4.654 656525 5.15
ibm12-easy 8.337 892915 10.52
ibm12-hard 8.317 902465 10.13
Average 0.88 1

12 better than mPL-RWSA (ICCAD04)

29
IBM ICCAD04

Test placer performance with cells and blocks
(floorplacement)

APlace2.0 FS2.6 Capo9.0
ibm10 28.55 41.96 34.98
ibm11 18.67 21.19 22.31
ibm12 33.51 40.84 40.78
ibm13 23.03 25.45 28.7
ibm14 35.9 39.93 40.97
ibm15 46.82 51.96 59.19
ibm16 54.58 62.77 67
ibm17 66.49 69.38 78.78
ibm18 42.14 45.59 50.39
Average 0.86 1 1.05

14 and 19 better than FS and Capo, respectively

30
IBM ISPD05

Test placer performance with cells and
movable/fixed blocks

adaptec2 adaptec4 BB1 BB2 BB3 BB4 AVG
APlace2.0 87.31 187.65 94.64 143.82 357.89 833.21 1
mFAR 91.53 190.84 97.7 168.7 379.95 876.28 1.06
Dragon 94.72 200.88 102.39 159.71 380.45 903.96 1.08
mPL 97.11 200.94 98.31 173.22 369.66 904.19 1.09
FastPlace 107.86 204.48 101.56 169.89 458.49 889.87 1.16
Capo 99.71 211.25 108.21 172.3 382.63 1098.76 1.17
NTUP 100.31 206.45 106.54 190.66 411.81 1154.15 1.21
FengShui 122.99 337.22 114.57 285.43 471.15 1040.05 1.5
KW 157.65 352.01 149.44 322.22 656.19 1403.79 1.84

6 better than the best other placer (mFAR)

31
APlace2.0 Conclusions

60 days clean sheet of paper Qinke Wang
Sherief Reda
Scalable implementation
State-of-the-art clustering and global placement
engines
Improved detailed placement engine
Better than best published results by
3 ISPD04 suite
14 ICCAD04
12 IBMPLACE V.2
6 ISPD05 Placement Contest
Recent Applications (other than restoring
functionality)
IR-drop driven placement (ICCD-2005 Best Paper)
Lens aberration-aware placement (DATE-2006)
Toward APlace3.0 ?

32
Thank You

Questions?

33
Goals and Plan

Goals
Build a new placer to win the competition
Scalable, robust, high-quality implementation
Leave no stone unturned / QOR on the table
Plan and Schedule
Work within most promising framework APlace
30 days for coding 30 days for tuning

34
Philosophy

Respect the competition
Well-funded groups with decades of experience
ABKGroups Capo, MLPart, APlace all unfunded
side projects
No placement-related industry interactions
QOR target 24-26 better than Capo v9r6 on all
known benchmarks
Nearly pulled out 10 days before competition
Work smart
Solve scalability and speed basics first
Slimmed-down data structure, -msse compiler
options, etc.
Ordered list of 15 QOR ideas to implement
Daily regressions on all known benchmarks
Synthetic testcases to predict bb3, bb4, etc.

35
Implementation Framework
New APlace Flow

APlace weaknesses
Weak clustering
Poor legalization / detailed placement

Clustering
Adaptive APlace engine
Global Phase
Unclustering

New APlace
New clustering
Adaptive parameter setting for scalability
New legalization iterative detailed placement

Legalization
WS arrangement
Detailed Phase
Cell order polishing
Global moving
36
Parameterization and Parallelizing
Tuning Knobs

Clustering ratio, top-level clusters, cluster
area constraints
Initial wirelength weight, wirelength weight
reduction ratio
Max CG iterations for each wirelength weight
Target placement discrepancy
Detailed placement parameters, etc.

Resources

SDSC ROCKS Cluster 8 Xeon CPUs at 2.8GHz
Michigan Prof. Sylvesters Group 8 various CPUs
UCSD FWGrid 60 Opteron CPUs at 1.6GHz
UCSD VLSICAD Group 8 Xeon CPUs at 2.4GHz

Wirelength Improvement after Tuning 2-3
37
Artificial Benchmark Synthesis

Synthetic benchmarks to test code scalability and
performance
Rapid response to broadcast of s00-nam.pdf
Created synthetic versions of bigblue3 and
bigblue4 within 48 hours
Mimicked fixed-block layout diagrams in the
artificial benchmark creation
This process was useful we identified (and
solved) a problem with clustering in presence of
many small fixed blocks

38
Results
Circuit GP HPWL Leg HPWL DP HPWL CPU (h)
adaptec1 80.20 81.80 79.50 3
adaptec2 84.70 92.18 87.31 3
adaptec3 218.00 230.00 218.00 10
adaptec4 182.90 194.75 187.71 13
bigblue1 93.67 97.85 94.64 5
bigblue2 140.68 147.85 143.80 12
bigblue3 357.28 407.09 357.89 22
bigblue4 813.91 868.07 833.21 50
39
Conclusions

ISPD05 an exercise in process and philosophy
At end, we were still 4 short of where we wanted
Not happy with how we handled 5-day time frame
Auto-tuning ? first results best results
During competition, wrote but then left out
annealing DP improvements that gained another
0.5
Students and IBM ARL did a really, really great
job
Currently restoring capabilities (congestion,
timing-driven, etc.) and cleaning (antecedents in
Naylor patent)

Write a Comment

User Comments (0)

About PowerShow.com

Architecture and Details of a High Quality, Large-Scale Analytical Placer - PowerPoint PPT Presentation

Architecture and Details of a High Quality, Large-Scale Analytical Placer

If top node of heap is 'valid' then cluster it with its closest neighbor ... calculate the clustering score of the new node and reinsert into the heap ... – PowerPoint PPT presentation