Title: Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects
1Floorplanning Optimization with Trajectory
Piecewise-Linear Model for Pipelined Interconnects
- C. Long, L. J. Simonson, W. Liao and L. He
- EDA Lab, EE Dept. UCLA
- DAC 2004
2Outline
- Motivation
- Background
- Trajectory piecewise-linear CPI model
- CPI-aware floorplanning
- Experiment results
- Conclusion and discussions
3Motivation
- Traditional design flow
- Architecture optimization minimize CPI
- Floorplanning optimization maximize clock
frequency - Architectural optimization is separated from the
physical optimization under the assumption that
layout does NOT change CPI.
4Traditional Flow
- A few years ago
- Clock rates were much lower
- More time for signal to reach its destination
- Inductance was less of a factor in delay
- Interconnects delay was smaller
- Less resistance
- Lower aspect ratio meant less capacitance
- Inter-module communication takes less than one
cycle - Interconnect length used to determine clock
period (just clock it faster until it doesnt
work) - Floorplanning had no impact on the cycle-by-cycle
operation (CPI) of the processor
5A New Interconnect Centric Reality
- Now
- Clock rates have increased by an order of
magnitude - My P2 from 1998 is 400MHz, The Prescott P4 will
be 4.0GHz by the fourth quarter of 04 and has 31
pipeline stages for integer operations, some of
which are due to interconnect pipelining
exclusively - Interconnects have longer delay with higher
aspect ratio - Die size is the same
- A signal can take up to ten clock cycles to
travel from opposite corner to opposite corner of
a chip in 90nm technology - Likely, the inter-module communication may take
over one cycle - Clock period is now a constraint, not an
objective - Interconnect is pipelined when it cannot meet the
constraint - A pipelined interconnect delays the cycle a
signal arrives - Changes the cycle-by-cycle behavior (CPI) of the
system - Determined by floorplanning
6How to solve this problem?
- Evaluate performance during floorplanning
optimization - Efficiency of the evaluation is the key
- Cycle-accurate simulation is too slow for this
purpose
Architecture optimization
Floorplanning optimization
ISA, Configuration
Performance evaluation
7Contributions of our work
- We have pointed out that the interconnect latency
has a significant impact on architecture
performance and it is critical to consider it
during floorplanning - We have developed an efficient table-based
cycle-per-instruction (CPI) model - Called trajectory piece-wise linear (TPWL) model
with error less than 3.0 - We have Integrated TPWL CPI model with floorplan
optimization - To reduce CPI by up to 28.57 with a small area
overhead of 5.72
8Background
- Architecture and partitioning
- A SuperScalar implementation of the MIPS
instruction set - Similar to Alpha 21264
- Twelve blocks
Block Area(mm2) Block Area(mm2)
IALU1 1.00 IALU2 1.00
IALU3 1.00 IMULT 1.00
F_ADD 1.94 F_MULT 2.07
RUU 3.04 Decode 1.44
Branch 2.27 L2 75.6
IL1 8.99 DL1 10.03
9Bus Latency Vectors
- Interface between physical level and architecture
level - Twelve buses
- Bus latency vectors (B)
- E.g., B 3, 4, 7,
- Characterize a floorplan as a vector containing
the latency of each interconnect
Bus id Terminal Bus id Terminal
1 IALU1, RUU 7 IL1, L2
2 IALU2, RUU 8 DL1, L2
3 IALU3, RUU 9 Branch, IL1
4 IMULT, RUU 10 Decode, Branch
5 FPADD, RUU 11 LSQ, DL1
6 FPMUL, RUU 12 Decode, RUU
10Miss Events and Performance Loss
- Types of miss events
- Data Cache Miss
- Instruction Cache Miss
- TLB Miss
- Branch Miss Prediction
- Other sources of performance loss
- Data dependencies
- Resource Contention
11Measuring Performance
- No hardware to measure
- Need a model of the hardware
- Simulate the execution of the machine
- Two types of simulation
- Trace driven simulation
- Shade to generate instruction and address trace,
dinero to model cache, etc. - Fast, 10s of instructions on host machine per
instruction on target machine - Inaccurate
- good for I-Cache performance loss measurement
- bad for D-Cache performance loss measurement
- poor for branch miss prediction performance loss
- very bad for data dependency performance loss
- Execution driven simulation
- State of target hardware is maintained and
updated in memory as each instruction is
processed - Slow, 1000s of instructions on host machine per
instruction on target machine - Cycle-accurate, true to cycle by cycle behavior
of hardware
12Cycle Accurate Simulation
- Given B, compute CPI
- Modify the architecture according to B
- Change the configuration file
- Insert buffers between modules
- Measure CPI for a subset of the SPEC2000
benchmark suite - Floating point benchmarks equake and mesa
- Integer benchmarks gzip, vortex and mcf
- Take the arithmetic mean of these benchmarks as
the CPI for B
13CPI Models
- A CPI model estimate CPI under interested
parameters such as interconnect latency,
architecture configuration, etc. - CPI models in the literature
- Static simulation Nussbaum01
- Based on a single detailed simulation
- Generate a synthetic instruction trace
- Take advantage of cache and branch prediction
statistics - Statistical sampling of cycle accurate simulation
- Sampling instead of truncating selectively
measuring in detail only an appropriate benchmark
subset - Configuring a systematic sampling simulation run
to achieve a desired confidence in estimates - More efficient than cycle-accurate simulation but
slow, none of them consider interconnect latency
14Traditional floorplanning
- Optimize floorplan via simulated annealing (SA)
algorithm - Objective function
- Moves
- Change the position or shape of blocks
- Cooling scheme
- Initial temperature
- Constant cooling rate
15Floorplanning considering CPI
- Based on simulated annealing
- Objective function
- Extend from traditional floorplanning framework
- Key is to estimate CPI efficiently
- Moves and cooling schedule remain the same
16Trajectory of SA
- The path that SA follows during optimization is a
trajectory in the solution space - We only need to accurately estimate CPI in the
area where the trajectory travels - The trajectory of SA with objective of area, wire
length and CPI is close to that of area and wire
length only
Bus2
Area and wire length
Area, wire length and CPI
Bus1
17Trajectory Piecewise-linear CPI Model
- Build a piecewise-linear model for a small
solution region around the trajectories of SA - Three phases sampling, collecting and simulating
- An example for 2-dimension bus vector
Latency (bus2)
simulation
Latency (bus1)
18TPWL Sampling
Latency (bus2)
Latency (bus1)
- Sample a complete simulated annealing process
with objective of area and total wire length to
obtain a set of bus latency vectors (points in
n-dimension)
19TPWL Collecting
Latency (bus2)
Latency (bus1)
- Collect all the points obtained in the sampling
phase in as few as possible balls (TPC problem)
20TPWL Simulating
Latency (bus2)
simulation
Latency (bus1)
- Obtain CPI by cycle accurate simulation for the
center of balls - Build a CPI table indexed by these center points
21CPI estimation under TPWL model
- Based on each entry, CPI of target B could be
estimated by first order expansion - For each entry, a weight is calculated based on
the distance between the target B and the entry
in CPI table - The final estimation is the weighted sum of the
estimation based on each entry
22CPI-aware Floorplanning- Overview
- Integrate the TPWL CPI model with a traditional
floorplanning tool
23Iterative TPWL model
- When the trajectory with objective of area and
total wire length is significantly different from
the trajectory with objective of area, total wire
length and CPI, an iterative TPWL model is needed
Bus2
Area and wire length
iteration 1
iteration 2
Area, wire length and CPI
Bus1
24Iterative TPWL Model
- Iteratively expand the CPI table to build a
iterative TPWL (iTPWL) model - Based on the TPWL model but from the second
iteration one, the objective of SA is area, total
wire length and CPI - Improve the accuracy of CPI estimation and the
quality of the final floorplan
25Summary on TPWL CPI Model
- Originally proposed for modeling non-linear
systems Rewienski03 - Outperforms other techniques based on quadratic
reduction - TPWL model is suitable for floorplanning
optimization - The trajectory of SA with objective of area,
total wire length and CPI is close to that with
objective of area and total wire length only - When these two trajectories are not close, iTPWL
model is employed to improve the accuracy - Contribution of this paper on TPWL model
- Introduce the TPC problem
- Expand TPWL model to iTPWL model
26Experiment results
- Verification of CPI models
- Error of TPWL model 2.62 Error of iTPWL model
1.66
27Impact of models to final floorplans
- Comparison of the floorplans obtained by access
ratio, sensitivity rate model, TPWL and iTPWL
model with objective of area, total wire length
and CPI - Access ratio Use access ratio of interconnects
to represent the impact to system performance - Estimate CPI based on first order expansion on
the original point
28Floorplanning with iTPWL Model
- Comparison between floorplans obtained by
different objectives
29Running time
- Simple-scalar simulation times to build up the
TPWL and iTPWL model
30Conclusion and discussion
- Propose an accurate CPI model with less than 3.0
error - The CPI-aware floorplaner reduce CPI by 28.57
with a small area overhead of 5.72 - Expand the TPWL model and improve the accuracy of
estimation - the accuracy of iTPWL model leads to
floorplanning solutions with high quality and
enables us to develop good heuristics, such as
access ratio, to minimize CPI without explicit
CPI calculation. - Plan to apply this model to architecture changes