Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects - PowerPoint PPT Presentation

About This Presentation

Title:

Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects

Description:

Similar to Alpha 21264. Twelve blocks. 10.03. DL1. 8.99. IL1. 75.6. L2. 2.27. Branch. 1.44. Decode ... Area, wire length and CPI. Iterative TPWL Model ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 31

Provided by: changb

Learn more at: http://eda.ee.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects

1
Floorplanning Optimization with Trajectory
Piecewise-Linear Model for Pipelined Interconnects

C. Long, L. J. Simonson, W. Liao and L. He
EDA Lab, EE Dept. UCLA
DAC 2004

2
Outline

Motivation
Background
Trajectory piecewise-linear CPI model
CPI-aware floorplanning
Experiment results
Conclusion and discussions

3
Motivation

Traditional design flow
Architecture optimization minimize CPI
Floorplanning optimization maximize clock
frequency
Architectural optimization is separated from the
physical optimization under the assumption that
layout does NOT change CPI.

4
Traditional Flow

A few years ago
Clock rates were much lower
More time for signal to reach its destination
Inductance was less of a factor in delay
Interconnects delay was smaller
Less resistance
Lower aspect ratio meant less capacitance
Inter-module communication takes less than one
cycle
Interconnect length used to determine clock
period (just clock it faster until it doesnt
work)
Floorplanning had no impact on the cycle-by-cycle
operation (CPI) of the processor

5
A New Interconnect Centric Reality

Now
Clock rates have increased by an order of
magnitude
My P2 from 1998 is 400MHz, The Prescott P4 will
be 4.0GHz by the fourth quarter of 04 and has 31
pipeline stages for integer operations, some of
which are due to interconnect pipelining
exclusively
Interconnects have longer delay with higher
aspect ratio
Die size is the same
A signal can take up to ten clock cycles to
travel from opposite corner to opposite corner of
a chip in 90nm technology
Likely, the inter-module communication may take
over one cycle
Clock period is now a constraint, not an
objective
Interconnect is pipelined when it cannot meet the
constraint
A pipelined interconnect delays the cycle a
signal arrives
Changes the cycle-by-cycle behavior (CPI) of the
system
Determined by floorplanning

6
How to solve this problem?

Evaluate performance during floorplanning
optimization
Efficiency of the evaluation is the key
Cycle-accurate simulation is too slow for this
purpose

Architecture optimization
Floorplanning optimization
ISA, Configuration
Performance evaluation
7
Contributions of our work

We have pointed out that the interconnect latency
has a significant impact on architecture
performance and it is critical to consider it
during floorplanning
We have developed an efficient table-based
cycle-per-instruction (CPI) model
Called trajectory piece-wise linear (TPWL) model
with error less than 3.0
We have Integrated TPWL CPI model with floorplan
optimization
To reduce CPI by up to 28.57 with a small area
overhead of 5.72

8
Background

Architecture and partitioning
A SuperScalar implementation of the MIPS
instruction set
Similar to Alpha 21264
Twelve blocks

Block Area(mm2) Block Area(mm2)
IALU1 1.00 IALU2 1.00
IALU3 1.00 IMULT 1.00
F_ADD 1.94 F_MULT 2.07
RUU 3.04 Decode 1.44
Branch 2.27 L2 75.6
IL1 8.99 DL1 10.03
9
Bus Latency Vectors

Interface between physical level and architecture
level
Twelve buses
Bus latency vectors (B)
E.g., B 3, 4, 7,
Characterize a floorplan as a vector containing
the latency of each interconnect

Bus id Terminal Bus id Terminal
1 IALU1, RUU 7 IL1, L2
2 IALU2, RUU 8 DL1, L2
3 IALU3, RUU 9 Branch, IL1
4 IMULT, RUU 10 Decode, Branch
5 FPADD, RUU 11 LSQ, DL1
6 FPMUL, RUU 12 Decode, RUU
10
Miss Events and Performance Loss

Types of miss events
Data Cache Miss
Instruction Cache Miss
TLB Miss
Branch Miss Prediction
Other sources of performance loss
Data dependencies
Resource Contention

11
Measuring Performance

No hardware to measure
Need a model of the hardware
Simulate the execution of the machine
Two types of simulation
Trace driven simulation
Shade to generate instruction and address trace,
dinero to model cache, etc.
Fast, 10s of instructions on host machine per
instruction on target machine
Inaccurate
good for I-Cache performance loss measurement
bad for D-Cache performance loss measurement
poor for branch miss prediction performance loss
very bad for data dependency performance loss
Execution driven simulation
State of target hardware is maintained and
updated in memory as each instruction is
processed
Slow, 1000s of instructions on host machine per
instruction on target machine
Cycle-accurate, true to cycle by cycle behavior
of hardware

12
Cycle Accurate Simulation

Given B, compute CPI
Modify the architecture according to B
Change the configuration file
Insert buffers between modules
Measure CPI for a subset of the SPEC2000
benchmark suite
Floating point benchmarks equake and mesa
Integer benchmarks gzip, vortex and mcf
Take the arithmetic mean of these benchmarks as
the CPI for B

13
CPI Models

A CPI model estimate CPI under interested
parameters such as interconnect latency,
architecture configuration, etc.
CPI models in the literature
Static simulation Nussbaum01
Based on a single detailed simulation
Generate a synthetic instruction trace
Take advantage of cache and branch prediction
statistics
Statistical sampling of cycle accurate simulation
Sampling instead of truncating selectively
measuring in detail only an appropriate benchmark
subset
Configuring a systematic sampling simulation run
to achieve a desired confidence in estimates
More efficient than cycle-accurate simulation but
slow, none of them consider interconnect latency

14
Traditional floorplanning

Optimize floorplan via simulated annealing (SA)
algorithm
Objective function
Moves
Change the position or shape of blocks
Cooling scheme
Initial temperature
Constant cooling rate

15
Floorplanning considering CPI

Based on simulated annealing
Objective function
Extend from traditional floorplanning framework
Key is to estimate CPI efficiently
Moves and cooling schedule remain the same

16
Trajectory of SA

The path that SA follows during optimization is a
trajectory in the solution space
We only need to accurately estimate CPI in the
area where the trajectory travels
The trajectory of SA with objective of area, wire
length and CPI is close to that of area and wire
length only

Bus2
Area and wire length
Area, wire length and CPI
Bus1
17
Trajectory Piecewise-linear CPI Model

Build a piecewise-linear model for a small
solution region around the trajectories of SA
Three phases sampling, collecting and simulating
An example for 2-dimension bus vector

Latency (bus2)
simulation
Latency (bus1)
18
TPWL Sampling
Latency (bus2)
Latency (bus1)

Sample a complete simulated annealing process
with objective of area and total wire length to
obtain a set of bus latency vectors (points in
n-dimension)

19
TPWL Collecting
Latency (bus2)
Latency (bus1)

Collect all the points obtained in the sampling
phase in as few as possible balls (TPC problem)

20
TPWL Simulating
Latency (bus2)
simulation
Latency (bus1)

Obtain CPI by cycle accurate simulation for the
center of balls
Build a CPI table indexed by these center points

21
CPI estimation under TPWL model

Based on each entry, CPI of target B could be
estimated by first order expansion
For each entry, a weight is calculated based on
the distance between the target B and the entry
in CPI table
The final estimation is the weighted sum of the
estimation based on each entry

22
CPI-aware Floorplanning- Overview

Integrate the TPWL CPI model with a traditional
floorplanning tool

23
Iterative TPWL model

When the trajectory with objective of area and
total wire length is significantly different from
the trajectory with objective of area, total wire
length and CPI, an iterative TPWL model is needed

Bus2
Area and wire length
iteration 1
iteration 2
Area, wire length and CPI
Bus1
24
Iterative TPWL Model

Iteratively expand the CPI table to build a
iterative TPWL (iTPWL) model
Based on the TPWL model but from the second
iteration one, the objective of SA is area, total
wire length and CPI
Improve the accuracy of CPI estimation and the
quality of the final floorplan

25
Summary on TPWL CPI Model

Originally proposed for modeling non-linear
systems Rewienski03
Outperforms other techniques based on quadratic
reduction
TPWL model is suitable for floorplanning
optimization
The trajectory of SA with objective of area,
total wire length and CPI is close to that with
objective of area and total wire length only
When these two trajectories are not close, iTPWL
model is employed to improve the accuracy
Contribution of this paper on TPWL model
Introduce the TPC problem
Expand TPWL model to iTPWL model

26
Experiment results

Verification of CPI models
Error of TPWL model 2.62 Error of iTPWL model
1.66

27
Impact of models to final floorplans

Comparison of the floorplans obtained by access
ratio, sensitivity rate model, TPWL and iTPWL
model with objective of area, total wire length
and CPI
Access ratio Use access ratio of interconnects
to represent the impact to system performance
Estimate CPI based on first order expansion on
the original point

28
Floorplanning with iTPWL Model

Comparison between floorplans obtained by
different objectives

29
Running time

Simple-scalar simulation times to build up the
TPWL and iTPWL model

30
Conclusion and discussion

Propose an accurate CPI model with less than 3.0
error
The CPI-aware floorplaner reduce CPI by 28.57
with a small area overhead of 5.72
Expand the TPWL model and improve the accuracy of
estimation
the accuracy of iTPWL model leads to
floorplanning solutions with high quality and
enables us to develop good heuristics, such as
access ratio, to minimize CPI without explicit
CPI calculation.
Plan to apply this model to architecture changes