Title: Simultaneous ShortPath and LongPath Timing Optimization for FPGAs
1Simultaneous Short-Path andLong-Path Timing
Optimization for FPGAs
- Ryan Fung, Vaughn Betz, William Chow
- Altera Toronto Technology Centre
2Terminology and Motivation
3Long-Path Timing
- Most CAD algorithms focus solely on reducing path
delays to meet operating frequency requirements - Example long-path timing constraints include
clock period, IO TSETUP, and IO TCLOCK-TO-OUTPUT
requirements
Design Operating Period gt Delay (Clock -gt Src
Reg) Delay (Comb Logic) TSETUP (Dst
Reg) - Delay (Clock -gt Dst Reg)
4Long-Path Timing
- Most CAD algorithms focus solely on reducing path
delays to meet operating frequency requirements
t lt -TSETUP (Dst Reg)
Source Register
Combinational Logic
Destination Register
t 0
Clock IO
5Short-Path Timing
- Satisfying only long-path timing constraints does
not guarantee design functionality - All register hold requirements must also be met
- Short-path constraints express these requirements
- Examples include THOLD for register-to-register
transfers, IO THOLD, and IO minimum
TCLOCK-TO-OUTPUT
Source Register
Combinational Logic
Destination Register
Register Hold Requirements Met iff Delay
(Clock -gt Src Reg) Delay (Comb Logic)
Delay (Clock -gt Dst Reg) gt THOLD (Dst Reg)
Clock IO
6Short-Path Timing
- Satisfying only long-path timing constraints does
not guarantee design functionality - All register hold requirements must also be met
- Short-path constraints express these requirements
t gt THOLD (Dst Reg)
Source Register
Combinational Logic
Destination Register
t 0
Clock IO
7Short-Path Timing and FPGAs
- Before this work, designers manually repaired
many short-path violations - More painful as
- Designs get larger (more clocks than low-skew
networks) - Process variation increases skew of clock
networks - Clocking strategies increase in complexity
- Fixing short-path violations may introduce
long-path violations ? design iterations - Typical manual technique uses logic cell buffers
? wastes logic - Major appeal of FPGAs is fast time-to-market and
low engineering costs - Need CAD algorithms that automatically optimize
designs to meet all requirements, simultaneously
8Programmable Delay Chains in FPGAs
- CAD tools set delays to fix short-path problems
- Short-path violations may persist, may create
long-path violations - Programmable delay chains cost silicon area
- Only used to slow down signals entering, leaving
FPGA - Delay chains are not sufficient to fix all
violations - Several paths may pass through the same delay
chain - Below IO TSETUP lt 3 ns, IO THOLD lt 0 can not
both be satisfied
Input IO
Path B Delay 6 ns
Clock
Path A Delay 2 ns
Clock Delay 3 ns
9Solution
10Overview of Overall Strategy
- Attack the simultaneous short-path and long-path
timing optimization problem in two phases - New slack allocation algorithm
- Short/long-path constraints ? connection delay
budgets - New FPGA routing algorithm
- Guided by delay budgets
- Overall Algorithm Name RCV (Routing Cost
Valleys). - RCV inspired from the shape of the cost vs. delay
curve of the new routing algorithm
11Comments on Overall Strategy
- Effective optimization of short-path timing
constraints can be achieved by extending only the
routing algorithm - Routing delay is a relatively large fraction of
total delay - Router can model delays relatively accurately
- Router has many options to insert delay
- Using spirals of routing resources
- Selecting delay chain settings (if modeled in
routing graph) - Selecting different LUT inputs (if modeled in
routing graph)
0
0
f not(A)B
0
0
0
1
1
1
0
0
1
0
1
1
0
0
0
0
1
1
A
B
B
A
12Prior Work Long-Path Slack Allocation
- Explicitly monitoring all path-level timing
constraints during optimization is highly
inefficient (memory/run-time) - Path count can be exponential in circuit size
- Long-path slack allocation produces a maximum
delay budget for each connection in a design - Minimax-PERT algorithm Youssef et al, ICCAD,
1990 - Long-path constraints met if design is
implemented so that for every connection, c,
delay(c) DBUDGET_MAX(c)
Desired Period 10 ns
5 ns
5 ns
Slack 8 ns
1 ns
1 ns
2 ns
8 ns
13Basic Short-/Long-Path Slack Allocation
- Slack allocation can also be used to produce
minimum delay budgets from short-path slacks - Need to determine legal min and max delay budgets
- DBUDGET_MIN(c) gt DBUDGET_MAX(c) cannot be
satisfied - Basic algorithm determines min delay budgets by
allocating short-path slack to max delay budgets
Desired THOLD 4 ns
2 ns
2 ns
Slack 6 ns
5 ns
5 ns
3 ns
1 ns
14Basic Algorithm
Inputs
Outputs
Initial Delays (Lower-Bound Delays)
Iterative Minimax-PERT Positive Slack Allocation
Maximum Delay Budgets
Long-Path Timing Constraints
Iterative Minimax-PERT Positive Slack Allocation
Minimum Delay Budgets
Short-Path Timing Constraints
15Comments on Basic Algorithm
- Algorithm sequence guarantees DBUDGET_MIN(c) lt
DBUDGET_MAX(c) - Considers lower/upper bounds on delay for each
connection - Restrictions imposed on connections by the FPGA
routing fabric - Problem Algorithm sequence ? final DBUDGET_MIN
may be too small to satisfy short-path timing - Solution Need to consider short-path timing
before finalizing maximum delay budgets
16Illustration of Problem
2.1 ns Delay
0.7 ns Delay
Connection c
Input IO
Constant (Negligible) Delay Resources
Clock
1.8 ns Clock Delay
IO TSETUP Requirement 3 ns
IO THOLD Requirement 0 ns
- c delay needs to increase by 1.1 ns for THOLD
- c shares 2 ns of long-path slack with 7 other
connections - Need to ensure 55 of long-path slack is
allocated to c to leave room for
DBUDGET_MIN(c) to satisfy short-path timing
17Solution
- Use a pre-processing step to find an initial
(lower-bound) set of delays that satisfy
short-path timing constraints - Pre-processing step iterates between short-path
and long-path negative slack allocation - Short-path negative slack allocation adds delay
to connections to fix short-path timing
violations - Long-path negative slack allocation removes delay
from connections to avoid long-path timing
violations
18Enhancement to Basic Algorithm
- / Adjust initial delays to provide short-path
critical connections more delay. / - DTEMPC DINITIALC
- iterate until stopping condition met
- perform short-path timing analysis using
DTEMPC - allocate negative short-path slack using
Minimax-PERT and update DTEMPC - perform long-path timing analysis using
DTEMPC - allocate negative long-path slack using
Minimax-PERT and update DTEMPC -
- DINITIALC DTEMPC
- / Continue with basic algorithm. /
19Routing Algorithm Overview
- Builds upon negotiated congestion routing
- Ebeling et al, IEEE Trans. On VLSI, Dec. 1995
- Negotiated congestion framework is excellent for
FPGA routing where wiring is quite limited - This work modifies the delay cost and look-ahead
function to achieve desirable routing delays - Consider min/max connection delay budgets
20Routing Algorithm Background
- Timing-driven negotiated congestion routers begin
by picking a set of resources to implement each
connection - Routing resources are initially selected to
implement each connection for minimum delay - Electrical shorts (congestion) are ignored
- Congestion is resolved (over several re-routing
iterations) encourage connections to take
detours - Router inner-loop does a directed routing-graph
search using a cost to score the use of
resources - Congestion component gradually reduce
congestion - Delay component keep critical connection delays
to a minimum
21Delay Portion of Old Routing Cost
- Linear long-path cost
- Critical connections have steeper slope
Delay Portion of Routing Cost
Total Estimated Routing Path Delay
22Incorporating Delay Budgets in Routing
- Minimum cost point is the target delay
- Between minimum and maximum delay budgets
Delay Portion of Routing Cost
Max Delay Budget
Target Delay
Min Delay Budget
Total Estimated Routing Path Delay
23Incorporating Delay Budgets in Routing
- Short-path linear region has slope,
-CRITSHORT-PATH - CRITSHORT-PATH (1.0 DLOWER BOUND / DTARGET)0.5
Short-Path Linear Region
Delay Portion of Routing Cost
Long-Path Linear Region
DTARGET
Total Estimated Routing Path Delay
DLOWER BOUND
24Incorporating Delay Budgets in Routing
- Quadratic costs ensure delay budgets are
respected unless significant congestion is
encountered - Congestion will be resolved, sacrificing timing
as little as possible
Short-Path Quadratic Region
Delay Portion of Routing Cost
Long-Path Quadratic Region
Max Delay Budget
Target Delay
Min Delay Budget
Total Estimated Routing Path Delay
25Incorporating Delay Budgets in Routing
- Shape inspired overall algorithm name
- Routing Cost Valleys (RCV)
Short-Path Quadratic Region
Delay Portion of Routing Cost
Long-Path Quadratic Region
Max Delay Budget
Target Delay
Min Delay Budget
Total Estimated Routing Path Delay
26Results
27Example Route with Inserted Delay
28Example Route with Inserted Delay
29Example Route with Inserted Delay
30Example Route with Inserted Delay
31Example Route with Inserted Delay
32Example Route with Inserted Delay
33Example Route with Inserted Delay
34Example Route with Inserted Delay
35Experimental Platform
- 100 representative FPGA designs from Altera
customers - User timing/placement/routing constraints removed
- Targeting Stratix devices
- 3,264 - 67,311 logic elements (median of 12,072
elements) - Version 4.0 of Alteras Quartus II Software
- No routing failures observed during experiments
- Benefit of using costs to enforce delay budgets
- Congestion penalization pushes router to
sacrifice timing to achieve a legal solution
36Long-Path Results
- Quartus II Software was instructed to optimize
clock frequency, FMAX
37Long-Path Results
- Results
- 3.9 average FMAX improvement
- Within 8 of an upper bound on FMAX (tolerating
electrical shorts) - 35.6 router time increase
- 9.3 placement-and-routing time increase
- Includes delay budget computation time
- Quadratic region (beyond delay budgets) of
routing cost beneficial - Heavily penalize adding delay that will likely
cause a timing violation - Linear region of routing cost beneficial
- Delay budget violations inevitable when
optimizing for maximum speed - Linear costs help cover delay budget violations
by achieving margin on other connections
38Internal THOLD Results
- Quartus II Software was instructed to
- Maximize clock frequency, FMAX
- Fix THOLD violations related to internal register
transfers - 18 of 100 circuits had internal THOLD problems
- Results
- Designs with THOLD violations reduced from 18 to
5 - 3.4 average FMAX improvement
- 20.9 placement-and-routing time increase
39Internal THOLD Results (Continued)
40PCI Core Results
- 66-MHz PCI cores represent a challenging combined
short/long-path timing optimization problem - IO timing requirements (IO TSETUP and THOLD) are
difficult to satisfy in FPGAs - 72 master-target 66-MHz PCI cores compiled into a
range of Altera Stratix devices and packages - Two fastest speed grades
41PCI Core Results (Before This Work)
42PCI Core Results (After This Work)
43Conclusions
- RCV simultaneously optimizes short/long-path
timing - Relatively small (14-21) increase in
placement-and-routing time - Greatly reduce short-path timing violations
- RCV improves long-path timing optimization
- 4 higher circuit speed for 9 increase in
place-and-route time - Delay budgets guide the router
- Alert it when a non-critical connection may
become critical - RCV reduces the need for delay chains in FPGAs
- Save area without sacrificing timing