Title: Global Critical Path: A Tool for SystemLevel Timing Analysis
1Global Critical Path A Tool for System-Level
Timing Analysis
- Girish Venkataramani
- Tiberiu Chelcea
- Seth C. Goldstein
- Mihai Budiu
2Timing Analysis
Gate-Level
System-Level
- Reg-to-Reg critical path
- Determines frequency
- Fault analysis, verification
- Global Critical Path (GCP)
- Determines iteration bound of system execution
- System bottleneck analysis
- Power, performance optimization
3GCP-driven Optimizations
RTL graph for jpeg encoder kernel
- Global Critical Path (GCP)
- Longest Sequential Path
- Optimize GCP for performance
- Optimize for power/area
50 K gates
4GCP-driven Optimizations
RTL graph for jpeg encoder kernel
- Global Critical Path (GCP)
- Longest Sequential Path
- Optimize GCP for performance
- Optimize for power/area
- Update GCP for re-use
50 K gates
5System Model
- How do we represent the system architecture?
- Unified model
- Captures resource usage
- RTL captures system architecture
- Too large potentially millions of signals
RTL Model
Arbitrary DAG
pipeline stage
6Modeling Execution State
- System state defined by values of state-holding
elements - Key insight Collective state of signals
controlling regs determine execution
progress
Control-path
Data-path
Model the transaction-level interaction between
circuit signals
Reg
enable
7Modeling Execution State
- Discrete event model captures interaction
- Events Control-path signal transitions
- Behaviors Partial execution ordering of events
s3?
s1?
s1
s2
Behavior
s3
s2?
Timing diagram defines event ordering
An event-behavior graph is constructed by
modeling the control-path for every state-holding
element in the system
8Profiling-based Timing Analysis
Spec
Incorporated into CASH compiler flow ASPLOS 04,
IWLS 04
Front-End
GCP, Slack
Optimization Loop
IR
Construct model
Back-End
Trace Processing
Profiling
Gate-level Circuit
Verilog Simulation
9Trace Processing
- Profiling records event firing times, slack
- Trace processing computes GCP
s1?
s3?
s2? fires
Behavior, b
s3? fires
s2?
Slack(s3?, b) 3 Slack(s1?, b) 0
5
9
8
Time
0-slack input is locally critical Fields, ISCA
01
GCP is longest path of 0-slack events
10Event Type Matters
Two types of events
- Data events
- From producers
ready
- Back-pressure events
- From consumers
Control-path
Data-path
Reg
enable
stall
11Example Asynchronous Pipelines
Events Behaviors
RTL
Handshake Protocol
Data events req?
Back-pressure events ack?, req?, ack?
12Event Sequences Matter
Data events req?
Back-pressure events ack?, req?, ack?
- Property of GCP for 4 phase H/S
- Data event sequence, pathdata ltreq?gt
- B-P event sequence, pathsync ltack? req?
ack?gt - GCP topology is always ltpathdata
pathsyncgt
13What does this mean?
PATHdata ltreq?gt Good computation-bound
PATHsync ltack? req? ack?gt Maybe bad
resource-bound
Optimization Goal Eliminate PATHsync from the
GCP Eg., slack matching, ICCAD 06
14Conclusions
- GCP is the longest sequential path of system
execution - Excellent bottleneck analysis tool
- Analyzes system performance across hierarchy
- GCP can be automatically computed and used for
driving system-level power and performance
optimizations - Slack Matching (performance)
- Operation Chaining (power, area)
- Hybrid Handshake Protocol (power-performance)
15Thank you!
- Phoenix project
- http//www.cs.cmu.edu/phoenix
16Backup Slides
17Complexity of Profiling-based Analysis
7
18Slack Update
- System model is network of behaviors
- New slack propagated through graph
- Propagation stops if either
- Minimum of in-slack, Min(si), is unchanged, or
- Update completes a whole cycle
- Finally, update GCP Follow locally critical
inputs
E.g., Let B1 fires earlier by dB1 time units
B1
s1
s2
dB1
B2
Min(s1,s2)
s3
s4
B2
19Slack Update
D(Pi) ? Pi Si
S1
S2
bm
b1
b2
? p1
? p2
20Slack Update
Assume L(b1) changes by d
D(Pi) ? Pi Si
S1
S2
bm
b1
b2
? p1
? p2
Diff abs(? p1 - ? p2)
- If ? p1 gt ? p2
- New S2 (Old S2) d
- S1 unchanged ( 0)
- If ? p2 gt ? p1
- New S2 Min(d, Diff)
- New S1 Max(0, (Diff d))
21Summarizing the GCP
lt1.0, 0gt
b1
b3
b3
lt1.0, 0gt
lt1.0, 0gt
b1
b4
b1
b4
b2
b3
b2
b2
lt0.5, 2gt
lt0.5, 3gt
b4
5
15
10
Time
ltGCP Frequency, Average Slackgt
22Modeling the Processor Blocks
ALU Instruction Mem Instruction
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
- Model is a single, out-of-order issue processor.
- It is simulated by MASE, a cycle-accurate
processor simulator, based on Simplescalar
23Modeling the Signals
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
Lets examine the issue block logic
24Modeling the Signals
Res. Stations (RS)
Functional Units
Dispatch
Issue
Exec
Re-Order Buffer (ROB)
25Modeling the Signals
Dispatch
Res. Stations (RS)
Functional Units
Exec
instr
remove
issue
resource available
ISSUE
Dependency Check
Resource Check
forwarded data
Re-Order Buffer (ROB)
26Finer-grained Model
Direction of Instruction Flow
Single-issue, out-of-order issue processor Image
rendered by a graph-drawing tool called dotty
(from Graphviz)
27Quick-Sort
Issue width 1 ROB size 2 IFQ size 2 LSQ
size 2 RS size 2 Exec. Time 652 k cycles
28ROB Critical Path
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
- ROB size also referred to as Instruction Window
- ? Upper-bound of Instruction-Level Parallelism
(ILP)
29Quick-Sort
Issue width 1 ROB size 2 IFQ size 2 LSQ
size 2 RS size 2 Exec. Time 652 k cycles
30Larger ROB
Need wider issue-width
Issue width 1 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 RS size 2 Exec.
Time 652 k 542 k cycles
31Wider Issue
LSQ is critical
Issue width 1 ? 2 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 RS size 2 Exec.
Time 652 k 542 k 459 k cycles
32Larger LSQ
Data forwarding
Issue width 1 ? 2 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 ? 4 RS size 2 Exec.
Time 652 k 542 k 459 k 418 k cycles
1.5x faster
33Final Critical Path
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
- Data forwarding path is most critical
- ? True data dependencies restrict parallelism