Global Critical Path: A Tool for SystemLevel Timing Analysis

About This Presentation

Title:

Global Critical Path: A Tool for SystemLevel Timing Analysis

Description:

Construct model. Gate-level. Circuit. Verilog. Simulation. Profiling. Trace. Processing ... bm. b1. b2. p1. p2. S1. S2. If p1 p2. New S2 = (Old S2) d. S1 ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 34

Provided by: Gir971

Category:

more less

Transcript and Presenter's Notes

Title: Global Critical Path: A Tool for SystemLevel Timing Analysis

1
Global Critical Path A Tool for System-Level
Timing Analysis

Girish Venkataramani
Tiberiu Chelcea
Seth C. Goldstein
Mihai Budiu

2
Timing Analysis
Gate-Level
System-Level

Reg-to-Reg critical path
Determines frequency
Fault analysis, verification

Global Critical Path (GCP)
Determines iteration bound of system execution
System bottleneck analysis
Power, performance optimization

3
GCP-driven Optimizations
RTL graph for jpeg encoder kernel

Global Critical Path (GCP)
Longest Sequential Path
Optimize GCP for performance
Optimize for power/area

50 K gates
4
GCP-driven Optimizations
RTL graph for jpeg encoder kernel

Global Critical Path (GCP)
Longest Sequential Path
Optimize GCP for performance
Optimize for power/area
Update GCP for re-use

50 K gates
5
System Model

How do we represent the system architecture?
Unified model
Captures resource usage
RTL captures system architecture
Too large potentially millions of signals

RTL Model

Arbitrary DAG

pipeline stage
6
Modeling Execution State

System state defined by values of state-holding
elements
Key insight Collective state of signals
controlling regs determine execution
progress

Control-path
Data-path
Model the transaction-level interaction between
circuit signals
Reg
enable
7
Modeling Execution State

Discrete event model captures interaction
Events Control-path signal transitions
Behaviors Partial execution ordering of events

s3?
s1?
s1
s2
Behavior
s3
s2?
Timing diagram defines event ordering
An event-behavior graph is constructed by
modeling the control-path for every state-holding
element in the system
8
Profiling-based Timing Analysis
Spec
Incorporated into CASH compiler flow ASPLOS 04,
IWLS 04
Front-End
GCP, Slack
Optimization Loop
IR
Construct model
Back-End
Trace Processing
Profiling
Gate-level Circuit
Verilog Simulation
9
Trace Processing

Profiling records event firing times, slack
Trace processing computes GCP

s1?
s3?
s2? fires
Behavior, b
s3? fires
s2?
Slack(s3?, b) 3 Slack(s1?, b) 0
5
9
8
Time
0-slack input is locally critical Fields, ISCA
01
GCP is longest path of 0-slack events
10
Event Type Matters
Two types of events

Data events
From producers

ready

Back-pressure events
From consumers

Control-path
Data-path
Reg
enable
stall
11
Example Asynchronous Pipelines
Events Behaviors
RTL
Handshake Protocol
Data events req?
Back-pressure events ack?, req?, ack?
12
Event Sequences Matter
Data events req?
Back-pressure events ack?, req?, ack?

Property of GCP for 4 phase H/S
Data event sequence, pathdata ltreq?gt
B-P event sequence, pathsync ltack? req?
ack?gt
GCP topology is always ltpathdata
pathsyncgt

13
What does this mean?
PATHdata ltreq?gt Good computation-bound
PATHsync ltack? req? ack?gt Maybe bad
resource-bound
Optimization Goal Eliminate PATHsync from the
GCP Eg., slack matching, ICCAD 06
14
Conclusions

GCP is the longest sequential path of system
execution
Excellent bottleneck analysis tool
Analyzes system performance across hierarchy
GCP can be automatically computed and used for
driving system-level power and performance
optimizations
Slack Matching (performance)
Operation Chaining (power, area)
Hybrid Handshake Protocol (power-performance)

15
Thank you!

Phoenix project
http//www.cs.cmu.edu/phoenix

16
Backup Slides
17
Complexity of Profiling-based Analysis
7
18
Slack Update

System model is network of behaviors
New slack propagated through graph
Propagation stops if either
Minimum of in-slack, Min(si), is unchanged, or
Update completes a whole cycle
Finally, update GCP Follow locally critical
inputs

E.g., Let B1 fires earlier by dB1 time units
B1
s1
s2
dB1
B2
Min(s1,s2)
s3
s4
B2
19
Slack Update
D(Pi) ? Pi Si
S1
S2
bm
b1
b2
? p1
? p2
20
Slack Update
Assume L(b1) changes by d
D(Pi) ? Pi Si
S1
S2
bm
b1
b2
? p1
? p2
Diff abs(? p1 - ? p2)

If ? p1 gt ? p2
New S2 (Old S2) d
S1 unchanged ( 0)

If ? p2 gt ? p1
New S2 Min(d, Diff)
New S1 Max(0, (Diff d))

21
Summarizing the GCP
lt1.0, 0gt
b1
b3
b3
lt1.0, 0gt
lt1.0, 0gt
b1
b4
b1
b4
b2
b3
b2
b2
lt0.5, 2gt
lt0.5, 3gt
b4
5
15
10
Time
ltGCP Frequency, Average Slackgt
22
Modeling the Processor Blocks
ALU Instruction Mem Instruction
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)

Model is a single, out-of-order issue processor.
It is simulated by MASE, a cycle-accurate
processor simulator, based on Simplescalar

23
Modeling the Signals
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
Lets examine the issue block logic
24
Modeling the Signals
Res. Stations (RS)
Functional Units
Dispatch
Issue
Exec
Re-Order Buffer (ROB)
25
Modeling the Signals
Dispatch
Res. Stations (RS)
Functional Units
Exec
instr
remove
issue
resource available
ISSUE
Dependency Check
Resource Check
forwarded data
Re-Order Buffer (ROB)
26
Finer-grained Model
Direction of Instruction Flow
Single-issue, out-of-order issue processor Image
rendered by a graph-drawing tool called dotty
(from Graphviz)
27
Quick-Sort
Issue width 1 ROB size 2 IFQ size 2 LSQ
size 2 RS size 2 Exec. Time 652 k cycles
28
ROB Critical Path
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)

ROB size also referred to as Instruction Window
? Upper-bound of Instruction-Level Parallelism
(ILP)

29
Quick-Sort
Issue width 1 ROB size 2 IFQ size 2 LSQ
size 2 RS size 2 Exec. Time 652 k cycles
30
Larger ROB
Need wider issue-width
Issue width 1 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 RS size 2 Exec.
Time 652 k 542 k cycles
31
Wider Issue
LSQ is critical
Issue width 1 ? 2 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 RS size 2 Exec.
Time 652 k 542 k 459 k cycles
32
Larger LSQ
Data forwarding
Issue width 1 ? 2 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 ? 4 RS size 2 Exec.
Time 652 k 542 k 459 k 418 k cycles
1.5x faster
33
Final Critical Path
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)