Global Critical Path: A Tool for SystemLevel Timing Analysis - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Global Critical Path: A Tool for SystemLevel Timing Analysis

Description:

Construct model. Gate-level. Circuit. Verilog. Simulation. Profiling. Trace. Processing ... bm. b1. b2. p1. p2. S1. S2. If p1 p2. New S2 = (Old S2) d. S1 ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 34
Provided by: Gir971
Category:

less

Transcript and Presenter's Notes

Title: Global Critical Path: A Tool for SystemLevel Timing Analysis


1
Global Critical Path A Tool for System-Level
Timing Analysis
  • Girish Venkataramani
  • Tiberiu Chelcea
  • Seth C. Goldstein
  • Mihai Budiu

2
Timing Analysis
Gate-Level
System-Level
  • Reg-to-Reg critical path
  • Determines frequency
  • Fault analysis, verification
  • Global Critical Path (GCP)
  • Determines iteration bound of system execution
  • System bottleneck analysis
  • Power, performance optimization

3
GCP-driven Optimizations
RTL graph for jpeg encoder kernel
  • Global Critical Path (GCP)
  • Longest Sequential Path
  • Optimize GCP for performance
  • Optimize for power/area

50 K gates
4
GCP-driven Optimizations
RTL graph for jpeg encoder kernel
  • Global Critical Path (GCP)
  • Longest Sequential Path
  • Optimize GCP for performance
  • Optimize for power/area
  • Update GCP for re-use

50 K gates
5
System Model
  • How do we represent the system architecture?
  • Unified model
  • Captures resource usage
  • RTL captures system architecture
  • Too large potentially millions of signals

RTL Model

Arbitrary DAG

pipeline stage
6
Modeling Execution State
  • System state defined by values of state-holding
    elements
  • Key insight Collective state of signals
    controlling regs determine execution
    progress



Control-path
Data-path
Model the transaction-level interaction between
circuit signals
Reg
enable
7
Modeling Execution State
  • Discrete event model captures interaction
  • Events Control-path signal transitions
  • Behaviors Partial execution ordering of events

s3?
s1?
s1
s2
Behavior
s3
s2?
Timing diagram defines event ordering
An event-behavior graph is constructed by
modeling the control-path for every state-holding
element in the system
8
Profiling-based Timing Analysis
Spec
Incorporated into CASH compiler flow ASPLOS 04,
IWLS 04
Front-End
GCP, Slack
Optimization Loop
IR
Construct model
Back-End
Trace Processing
Profiling
Gate-level Circuit
Verilog Simulation
9
Trace Processing
  • Profiling records event firing times, slack
  • Trace processing computes GCP

s1?
s3?
s2? fires
Behavior, b
s3? fires
s2?
Slack(s3?, b) 3 Slack(s1?, b) 0
5
9
8
Time
0-slack input is locally critical Fields, ISCA
01
GCP is longest path of 0-slack events
10
Event Type Matters
Two types of events
  • Data events
  • From producers

ready

  • Back-pressure events
  • From consumers

Control-path
Data-path
Reg
enable
stall
11
Example Asynchronous Pipelines
Events Behaviors
RTL
Handshake Protocol
Data events req?
Back-pressure events ack?, req?, ack?
12
Event Sequences Matter
Data events req?
Back-pressure events ack?, req?, ack?
  • Property of GCP for 4 phase H/S
  • Data event sequence, pathdata ltreq?gt
  • B-P event sequence, pathsync ltack? req?
    ack?gt
  • GCP topology is always ltpathdata
    pathsyncgt

13
What does this mean?
PATHdata ltreq?gt Good computation-bound
PATHsync ltack? req? ack?gt Maybe bad
resource-bound
Optimization Goal Eliminate PATHsync from the
GCP Eg., slack matching, ICCAD 06
14
Conclusions
  • GCP is the longest sequential path of system
    execution
  • Excellent bottleneck analysis tool
  • Analyzes system performance across hierarchy
  • GCP can be automatically computed and used for
    driving system-level power and performance
    optimizations
  • Slack Matching (performance)
  • Operation Chaining (power, area)
  • Hybrid Handshake Protocol (power-performance)

15
Thank you!
  • Phoenix project
  • http//www.cs.cmu.edu/phoenix

16
Backup Slides
17
Complexity of Profiling-based Analysis
7
18
Slack Update
  • System model is network of behaviors
  • New slack propagated through graph
  • Propagation stops if either
  • Minimum of in-slack, Min(si), is unchanged, or
  • Update completes a whole cycle
  • Finally, update GCP Follow locally critical
    inputs

E.g., Let B1 fires earlier by dB1 time units
B1
s1
s2
dB1
B2
Min(s1,s2)
s3
s4
B2
19
Slack Update
D(Pi) ? Pi Si
S1
S2
bm
b1
b2
? p1
? p2
20
Slack Update
Assume L(b1) changes by d
D(Pi) ? Pi Si
S1
S2
bm
b1
b2
? p1
? p2
Diff abs(? p1 - ? p2)
  • If ? p1 gt ? p2
  • New S2 (Old S2) d
  • S1 unchanged ( 0)
  • If ? p2 gt ? p1
  • New S2 Min(d, Diff)
  • New S1 Max(0, (Diff d))

21
Summarizing the GCP
lt1.0, 0gt
b1
b3
b3
lt1.0, 0gt
lt1.0, 0gt
b1
b4
b1
b4
b2
b3
b2
b2
lt0.5, 2gt
lt0.5, 3gt
b4
5
15
10
Time
ltGCP Frequency, Average Slackgt
22
Modeling the Processor Blocks
ALU Instruction Mem Instruction
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
  • Model is a single, out-of-order issue processor.
  • It is simulated by MASE, a cycle-accurate
    processor simulator, based on Simplescalar

23
Modeling the Signals
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
Lets examine the issue block logic
24
Modeling the Signals
Res. Stations (RS)
Functional Units
Dispatch
Issue
Exec
Re-Order Buffer (ROB)
25
Modeling the Signals
Dispatch
Res. Stations (RS)
Functional Units
Exec
instr
remove
issue
resource available
ISSUE
Dependency Check
Resource Check
forwarded data
Re-Order Buffer (ROB)
26
Finer-grained Model
Direction of Instruction Flow
Single-issue, out-of-order issue processor Image
rendered by a graph-drawing tool called dotty
(from Graphviz)
27
Quick-Sort
Issue width 1 ROB size 2 IFQ size 2 LSQ
size 2 RS size 2 Exec. Time 652 k cycles
28
ROB Critical Path
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
  • ROB size also referred to as Instruction Window
  • ? Upper-bound of Instruction-Level Parallelism
    (ILP)

29
Quick-Sort
Issue width 1 ROB size 2 IFQ size 2 LSQ
size 2 RS size 2 Exec. Time 652 k cycles
30
Larger ROB
Need wider issue-width
Issue width 1 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 RS size 2 Exec.
Time 652 k 542 k cycles
31
Wider Issue
LSQ is critical
Issue width 1 ? 2 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 RS size 2 Exec.
Time 652 k 542 k 459 k cycles
32
Larger LSQ
Data forwarding
Issue width 1 ? 2 ROB size 2 ? 4 IFQ
size 2 LSQ size 2 ? 4 RS size 2 Exec.
Time 652 k 542 k 459 k 418 k cycles
1.5x faster
33
Final Critical Path
Instr. Fetch Q (IFQ)
Res. Stations (RS)
Functional Units
Fetch
Dispatch
Issue
Exec
Memory
Commit
Re-Order Buffer (ROB)
  • Data forwarding path is most critical
  • ? True data dependencies restrict parallelism
Write a Comment
User Comments (0)
About PowerShow.com