Low-Complexity Reorder Buffer Architecture* - PowerPoint PPT Presentation

About This Presentation
Title:

Low-Complexity Reorder Buffer Architecture*

Description:

Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 62
Provided by: Gurh1
Category:

less

Transcript and Presenter's Notes

Title: Low-Complexity Reorder Buffer Architecture*


1
Low-ComplexityReorder Buffer Architecture
Gurhan Kucuk, Dmitry Ponomarev, Kanad
Ghose Department of Computer Science State
University of New York Binghamton, NY
13902-6000 http//www.cs.binghamton.edu/lowpower
16th Annual ACM International Conference on
Supercomputing (ICS02), June 24th 2002
supported in part by DARPA through the PAC-C
program and NSF
2
Outline
  • ROB complexities
  • Motivation for the low-complexity ROB
  • Low-complexity ROB design
  • Results
  • Concluding remarks

3
What This Work is All About
  • Complex, richly-ported ROBs are common in modern
    superscalar datapaths
  • Number of ports are aggravated when results are
    held within ROB slots (Example Pentium III)
  • ROB complexity reduction is important for
    reducing power and improving performance
  • ROB dissipates a non-trivial fraction of the
    total chip power
  • ROB accesses stretch over several cycles
  • Goal of this work Reduce the complexity and
    power dissipation of the ROB without sacrificing
    performance

4
Pentium III-like Superscalar Datapath
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
5
ROB Port Requirements for a W-way CPU
Writeback W write ports to write results
Decode/Dispatch W write ports to setup entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit W read ports for instruction commitment
6
ROB Port Requirements for a W-way CPU
Writeback W write ports To write results
Decode/Dispatch 1 W-wide write port to setup
entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit 1 W-wide read port for instruction
commitment
7
Where are the Source Values Coming From?
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
8
Where are the Source Values Coming From ?
62
32
6
96-entry ROB, 4-way processor SPEC2K Benchmarks
9
How Efficiently are the Ports Used ?
Writeback W write ports To write results
Decode/Dispatch W write ports to setup entries
ROB
Dispatch/Issue 2W read ports to read the source
operands
Commit W read ports for instruction commitment
6
10
Approaches to Reducing ROB Complexity
  • Reduce the number of read ports for reading out
    the source operand values
  • More radical (and better) Completely eliminate
    the read ports for reading source operand values!

11
Reducing the Number of Read Ports
3.5
1.0
Average IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
12
Problems with Retaining Fewer Source Read Ports
on the ROB
  • Need arbitration for the small number of ports
  • Additional logic needed to block the instructions
    which could not get the port.
  • Need a switching network to route the operands to
    correct destinations
  • Multi-cycle access still remains in the critical
    path of Dispatch/Issue logic

13
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
14
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
2
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
15
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
1
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
3
Instruction dispatch
D-cache
Result/status forwarding buses
16
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction 71 Shorter bit and wordlines
17
Our Solution Elimination of Read Ports
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
Area Reduction 45
18
Eliminating/Reducing the Number of Read Ports
Effects on Power Dissipation
  • Power is reduced because
  • shorter bitlines and wordlines
  • lower capacitive loading
  • fewer decoders
  • fewer drivers and sense amps

19
Completely Eliminating the Source Read Ports on
the ROB
  • The Problem Issue of instructions that require a
    value stored in the ROB will stall
  • Solutions
  • Forward the value to the waiting instruction at
    the time of committing the value
    LATE FORWARDING

20
Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
21
Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
22
Optimizing Late Forwarding
  • PROBLEM If Late Forwarding is done for every
    result that is committed, additional forwarding
    buses are needed in order not to degrade the
    performance
  • SOLUTION Selective Late Forwarding (SLF)
  • SLF requires additional bit in the ROB
  • That bit is set by the dispatched instructions
    that require Late Forwarding
  • No additional forwarding buses are needed, since
    SLF traffic is very small

23
Late Forwarding Use the Normal Forwarding Buses!
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
Only 3.5 of the traffic is from SELECTIVE LATE
FORWARDING
D-cache
Result/status forwarding buses
24
Performance Drop of Simplified ROB
9.6
3.5
1.0
Average IPC Drop
17
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
37
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
25
IPC PenaltySource Value Not Accessible within
the ROB
Lifetime of a Result Value
Late Forwarding/ Commitment
Forwarding
Value within ARF
Result Generation
Value within ROB
time
26
Improving IPC with No Read Ports
  • Cache recently generated values in a set of
    RETENTION LATCHES (RL)
  • Retention Latches are SMALL and FAST
  • Only 8 to 16 latches needed in the set
  • Entire set has 1 or 2 read ports

27
Datapath with the Retention Latches
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
28
Datapath with the Retention Latches
RETENTION LATCHES
Function Units
Architectural Register File
Instruction Issue
IQ
FU1
F1
D1
F2
D2
FU2
ROB
ARF
FUm
Fetch
Decode/Dispatch
LSQ
EX
Instruction dispatch
D-cache
Result/status forwarding buses
29
The Structure of the Retention Latch Set
L recently-written results (L1 or 2 works great)
8 or 16 latches
L-ported CAM field (key ROB_slot_id)
Result Values
Status
L ROB slot addresses (L1 or 2)
W write ports for writing up to W results in
parallel
30
Retention Latch Management Strategies
  • FIFO
  • 8 entry RL 42 hit rate
  • 16 entry RL 55 hit rate
  • LRU
  • 8 entry RL 56 hit rate
  • 16 entry RL 62 hit rate
  • Random Replacement
  • Worse performance than FIFO

31
Hit Ratios to Retention Latches
42
55
56
62
Average Hit Ratio
Hit Ratios
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
32
Accessing Retention Latch Entries
  • ROB index is used as a unique key in the
    Retention Latches to search the result values
  • Need to maintain unique keys even when we have
  • Reuse of a ROB slot
  • Not a problem for FIFO
  • simply flush a RL entry at commit time for LRU
  • Branch mispredictions

33
Handling Branch Mispredictions
  • Selective RL Flushing Retention latch entries
    that are in the mispredicted path are flushed
  • Uses branch tags
  • Complicated implementation
  • Complete RL Flushing All retention latch entries
    are flushed
  • Very simple implementation
  • Performance drop is only 1.5 compared to
    selective flushing

34
Misprediction Handling Performance
1.5
Average IPC Drop
IPC
35
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB index
ADD
Instruction
Src1 arch.
2
Src1 valid
?
Src1 value
?
Src2 arch.
3
?
Src2 valid
?
Src2 value
Simplified IDB entry 1
36
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0


ADD
Instruction
1


Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
37
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.



0


ADD
Instruction
12
1
7
1


Src1 reg.
2



2
12
0
Src1 valid
?
3
3
1
ROB
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
38
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.



0


ADD
Instruction
12
1
7
1


Src1 reg.
2



2
12
0
Src1 valid
1
3
3
1
ROB
Src1 value
7
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
39
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.



0


ADD
Instruction
12
0
?
1


Src1 reg.
2



2
12
0
Src1 valid
?
3
3
1
ROB
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
40
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. valid
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.



0


ADD
Instruction
12
0
?
1


Src1 reg.
2



2
12
0
Src1 valid
0
3
3
1
ROB
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
41
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0


ADD
Instruction
1


Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4




Src2 reg.
3



?
3
43
Src2 valid
Rename Table


?
Src2 value
ARF
Simplified IDB entry 1
42
Scenario 1 Traditional Design
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0


ADD
Instruction
1


Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4




Src2 reg.
3



1
3
43
Src2 valid
Rename Table


43
Src2 value
ARF
Simplified IDB entry 1
43
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB index
ADD
Instruction
Src1 arch.
2
Src1 valid
?
Src1 value
?
Src2 arch.
3
?
Src2 valid
?
Src2 value
Simplified IDB entry 1
44
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0


ADD
Instruction
1


Src1 reg.
2
2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
45
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.


0


ADD
Instruction
Retention Latches
12
7
1


Src1 reg.
2


2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
46
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.


0


ADD
Instruction
Retention Latches
12
7
1


Src1 reg.
2


2
12
0
Src1 valid
1
3
3
1
Src1 value
7
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
47
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.


0


ADD
Instruction
Retention Latches
MISS


1


Src1 reg.
2


2
12
0
Src1 valid
?
3
3
1
Src1 value
?
4


Src2 reg.
3



?
Src2 valid
Rename Table
?
Src2 value
Simplified IDB entry 1
48
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.


0


ADD
Instruction
Retention Latches
MISS


1


Src1 reg.
2


2
12
0
Src1 valid
0
3
3
1
ROB /Phys.
Phys. valid
Phys. value
SLF
Src1 value
?
4






Src2 reg.
3



?
Src2 valid
12
X
X
0
Rename Table
?
Src2 value




ROB
Simplified IDB entry 1
X Dont Care
49
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
ROB /Phys.
Phys. value
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.


0


ADD
Instruction
Retention Latches
MISS


1


Src1 reg.
2


2
12
0
Src1 valid
0
3
3
1
ROB /Phys.
Phys. valid
Phys. value
SLF
Src1 value
?
4






Src2 reg.
3



?
Src2 valid
12
X
X
1
Rename Table
?
Src2 value




ROB
Simplified IDB entry 1
X Dont Care
50
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0


ADD
Instruction
1


Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4




Src2 reg.
3



?
3
43
Src2 valid
Rename Table


?
Src2 value
ARF
Simplified IDB entry 1
51
Scenario 2 Simplified ROB with RLs
Instruction ADD R1, R2, R3
5
ROB0 ARF1
ROB /Phys.
ROB index
Arch.
0


ADD
Instruction
1


Src1 reg.
2
2
12
0
Src1 valid
1
3
3
1
Arch. value
Arch.
Src1 value
7
4




Src2 reg.
3



1
3
43
Src2 valid
Rename Table


43
Src2 value
ARF
Simplified IDB entry 1
52
Experimental Setup the AccuPower (DATE02)
Compiled SPEC benchmarks
Performance stats
Microarchitectural Simulator (Rooted in
SimpleScalar)
Datapath specs
Transition counts, Context information
Power/energy stats
Energy/Power Estimator
VLSI layout data
SPICE
SPICE deck
SPICE measures of energy per transition
53
Configuration of the Simulated System
Machine width
4-way
Issue Queue
32 entries
96 entries
Reorder Buffer
32 entries
Load/Store Queue
Simulated the execution of SPEC2000 benchmarks
54
Assumed Timings
Smaller delay few latches
Rename Table lookup for ROB index
Rename Table Lookup for ROB index
Associative lookup of operand from retention
latches using ROB index as a key
Source operand read from the ROB
Source operand read from the ROB
D1
D1
D2
D3
D2
Timing of the baseline model
Timing of the simplified ROB
55
Experimental Results Effect on Performance
0.1
-1.6
-1.0
-2.3
Avg. IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
56
Experimental Results Effect on Performance
3.3
1.7
2.3
1.0
Avg. IPC Drop
Performance Drop
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
57
Experimental Results Effect on Power
30
23.4
22.2
21
20.2
Avg. Savings
Power Savings
bzip2
gap
gcc
gzip
mcf
parser
perl
twolf
Int Avg.
vortex
vpr
applu
apsi
art
equake
mesa
mgrid
swim
wupwise
FP Avg.
58
Summary of Results
  • Significantly reduced ROB complexity and power
    dissipation
  • 45 area reduction
  • 20 to 30 power reduction across SPEC 2000
    benchmarks
  • Actual IPC improvements
  • 1.6 to 2.3 gain across SPEC benchmarks
  • IPC gains come from 1 cycle access to RL (vs. 2
    cycles that would be needed for ROB access)

59
Related Work
  • Value-Aging Buffer (Hu Martonosi, PACS 2000)
  • Forwarding Buffer and Clustered Register Cache
    (Borch et.al., HPCA02)
  • Multiple Register Banks (Cruz et.al., ISCA00
    Balasubramonian et.al., MICRO01)
  • See paper for discussions

60
Conclusions
  • Typical source operand location statistics can be
    successfully exploited to reduce ROB complexity
  • Significant reduction in ROB area and power no
    ROB ports needed for reading source operands
  • IPC gains are possible because of the use of a
    small sized, low-ported Retention Latch to supply
    cached operand values in a single cycle

61
Low-ComplexityReorder Buffer Architecture
Gurhan Kucuk, Dmitry Ponomarev, Kanad
Ghose Department of Computer Science State
University of New York Binghamton, NY
13902-6000 http//www.cs.binghamton.edu/lowpower
16th Annual ACM International Conference on
Supercomputing (ICS02), June 24th 2002
supported in part by DARPA through the PAC-C
program and NSF
Write a Comment
User Comments (0)
About PowerShow.com