Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File presentation

About This Presentation

Transcript and Presenter's Notes

Title: Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

1
Addressing Instruction Fetch Bottlenecksby Using
an Instruction Register File

Stephen Hines, Gary Tyson, and David Whalley
Computer Science Dept.
Florida State University
June 8-16, 2007

2
Instruction Packing

Store frequently occurring instructions as
specified by the compiler in a small, low-power
Instruction Register File (IRF)
Allow multiple instruction fetches from the IRF
by packing instruction references together
Tightly packed multiple IRF references
Loosely packed piggybacks an IRF reference onto
an existing instruction
Facilitate parameterization of some instructions
using an Immediate Table (IMM)

3
Execution of IRF Instructions
Instruction Fetch Stage
First Half of Instruction Decode Stage
IF/ID
PC
packed instruction
packed instruction
To Instruction Decoder
IRWP
Executing a Tightly Packed Param4c Instruction
4
Outline

Introduction
IRF and Instruction Packing Overview
Integrating an IRF with an L0 I-Cache
Decoupling Instruction Fetch
Experimental Evaluation
Related Work
Conclusions Future Work

5
MIPSIRF Instruction Formats
inst1
s
inst2
inst3
T-type
inst
rt
rd
function
R-type
inst
rt
immediate
rs
I-type
immediate
win
J-type
6
Previous Work in IRF

Register Windowing Loop Cache (MICRO 2005)
Compiler Optimizations (CASES 2006)
Instruction Selection
Register Renaming
Instruction Scheduling

7
Integrating an IRF with an L0 I-Cache

L0 or Filter Caches
Small and direct-mapped
Fast hit time
Low energy per access
Higher miss rate than L1
256B L0 I-cache 8B line size Kin97
Fetch energy reduced 68
Cycle time increased 46!!!
IRF reduces code size, while L0 only focuses on
energy reduction at the cost of performance
IRF can alleviate performance penalty associated
with L0 cache misses, due to overlapping fetch

8
L0 Cache Miss Penalty
L0 Cache Miss
1
2
3
4
5
6
7
8
9
Cycle
Insn1
IF
ID
EX
M
WB
Insn2
IF
ID
EX
M
WB
Insn3
IF
ID
EX
M
WB
Insn4
IF
ID
EX
M
WB
9
Overlapping Fetch with an IRF
L0 Cache Miss
1
2
3
4
5
6
7
8
9
Cycle
Insn1
IF
ID
EX
M
WB
Pack2a
IFab
IDa
EXa
Ma
WBa
Pack2b
IDb
EXb
Mb
WBb
Insn3
IF
ID
EX
M
WB
10
Decoupling Instruction Fetch

Instruction bandwidth in a pipeline is usually
uniform (fetch, decode, issue, commit, )
Artificially limits the effective design space
Front-end throttling improves energy utilization
by reducing the fetch bandwidth in areas of low
ILP
IRF can provide virtual front-end throttling
Fetch fewer instructions every cycle, but allow
multiple issue of packed instructions
Areas of high ILP are often densely packed
Lower ILP for infrequently executed sections of
code

11
Out-of-order Pipeline Configurations
12
Experimental Evaluation

MiBench embedded benchmark suite 6 categories
representing common tasks for various domains
SimpleScalar MIPS/PISA architectural simulator
Wattch/Cacti extensions for modeling energy
consumption (inactive portions of pipeline only
dissipate 10 of normal energy when using cc3
clock gating)
VPO Very Portable Optimizer targeted for
SimpleScalar MIPS/PISA

13
L0 Study Configuration Data
14
Execution Efficiency for L0 I-Caches
15
Energy Efficiency for L0 I-Caches
16
Decoupled Fetch Configurations
17
Execution Efficiency for Asymmetric Pipeline
Bandwidth
18
Energy Efficiency for Asymmetric Pipeline
Bandwidth
19
Energy-Delay2 for Asymmetric Pipeline Bandwidth
20
Related Work

L-caches subdivide instruction cache, such that
one portion contains the most frequently accessed
code
Loop Caches capture simple loop behaviors and
replay instructions
Zero Overhead Loop Buffers (ZOLB)
Pipeline gating / Front-end throttling stall
fetch when in areas of low IPC

21
Conclusions and Future Work

Future Topics
Can we pack areas where L0 is likely to miss?
IRF encrypted or compressed I-Caches
IRF asymmetric frequency clustering (of
pipeline backend functional units)
IRF can alleviate fetch bottlenecks from L0
I-Cache misses or branch mispredictions
Increased IPC of L0 system by 6.75
Further decreased energy of L0 system by 5.78
Decoupling fetch provides a wider spectrum of
design points to be evaluated (energy/performance)

22
The End

Questions ???

23
(No Transcript)
24
Energy Consumption
25
Static Code Size
26
Conclusions Future Work

Compiler optimizations targeted specifically for
IRF can further reduce energy (12.2?15.8), code
size (16.8?28.8) and execution time
Unique transformation opportunities exist due to
IRF, such as code duplication for code size
reduction and predication
As processor designs become more idiosyncratic,
it is increasingly important to explore the
possibility of evolving existing compiler
optimizations
Register targeting and loop unrolling should also
be explored with instruction packing
Enhanced parameterization techniques

27
(No Transcript)
28
Instruction Redundancy

Profiled largest benchmark in each of six MiBench
categories
Most frequent 32 instructions comprise 66.5 of
total dynamic and 31 of total static instructions

29
Compilation Framework
30
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File PowerPoint PPT Presentation