Title: Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
1Addressing Instruction Fetch Bottlenecksby Using
an Instruction Register File
- Stephen Hines, Gary Tyson, and David Whalley
- Computer Science Dept.
- Florida State University
- June 8-16, 2007
2Instruction Packing
- Store frequently occurring instructions as
specified by the compiler in a small, low-power
Instruction Register File (IRF) - Allow multiple instruction fetches from the IRF
by packing instruction references together - Tightly packed multiple IRF references
- Loosely packed piggybacks an IRF reference onto
an existing instruction - Facilitate parameterization of some instructions
using an Immediate Table (IMM)
3Execution of IRF Instructions
Instruction Fetch Stage
First Half of Instruction Decode Stage
IF/ID
PC
packed instruction
packed instruction
To Instruction Decoder
IRWP
Executing a Tightly Packed Param4c Instruction
4Outline
- Introduction
- IRF and Instruction Packing Overview
- Integrating an IRF with an L0 I-Cache
- Decoupling Instruction Fetch
- Experimental Evaluation
- Related Work
- Conclusions Future Work
5MIPSIRF Instruction Formats
inst1
s
inst2
inst3
T-type
inst
rt
rd
function
R-type
inst
rt
immediate
rs
I-type
immediate
win
J-type
6Previous Work in IRF
- Register Windowing Loop Cache (MICRO 2005)
- Compiler Optimizations (CASES 2006)
- Instruction Selection
- Register Renaming
- Instruction Scheduling
7Integrating an IRF with an L0 I-Cache
- L0 or Filter Caches
- Small and direct-mapped
- Fast hit time
- Low energy per access
- Higher miss rate than L1
- 256B L0 I-cache 8B line size Kin97
- Fetch energy reduced 68
- Cycle time increased 46!!!
- IRF reduces code size, while L0 only focuses on
energy reduction at the cost of performance - IRF can alleviate performance penalty associated
with L0 cache misses, due to overlapping fetch
8L0 Cache Miss Penalty
L0 Cache Miss
1
2
3
4
5
6
7
8
9
Cycle
Insn1
IF
ID
EX
M
WB
Insn2
IF
ID
EX
M
WB
Insn3
IF
ID
EX
M
WB
Insn4
IF
ID
EX
M
WB
9Overlapping Fetch with an IRF
L0 Cache Miss
1
2
3
4
5
6
7
8
9
Cycle
Insn1
IF
ID
EX
M
WB
Pack2a
IFab
IDa
EXa
Ma
WBa
Pack2b
IDb
EXb
Mb
WBb
Insn3
IF
ID
EX
M
WB
10Decoupling Instruction Fetch
- Instruction bandwidth in a pipeline is usually
uniform (fetch, decode, issue, commit, ) - Artificially limits the effective design space
- Front-end throttling improves energy utilization
by reducing the fetch bandwidth in areas of low
ILP - IRF can provide virtual front-end throttling
- Fetch fewer instructions every cycle, but allow
multiple issue of packed instructions - Areas of high ILP are often densely packed
- Lower ILP for infrequently executed sections of
code
11Out-of-order Pipeline Configurations
12Experimental Evaluation
- MiBench embedded benchmark suite 6 categories
representing common tasks for various domains - SimpleScalar MIPS/PISA architectural simulator
- Wattch/Cacti extensions for modeling energy
consumption (inactive portions of pipeline only
dissipate 10 of normal energy when using cc3
clock gating) - VPO Very Portable Optimizer targeted for
SimpleScalar MIPS/PISA
13L0 Study Configuration Data
14Execution Efficiency for L0 I-Caches
15Energy Efficiency for L0 I-Caches
16Decoupled Fetch Configurations
17Execution Efficiency for Asymmetric Pipeline
Bandwidth
18Energy Efficiency for Asymmetric Pipeline
Bandwidth
19Energy-Delay2 for Asymmetric Pipeline Bandwidth
20Related Work
- L-caches subdivide instruction cache, such that
one portion contains the most frequently accessed
code - Loop Caches capture simple loop behaviors and
replay instructions - Zero Overhead Loop Buffers (ZOLB)
- Pipeline gating / Front-end throttling stall
fetch when in areas of low IPC
21Conclusions and Future Work
- Future Topics
- Can we pack areas where L0 is likely to miss?
- IRF encrypted or compressed I-Caches
- IRF asymmetric frequency clustering (of
pipeline backend functional units) - IRF can alleviate fetch bottlenecks from L0
I-Cache misses or branch mispredictions - Increased IPC of L0 system by 6.75
- Further decreased energy of L0 system by 5.78
- Decoupling fetch provides a wider spectrum of
design points to be evaluated (energy/performance)
22The End
23(No Transcript)
24Energy Consumption
25Static Code Size
26Conclusions Future Work
- Compiler optimizations targeted specifically for
IRF can further reduce energy (12.2?15.8), code
size (16.8?28.8) and execution time - Unique transformation opportunities exist due to
IRF, such as code duplication for code size
reduction and predication - As processor designs become more idiosyncratic,
it is increasingly important to explore the
possibility of evolving existing compiler
optimizations - Register targeting and loop unrolling should also
be explored with instruction packing - Enhanced parameterization techniques
27(No Transcript)
28Instruction Redundancy
- Profiled largest benchmark in each of six MiBench
categories - Most frequent 32 instructions comprise 66.5 of
total dynamic and 31 of total static instructions
29Compilation Framework
30(No Transcript)