Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File


1
Addressing Instruction Fetch Bottlenecksby Using
an Instruction Register File
  • Stephen Hines, Gary Tyson, and David Whalley
  • Computer Science Dept.
  • Florida State University
  • June 8-16, 2007

2
Instruction Packing
  • Store frequently occurring instructions as
    specified by the compiler in a small, low-power
    Instruction Register File (IRF)
  • Allow multiple instruction fetches from the IRF
    by packing instruction references together
  • Tightly packed multiple IRF references
  • Loosely packed piggybacks an IRF reference onto
    an existing instruction
  • Facilitate parameterization of some instructions
    using an Immediate Table (IMM)

3
Execution of IRF Instructions
Instruction Fetch Stage
First Half of Instruction Decode Stage
IF/ID
PC
packed instruction
packed instruction
To Instruction Decoder
IRWP
Executing a Tightly Packed Param4c Instruction
4
Outline
  • Introduction
  • IRF and Instruction Packing Overview
  • Integrating an IRF with an L0 I-Cache
  • Decoupling Instruction Fetch
  • Experimental Evaluation
  • Related Work
  • Conclusions Future Work

5
MIPSIRF Instruction Formats
inst1
s
inst2
inst3
T-type
inst
rt
rd
function
R-type
inst
rt
immediate
rs
I-type
immediate
win
J-type
6
Previous Work in IRF
  • Register Windowing Loop Cache (MICRO 2005)
  • Compiler Optimizations (CASES 2006)
  • Instruction Selection
  • Register Renaming
  • Instruction Scheduling

7
Integrating an IRF with an L0 I-Cache
  • L0 or Filter Caches
  • Small and direct-mapped
  • Fast hit time
  • Low energy per access
  • Higher miss rate than L1
  • 256B L0 I-cache 8B line size Kin97
  • Fetch energy reduced 68
  • Cycle time increased 46!!!
  • IRF reduces code size, while L0 only focuses on
    energy reduction at the cost of performance
  • IRF can alleviate performance penalty associated
    with L0 cache misses, due to overlapping fetch

8
L0 Cache Miss Penalty
L0 Cache Miss
1
2
3
4
5
6
7
8
9
Cycle
Insn1
IF
ID
EX
M
WB
Insn2
IF
ID
EX
M
WB
Insn3
IF
ID
EX
M
WB
Insn4
IF
ID
EX
M
WB
9
Overlapping Fetch with an IRF
L0 Cache Miss
1
2
3
4
5
6
7
8
9
Cycle
Insn1
IF
ID
EX
M
WB
Pack2a
IFab
IDa
EXa
Ma
WBa
Pack2b
IDb
EXb
Mb
WBb
Insn3
IF
ID
EX
M
WB
10
Decoupling Instruction Fetch
  • Instruction bandwidth in a pipeline is usually
    uniform (fetch, decode, issue, commit, )
  • Artificially limits the effective design space
  • Front-end throttling improves energy utilization
    by reducing the fetch bandwidth in areas of low
    ILP
  • IRF can provide virtual front-end throttling
  • Fetch fewer instructions every cycle, but allow
    multiple issue of packed instructions
  • Areas of high ILP are often densely packed
  • Lower ILP for infrequently executed sections of
    code

11
Out-of-order Pipeline Configurations
12
Experimental Evaluation
  • MiBench embedded benchmark suite 6 categories
    representing common tasks for various domains
  • SimpleScalar MIPS/PISA architectural simulator
  • Wattch/Cacti extensions for modeling energy
    consumption (inactive portions of pipeline only
    dissipate 10 of normal energy when using cc3
    clock gating)
  • VPO Very Portable Optimizer targeted for
    SimpleScalar MIPS/PISA

13
L0 Study Configuration Data
14
Execution Efficiency for L0 I-Caches
15
Energy Efficiency for L0 I-Caches
16
Decoupled Fetch Configurations
17
Execution Efficiency for Asymmetric Pipeline
Bandwidth
18
Energy Efficiency for Asymmetric Pipeline
Bandwidth
19
Energy-Delay2 for Asymmetric Pipeline Bandwidth
20
Related Work
  • L-caches subdivide instruction cache, such that
    one portion contains the most frequently accessed
    code
  • Loop Caches capture simple loop behaviors and
    replay instructions
  • Zero Overhead Loop Buffers (ZOLB)
  • Pipeline gating / Front-end throttling stall
    fetch when in areas of low IPC

21
Conclusions and Future Work
  • Future Topics
  • Can we pack areas where L0 is likely to miss?
  • IRF encrypted or compressed I-Caches
  • IRF asymmetric frequency clustering (of
    pipeline backend functional units)
  • IRF can alleviate fetch bottlenecks from L0
    I-Cache misses or branch mispredictions
  • Increased IPC of L0 system by 6.75
  • Further decreased energy of L0 system by 5.78
  • Decoupling fetch provides a wider spectrum of
    design points to be evaluated (energy/performance)

22
The End
  • Questions ???

23
(No Transcript)
24
Energy Consumption
25
Static Code Size
26
Conclusions Future Work
  • Compiler optimizations targeted specifically for
    IRF can further reduce energy (12.2?15.8), code
    size (16.8?28.8) and execution time
  • Unique transformation opportunities exist due to
    IRF, such as code duplication for code size
    reduction and predication
  • As processor designs become more idiosyncratic,
    it is increasingly important to explore the
    possibility of evolving existing compiler
    optimizations
  • Register targeting and loop unrolling should also
    be explored with instruction packing
  • Enhanced parameterization techniques

27
(No Transcript)
28
Instruction Redundancy
  • Profiled largest benchmark in each of six MiBench
    categories
  • Most frequent 32 instructions comprise 66.5 of
    total dynamic and 31 of total static instructions

29
Compilation Framework
30
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com