Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache

Description:

Scratch-pad size (96 bytes) T1. time. copy1. copy4. copy3. copy4. T3. copy2. T3. 64b ... Varied scratch-pad size from 32-bytes to 4-Kbytes. Two configurations ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 34
Provided by: Kevi1
Category:

less

Transcript and Presenter's Notes

Title: Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache


1
Compiler Managed Dynamic Instruction Placement In
A Low-Power Code Cache
  • Rajiv Ravindran, Pracheeti Nagarkar, Ganesh
    Dasika, Robert Senger, Eric Marsman,
  • Scott Mahlke, Richard BrownDepartment of
    Electrical Engineering and Computer Science
  • University of Michigan, Ann Arbor

University of Michigan Electrical Engineering and
Computer Science
2
Introduction
  • Instruction fetch power dominant in low-power
    embedded processors
  • 27 for the StrongARM
  • 50 for the Motorola MCORE
  • Two alternatives

No hardware overhead Part of the physical
address space - Managed in software
Hardware managed Transparent to the user -
Power hungry tag-checking and comparison
logic
Scratch-pad
Instruction-cache
3
Focus Of This Work
  • Explore the use of scratch-pad for reducing
    instruction fetch power
  • Two possible software managed schemes
  • Static
  • Map hot regions prior to execution
  • Contents do not change during execution
  • Dynamic
  • Allow contents to change during execution
  • Explicit copying of hot regions

4
Scratch-pad Management Static Approach
BB1
BB2
T1
BB3
BB4
BB14
BB6
BB5
BB7
BB8
BB9
T2
BB10
BB11
T3
BB12
BB13
5
Scratch-pad Management Static Approach
T1
profit size freq
6
Scratch-pad Management Static Approach
Equivalent to bin-packing
T2 32 bytes
T1
96 bytes
T1 64 bytes
profit size freq
7
Scratch-pad Management Dynamic Approach
BB1
BB2
T1
BB3
Scratch-pad size (96 bytes)
BB4
BB14
Scratch-pad space
BB6
BB5
32b
BB7
BB8
T1
64b
BB9
T2
time
BB10
copy T1
BB11
T3
BB12
BB13
8
Scratch-pad Management Dynamic Approach
T1
Scratch-pad size (96 bytes)
Scratch-pad space
T2
32b
T1
64b
time
copy T1
copy T2
9
Scratch-pad Management Dynamic Approach
T1
Scratch-pad size (96 bytes)
Scratch-pad space
T2
T3
32b
T1
64b
time
copy T1
copy T2
copy T3 over T2
10
Scratch-pad Management Dynamic Approach
Scratch-pad size (96 bytes)
Scratch-pad space
T2
T2
T3
32b
T1
64b
time
copy T1
copy T2
copy T3 over T2
copy T2 over T3
11
Scratch-pad Management Dynamic Approach
Copy1 for T1
Copy2 for T2
BB1
BB2
T1
BB3
Scratch-pad size (96 bytes)
BB4
BB14
Scratch-pad space
T2
T2
T3
T3
BB6
BB5
32b
BB7
BB8
T1
64b
BB9
T2
copy1
copy4
copy3
copy4
copy2
time
BB10
copy T1
copy T2
copy T3 over T2
copy T2 over T3
copy T3 over T2
BB11
Copy4 for T3
T3
BB12
Copy3 for T2
BB13
12
Objectives Of This Work
  • Develop a dynamic compiler managed scheme to
    exploit scratch-pad
  • Prior work Verma et al,04
  • ILP based solution
  • Not scalable
  • Limits scope of analysis to single procedure,
    loop-nests
  • Practical solution
  • Scalable
  • Handle arbitrary control flow graphs
  • Inter-procedural analysis

13
Our Approach
  • Two phases
  • Trace selection scratch-pad (SP) allocation
  • Identify frequently executed traces
  • Select the most energy beneficial traces
  • Place them with possible overlap to reduce copy
    overhead
  • Copy placement
  • Insert copies to realize the placement
  • Hoist within the control flow graph to minimize
    overhead
  • Fix branch targets into selected traces

14
SP Allocation Computing Energy Gain
Benefit Energy savings when the trace is
executed from scratch-pad instead of
memory CopyCost Overhead associated with
copying the trace once
Benefit ProfileWeight Size
DFetchEnergy CopyCost Size ( FetchEnergy
WriteEnergy) Energy Gain Benefit -
CopyCost
15
SP Allocation Placing Traces
initial copy of T2
recopy of T2
recopy of T2
T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3
initial copy of T1
recopy of T1
recopy of T1
Dynamic Copy Cost copies of T1 CopyCost
(T1) copies of T2 CopyCost(T2)
16
Temporal Relationship Graph Gloy et al,97
copy of T2
copy of T2
T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3
copy of T1
copy of T1
2 CopyCost (T1) 2 CopyCost(T2)
T1
T2
T3
Edge Weights between two nodes denote the Dynamic
Copy Cost
17
SP Allocation Placing Traces
T2
T2
Energy Gain T1 ? 3104nJ Energy Gain T2 ?
15952nJ Energy Gain T3 ? 752nJ
96-bytes
18
SP Allocation Placing Traces
T2
T2
Energy Gain T1 ? 3104nJ Energy Gain T2 ?
15952nJ Energy Gain T3 ? 752nJ
T1
96-bytes
T1
T1
T1
19
SP Allocation Placing Traces
T2
T2
432nJ
T1
T2
T1
96-bytes
96nJ
144nJ
T1
T3
T1
T1
20
SP Allocation Placing Traces
T2, T3
T2, T3
432nJ
T1
T2
T1
96-bytes
96nJ
144nJ
T1
T3
T1
T1
21
Copy Placement
  • Initially, naively place copies at trace entry
    points
  • Guarantees correct but inefficient execution
  • Reduce the copy overhead
  • Identify frequently executed copies
  • Iteratively hoist copies to less frequently
    executed blocks
  • Remove redundant copies
  • Ensure that the hoists and removal are legal
  • Traces are present prior to execution

22
Copy Placement Initial Placement
C1-T1
BB1
BB2
T1
BB3
C2-T1
BB4
BB14
BB6
BB5
C3-T1
BB7
BB8
C1-T2
BB9
T2
BB10
BB11
C3-T2
T3
BB12
C1-T3
C2-T3
BB13
23
Copy Placement Redundant Copies
C1-T1
BB1
T2, T3
BB2
T1
BB3
T2, T3
C2-T1
BB4
BB14
T1
BB6
BB5
C3-T1
BB7
T1
BB8
C1-T2
BB9
T1
T2
BB10
T1
BB11
C3-T2
T3
BB12
C1-T3
C2-T3
BB13
24
Copy Placement Hoisting
C1-T1
BB1
BB2
T1
BB3
Live-Range T1? BB4, BB6, BB7 T2? BB9,
BB10 T3? BB12
BB4
BB14
BB6
BB5
BB7
BB8
C1-T2
BB9
T2
BB10
BB11
T3
BB12
C1-T3
BB13
25
Copy Placement Hoisting
Live-Range of T2 before hoist
BB1
T2, T3
BB2
T1
T2, T3
BB3
BB4
BB14
T1
BB6
BB5
BB7
T1
BB8
C1-T2
BB9
T1
T2
BB10
T1
BB11
T3
BB12
C1-T3
BB13
26
Copy Placement Hoisting
Live-Range of T2 after hoist ? legal
BB1
T2, T3
C1-T2
BB2
T1
T2, T3
BB3
BB4
BB14
T1
BB6
BB5
BB7
T1
BB8
BB9
T1
T2
BB10
T1
BB11
T3
BB12
C1-T3
BB13
27
Experimental Setup
  • Trimaran compiler framework
  • Measured instruction fetch power
  • Varied scratch-pad size from 32-bytes to 4-Kbytes
  • Two configurations
  • WIMS microcontroller at the Univ. of Michigan
  • On-chip memory and scratch-pad
  • Static vs dynamic schemes
  • PowerMill
  • Conventional processor
  • Off-chip memory, on-chip scratch-pad vs on-chip
    I-cache
  • CACTI model
  • Scratch-pad vs I-cache
  • DMA copying
  • 2 bytes per cycle, stalling

28
Energy Savings Static vs Dynamic
WIMS Energy Savings, 64-Byte scratch-pad
60
dynamic
static
50
40
Energy Improvement
30
20
10
0
fir
sha
epic
cjpeg
djpeg
unepic
average
blowfish
rawcaudio
mpeg2enc
mpeg2dec
pegwitenc
pegwitdec
rawdaudio
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
Average savings for Dynamic 28 Average savings
for Static 17
29
Effect of Varying Scratch-pad Size
pegwitenc
35
Static Energy
Static Hit Rate
30
Dynamic Energy
100
Dynamic Hit Rate
25
80
20
Energy Savings
Hit Rate
60
15
40
10
20
5
0
0
32
64
128
256
512
1024
2048
4096
32
64
128
256
512
1024
2048
4096
SP Size (Bytes)
SP Size (bytes)
30
Scratch-pad Size For 95 Hit Rate
9000
8000
static
dynamic
7000
6000
5000
Size (bytes)
4000
3000
2000
1000
0
fir
sha
epic
cjpeg
djpeg
unepic
blowfish
average
pegwitenc
mpeg2enc
pegwitdec
mpeg2dec
rawcaudio
rawdaudio
gsmencode
gsmdecode
pgpencode
pgpdecode
g721encode
g721decode
Dynamic is 2.5x better than static
31
Energy Savings SP vs I-Cache
Cacti energy savings, 64b scratch-pad/I-cache
120
dynamic
static
100
I-cache
80
60
40
Energy Improvement
20
0
fir
sha
epic
cjpeg
djpeg
-20
unepic
average
blowfish
rawcaudio
rawdaudio
mpeg2enc
mpeg2dec
pegwitenc
pegwitdec
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
-40
-60
Average savings for Dynamic 48 Average savings
for Static 25 Average savings for
I-cache 30
32
Conclusions
  • Compiler directed dynamic placement in
    scratch-pad
  • Arbitrary control flow graph
  • Inter-procedural
  • Two phases ? SP allocation copy placement
  • 28 savings for dynamic as compared to 16 for
    static for a 64-byte scratch-pad
  • 41 savings for dynamic as compared to 31 for
    static for 256-byte scratch-pad
  • 2 to 10 stall cycles
  • Within 0 to 11 of optimal, but scalable

33
For more information http//cccp.eecs.umich
.edu Thank You!
Write a Comment
User Comments (0)
About PowerShow.com