Title: Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache
1Compiler Managed Dynamic Instruction Placement In
A Low-Power Code Cache
- Rajiv Ravindran, Pracheeti Nagarkar, Ganesh
Dasika, Robert Senger, Eric Marsman, - Scott Mahlke, Richard BrownDepartment of
Electrical Engineering and Computer Science - University of Michigan, Ann Arbor
University of Michigan Electrical Engineering and
Computer Science
2Introduction
- Instruction fetch power dominant in low-power
embedded processors - 27 for the StrongARM
- 50 for the Motorola MCORE
- Two alternatives
No hardware overhead Part of the physical
address space - Managed in software
Hardware managed Transparent to the user -
Power hungry tag-checking and comparison
logic
Scratch-pad
Instruction-cache
3Focus Of This Work
- Explore the use of scratch-pad for reducing
instruction fetch power - Two possible software managed schemes
- Static
- Map hot regions prior to execution
- Contents do not change during execution
- Dynamic
- Allow contents to change during execution
- Explicit copying of hot regions
4Scratch-pad Management Static Approach
BB1
BB2
T1
BB3
BB4
BB14
BB6
BB5
BB7
BB8
BB9
T2
BB10
BB11
T3
BB12
BB13
5Scratch-pad Management Static Approach
T1
profit size freq
6Scratch-pad Management Static Approach
Equivalent to bin-packing
T2 32 bytes
T1
96 bytes
T1 64 bytes
profit size freq
7Scratch-pad Management Dynamic Approach
BB1
BB2
T1
BB3
Scratch-pad size (96 bytes)
BB4
BB14
Scratch-pad space
BB6
BB5
32b
BB7
BB8
T1
64b
BB9
T2
time
BB10
copy T1
BB11
T3
BB12
BB13
8Scratch-pad Management Dynamic Approach
T1
Scratch-pad size (96 bytes)
Scratch-pad space
T2
32b
T1
64b
time
copy T1
copy T2
9Scratch-pad Management Dynamic Approach
T1
Scratch-pad size (96 bytes)
Scratch-pad space
T2
T3
32b
T1
64b
time
copy T1
copy T2
copy T3 over T2
10Scratch-pad Management Dynamic Approach
Scratch-pad size (96 bytes)
Scratch-pad space
T2
T2
T3
32b
T1
64b
time
copy T1
copy T2
copy T3 over T2
copy T2 over T3
11Scratch-pad Management Dynamic Approach
Copy1 for T1
Copy2 for T2
BB1
BB2
T1
BB3
Scratch-pad size (96 bytes)
BB4
BB14
Scratch-pad space
T2
T2
T3
T3
BB6
BB5
32b
BB7
BB8
T1
64b
BB9
T2
copy1
copy4
copy3
copy4
copy2
time
BB10
copy T1
copy T2
copy T3 over T2
copy T2 over T3
copy T3 over T2
BB11
Copy4 for T3
T3
BB12
Copy3 for T2
BB13
12Objectives Of This Work
- Develop a dynamic compiler managed scheme to
exploit scratch-pad - Prior work Verma et al,04
- ILP based solution
- Not scalable
- Limits scope of analysis to single procedure,
loop-nests - Practical solution
- Scalable
- Handle arbitrary control flow graphs
- Inter-procedural analysis
13Our Approach
- Two phases
- Trace selection scratch-pad (SP) allocation
- Identify frequently executed traces
- Select the most energy beneficial traces
- Place them with possible overlap to reduce copy
overhead - Copy placement
- Insert copies to realize the placement
- Hoist within the control flow graph to minimize
overhead - Fix branch targets into selected traces
14SP Allocation Computing Energy Gain
Benefit Energy savings when the trace is
executed from scratch-pad instead of
memory CopyCost Overhead associated with
copying the trace once
Benefit ProfileWeight Size
DFetchEnergy CopyCost Size ( FetchEnergy
WriteEnergy) Energy Gain Benefit -
CopyCost
15SP Allocation Placing Traces
initial copy of T2
recopy of T2
recopy of T2
T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3
initial copy of T1
recopy of T1
recopy of T1
Dynamic Copy Cost copies of T1 CopyCost
(T1) copies of T2 CopyCost(T2)
16Temporal Relationship Graph Gloy et al,97
copy of T2
copy of T2
T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3
copy of T1
copy of T1
2 CopyCost (T1) 2 CopyCost(T2)
T1
T2
T3
Edge Weights between two nodes denote the Dynamic
Copy Cost
17SP Allocation Placing Traces
T2
T2
Energy Gain T1 ? 3104nJ Energy Gain T2 ?
15952nJ Energy Gain T3 ? 752nJ
96-bytes
18SP Allocation Placing Traces
T2
T2
Energy Gain T1 ? 3104nJ Energy Gain T2 ?
15952nJ Energy Gain T3 ? 752nJ
T1
96-bytes
T1
T1
T1
19SP Allocation Placing Traces
T2
T2
432nJ
T1
T2
T1
96-bytes
96nJ
144nJ
T1
T3
T1
T1
20SP Allocation Placing Traces
T2, T3
T2, T3
432nJ
T1
T2
T1
96-bytes
96nJ
144nJ
T1
T3
T1
T1
21Copy Placement
- Initially, naively place copies at trace entry
points - Guarantees correct but inefficient execution
- Reduce the copy overhead
- Identify frequently executed copies
- Iteratively hoist copies to less frequently
executed blocks - Remove redundant copies
- Ensure that the hoists and removal are legal
- Traces are present prior to execution
22Copy Placement Initial Placement
C1-T1
BB1
BB2
T1
BB3
C2-T1
BB4
BB14
BB6
BB5
C3-T1
BB7
BB8
C1-T2
BB9
T2
BB10
BB11
C3-T2
T3
BB12
C1-T3
C2-T3
BB13
23Copy Placement Redundant Copies
C1-T1
BB1
T2, T3
BB2
T1
BB3
T2, T3
C2-T1
BB4
BB14
T1
BB6
BB5
C3-T1
BB7
T1
BB8
C1-T2
BB9
T1
T2
BB10
T1
BB11
C3-T2
T3
BB12
C1-T3
C2-T3
BB13
24Copy Placement Hoisting
C1-T1
BB1
BB2
T1
BB3
Live-Range T1? BB4, BB6, BB7 T2? BB9,
BB10 T3? BB12
BB4
BB14
BB6
BB5
BB7
BB8
C1-T2
BB9
T2
BB10
BB11
T3
BB12
C1-T3
BB13
25Copy Placement Hoisting
Live-Range of T2 before hoist
BB1
T2, T3
BB2
T1
T2, T3
BB3
BB4
BB14
T1
BB6
BB5
BB7
T1
BB8
C1-T2
BB9
T1
T2
BB10
T1
BB11
T3
BB12
C1-T3
BB13
26Copy Placement Hoisting
Live-Range of T2 after hoist ? legal
BB1
T2, T3
C1-T2
BB2
T1
T2, T3
BB3
BB4
BB14
T1
BB6
BB5
BB7
T1
BB8
BB9
T1
T2
BB10
T1
BB11
T3
BB12
C1-T3
BB13
27Experimental Setup
- Trimaran compiler framework
- Measured instruction fetch power
- Varied scratch-pad size from 32-bytes to 4-Kbytes
- Two configurations
- WIMS microcontroller at the Univ. of Michigan
- On-chip memory and scratch-pad
- Static vs dynamic schemes
- PowerMill
- Conventional processor
- Off-chip memory, on-chip scratch-pad vs on-chip
I-cache - CACTI model
- Scratch-pad vs I-cache
- DMA copying
- 2 bytes per cycle, stalling
28Energy Savings Static vs Dynamic
WIMS Energy Savings, 64-Byte scratch-pad
60
dynamic
static
50
40
Energy Improvement
30
20
10
0
fir
sha
epic
cjpeg
djpeg
unepic
average
blowfish
rawcaudio
mpeg2enc
mpeg2dec
pegwitenc
pegwitdec
rawdaudio
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
Average savings for Dynamic 28 Average savings
for Static 17
29Effect of Varying Scratch-pad Size
pegwitenc
35
Static Energy
Static Hit Rate
30
Dynamic Energy
100
Dynamic Hit Rate
25
80
20
Energy Savings
Hit Rate
60
15
40
10
20
5
0
0
32
64
128
256
512
1024
2048
4096
32
64
128
256
512
1024
2048
4096
SP Size (Bytes)
SP Size (bytes)
30Scratch-pad Size For 95 Hit Rate
9000
8000
static
dynamic
7000
6000
5000
Size (bytes)
4000
3000
2000
1000
0
fir
sha
epic
cjpeg
djpeg
unepic
blowfish
average
pegwitenc
mpeg2enc
pegwitdec
mpeg2dec
rawcaudio
rawdaudio
gsmencode
gsmdecode
pgpencode
pgpdecode
g721encode
g721decode
Dynamic is 2.5x better than static
31Energy Savings SP vs I-Cache
Cacti energy savings, 64b scratch-pad/I-cache
120
dynamic
static
100
I-cache
80
60
40
Energy Improvement
20
0
fir
sha
epic
cjpeg
djpeg
-20
unepic
average
blowfish
rawcaudio
rawdaudio
mpeg2enc
mpeg2dec
pegwitenc
pegwitdec
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
-40
-60
Average savings for Dynamic 48 Average savings
for Static 25 Average savings for
I-cache 30
32Conclusions
- Compiler directed dynamic placement in
scratch-pad - Arbitrary control flow graph
- Inter-procedural
- Two phases ? SP allocation copy placement
- 28 savings for dynamic as compared to 16 for
static for a 64-byte scratch-pad - 41 savings for dynamic as compared to 31 for
static for 256-byte scratch-pad - 2 to 10 stall cycles
- Within 0 to 11 of optimal, but scalable
33 For more information http//cccp.eecs.umich
.edu Thank You!