Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache

About This Presentation

Title:

Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache

Description:

Scratch-pad size (96 bytes) T1. time. copy1. copy4. copy3. copy4. T3. copy2. T3. 64b ... Varied scratch-pad size from 32-bytes to 4-Kbytes. Two configurations ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 34

Provided by: Kevi1

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache

1
Compiler Managed Dynamic Instruction Placement In
A Low-Power Code Cache

Rajiv Ravindran, Pracheeti Nagarkar, Ganesh
Dasika, Robert Senger, Eric Marsman,
Scott Mahlke, Richard BrownDepartment of
Electrical Engineering and Computer Science
University of Michigan, Ann Arbor

University of Michigan Electrical Engineering and
Computer Science
2
Introduction

Instruction fetch power dominant in low-power
embedded processors
27 for the StrongARM
50 for the Motorola MCORE
Two alternatives

No hardware overhead Part of the physical
address space - Managed in software
Hardware managed Transparent to the user -
Power hungry tag-checking and comparison
logic
Scratch-pad
Instruction-cache
3
Focus Of This Work

Explore the use of scratch-pad for reducing
instruction fetch power
Two possible software managed schemes
Static
Map hot regions prior to execution
Contents do not change during execution
Dynamic
Allow contents to change during execution
Explicit copying of hot regions

4
Scratch-pad Management Static Approach
BB1
BB2
T1
BB3
BB4
BB14
BB6
BB5
BB7
BB8
BB9
T2
BB10
BB11
T3
BB12
BB13
5
Scratch-pad Management Static Approach
T1
profit size freq
6
Scratch-pad Management Static Approach
Equivalent to bin-packing
T2 32 bytes
T1
96 bytes
T1 64 bytes
profit size freq
7
Scratch-pad Management Dynamic Approach
BB1
BB2
T1
BB3
Scratch-pad size (96 bytes)
BB4
BB14
Scratch-pad space
BB6
BB5
32b
BB7
BB8
T1
64b
BB9
T2
time
BB10
copy T1
BB11
T3
BB12
BB13
8
Scratch-pad Management Dynamic Approach
T1
Scratch-pad size (96 bytes)
Scratch-pad space
T2
32b
T1
64b
time
copy T1
copy T2
9
Scratch-pad Management Dynamic Approach
T1
Scratch-pad size (96 bytes)
Scratch-pad space
T2
T3
32b
T1
64b
time
copy T1
copy T2
copy T3 over T2
10
Scratch-pad Management Dynamic Approach
Scratch-pad size (96 bytes)
Scratch-pad space
T2
T2
T3
32b
T1
64b
time
copy T1
copy T2
copy T3 over T2
copy T2 over T3
11
Scratch-pad Management Dynamic Approach
Copy1 for T1
Copy2 for T2
BB1
BB2
T1
BB3
Scratch-pad size (96 bytes)
BB4
BB14
Scratch-pad space
T2
T2
T3
T3
BB6
BB5
32b
BB7
BB8
T1
64b
BB9
T2
copy1
copy4
copy3
copy4
copy2
time
BB10
copy T1
copy T2
copy T3 over T2
copy T2 over T3
copy T3 over T2
BB11
Copy4 for T3
T3
BB12
Copy3 for T2
BB13
12
Objectives Of This Work

Develop a dynamic compiler managed scheme to
exploit scratch-pad
Prior work Verma et al,04
ILP based solution
Not scalable
Limits scope of analysis to single procedure,
loop-nests
Practical solution
Scalable
Handle arbitrary control flow graphs
Inter-procedural analysis

13
Our Approach

Two phases
Trace selection scratch-pad (SP) allocation
Identify frequently executed traces
Select the most energy beneficial traces
Place them with possible overlap to reduce copy
overhead
Copy placement
Insert copies to realize the placement
Hoist within the control flow graph to minimize
overhead
Fix branch targets into selected traces

14
SP Allocation Computing Energy Gain
Benefit Energy savings when the trace is
executed from scratch-pad instead of
memory CopyCost Overhead associated with
copying the trace once
Benefit ProfileWeight Size
DFetchEnergy CopyCost Size ( FetchEnergy
WriteEnergy) Energy Gain Benefit -
CopyCost
15
SP Allocation Placing Traces
initial copy of T2
recopy of T2
recopy of T2
T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3
initial copy of T1
recopy of T1
recopy of T1
Dynamic Copy Cost copies of T1 CopyCost
(T1) copies of T2 CopyCost(T2)
16
Temporal Relationship Graph Gloy et al,97
copy of T2
copy of T2
T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3
copy of T1
copy of T1
2 CopyCost (T1) 2 CopyCost(T2)
T1
T2
T3
Edge Weights between two nodes denote the Dynamic
Copy Cost
17
SP Allocation Placing Traces
T2
T2
Energy Gain T1 ? 3104nJ Energy Gain T2 ?
15952nJ Energy Gain T3 ? 752nJ
96-bytes
18
SP Allocation Placing Traces
T2
T2
Energy Gain T1 ? 3104nJ Energy Gain T2 ?
15952nJ Energy Gain T3 ? 752nJ
T1
96-bytes
T1
T1
T1
19
SP Allocation Placing Traces
T2
T2
432nJ
T1
T2
T1
96-bytes
96nJ
144nJ
T1
T3
T1
T1
20
SP Allocation Placing Traces
T2, T3
T2, T3
432nJ
T1
T2
T1
96-bytes
96nJ
144nJ
T1
T3
T1
T1
21
Copy Placement

Initially, naively place copies at trace entry
points
Guarantees correct but inefficient execution
Reduce the copy overhead
Identify frequently executed copies
Iteratively hoist copies to less frequently
executed blocks
Remove redundant copies
Ensure that the hoists and removal are legal
Traces are present prior to execution

22
Copy Placement Initial Placement
C1-T1
BB1
BB2
T1
BB3
C2-T1
BB4
BB14
BB6
BB5
C3-T1
BB7
BB8
C1-T2
BB9
T2
BB10
BB11
C3-T2
T3
BB12
C1-T3
C2-T3
BB13
23
Copy Placement Redundant Copies
C1-T1
BB1
T2, T3
BB2
T1
BB3
T2, T3
C2-T1
BB4
BB14
T1
BB6
BB5
C3-T1
BB7
T1
BB8
C1-T2
BB9
T1
T2
BB10
T1
BB11
C3-T2
T3
BB12
C1-T3
C2-T3
BB13
24
Copy Placement Hoisting
C1-T1
BB1
BB2
T1
BB3
Live-Range T1? BB4, BB6, BB7 T2? BB9,
BB10 T3? BB12
BB4
BB14
BB6
BB5
BB7
BB8
C1-T2
BB9
T2
BB10
BB11
T3
BB12
C1-T3
BB13
25
Copy Placement Hoisting
Live-Range of T2 before hoist
BB1
T2, T3
BB2
T1
T2, T3
BB3
BB4
BB14
T1
BB6
BB5
BB7
T1
BB8
C1-T2
BB9
T1
T2
BB10
T1
BB11
T3
BB12
C1-T3
BB13
26
Copy Placement Hoisting
Live-Range of T2 after hoist ? legal
BB1
T2, T3
C1-T2
BB2
T1
T2, T3
BB3
BB4
BB14
T1
BB6
BB5
BB7
T1
BB8
BB9
T1
T2
BB10
T1
BB11
T3
BB12
C1-T3
BB13
27
Experimental Setup

Trimaran compiler framework
Measured instruction fetch power
Varied scratch-pad size from 32-bytes to 4-Kbytes
Two configurations
WIMS microcontroller at the Univ. of Michigan
On-chip memory and scratch-pad
Static vs dynamic schemes
PowerMill
Conventional processor
Off-chip memory, on-chip scratch-pad vs on-chip
I-cache
CACTI model
Scratch-pad vs I-cache
DMA copying
2 bytes per cycle, stalling

28
Energy Savings Static vs Dynamic
WIMS Energy Savings, 64-Byte scratch-pad
60
dynamic
static
50
40
Energy Improvement
30
20
10
0
fir
sha
epic
cjpeg
djpeg
unepic
average
blowfish
rawcaudio
mpeg2enc
mpeg2dec
pegwitenc
pegwitdec
rawdaudio
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
Average savings for Dynamic 28 Average savings
for Static 17
29
Effect of Varying Scratch-pad Size
pegwitenc
35
Static Energy
Static Hit Rate
30
Dynamic Energy
100
Dynamic Hit Rate
25
80
20
Energy Savings
Hit Rate
60
15
40
10
20
5
0
0
32
64
128
256
512
1024
2048
4096
32
64
128
256
512
1024
2048
4096
SP Size (Bytes)
SP Size (bytes)
30
Scratch-pad Size For 95 Hit Rate
9000
8000
static
dynamic
7000
6000
5000
Size (bytes)
4000
3000
2000
1000
0
fir
sha
epic
cjpeg
djpeg
unepic
blowfish
average
pegwitenc
mpeg2enc
pegwitdec
mpeg2dec
rawcaudio
rawdaudio
gsmencode
gsmdecode
pgpencode
pgpdecode
g721encode
g721decode
Dynamic is 2.5x better than static
31
Energy Savings SP vs I-Cache
Cacti energy savings, 64b scratch-pad/I-cache
120
dynamic
static
100
I-cache
80
60
40
Energy Improvement
20
0
fir
sha
epic
cjpeg
djpeg
-20
unepic
average
blowfish
rawcaudio
rawdaudio
mpeg2enc
mpeg2dec
pegwitenc
pegwitdec
pgpencode
pgpdecode
gsmencode
gsmdecode
g721encode
g721decode
-40
-60
Average savings for Dynamic 48 Average savings
for Static 25 Average savings for
I-cache 30
32
Conclusions

Compiler directed dynamic placement in
scratch-pad
Arbitrary control flow graph
Inter-procedural
Two phases ? SP allocation copy placement
28 savings for dynamic as compared to 16 for
static for a 64-byte scratch-pad
41 savings for dynamic as compared to 31 for
static for 256-byte scratch-pad
2 to 10 stall cycles
Within 0 to 11 of optimal, but scalable

33
For more information http//cccp.eecs.umich
.edu Thank You!

Write a Comment

User Comments (0)

About PowerShow.com

Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache - PowerPoint PPT Presentation

Compiler Managed Dynamic Instruction Placement In A LowPower Code Cache

Scratch-pad size (96 bytes) T1. time. copy1. copy4. copy3. copy4. T3. copy2. T3. 64b ... Varied scratch-pad size from 32-bytes to 4-Kbytes. Two configurations ... – PowerPoint PPT presentation