System-Level Memory Bus Power And Performance Optimization for Embedded Systems

About This Presentation

Title:

System-Level Memory Bus Power And Performance Optimization for Embedded Systems

Description:

studies.ac.upc.edu – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 70

Provided by: Robert2345

Learn more at: https://studies.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

1
System-Level Memory Bus Power And Performance
Optimization for Embedded Systems

Ke Ning
kning_at_ece.neu.edu
David Kaeli
kaeli_at_ece.neu.edu

2
Why Power is More Important?

Power A First Class Design Constraint for
Future Architecture Trevor Mudge 2001
Increasing complexity for higher performance
(MIPS)
Parallelism, pipeline, memory/cache size
Higher clock frequency, larger die size
Rising dynamic power consumption
CMOS process continues to shrink
Smaller size logic gates reduce Vthreshold
Lower Vthreshold will have higher leakage
Leakage power will exceed dynamic power
Things getting worse in Embedded System
Low power and low cost systems
Fixed or Limited applications/functionalities
Real-time systems with timing constraints

3
Power Breakdown of An Embedded System
ResearchTarget
25C1.2V Internal400MHz CCLK Blackfin
Processor 3.3V External133MHz SDRAM27MHz PPI
Source Analog Devices Inc.
4
Introduction

Related work on microprocessor power
Low power design trend
Power metrics
Power performance tradeoffs
Power optimization techniques
Power estimation framework
Experimental framework built from Blackfin cycle
accurate simulator
Validated through a Blackfin EZKit board
Power aware bus arbitration
Memory page remapping

5
Outline

Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I Power-Aware Bus Arbitration
Optimization II Memory Page Remapping
Summary

6
Power Modeling

Dynamic power estimation
Instruction level model Tiwari94,
JouleTrackSinha01
Function level model Qu00
Architecture model Cai-Lim Model,
TEMPESTCaiLim99, WattchBrooks00,
SimplepowerYe00
Static power estimation
Butts-Sohi model Butts00
Previous memory system power estimation
Activity model CACTIWilton96
Trace driven model Dinero IVElder98

7
Power Equation

2
I
N k
V
f
ACV
P
leakage
design
DD
DD
dynamic
leakage
Activity Factor
Total Capacitance
Voltage
Frequency
Transistor Number
Technology factor
8
Common Power Optimization Techniques

Gating (turn off unused components)
Clock gating
Voltage gating Cache decay Hu01
Scaling (scale operating point of an component)
Voltage scaling Drowsy cache Flautner02
Frequency scaling Pering98
Resource scaling DRAM power mode Delaluz01
Banking (break single component into smaller
sub-units)
Vertical sub-banking Filter cacheKin97
Horizontal sub-banking Scratchpad Kandemir01
Clustering (partition components into clusters)
Switching reduction (redesigning with lower
activity)
Bus encoding Permutation Code Mehta96,
redundant codeStan95, Benini98, WZEMusoll97

9
Power Aware Figure of Merit

Delay, D
Performance, MIPS
Power, P
Battery life (mobile), packaging (high
performance)
Obvious choice for power performance tradeoff, PD
Joules/instruction, inversely MIPS/W
Energy figure
Mobile / low power applications
Energy Delay PD2
MIPS2/W Gonzalez96
Energy Delay Square PD3
MIPS3/W
Voltage and frequency independent
More generically, MIPSm/W

10
Power Optimization Effect on Power Figure

Most of optimization schemes sacrifice
performance for lower power consumption, except
switching reduction.
All of optimization schemes generate higher power
efficiency.
All of optimization schemes increase hardware
complexity.

11
Outline

Research Motivation and Introduction
Related
Power Estimation Framework
Optimization I Power-Aware Bus Arbitration
Optimization II Memory Page Remapping
Summary

12
External Bus

External Bus Components
Typically is off-chip bus
Includes Control Bus, Address Bus, Data Bus
External Bus Power Consumption
Dynamic power factors activity, capacitance,
frequency, voltage
Leakage power supply voltage, threshold voltage,
CMOS technology
Different from internal memory bus power
Longer physical distance, higher bus capacitance,
lower speed
Cross line interference, Higher leakage current
Different communication protocols
(memory/peripheral dependent)
Multiplexed row/column address bus, narrower data
bus

13
Embedded SOC System Architecture

Media Processor Core
SDRAM
External Bus Interface Unit (EBIU)
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
External Bus
FLASH Memory
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
PowerModelingArea
S-Video/CVBS
NIC
14
ADSP-BF533 EZ-Kit Lite Board
Video In Out
Audio In
Audio Out
Video Codec/ ADV Converter
Audio Codec/ AD Converter
SPORT Data I/O
BF533Blackfin Processor
SDRAMMemory
FLASH Memory
15
External Bus Power Estimator

Previous Approaches
Used Hamming distance Benini98
Control signal was not considered
Shared row and column address bus
Memory state transitions were not considered
In Our Estimator
Integrate memory control signal power into the
model
Consider the case where row and column address
are shared
Memory state transitions and stalls also cost
power
Consider page miss penalty and traffic reverse
penalty
P(bus) P(page miss)
P(bus turnaround)
P(control signal)
P(address generation)
P(data transmission)
P(leakage)

16
Two External Bus SDRAM Timing Models

(a) SDRAM Access in Sequential Command Mode
P
A
R
R
R
R
N
N
N
N
Bank 0 Request
P
A
R
R
N
N
N
N
Bank 1 Request
(b) SDRAM Access in Pipelined Command Mode
P
A
R
R
R
R
N
N
Bank 0 Request
P
A
R
R
R
R
Bank 1 Request
System Clock Cycles (SCLK)
P
A
N
R
- PRECHARGE
- ACTIVATE
- NOP
- READ
17
Bus Power Simulation Framework

Program
Memory HierarchyModel
Target BinaryCompiler
Instruction LevelSimulator
Memory TraceGenerator
Memory PowerModel
Memory TechnologyTiming Model
External Bus Power Estimator
Bus Power
Developed software modules
18
Multimedia Benchmark Configurations
Name Description I-Cache Size D-Cache Size
MPEG2-ENC MPEG-2 Video encoder with 720x480 420 input frames. 16k 16k
MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with 422 CCIR frame output. 16k 16k
H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression. 16k 16k
H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm. 16k 16k
JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k
JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k
PGP-ENC Pretty Good Privacy encryption and digital signature of text message. 8k 4k
PGP-DEC Pretty Good Privacy decryption of encrypted message. 8k 4k
G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k
G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k
19
Outline

Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I Power-Aware Bus Arbitration
Optimization II Memory Page Remapping
Summary

20
Optimization I Bus Arbitration

Multiple bus access masters in an SOC system
Processor cores
Data/Instruction caches
DMA
ASIC modules
Multimedia applications
High bus bandwidth throughput
Large memory footprint
Efficient arbitration algorithm can
Increase power awareness
Increase bus throughput
Reduce bus power

21
Bus Arbitration Target Region

Media Processor Core
SDRAM
EBIU with Arbitration Enabled
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
FLASH Memory
External Bus
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
S-Video/CVBS
NIC
22
Bus Arbitration Schemes

EBIU with arbitration enabled
Handle core-to-memory and core-to-peripheral
communication
Resolve bus access contention
Schedule bus access requests
Traditional Algorithms
First Come First Serve (FCFS)
Fixed Priority
Power Aware Algorithms(Categorized by power
metric / cost function)
Minimum Power (P1D0) or (1, 0)
Minimum Delay (P0D1) or (0, 1)
Minimum Power Delay Product (P1D1) or (1, 1)
Minimum Power Delay Square Product (P1D2) or (1,
2)
More generically (PnDm) or (n, m)

23
Bus Arbitration Schemes (Continued)

Power Aware Arbitration
From the current pending requests in the waiting
queue, find a permutation of the external bus
requests to achieve the minimum total power
and/or performance cost.
Reducible to minimum Hamiltonian path problem in
a graph G(V,E).
Vertex Request R(t,s,b,l)
t request arrival time
s starting address
b block size
l read / write
Edge Transition of Request i and j.
i,j - Request i and j
edge weight w(i, j) is cost of transition

24
Minimum Hamiltonian Path Problem
R0
R0 Last Request on the Bus. Must be the
starting point of a path. R1, R2, R3 Requests
in the queue w(i,j) P(i,j)nD(i,j)m
P(i,j) Power of Rj after Ri D(i,j) Delay
of Rj after Ri Hamiltonian Path
R0-gtR3-gtR1-gtR2 Minimum Path weight
w(0,3)w(3,1)w(1,2) NP-Complete Problem
w(0,3)
w(0,1)
w(0,2)
R3
R1
w(2,3)
w(1,2)
R2
w(2,1)
w(3,2)
w(1,3)
w(3,1)
25
Greedy Solution
R0
Greedy Algorithm (local min) Only the next
request in the path is needed minw(0,j)
w(i,j) is the edge weight of graph G(V,E)
w(0,3)
w(0,1)
w(0,2)
R3
R1
w(2,3)
w(1,2)
R2
w(2,1)
w(3,2)
In each iteration of arbitration 1. A new graph
G(V,E) need to be constructed. 2. A greedy
solution request is arbitrated to use the bus.
w(1,3)
w(3,1)
26
Experimental Setup

Utilized embedded power modeling framework
Implemented eleven different arbitration schemes
inside EBIU
FCFS, FixedPriority.
minimum power (P1D0) or (1,0), minimum delay
(P0D1) or (0, 1), and (1,1), (1,2), (2,1),
(1,3), (3, 1), (3, 2), (2, 3)
10 multimedia application benchmarks are ported
to Blackfin architecture and simulated, including
MPEG-2, H.264, JPEG, PGP and G.721.

27
Power Improvement

Power-aware arbitration schemes have lower power
consumptions than Fixed Priority and FCFS.
Difference across different power-aware
arbitration strategies is small.
Parallel Command model has 6-7 saving than
Sequential Command model for MPEG2 ENC DEC.
The results are consistent to all other
benchmarks.

28
Speed Improvement

Power-aware schemes have smaller bus delay than
traditional Fixed Priority and FCFS.
Difference across different power-aware
arbitration strategies is small.
Parallel Command model has 3-9 speedup than
Sequential Command model for MPEG2 ENC DEC.
The results are consistent to all other
benchmarks.

29
Comparison with Exhaustive Algorithm

Greedy Algorithm can fail in certain case.
Complexity of O(n) vs O(n!).
Performance difference is negligible

R0
Exhaustive Search
Greedy Search
20
18
20
R3
R1
7
17
R2
15
5
18
17
new
30
Comments on Experimental Results

Power aware arbitrators significantly reduce the
external bus power for all 8 benchmarks. In
average, there are 14 power saving.
Power aware arbitrators reduce the bus access
delay. The delay are reduced by 21 in average
among 8 benchmarks.
Pipelined SDRAM model has big performance
advantage over sequential SDRAM model. It achieve
6 power saving and 12 speedup.
Power and delay in external bus are highly
correlated. Minimum power also achieves minimum
delay.
Minimum power schemes will lead to simpler design
options. Scheme (1,0) is preferred due to its
simplicity.

31
Design of A Power Estimation Unit (PEU)

Bank(0) Open Row Addr
Bank(1) Open Row Addr
Last Column Address
Last Bank Address
Bank(2) Open Row Addr
Bank(3) Open Row Addr
Next RequestAddress
If not equal, output bank miss power
Bank Addr
Estimated Power
Updated Column Addr
If not equal, output page miss penalty
power,update last column address register
Row Addr
Use hamming distanceto calculate column address
data power
Column Addr
Power Estimation Unit (PEU)
32
Two Arbitrator Implementation Structures
State Update

Shared PEU Structure
Memory/Bus States Info
Request Queue Buffer
t
s
b
l
t
s
b
l
External Bus
AccessCommandGenerator
Minimum Power Request
Power Estimator Unit (PEU)
Comparator
t
s
b
l
t
s
b
l
Dedicated PEUStructure
State Update
Memory/Bus States Info
Request Queue Buffer
t
s
b
l
Power Estimator Unit
t
s
b
l
External Bus
Power Estimator Unit
AccessCommandGenerator
Minimum Power Request
Power Estimator Unit
Comparator
Power Estimator Unit (PEU)
t
s
b
l
t
s
b
l
33
Performance of two structures

Higher PEU delay will lower the external bus
performance for both MPEG-2 encoder and decoder.
When PEU delay is 5 or higher, dedicated
structure is preferred than shared structure.
Otherwise, shared structure is enough.

34
Summary of Bus Arbitration Schemes

Efficient bus arbitrations can provide benefits
to both power and performance over traditional
arbitration schemes.
Minimum power and minimum delay are highly
correlated on external bus performance.
Pipelined SDRAM model has significant advantage
over sequential SDRAM model.
Arbitration scheme (1, 0) is recommended.
Minimum power approach provides more design
options and leads to simpler design
implementations. The trade-off between design
complexity and performance was presented.

35
Outline

Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I Power-Aware Bus Arbitration
Optimization II Memory Page Remapping
Summary

36
Data Access Pattern in Multimedia Apps
time
time
Fix Stride
2-Way Stream
address
address
time

3 common data access patterns in multimedia
applications
Majority of cycles in loop bodies and array
accesses
High data access bandwidth
Poor locality, cross page references

2-D Stride
address
37
Previous work on Access Pattern

Previous work was performance driven and
OS/compiler related approach
Data Pre-fetching Chen94 Zhang00
Memory Customization Adve00 Grun01
Data Layout Optimization Catthoor98 DeLaLuz04
Shortcoming of OS/compiler-based strategies
Multimedia benchmarks dominant activities are
within large monolithic data buffers.
Buffers generally contain many memory pages and
can not be further optimized.
Constraint by the OS and compiler capability.
Poor flexibility.

38
Optimization II - Page Remapping

Technique currently used in large memory space
peripheral memory access.
External memories in embedded multimedia systems
High bus access overhead
Page miss penalty
Efficient page remapping can
Reduce page misses
Improve external bus throughput
Reduce power / energy consumption.

39
Page Remapping Target Region

Media Processor Core
SDRAM
External Bus Interface Unit (EBIU)
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
FLASH Memory
External Bus
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
S-Video/CVBS
NIC
40
SDRAM Memory Pages

High memory access latency. Minimum latency of an
sclk cycle
Page miss penalty
Additional latency due to refresh cycle
No guaranteed access due to arbitration logic
Non-sequential read/write would suffer

Bank 0
Bank 1
Bank 2
Bank M-1
Page 0
X
X
Page 1
X
X
X
Page 2
X
X
X
X
Page 3
X
Page 4
X
X
X
Page N-1
41
SDRAM Page Miss Penalty
COMMAND
P
A
R
R
R
R
P
A
R
R
R
R
DATA
D
D
D
D
D
D
D
D
P
A
R
R
R
R
COMMAND
R
R
R
R
DATA
D
D
D
D
D
D
D
D
System Clock Cycles (SCLK)
- NOP
- DATA
- PRECHARGE
- ACTIVATE
- READ
N
D
P
A
R
42
SDRAM Timing Parameters
SDRAM parameter Sclk cycles
trcd 1-15
trp 1-7
trcd tras trp 1-15
tcas 2-3
Access type Number of cycles
Read cycle trp n(tcas)
Write cycle twp
Page miss trp trcd
Refresh cycle 2(trcd) nrows
twp write to precharge trp read to
precharge tras activate to precharge tcas
read latency 8-10 sclk penalty associated with
a page miss
43
SDRAM Page Access Sequence (I)
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
Page 1
R
R
R
R
R
R
Page 2
Page 3
R
R
R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
System Clock
Typically access pattern of 2-D stride / 2-way
stream. Poor data layout causes significant
access overhead.
P Precharge A Activation R - Read
44
SDRAM Page Access Sequence (II)
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
R
R
R
R
R
R
R
R
R
Page 1
Page 2
Page 3
P A R
R
R
R
R
R
R
P A R
P A R
P A R
R
R
System Clock
Less access overhead with distributed data layout.
P Precharge A Activation R - Read
45
Why we use Page Remapping

Bank 0
Bank 1
Bank 2
Bank 3
Page 2
X
X
X
X
Page Remapping Entry of Page 2 2,0,1,3
Page 2
X
X
X
X
46
Module in an SOC System

Address translation unit, only translates bank
address
Non-MMU system inserts a page remapping module
before EBIU
MMU system can take advantage the existing
address translation unit. No extra hardware
needed

SDRAM
External Bus Interface Unit (EBIU)
Page Remapping
InternalBus
FLASH Memory
External Bus
Asynchronous Devices
47
Sequence (I) after Remapping
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
Page 1
R
R
R
R
R
R
Page 2
Page 3
R
R
R
P A R
R
R
R
R
R
R
R
R
P A R
P A R
P A R
System Clock
Same performance as sequence II. Applicable for
monolithic data buffers (eg. frame buffers).
P Precharge A Activation R - Read
48
Page Remapping Algorithm

NP complete problem.
Reducible to graph coloring problem in a page
transition graph G(V,E).
Vertex Page Im,n
m page bank number
n page row number
Edge Transition of Page Im,n to Ip,q.
weighted edges captures page traversal during the
program execution
edge weight is number of transition from Page
Im,n to Page Ip,q
Color Bank
Each bank have one distinct color.
Every page will be assigned one color.

49
Page Remapping Algorithm (continued)

Page Remapping Algorithm
From the page transition graph, find the color
(bank) assignment for each page, such that the
transition cost between same color pages is
minimized.
Algorithm Steps
Sort the edges based on their transition weight
Edges are process in a decreasing weight order
Color the pages associated with each edge
Weight parameter array for each page represents
the cost of mapping that page into each bank
eg 500, 200, 0, 0
5 different situations of processing each edge
Page remapping table (PMT) is generated as a
result of mapping.

50
Example Case
100
I0,0
I3,1
60
500
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
I0,0
30
Page 1
I0,1
I1,1
I2,1
I3,1
I0,1
I2,1
50
Page 2
I1,2
Page 3
I1,3
40
80
Original page allocation
I1,1
I1,3
200
I1,2
Page transition graph
51
Initial Step
Bank 0
Bank 1
Bank 2
Bank 3
No page is mapped. All slots are available.
Page 0
Page 1
Page 2
Page 3
52
Step (1) two unmapped pages
500
Selected Edge
I0,0
I0,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Allocate unmapped pages I0,0 and I0,1
Page 0
I0,0
Page 1
Weight Parameters Updates
I0,1
Page 2
I0,00 0, 500, 0, 0
Page 3
I0,11 500, 0, 0, 0
53
Step (2) two unmapped pages
200
Selected Edge
I1,1
I1,2
Bank 0
Bank 1
Bank 2
Bank 3
Actions Allocate unmapped pages I1,1 and I1,2
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
Page 2
I1,2
I1,10 0, 200, 0, 0
Page 3
I1,21 200, 0, 0, 0
54
Step (3) one unmapped page
100
Selected Edge
I0,0
I3,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I3,1 and no change for I0,0
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
Page 2
I1,2
I3,12 100, 0, 0, 0
Page 3
I0,00 0, 500, 100, 0
55
Step (4) one unmapped page
80
Selected Edge
I1,2
I2,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I2,1 and no change for I1,2
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I2,13 0, 80, 0, 0
Page 3
I1,21 200, 0, 0, 80
56
Step (5) one unmapped page
60
Selected Edge
I3,1
I1,3
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I1,3 and no change for I3,1
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I1,30 0, 0, 60, 0
Page 3
I1,3
I3,12 160, 0, 0, 0
57
Step (6) same row pages
50
Selected Edge
I1,1
I3,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I1,1 and I3,1 are on the same row,
no actions.
Page 0
I0,0
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
Page 3
I1,3
58
Step (7) two mapped pages
40
Selected Edge
I2,1
I1,3
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I2,1 and I1,3 are mapped, no
conflicts.
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I1,30 0, 0, 60, 40
Page 3
I1,3
I2,13 40, 80, 0, 0
59
Step (8) conflict resolving
30
Selected Edge
I0,0
I1,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I0,0 and I1,1 are mapped and in
same bank.
Page 0
I0,0
Current Weight Parameters
Page 1
I0,1
I1,1
I3,1
I2,1
I0,11 500, 0, 0, 0
Page 2
I1,2
Page 3
I1,3
I1,10 30, 200, 0, 0
I2,13 40, 80, 0, 0
I3,12 160, 0, 0, 0
Bank 0
Bank 1
Bank 2
Bank 3
Updated Weight Parameters
Page 0
I0,0
I0,00 0, 500, 100, 30
Page 1
I3,1
I2,1
I0,1
I1,1
Page 2
I1,2
Page 3
I1,3
No Conflict
60
Generated PMT table
I-Cache
D-Cache
I0,0
00
I0,1
10
xx
External Memory Address
Page Remapping Table (4kB)
xx
xx
I1,1
11
I1,2
01
Memory Page Address (14bits)
I1,3
00
xx
I2,1
00
xx
xx
Row/Column Address (22bits)
xx
Bank Address (2bits)
I3,1
01
xx
EBIU
xx
16MB External SDRAM
61
Experimental Setup

Utilized embedded power modeling framework
Extended address translation unit for page
remapping
Page coloring program to generate PMT
Same 10 Multimedia application benchmarks
MPEG-2 encoder and decoder
H.264 encoder and decoder
JPEG encoder and decoder
PGP encoder and decoder
G.721 encoder and decoder

62
Page Miss Reduction
63
External Bus Power
64
Average Access Delay
65
Comments of Page Remapping

Page remapping algorithm is presented by example.
Our algorithm can significantly reduce the memory
page miss rate by 70-80 on average.
For a 4-bank SDRAM memory system, we reduced
external
memory access time by 12.6.
The proposed algorithm can reduce power
consumption in majority of the benchmarks,
averaged by 13.2 of power reduction.
Combining the effects of both power and delay,
our algorithm can benefit significantly to the
total energy cost.
Stability study was done in dissertation. PMT
table generated from one test vector input
perform well on different inputs.

66
Outline

Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I Power-Aware Bus Arbitration
Optimization II Memory Page Remapping
Summary

67
Summary

Reviewed the issues of external bus power in a
system-on-a-chip (SOC) embedded system.
Built external bus power estimation framework and
experimental methodology.
PACS04
Proposed a series of power aware bus arbitration
schemes and their performance improvement over
traditional schemes.
HiPEAC05 also appeared in LNCS
Transaction of High performance of Embedded
Architectures and Compilers
Proposed page remapping algorithm to reduce page
misses and its power and delay improvements.
LCTES07

68
Future Work

Integration of power estimation framework in
complete tool chain
Extend arbitration schemes to multiple memory
interfaces and other peripheral interfaces.
Compare performance of page remapping with
corresponding OS/Compiler schemes