Title: System-Level Memory Bus Power And Performance Optimization for Embedded Systems
1System-Level Memory Bus Power And Performance
Optimization for Embedded Systems
- Ke Ning
- kning_at_ece.neu.edu
- David Kaeli
- kaeli_at_ece.neu.edu
2Why Power is More Important?
- Power A First Class Design Constraint for
Future Architecture Trevor Mudge 2001 - Increasing complexity for higher performance
(MIPS) - Parallelism, pipeline, memory/cache size
- Higher clock frequency, larger die size
- Rising dynamic power consumption
- CMOS process continues to shrink
- Smaller size logic gates reduce Vthreshold
- Lower Vthreshold will have higher leakage
- Leakage power will exceed dynamic power
- Things getting worse in Embedded System
- Low power and low cost systems
- Fixed or Limited applications/functionalities
- Real-time systems with timing constraints
3Power Breakdown of An Embedded System
ResearchTarget
25C1.2V Internal400MHz CCLK Blackfin
Processor 3.3V External133MHz SDRAM27MHz PPI
Source Analog Devices Inc.
4Introduction
- Related work on microprocessor power
- Low power design trend
- Power metrics
- Power performance tradeoffs
- Power optimization techniques
- Power estimation framework
- Experimental framework built from Blackfin cycle
accurate simulator - Validated through a Blackfin EZKit board
- Power aware bus arbitration
- Memory page remapping
5Outline
- Research Motivation and Introduction
- Related Work
- Power Estimation Framework
- Optimization I Power-Aware Bus Arbitration
- Optimization II Memory Page Remapping
- Summary
6Power Modeling
- Dynamic power estimation
- Instruction level model Tiwari94,
JouleTrackSinha01 - Function level model Qu00
- Architecture model Cai-Lim Model,
TEMPESTCaiLim99, WattchBrooks00,
SimplepowerYe00 - Static power estimation
- Butts-Sohi model Butts00
- Previous memory system power estimation
- Activity model CACTIWilton96
- Trace driven model Dinero IVElder98
7Power Equation
2
I
N k
V
f
ACV
P
leakage
design
DD
DD
dynamic
leakage
Activity Factor
Total Capacitance
Voltage
Frequency
Transistor Number
Technology factor
8Common Power Optimization Techniques
- Gating (turn off unused components)
- Clock gating
- Voltage gating Cache decay Hu01
- Scaling (scale operating point of an component)
- Voltage scaling Drowsy cache Flautner02
- Frequency scaling Pering98
- Resource scaling DRAM power mode Delaluz01
- Banking (break single component into smaller
sub-units) - Vertical sub-banking Filter cacheKin97
- Horizontal sub-banking Scratchpad Kandemir01
- Clustering (partition components into clusters)
- Switching reduction (redesigning with lower
activity) - Bus encoding Permutation Code Mehta96,
redundant codeStan95, Benini98, WZEMusoll97
9Power Aware Figure of Merit
- Delay, D
- Performance, MIPS
- Power, P
- Battery life (mobile), packaging (high
performance) - Obvious choice for power performance tradeoff, PD
- Joules/instruction, inversely MIPS/W
- Energy figure
- Mobile / low power applications
- Energy Delay PD2
- MIPS2/W Gonzalez96
- Energy Delay Square PD3
- MIPS3/W
- Voltage and frequency independent
- More generically, MIPSm/W
10Power Optimization Effect on Power Figure
- Most of optimization schemes sacrifice
performance for lower power consumption, except
switching reduction. - All of optimization schemes generate higher power
efficiency. - All of optimization schemes increase hardware
complexity.
11Outline
- Research Motivation and Introduction
- Related
- Power Estimation Framework
- Optimization I Power-Aware Bus Arbitration
- Optimization II Memory Page Remapping
- Summary
12External Bus
- External Bus Components
- Typically is off-chip bus
- Includes Control Bus, Address Bus, Data Bus
- External Bus Power Consumption
- Dynamic power factors activity, capacitance,
frequency, voltage - Leakage power supply voltage, threshold voltage,
CMOS technology - Different from internal memory bus power
- Longer physical distance, higher bus capacitance,
lower speed - Cross line interference, Higher leakage current
- Different communication protocols
(memory/peripheral dependent) - Multiplexed row/column address bus, narrower data
bus
13Embedded SOC System Architecture
Media Processor Core
SDRAM
External Bus Interface Unit (EBIU)
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
External Bus
FLASH Memory
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
PowerModelingArea
S-Video/CVBS
NIC
14ADSP-BF533 EZ-Kit Lite Board
Video In Out
Audio In
Audio Out
Video Codec/ ADV Converter
Audio Codec/ AD Converter
SPORT Data I/O
BF533Blackfin Processor
SDRAMMemory
FLASH Memory
15External Bus Power Estimator
- Previous Approaches
- Used Hamming distance Benini98
- Control signal was not considered
- Shared row and column address bus
- Memory state transitions were not considered
- In Our Estimator
- Integrate memory control signal power into the
model - Consider the case where row and column address
are shared - Memory state transitions and stalls also cost
power - Consider page miss penalty and traffic reverse
penalty - P(bus) P(page miss)
- P(bus turnaround)
- P(control signal)
- P(address generation)
- P(data transmission)
- P(leakage)
16Two External Bus SDRAM Timing Models
(a) SDRAM Access in Sequential Command Mode
P
A
R
R
R
R
N
N
N
N
Bank 0 Request
P
A
R
R
N
N
N
N
Bank 1 Request
(b) SDRAM Access in Pipelined Command Mode
P
A
R
R
R
R
N
N
Bank 0 Request
P
A
R
R
R
R
Bank 1 Request
System Clock Cycles (SCLK)
P
A
N
R
- PRECHARGE
- ACTIVATE
- NOP
- READ
17Bus Power Simulation Framework
Program
Memory HierarchyModel
Target BinaryCompiler
Instruction LevelSimulator
Memory TraceGenerator
Memory PowerModel
Memory TechnologyTiming Model
External Bus Power Estimator
Bus Power
Developed software modules
18Multimedia Benchmark Configurations
Name Description I-Cache Size D-Cache Size
MPEG2-ENC MPEG-2 Video encoder with 720x480 420 input frames. 16k 16k
MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with 422 CCIR frame output. 16k 16k
H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression. 16k 16k
H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm. 16k 16k
JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k
JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k
PGP-ENC Pretty Good Privacy encryption and digital signature of text message. 8k 4k
PGP-DEC Pretty Good Privacy decryption of encrypted message. 8k 4k
G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k
G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k
19Outline
- Research Motivation and Introduction
- Related Work
- Power Estimation Framework
- Optimization I Power-Aware Bus Arbitration
- Optimization II Memory Page Remapping
- Summary
20Optimization I Bus Arbitration
- Multiple bus access masters in an SOC system
- Processor cores
- Data/Instruction caches
- DMA
- ASIC modules
- Multimedia applications
- High bus bandwidth throughput
- Large memory footprint
- Efficient arbitration algorithm can
- Increase power awareness
- Increase bus throughput
- Reduce bus power
21Bus Arbitration Target Region
Media Processor Core
SDRAM
EBIU with Arbitration Enabled
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
FLASH Memory
External Bus
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
S-Video/CVBS
NIC
22Bus Arbitration Schemes
- EBIU with arbitration enabled
- Handle core-to-memory and core-to-peripheral
communication - Resolve bus access contention
- Schedule bus access requests
- Traditional Algorithms
- First Come First Serve (FCFS)
- Fixed Priority
- Power Aware Algorithms(Categorized by power
metric / cost function) - Minimum Power (P1D0) or (1, 0)
- Minimum Delay (P0D1) or (0, 1)
- Minimum Power Delay Product (P1D1) or (1, 1)
- Minimum Power Delay Square Product (P1D2) or (1,
2) - More generically (PnDm) or (n, m)
23Bus Arbitration Schemes (Continued)
- Power Aware Arbitration
- From the current pending requests in the waiting
queue, find a permutation of the external bus
requests to achieve the minimum total power
and/or performance cost. - Reducible to minimum Hamiltonian path problem in
a graph G(V,E). - Vertex Request R(t,s,b,l)
- t request arrival time
- s starting address
- b block size
- l read / write
- Edge Transition of Request i and j.
- i,j - Request i and j
- edge weight w(i, j) is cost of transition
24Minimum Hamiltonian Path Problem
R0
R0 Last Request on the Bus. Must be the
starting point of a path. R1, R2, R3 Requests
in the queue w(i,j) P(i,j)nD(i,j)m
P(i,j) Power of Rj after Ri D(i,j) Delay
of Rj after Ri Hamiltonian Path
R0-gtR3-gtR1-gtR2 Minimum Path weight
w(0,3)w(3,1)w(1,2) NP-Complete Problem
w(0,3)
w(0,1)
w(0,2)
R3
R1
w(2,3)
w(1,2)
R2
w(2,1)
w(3,2)
w(1,3)
w(3,1)
25Greedy Solution
R0
Greedy Algorithm (local min) Only the next
request in the path is needed minw(0,j)
w(i,j) is the edge weight of graph G(V,E)
w(0,3)
w(0,1)
w(0,2)
R3
R1
w(2,3)
w(1,2)
R2
w(2,1)
w(3,2)
In each iteration of arbitration 1. A new graph
G(V,E) need to be constructed. 2. A greedy
solution request is arbitrated to use the bus.
w(1,3)
w(3,1)
26Experimental Setup
- Utilized embedded power modeling framework
- Implemented eleven different arbitration schemes
inside EBIU - FCFS, FixedPriority.
- minimum power (P1D0) or (1,0), minimum delay
(P0D1) or (0, 1), and (1,1), (1,2), (2,1),
(1,3), (3, 1), (3, 2), (2, 3) - 10 multimedia application benchmarks are ported
to Blackfin architecture and simulated, including
MPEG-2, H.264, JPEG, PGP and G.721.
27Power Improvement
- Power-aware arbitration schemes have lower power
consumptions than Fixed Priority and FCFS. - Difference across different power-aware
arbitration strategies is small. - Parallel Command model has 6-7 saving than
Sequential Command model for MPEG2 ENC DEC. - The results are consistent to all other
benchmarks.
28Speed Improvement
- Power-aware schemes have smaller bus delay than
traditional Fixed Priority and FCFS. - Difference across different power-aware
arbitration strategies is small. - Parallel Command model has 3-9 speedup than
Sequential Command model for MPEG2 ENC DEC. - The results are consistent to all other
benchmarks.
29Comparison with Exhaustive Algorithm
- Greedy Algorithm can fail in certain case.
- Complexity of O(n) vs O(n!).
- Performance difference is negligible
R0
Exhaustive Search
Greedy Search
20
18
20
R3
R1
7
17
R2
15
5
18
17
new
30Comments on Experimental Results
- Power aware arbitrators significantly reduce the
external bus power for all 8 benchmarks. In
average, there are 14 power saving. - Power aware arbitrators reduce the bus access
delay. The delay are reduced by 21 in average
among 8 benchmarks. - Pipelined SDRAM model has big performance
advantage over sequential SDRAM model. It achieve
6 power saving and 12 speedup. - Power and delay in external bus are highly
correlated. Minimum power also achieves minimum
delay. - Minimum power schemes will lead to simpler design
options. Scheme (1,0) is preferred due to its
simplicity.
31Design of A Power Estimation Unit (PEU)
Bank(0) Open Row Addr
Bank(1) Open Row Addr
Last Column Address
Last Bank Address
Bank(2) Open Row Addr
Bank(3) Open Row Addr
Next RequestAddress
If not equal, output bank miss power
Bank Addr
Estimated Power
Updated Column Addr
If not equal, output page miss penalty
power,update last column address register
Row Addr
Use hamming distanceto calculate column address
data power
Column Addr
Power Estimation Unit (PEU)
32Two Arbitrator Implementation Structures
State Update
Shared PEU Structure
Memory/Bus States Info
Request Queue Buffer
t
s
b
l
t
s
b
l
External Bus
AccessCommandGenerator
Minimum Power Request
Power Estimator Unit (PEU)
Comparator
t
s
b
l
t
s
b
l
Dedicated PEUStructure
State Update
Memory/Bus States Info
Request Queue Buffer
t
s
b
l
Power Estimator Unit
t
s
b
l
External Bus
Power Estimator Unit
AccessCommandGenerator
Minimum Power Request
Power Estimator Unit
Comparator
Power Estimator Unit (PEU)
t
s
b
l
t
s
b
l
33Performance of two structures
- Higher PEU delay will lower the external bus
performance for both MPEG-2 encoder and decoder. - When PEU delay is 5 or higher, dedicated
structure is preferred than shared structure.
Otherwise, shared structure is enough.
34Summary of Bus Arbitration Schemes
- Efficient bus arbitrations can provide benefits
to both power and performance over traditional
arbitration schemes. - Minimum power and minimum delay are highly
correlated on external bus performance. - Pipelined SDRAM model has significant advantage
over sequential SDRAM model. - Arbitration scheme (1, 0) is recommended.
- Minimum power approach provides more design
options and leads to simpler design
implementations. The trade-off between design
complexity and performance was presented.
35Outline
- Research Motivation and Introduction
- Related Work
- Power Estimation Framework
- Optimization I Power-Aware Bus Arbitration
- Optimization II Memory Page Remapping
- Summary
36Data Access Pattern in Multimedia Apps
time
time
Fix Stride
2-Way Stream
address
address
time
- 3 common data access patterns in multimedia
applications - Majority of cycles in loop bodies and array
accesses - High data access bandwidth
- Poor locality, cross page references
2-D Stride
address
37Previous work on Access Pattern
- Previous work was performance driven and
OS/compiler related approach - Data Pre-fetching Chen94 Zhang00
- Memory Customization Adve00 Grun01
- Data Layout Optimization Catthoor98 DeLaLuz04
- Shortcoming of OS/compiler-based strategies
- Multimedia benchmarks dominant activities are
within large monolithic data buffers. - Buffers generally contain many memory pages and
can not be further optimized. - Constraint by the OS and compiler capability.
Poor flexibility.
38Optimization II - Page Remapping
- Technique currently used in large memory space
peripheral memory access. - External memories in embedded multimedia systems
- High bus access overhead
- Page miss penalty
- Efficient page remapping can
- Reduce page misses
- Improve external bus throughput
- Reduce power / energy consumption.
39Page Remapping Target Region
Media Processor Core
SDRAM
External Bus Interface Unit (EBIU)
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
FLASH Memory
External Bus
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
S-Video/CVBS
NIC
40SDRAM Memory Pages
- High memory access latency. Minimum latency of an
sclk cycle - Page miss penalty
- Additional latency due to refresh cycle
- No guaranteed access due to arbitration logic
- Non-sequential read/write would suffer
Bank 0
Bank 1
Bank 2
Bank M-1
Page 0
X
X
Page 1
X
X
X
Page 2
X
X
X
X
Page 3
X
Page 4
X
X
X
Page N-1
41SDRAM Page Miss Penalty
COMMAND
P
A
R
R
R
R
P
A
R
R
R
R
DATA
D
D
D
D
D
D
D
D
P
A
R
R
R
R
COMMAND
R
R
R
R
DATA
D
D
D
D
D
D
D
D
System Clock Cycles (SCLK)
- NOP
- DATA
- PRECHARGE
- ACTIVATE
- READ
N
D
P
A
R
42SDRAM Timing Parameters
SDRAM parameter Sclk cycles
trcd 1-15
trp 1-7
trcd tras trp 1-15
tcas 2-3
Access type Number of cycles
Read cycle trp n(tcas)
Write cycle twp
Page miss trp trcd
Refresh cycle 2(trcd) nrows
twp write to precharge trp read to
precharge tras activate to precharge tcas
read latency 8-10 sclk penalty associated with
a page miss
43SDRAM Page Access Sequence (I)
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
Page 1
R
R
R
R
R
R
Page 2
Page 3
R
R
R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
System Clock
Typically access pattern of 2-D stride / 2-way
stream. Poor data layout causes significant
access overhead.
P Precharge A Activation R - Read
44SDRAM Page Access Sequence (II)
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
R
R
R
R
R
R
R
R
R
Page 1
Page 2
Page 3
P A R
R
R
R
R
R
R
P A R
P A R
P A R
R
R
System Clock
Less access overhead with distributed data layout.
P Precharge A Activation R - Read
45Why we use Page Remapping
Bank 0
Bank 1
Bank 2
Bank 3
Page 2
X
X
X
X
Page Remapping Entry of Page 2 2,0,1,3
Page 2
X
X
X
X
46Module in an SOC System
- Address translation unit, only translates bank
address - Non-MMU system inserts a page remapping module
before EBIU - MMU system can take advantage the existing
address translation unit. No extra hardware
needed
SDRAM
External Bus Interface Unit (EBIU)
Page Remapping
InternalBus
FLASH Memory
External Bus
Asynchronous Devices
47Sequence (I) after Remapping
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
Page 1
R
R
R
R
R
R
Page 2
Page 3
R
R
R
P A R
R
R
R
R
R
R
R
R
P A R
P A R
P A R
System Clock
Same performance as sequence II. Applicable for
monolithic data buffers (eg. frame buffers).
P Precharge A Activation R - Read
48Page Remapping Algorithm
- NP complete problem.
- Reducible to graph coloring problem in a page
transition graph G(V,E). - Vertex Page Im,n
- m page bank number
- n page row number
- Edge Transition of Page Im,n to Ip,q.
- weighted edges captures page traversal during the
program execution - edge weight is number of transition from Page
Im,n to Page Ip,q - Color Bank
- Each bank have one distinct color.
- Every page will be assigned one color.
49Page Remapping Algorithm (continued)
- Page Remapping Algorithm
- From the page transition graph, find the color
(bank) assignment for each page, such that the
transition cost between same color pages is
minimized. - Algorithm Steps
- Sort the edges based on their transition weight
- Edges are process in a decreasing weight order
- Color the pages associated with each edge
- Weight parameter array for each page represents
the cost of mapping that page into each bank - eg 500, 200, 0, 0
- 5 different situations of processing each edge
- Page remapping table (PMT) is generated as a
result of mapping.
50Example Case
100
I0,0
I3,1
60
500
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
I0,0
30
Page 1
I0,1
I1,1
I2,1
I3,1
I0,1
I2,1
50
Page 2
I1,2
Page 3
I1,3
40
80
Original page allocation
I1,1
I1,3
200
I1,2
Page transition graph
51Initial Step
Bank 0
Bank 1
Bank 2
Bank 3
No page is mapped. All slots are available.
Page 0
Page 1
Page 2
Page 3
52Step (1) two unmapped pages
500
Selected Edge
I0,0
I0,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Allocate unmapped pages I0,0 and I0,1
Page 0
I0,0
Page 1
Weight Parameters Updates
I0,1
Page 2
I0,00 0, 500, 0, 0
Page 3
I0,11 500, 0, 0, 0
53Step (2) two unmapped pages
200
Selected Edge
I1,1
I1,2
Bank 0
Bank 1
Bank 2
Bank 3
Actions Allocate unmapped pages I1,1 and I1,2
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
Page 2
I1,2
I1,10 0, 200, 0, 0
Page 3
I1,21 200, 0, 0, 0
54Step (3) one unmapped page
100
Selected Edge
I0,0
I3,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I3,1 and no change for I0,0
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
Page 2
I1,2
I3,12 100, 0, 0, 0
Page 3
I0,00 0, 500, 100, 0
55Step (4) one unmapped page
80
Selected Edge
I1,2
I2,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I2,1 and no change for I1,2
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I2,13 0, 80, 0, 0
Page 3
I1,21 200, 0, 0, 80
56Step (5) one unmapped page
60
Selected Edge
I3,1
I1,3
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I1,3 and no change for I3,1
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I1,30 0, 0, 60, 0
Page 3
I1,3
I3,12 160, 0, 0, 0
57Step (6) same row pages
50
Selected Edge
I1,1
I3,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I1,1 and I3,1 are on the same row,
no actions.
Page 0
I0,0
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
Page 3
I1,3
58Step (7) two mapped pages
40
Selected Edge
I2,1
I1,3
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I2,1 and I1,3 are mapped, no
conflicts.
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I1,30 0, 0, 60, 40
Page 3
I1,3
I2,13 40, 80, 0, 0
59Step (8) conflict resolving
30
Selected Edge
I0,0
I1,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I0,0 and I1,1 are mapped and in
same bank.
Page 0
I0,0
Current Weight Parameters
Page 1
I0,1
I1,1
I3,1
I2,1
I0,11 500, 0, 0, 0
Page 2
I1,2
Page 3
I1,3
I1,10 30, 200, 0, 0
I2,13 40, 80, 0, 0
I3,12 160, 0, 0, 0
Bank 0
Bank 1
Bank 2
Bank 3
Updated Weight Parameters
Page 0
I0,0
I0,00 0, 500, 100, 30
Page 1
I3,1
I2,1
I0,1
I1,1
Page 2
I1,2
Page 3
I1,3
No Conflict
60Generated PMT table
I-Cache
D-Cache
I0,0
00
I0,1
10
xx
External Memory Address
Page Remapping Table (4kB)
xx
xx
I1,1
11
I1,2
01
Memory Page Address (14bits)
I1,3
00
xx
I2,1
00
xx
xx
Row/Column Address (22bits)
xx
Bank Address (2bits)
I3,1
01
xx
EBIU
xx
16MB External SDRAM
61Experimental Setup
- Utilized embedded power modeling framework
- Extended address translation unit for page
remapping - Page coloring program to generate PMT
- Same 10 Multimedia application benchmarks
- MPEG-2 encoder and decoder
- H.264 encoder and decoder
- JPEG encoder and decoder
- PGP encoder and decoder
- G.721 encoder and decoder
62Page Miss Reduction
63External Bus Power
64Average Access Delay
65Comments of Page Remapping
- Page remapping algorithm is presented by example.
- Our algorithm can significantly reduce the memory
page miss rate by 70-80 on average. - For a 4-bank SDRAM memory system, we reduced
external - memory access time by 12.6.
- The proposed algorithm can reduce power
consumption in majority of the benchmarks,
averaged by 13.2 of power reduction. - Combining the effects of both power and delay,
our algorithm can benefit significantly to the
total energy cost. - Stability study was done in dissertation. PMT
table generated from one test vector input
perform well on different inputs.
66Outline
- Research Motivation and Introduction
- Related Work
- Power Estimation Framework
- Optimization I Power-Aware Bus Arbitration
- Optimization II Memory Page Remapping
- Summary
67Summary
- Reviewed the issues of external bus power in a
system-on-a-chip (SOC) embedded system. - Built external bus power estimation framework and
experimental methodology. - PACS04
- Proposed a series of power aware bus arbitration
schemes and their performance improvement over
traditional schemes. - HiPEAC05 also appeared in LNCS
- Transaction of High performance of Embedded
Architectures and Compilers - Proposed page remapping algorithm to reduce page
misses and its power and delay improvements. - LCTES07
68Future Work
- Integration of power estimation framework in
complete tool chain - Extend arbitration schemes to multiple memory
interfaces and other peripheral interfaces. - Compare performance of page remapping with
corresponding OS/Compiler schemes
69