System-Level Memory Bus Power And Performance Optimization for Embedded Systems - PowerPoint PPT Presentation

About This Presentation
Title:

System-Level Memory Bus Power And Performance Optimization for Embedded Systems

Description:

studies.ac.upc.edu – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 70
Provided by: Robert2345
Category:

less

Transcript and Presenter's Notes

Title: System-Level Memory Bus Power And Performance Optimization for Embedded Systems


1
System-Level Memory Bus Power And Performance
Optimization for Embedded Systems
  • Ke Ning
  • kning_at_ece.neu.edu
  • David Kaeli
  • kaeli_at_ece.neu.edu

2
Why Power is More Important?
  • Power A First Class Design Constraint for
    Future Architecture Trevor Mudge 2001
  • Increasing complexity for higher performance
    (MIPS)
  • Parallelism, pipeline, memory/cache size
  • Higher clock frequency, larger die size
  • Rising dynamic power consumption
  • CMOS process continues to shrink
  • Smaller size logic gates reduce Vthreshold
  • Lower Vthreshold will have higher leakage
  • Leakage power will exceed dynamic power
  • Things getting worse in Embedded System
  • Low power and low cost systems
  • Fixed or Limited applications/functionalities
  • Real-time systems with timing constraints

3
Power Breakdown of An Embedded System
ResearchTarget
25C1.2V Internal400MHz CCLK Blackfin
Processor 3.3V External133MHz SDRAM27MHz PPI
Source Analog Devices Inc.
4
Introduction
  • Related work on microprocessor power
  • Low power design trend
  • Power metrics
  • Power performance tradeoffs
  • Power optimization techniques
  • Power estimation framework
  • Experimental framework built from Blackfin cycle
    accurate simulator
  • Validated through a Blackfin EZKit board
  • Power aware bus arbitration
  • Memory page remapping

5
Outline
  • Research Motivation and Introduction
  • Related Work
  • Power Estimation Framework
  • Optimization I Power-Aware Bus Arbitration
  • Optimization II Memory Page Remapping
  • Summary

6
Power Modeling
  • Dynamic power estimation
  • Instruction level model Tiwari94,
    JouleTrackSinha01
  • Function level model Qu00
  • Architecture model Cai-Lim Model,
    TEMPESTCaiLim99, WattchBrooks00,
    SimplepowerYe00
  • Static power estimation
  • Butts-Sohi model Butts00
  • Previous memory system power estimation
  • Activity model CACTIWilton96
  • Trace driven model Dinero IVElder98

7
Power Equation



2
I
N k
V
f
ACV
P
leakage
design
DD
DD
dynamic
leakage
Activity Factor
Total Capacitance
Voltage
Frequency
Transistor Number
Technology factor
8
Common Power Optimization Techniques
  • Gating (turn off unused components)
  • Clock gating
  • Voltage gating Cache decay Hu01
  • Scaling (scale operating point of an component)
  • Voltage scaling Drowsy cache Flautner02
  • Frequency scaling Pering98
  • Resource scaling DRAM power mode Delaluz01
  • Banking (break single component into smaller
    sub-units)
  • Vertical sub-banking Filter cacheKin97
  • Horizontal sub-banking Scratchpad Kandemir01
  • Clustering (partition components into clusters)
  • Switching reduction (redesigning with lower
    activity)
  • Bus encoding Permutation Code Mehta96,
    redundant codeStan95, Benini98, WZEMusoll97

9
Power Aware Figure of Merit
  • Delay, D
  • Performance, MIPS
  • Power, P
  • Battery life (mobile), packaging (high
    performance)
  • Obvious choice for power performance tradeoff, PD
  • Joules/instruction, inversely MIPS/W
  • Energy figure
  • Mobile / low power applications
  • Energy Delay PD2
  • MIPS2/W Gonzalez96
  • Energy Delay Square PD3
  • MIPS3/W
  • Voltage and frequency independent
  • More generically, MIPSm/W

10
Power Optimization Effect on Power Figure
  • Most of optimization schemes sacrifice
    performance for lower power consumption, except
    switching reduction.
  • All of optimization schemes generate higher power
    efficiency.
  • All of optimization schemes increase hardware
    complexity.

11
Outline
  • Research Motivation and Introduction
  • Related
  • Power Estimation Framework
  • Optimization I Power-Aware Bus Arbitration
  • Optimization II Memory Page Remapping
  • Summary

12
External Bus
  • External Bus Components
  • Typically is off-chip bus
  • Includes Control Bus, Address Bus, Data Bus
  • External Bus Power Consumption
  • Dynamic power factors activity, capacitance,
    frequency, voltage
  • Leakage power supply voltage, threshold voltage,
    CMOS technology
  • Different from internal memory bus power
  • Longer physical distance, higher bus capacitance,
    lower speed
  • Cross line interference, Higher leakage current
  • Different communication protocols
    (memory/peripheral dependent)
  • Multiplexed row/column address bus, narrower data
    bus

13
Embedded SOC System Architecture

Media Processor Core
SDRAM
External Bus Interface Unit (EBIU)
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
External Bus
FLASH Memory
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
PowerModelingArea
S-Video/CVBS
NIC
14
ADSP-BF533 EZ-Kit Lite Board
Video In Out
Audio In
Audio Out
Video Codec/ ADV Converter
Audio Codec/ AD Converter
SPORT Data I/O
BF533Blackfin Processor
SDRAMMemory
FLASH Memory
15
External Bus Power Estimator
  • Previous Approaches
  • Used Hamming distance Benini98
  • Control signal was not considered
  • Shared row and column address bus
  • Memory state transitions were not considered
  • In Our Estimator
  • Integrate memory control signal power into the
    model
  • Consider the case where row and column address
    are shared
  • Memory state transitions and stalls also cost
    power
  • Consider page miss penalty and traffic reverse
    penalty
  • P(bus) P(page miss)
  • P(bus turnaround)
  • P(control signal)
  • P(address generation)
  • P(data transmission)
  • P(leakage)

16
Two External Bus SDRAM Timing Models

(a) SDRAM Access in Sequential Command Mode
P
A
R
R
R
R
N
N
N
N
Bank 0 Request
P
A
R
R
N
N
N
N
Bank 1 Request
(b) SDRAM Access in Pipelined Command Mode
P
A
R
R
R
R
N
N
Bank 0 Request
P
A
R
R
R
R
Bank 1 Request
System Clock Cycles (SCLK)
P
A
N
R
- PRECHARGE
- ACTIVATE
- NOP
- READ
17
Bus Power Simulation Framework

Program
Memory HierarchyModel
Target BinaryCompiler
Instruction LevelSimulator
Memory TraceGenerator
Memory PowerModel
Memory TechnologyTiming Model
External Bus Power Estimator
Bus Power
Developed software modules
18
Multimedia Benchmark Configurations
Name Description I-Cache Size D-Cache Size
MPEG2-ENC MPEG-2 Video encoder with 720x480 420 input frames. 16k 16k
MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with 422 CCIR frame output. 16k 16k
H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression. 16k 16k
H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm. 16k 16k
JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k
JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k
PGP-ENC Pretty Good Privacy encryption and digital signature of text message. 8k 4k
PGP-DEC Pretty Good Privacy decryption of encrypted message. 8k 4k
G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k
G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k
19
Outline
  • Research Motivation and Introduction
  • Related Work
  • Power Estimation Framework
  • Optimization I Power-Aware Bus Arbitration
  • Optimization II Memory Page Remapping
  • Summary

20
Optimization I Bus Arbitration
  • Multiple bus access masters in an SOC system
  • Processor cores
  • Data/Instruction caches
  • DMA
  • ASIC modules
  • Multimedia applications
  • High bus bandwidth throughput
  • Large memory footprint
  • Efficient arbitration algorithm can
  • Increase power awareness
  • Increase bus throughput
  • Reduce bus power

21
Bus Arbitration Target Region

Media Processor Core
SDRAM
EBIU with Arbitration Enabled
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
FLASH Memory
External Bus
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
S-Video/CVBS
NIC
22
Bus Arbitration Schemes
  • EBIU with arbitration enabled
  • Handle core-to-memory and core-to-peripheral
    communication
  • Resolve bus access contention
  • Schedule bus access requests
  • Traditional Algorithms
  • First Come First Serve (FCFS)
  • Fixed Priority
  • Power Aware Algorithms(Categorized by power
    metric / cost function)
  • Minimum Power (P1D0) or (1, 0)
  • Minimum Delay (P0D1) or (0, 1)
  • Minimum Power Delay Product (P1D1) or (1, 1)
  • Minimum Power Delay Square Product (P1D2) or (1,
    2)
  • More generically (PnDm) or (n, m)

23
Bus Arbitration Schemes (Continued)
  • Power Aware Arbitration
  • From the current pending requests in the waiting
    queue, find a permutation of the external bus
    requests to achieve the minimum total power
    and/or performance cost.
  • Reducible to minimum Hamiltonian path problem in
    a graph G(V,E).
  • Vertex Request R(t,s,b,l)
  • t request arrival time
  • s starting address
  • b block size
  • l read / write
  • Edge Transition of Request i and j.
  • i,j - Request i and j
  • edge weight w(i, j) is cost of transition

24
Minimum Hamiltonian Path Problem
R0
R0 Last Request on the Bus. Must be the
starting point of a path. R1, R2, R3 Requests
in the queue w(i,j) P(i,j)nD(i,j)m
P(i,j) Power of Rj after Ri D(i,j) Delay
of Rj after Ri Hamiltonian Path
R0-gtR3-gtR1-gtR2 Minimum Path weight
w(0,3)w(3,1)w(1,2) NP-Complete Problem
w(0,3)
w(0,1)
w(0,2)
R3
R1
w(2,3)
w(1,2)
R2
w(2,1)
w(3,2)
w(1,3)
w(3,1)
25
Greedy Solution
R0
Greedy Algorithm (local min) Only the next
request in the path is needed minw(0,j)
w(i,j) is the edge weight of graph G(V,E)
w(0,3)
w(0,1)
w(0,2)
R3
R1
w(2,3)
w(1,2)
R2
w(2,1)
w(3,2)
In each iteration of arbitration 1. A new graph
G(V,E) need to be constructed. 2. A greedy
solution request is arbitrated to use the bus.
w(1,3)
w(3,1)
26
Experimental Setup
  • Utilized embedded power modeling framework
  • Implemented eleven different arbitration schemes
    inside EBIU
  • FCFS, FixedPriority.
  • minimum power (P1D0) or (1,0), minimum delay
    (P0D1) or (0, 1), and (1,1), (1,2), (2,1),
    (1,3), (3, 1), (3, 2), (2, 3)
  • 10 multimedia application benchmarks are ported
    to Blackfin architecture and simulated, including
    MPEG-2, H.264, JPEG, PGP and G.721.

27
Power Improvement
  • Power-aware arbitration schemes have lower power
    consumptions than Fixed Priority and FCFS.
  • Difference across different power-aware
    arbitration strategies is small.
  • Parallel Command model has 6-7 saving than
    Sequential Command model for MPEG2 ENC DEC.
  • The results are consistent to all other
    benchmarks.

28
Speed Improvement
  • Power-aware schemes have smaller bus delay than
    traditional Fixed Priority and FCFS.
  • Difference across different power-aware
    arbitration strategies is small.
  • Parallel Command model has 3-9 speedup than
    Sequential Command model for MPEG2 ENC DEC.
  • The results are consistent to all other
    benchmarks.

29
Comparison with Exhaustive Algorithm
  • Greedy Algorithm can fail in certain case.
  • Complexity of O(n) vs O(n!).
  • Performance difference is negligible

R0
Exhaustive Search
Greedy Search
20
18
20
R3
R1
7
17
R2
15
5
18
17
new
30
Comments on Experimental Results
  • Power aware arbitrators significantly reduce the
    external bus power for all 8 benchmarks. In
    average, there are 14 power saving.
  • Power aware arbitrators reduce the bus access
    delay. The delay are reduced by 21 in average
    among 8 benchmarks.
  • Pipelined SDRAM model has big performance
    advantage over sequential SDRAM model. It achieve
    6 power saving and 12 speedup.
  • Power and delay in external bus are highly
    correlated. Minimum power also achieves minimum
    delay.
  • Minimum power schemes will lead to simpler design
    options. Scheme (1,0) is preferred due to its
    simplicity.

31
Design of A Power Estimation Unit (PEU)

Bank(0) Open Row Addr
Bank(1) Open Row Addr
Last Column Address
Last Bank Address
Bank(2) Open Row Addr
Bank(3) Open Row Addr
Next RequestAddress
If not equal, output bank miss power
Bank Addr
Estimated Power
Updated Column Addr
If not equal, output page miss penalty
power,update last column address register
Row Addr
Use hamming distanceto calculate column address
data power
Column Addr
Power Estimation Unit (PEU)
32
Two Arbitrator Implementation Structures
State Update

Shared PEU Structure
Memory/Bus States Info
Request Queue Buffer
t
s
b
l
t
s
b
l
External Bus
AccessCommandGenerator
Minimum Power Request
Power Estimator Unit (PEU)
Comparator
t
s
b
l
t
s
b
l
Dedicated PEUStructure
State Update
Memory/Bus States Info
Request Queue Buffer
t
s
b
l
Power Estimator Unit
t
s
b
l
External Bus
Power Estimator Unit
AccessCommandGenerator
Minimum Power Request
Power Estimator Unit
Comparator
Power Estimator Unit (PEU)
t
s
b
l
t
s
b
l
33
Performance of two structures
  • Higher PEU delay will lower the external bus
    performance for both MPEG-2 encoder and decoder.
  • When PEU delay is 5 or higher, dedicated
    structure is preferred than shared structure.
    Otherwise, shared structure is enough.

34
Summary of Bus Arbitration Schemes
  • Efficient bus arbitrations can provide benefits
    to both power and performance over traditional
    arbitration schemes.
  • Minimum power and minimum delay are highly
    correlated on external bus performance.
  • Pipelined SDRAM model has significant advantage
    over sequential SDRAM model.
  • Arbitration scheme (1, 0) is recommended.
  • Minimum power approach provides more design
    options and leads to simpler design
    implementations. The trade-off between design
    complexity and performance was presented.

35
Outline
  • Research Motivation and Introduction
  • Related Work
  • Power Estimation Framework
  • Optimization I Power-Aware Bus Arbitration
  • Optimization II Memory Page Remapping
  • Summary

36
Data Access Pattern in Multimedia Apps
time
time
Fix Stride
2-Way Stream
address
address
time
  • 3 common data access patterns in multimedia
    applications
  • Majority of cycles in loop bodies and array
    accesses
  • High data access bandwidth
  • Poor locality, cross page references

2-D Stride
address
37
Previous work on Access Pattern
  • Previous work was performance driven and
    OS/compiler related approach
  • Data Pre-fetching Chen94 Zhang00
  • Memory Customization Adve00 Grun01
  • Data Layout Optimization Catthoor98 DeLaLuz04
  • Shortcoming of OS/compiler-based strategies
  • Multimedia benchmarks dominant activities are
    within large monolithic data buffers.
  • Buffers generally contain many memory pages and
    can not be further optimized.
  • Constraint by the OS and compiler capability.
    Poor flexibility.

38
Optimization II - Page Remapping
  • Technique currently used in large memory space
    peripheral memory access.
  • External memories in embedded multimedia systems
  • High bus access overhead
  • Page miss penalty
  • Efficient page remapping can
  • Reduce page misses
  • Improve external bus throughput
  • Reduce power / energy consumption.

39
Page Remapping Target Region

Media Processor Core
SDRAM
External Bus Interface Unit (EBIU)
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
FLASH Memory
External Bus
Memory DMA 0
PPIDMA
SPORT DMA
Memory DMA 1
Asynchronous Devices
NTSC/PAL Encoder
StreamingInterface
S-Video/CVBS
NIC
40
SDRAM Memory Pages
  • High memory access latency. Minimum latency of an
    sclk cycle
  • Page miss penalty
  • Additional latency due to refresh cycle
  • No guaranteed access due to arbitration logic
  • Non-sequential read/write would suffer

Bank 0
Bank 1
Bank 2
Bank M-1
Page 0
X
X
Page 1
X
X
X
Page 2
X
X
X
X
Page 3
X
Page 4
X
X
X
Page N-1
41
SDRAM Page Miss Penalty
COMMAND
P
A
R
R
R
R
P
A
R
R
R
R
DATA
D
D
D
D
D
D
D
D
P
A
R
R
R
R
COMMAND
R
R
R
R
DATA
D
D
D
D
D
D
D
D
System Clock Cycles (SCLK)
- NOP
- DATA
- PRECHARGE
- ACTIVATE
- READ
N
D
P
A
R
42
SDRAM Timing Parameters
SDRAM parameter Sclk cycles
trcd 1-15
trp 1-7
trcd tras trp 1-15
tcas 2-3
Access type Number of cycles
Read cycle trp n(tcas)
Write cycle twp
Page miss trp trcd
Refresh cycle 2(trcd) nrows
twp write to precharge trp read to
precharge tras activate to precharge tcas
read latency 8-10 sclk penalty associated with
a page miss
43
SDRAM Page Access Sequence (I)
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
Page 1
R
R
R
R
R
R
Page 2
Page 3
R
R
R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
P A R
System Clock
Typically access pattern of 2-D stride / 2-way
stream. Poor data layout causes significant
access overhead.
P Precharge A Activation R - Read
44
SDRAM Page Access Sequence (II)
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
R
R
R
R
R
R
R
R
R
Page 1
Page 2
Page 3
P A R
R
R
R
R
R
R
P A R
P A R
P A R
R
R
System Clock
Less access overhead with distributed data layout.
P Precharge A Activation R - Read
45
Why we use Page Remapping

Bank 0
Bank 1
Bank 2
Bank 3
Page 2
X
X
X
X
Page Remapping Entry of Page 2 2,0,1,3
Page 2
X
X
X
X
46
Module in an SOC System
  • Address translation unit, only translates bank
    address
  • Non-MMU system inserts a page remapping module
    before EBIU
  • MMU system can take advantage the existing
    address translation unit. No extra hardware
    needed

SDRAM
External Bus Interface Unit (EBIU)
Page Remapping
InternalBus
FLASH Memory
External Bus
Asynchronous Devices
47
Sequence (I) after Remapping
12 Reads across 4 banks
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
R
R
R
Page 1
R
R
R
R
R
R
Page 2
Page 3
R
R
R
P A R
R
R
R
R
R
R
R
R
P A R
P A R
P A R
System Clock
Same performance as sequence II. Applicable for
monolithic data buffers (eg. frame buffers).
P Precharge A Activation R - Read
48
Page Remapping Algorithm
  • NP complete problem.
  • Reducible to graph coloring problem in a page
    transition graph G(V,E).
  • Vertex Page Im,n
  • m page bank number
  • n page row number
  • Edge Transition of Page Im,n to Ip,q.
  • weighted edges captures page traversal during the
    program execution
  • edge weight is number of transition from Page
    Im,n to Page Ip,q
  • Color Bank
  • Each bank have one distinct color.
  • Every page will be assigned one color.

49
Page Remapping Algorithm (continued)
  • Page Remapping Algorithm
  • From the page transition graph, find the color
    (bank) assignment for each page, such that the
    transition cost between same color pages is
    minimized.
  • Algorithm Steps
  • Sort the edges based on their transition weight
  • Edges are process in a decreasing weight order
  • Color the pages associated with each edge
  • Weight parameter array for each page represents
    the cost of mapping that page into each bank
  • eg 500, 200, 0, 0
  • 5 different situations of processing each edge
  • Page remapping table (PMT) is generated as a
    result of mapping.

50
Example Case
100
I0,0
I3,1
60
500
Bank 0
Bank 1
Bank 2
Bank 3
Page 0
I0,0
30
Page 1
I0,1
I1,1
I2,1
I3,1
I0,1
I2,1
50
Page 2
I1,2
Page 3
I1,3
40
80
Original page allocation
I1,1
I1,3
200
I1,2
Page transition graph
51
Initial Step
Bank 0
Bank 1
Bank 2
Bank 3
No page is mapped. All slots are available.
Page 0
Page 1
Page 2
Page 3
52
Step (1) two unmapped pages
500
Selected Edge
I0,0
I0,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Allocate unmapped pages I0,0 and I0,1
Page 0
I0,0
Page 1
Weight Parameters Updates
I0,1
Page 2
I0,00 0, 500, 0, 0
Page 3
I0,11 500, 0, 0, 0
53
Step (2) two unmapped pages
200
Selected Edge
I1,1
I1,2
Bank 0
Bank 1
Bank 2
Bank 3
Actions Allocate unmapped pages I1,1 and I1,2
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
Page 2
I1,2
I1,10 0, 200, 0, 0
Page 3
I1,21 200, 0, 0, 0
54
Step (3) one unmapped page
100
Selected Edge
I0,0
I3,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I3,1 and no change for I0,0
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
Page 2
I1,2
I3,12 100, 0, 0, 0
Page 3
I0,00 0, 500, 100, 0
55
Step (4) one unmapped page
80
Selected Edge
I1,2
I2,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I2,1 and no change for I1,2
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I2,13 0, 80, 0, 0
Page 3
I1,21 200, 0, 0, 80
56
Step (5) one unmapped page
60
Selected Edge
I3,1
I1,3
Bank 0
Bank 1
Bank 2
Bank 3
Actions Map pages I1,3 and no change for I3,1
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I1,30 0, 0, 60, 0
Page 3
I1,3
I3,12 160, 0, 0, 0
57
Step (6) same row pages
50
Selected Edge
I1,1
I3,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I1,1 and I3,1 are on the same row,
no actions.
Page 0
I0,0
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
Page 3
I1,3
58
Step (7) two mapped pages
40
Selected Edge
I2,1
I1,3
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I2,1 and I1,3 are mapped, no
conflicts.
Page 0
I0,0
Weight Parameters Updates
Page 1
I0,1
I1,1
I3,1
I2,1
Page 2
I1,2
I1,30 0, 0, 60, 40
Page 3
I1,3
I2,13 40, 80, 0, 0
59
Step (8) conflict resolving
30
Selected Edge
I0,0
I1,1
Bank 0
Bank 1
Bank 2
Bank 3
Actions Both I0,0 and I1,1 are mapped and in
same bank.
Page 0
I0,0
Current Weight Parameters
Page 1
I0,1
I1,1
I3,1
I2,1
I0,11 500, 0, 0, 0
Page 2
I1,2
Page 3
I1,3
I1,10 30, 200, 0, 0
I2,13 40, 80, 0, 0
I3,12 160, 0, 0, 0
Bank 0
Bank 1
Bank 2
Bank 3
Updated Weight Parameters
Page 0
I0,0
I0,00 0, 500, 100, 30
Page 1
I3,1
I2,1
I0,1
I1,1
Page 2
I1,2
Page 3
I1,3
No Conflict
60
Generated PMT table
I-Cache
D-Cache
I0,0
00
I0,1
10
xx
External Memory Address
Page Remapping Table (4kB)
xx
xx
I1,1
11
I1,2
01
Memory Page Address (14bits)
I1,3
00
xx
I2,1
00
xx
xx
Row/Column Address (22bits)
xx
Bank Address (2bits)
I3,1
01
xx
EBIU
xx
16MB External SDRAM
61
Experimental Setup
  • Utilized embedded power modeling framework
  • Extended address translation unit for page
    remapping
  • Page coloring program to generate PMT
  • Same 10 Multimedia application benchmarks
  • MPEG-2 encoder and decoder
  • H.264 encoder and decoder
  • JPEG encoder and decoder
  • PGP encoder and decoder
  • G.721 encoder and decoder

62
Page Miss Reduction
63
External Bus Power
64
Average Access Delay
65
Comments of Page Remapping
  • Page remapping algorithm is presented by example.
  • Our algorithm can significantly reduce the memory
    page miss rate by 70-80 on average.
  • For a 4-bank SDRAM memory system, we reduced
    external
  • memory access time by 12.6.
  • The proposed algorithm can reduce power
    consumption in majority of the benchmarks,
    averaged by 13.2 of power reduction.
  • Combining the effects of both power and delay,
    our algorithm can benefit significantly to the
    total energy cost.
  • Stability study was done in dissertation. PMT
    table generated from one test vector input
    perform well on different inputs.

66
Outline
  • Research Motivation and Introduction
  • Related Work
  • Power Estimation Framework
  • Optimization I Power-Aware Bus Arbitration
  • Optimization II Memory Page Remapping
  • Summary

67
Summary
  • Reviewed the issues of external bus power in a
    system-on-a-chip (SOC) embedded system.
  • Built external bus power estimation framework and
    experimental methodology.
  • PACS04
  • Proposed a series of power aware bus arbitration
    schemes and their performance improvement over
    traditional schemes.
  • HiPEAC05 also appeared in LNCS
  • Transaction of High performance of Embedded
    Architectures and Compilers
  • Proposed page remapping algorithm to reduce page
    misses and its power and delay improvements.
  • LCTES07

68
Future Work
  • Integration of power estimation framework in
    complete tool chain
  • Extend arbitration schemes to multiple memory
    interfaces and other peripheral interfaces.
  • Compare performance of page remapping with
    corresponding OS/Compiler schemes

69
  • Thank You !
Write a Comment
User Comments (0)
About PowerShow.com