Low Power Processor Design: Part I - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Low Power Processor Design: Part I

Description:

Data: bring the data to the compute units and take away the results ... Processors vis- -vis ASICs are distinguished (or identified) by 'instructions' ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 24

Provided by: cseIit

Category:

more less

Transcript and Presenter's Notes

Title: Low Power Processor Design: Part I

1
Low Power Processor Design Part I

M. Balakrishnan

2
Contents

Introduction
Processor development history
Data storage and movement
Control and sequencing
Component
Processor Power

3
Components of Computation

Any algorithm is a sequence of steps
Any algorithm execution has the following
components
Computation perform the computation
Data bring the data to the compute units and
take away the results
Control schedule and activate the steps and
generate the associated control signals to
bring/take away the data and perform the
computation

4
Processors What is different?

Processors vis-à-vis ASICs are distinguished (or
identified) by instructions being associated
with control
These instructions isolate the user from the
micro-architectural details and provide an easy
programming paradigm

Instructions
Micro-architecture
5
Processor Developments 1971-2008

Processor architecture over the years have
developed in all the three components with the
primary focus on increasing performance
Apart from performance, complexity of
applications and portability across platforms
have also driven the architectural developments
More recently, we have been forced to look into
the impact on power due to these architectural
developments. Now the trend exemplified by
multi-core design is to look at performance-power
benefits

6
Computation in Processors
Single ALU for add/sub/logic ops
Increase in effective rate of computation per
unit time (integer ops or floating ops per
second)
Increase in data word-length
Dedicated hardware op. units
Pipelined ALUs
Vector ops
Multiple ALUs (superscalar/VLIW)
7
Data Movement in Processors
Memory Single Accumulator
Increase in effective rate of memory access per
unit time
Secondary storage/ Virtual memory
Register file
Cache
Bypass paths
Write/read buffers
8
Instruction Flow
Instruction register decoder
Increase in effective rate of instructions
executed per unit time (mips)
Microprogramming
Instruction prefetch
Instruction pipelining
Speculative execution
Issue units
9
Register File Power Reduction2

Large size register file with multiple ports is
a key to processor performance measured in term
of IPC (instructions per cycle).
Such large register files consume considerable
energy as well as delay.
Many techniques have been used to reduce RF
access requirements and many of them revolve
around de-allocating the existing registers as
soon as it is possible, which reduces the
register pressure and thus improves performance
for a given RF.
Here we discuss a technique to reduce writes to
the RF instead use the bypass network for feeding
the operand.

10
Multi-port Register File
Multi-ported RF
ALU
ALU
11
Multi-port RF Access Energy
ReadPort 1
WriteAdr 1
ReadAdr 1
ReadPort 2
ReadAdr 2
12
Writeback Avoidance Condition

A result value is a transient if
The value must be short-lived. A value generated
by instruction x is short-lived, if the target
register of x is redefined by another instruction
before x is written back
There is only one consumer for this value.
The consuming instruction is issued before the
value is produced
There must be no branch instruction between the
value producer and re-definer.
The sole consumer of the value should not be
subject to replay caused by load latency
mis-prediction or memory dependence
mis-prediction.

13
Selective Writeback

Transient values need not be written back into
the register file but can be sent directly to the
ALU executing units through Bypass.
This requires three bits per register file for
making sure the required conditions for transient
value is met. Further check-pointing is used for
rolling back in case of interrupts etc.

14
Register File Bypass Path
Multi-ported RF
Bypass buffer
ALU
ALU
15
Results of Selective Write-back

Write energy is 1.8 times read energy
45 of the produced results are transients and
need not be written back
This results in 36 reduction in energy
consumption in the RF. Assuming RF itself takes
10 to 25 of the overall energy, this results in
3 to 7 overall reduction.
It also improves10.9 performance of the base
processor. Over related techniques it improves
performance by 5 to 10.
The technique can be used to reduce the number of
registers and/or ports and thus save energy.

16
Associative Memory

A number of associative memories are getting
used in a modern processor design. These include
DTLB, ITLB, (Data and Instruction TLB), STQ and
LDQ (Store and load queue).
These are energy intensive components as both
broadcast as well as concurrent comparison across
all the key elements are involved at each step.
As the key data repeats frequently, the same can
be exploited for energy reduction.3

17
Associative Memory Structure
comparator
multiplexer
Broadcast key
key 0
data 0
key 1
data 1
match
key (n-1)
data (n-1)
encoder
18
Search Key Memoization3

Typically high order bits repeat frequently
The optimal dividing line between H and L bits
vary from application to application. It is also
a function of address allocation strategy of OS
If the H part of the key repeats, it is neither
broadcast nor compared again. The result of the
previous match for H bits is stored in a
flip-flop in each entry and that is reused

19
Key Memoization Structure
comparator
KeyL
KeyH
comparator
key 0H
key 0L
match
clock
drive-upper
20
Results

For a 40-bit virtual address, size of L varied
from 10 to 20 bits
DTLB power consumption was reduced by 70 and
ITLB by 93. For ITLB the L was only 3-bits
More than 2-way split could reduce power further.
With 3 components, DTLB power reduction went up
to 81

21
Off-Chip Bus Energy Reduction

Off-chip busses typically connect cache to the
main memory (and possibly other peripherals).
Studies show they also consume 10 to 23 of
system power.
Techniques for data encoding have been proposed
to reduce the activity on these busses.
More recently value caches on both sides with
associated encoding have been used for power
reduction.

22
Value Cache Approach

Yang4 first proposed a cache based system for
transmitting frequently occurring values. With a
cache size restricted to word length (32), all
hits can be transmitted by just toggling one bit
with control indicating hit/miss. In miss the
original data values are sent.

On-chip Data Cache
Off-chip Memory
Control
Value cache
Value cache
Data Bus
23
References

V. Venkatachalam and M. Franz, Power Reduction
Techniques for Microprocessor systems, ACM
Computing Surveys, Vol. 37, No. 3, sep. 2005, pp.
195-237
D. Balkan et.al., Selective writeback exploiting
transient values for energy-efficiency and
performance, ISLPED 2006, Oct. 2006, pp. 37-42
J. Sharkey et.al.,Power efficient wakeup tag
broadcast, ICCD 2005, Oct 2005, pp. 654-661
J.Yang et. al,FV encoding for low power data
I/O, ISLPED 2001, Aug. 2001, pp. 84-87