Low Power Processor Design: Part I - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Low Power Processor Design: Part I

Description:

Data: bring the data to the compute units and take away the results ... Processors vis- -vis ASICs are distinguished (or identified) by 'instructions' ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 24
Provided by: cseIit
Category:

less

Transcript and Presenter's Notes

Title: Low Power Processor Design: Part I


1
Low Power Processor Design Part I
  • M. Balakrishnan

2
Contents
  • Introduction
  • Processor development history
  • Data storage and movement
  • Control and sequencing
  • Component
  • Processor Power

3
Components of Computation
  • Any algorithm is a sequence of steps
  • Any algorithm execution has the following
    components
  • Computation perform the computation
  • Data bring the data to the compute units and
    take away the results
  • Control schedule and activate the steps and
    generate the associated control signals to
    bring/take away the data and perform the
    computation

4
Processors What is different?
  • Processors vis-à-vis ASICs are distinguished (or
    identified) by instructions being associated
    with control
  • These instructions isolate the user from the
    micro-architectural details and provide an easy
    programming paradigm

Instructions
Micro-architecture
5
Processor Developments 1971-2008
  • Processor architecture over the years have
    developed in all the three components with the
    primary focus on increasing performance
  • Apart from performance, complexity of
    applications and portability across platforms
    have also driven the architectural developments
  • More recently, we have been forced to look into
    the impact on power due to these architectural
    developments. Now the trend exemplified by
    multi-core design is to look at performance-power
    benefits

6
Computation in Processors
Single ALU for add/sub/logic ops
Increase in effective rate of computation per
unit time (integer ops or floating ops per
second)
Increase in data word-length
Dedicated hardware op. units
Pipelined ALUs
Vector ops
Multiple ALUs (superscalar/VLIW)
7
Data Movement in Processors
Memory Single Accumulator
Increase in effective rate of memory access per
unit time
Secondary storage/ Virtual memory
Register file
Cache
Bypass paths
Write/read buffers
8
Instruction Flow
Instruction register decoder
Increase in effective rate of instructions
executed per unit time (mips)
Microprogramming
Instruction prefetch
Instruction pipelining
Speculative execution
Issue units
9
Register File Power Reduction2
  • Large size register file with multiple ports is
    a key to processor performance measured in term
    of IPC (instructions per cycle).
  • Such large register files consume considerable
    energy as well as delay.
  • Many techniques have been used to reduce RF
    access requirements and many of them revolve
    around de-allocating the existing registers as
    soon as it is possible, which reduces the
    register pressure and thus improves performance
    for a given RF.
  • Here we discuss a technique to reduce writes to
    the RF instead use the bypass network for feeding
    the operand.

10
Multi-port Register File
Multi-ported RF
ALU
ALU
11
Multi-port RF Access Energy
ReadPort 1
WriteAdr 1
ReadAdr 1
ReadPort 2
ReadAdr 2
12
Writeback Avoidance Condition
  • A result value is a transient if
  • The value must be short-lived. A value generated
    by instruction x is short-lived, if the target
    register of x is redefined by another instruction
    before x is written back
  • There is only one consumer for this value.
  • The consuming instruction is issued before the
    value is produced
  • There must be no branch instruction between the
    value producer and re-definer.
  • The sole consumer of the value should not be
    subject to replay caused by load latency
    mis-prediction or memory dependence
    mis-prediction.

13
Selective Writeback
  • Transient values need not be written back into
    the register file but can be sent directly to the
    ALU executing units through Bypass.
  • This requires three bits per register file for
    making sure the required conditions for transient
    value is met. Further check-pointing is used for
    rolling back in case of interrupts etc.

14
Register File Bypass Path
Multi-ported RF
Bypass buffer
ALU
ALU
15
Results of Selective Write-back
  • Write energy is 1.8 times read energy
  • 45 of the produced results are transients and
    need not be written back
  • This results in 36 reduction in energy
    consumption in the RF. Assuming RF itself takes
    10 to 25 of the overall energy, this results in
    3 to 7 overall reduction.
  • It also improves10.9 performance of the base
    processor. Over related techniques it improves
    performance by 5 to 10.
  • The technique can be used to reduce the number of
    registers and/or ports and thus save energy.

16
Associative Memory
  • A number of associative memories are getting
    used in a modern processor design. These include
    DTLB, ITLB, (Data and Instruction TLB), STQ and
    LDQ (Store and load queue).
  • These are energy intensive components as both
    broadcast as well as concurrent comparison across
    all the key elements are involved at each step.
  • As the key data repeats frequently, the same can
    be exploited for energy reduction.3

17
Associative Memory Structure
comparator
multiplexer
Broadcast key
key 0
data 0
key 1
data 1
match
key (n-1)
data (n-1)
encoder
18
Search Key Memoization3
  • Typically high order bits repeat frequently
  • The optimal dividing line between H and L bits
    vary from application to application. It is also
    a function of address allocation strategy of OS
  • If the H part of the key repeats, it is neither
    broadcast nor compared again. The result of the
    previous match for H bits is stored in a
    flip-flop in each entry and that is reused

19
Key Memoization Structure
comparator
KeyL
KeyH
comparator
key 0H
key 0L
match
clock
drive-upper
20
Results
  • For a 40-bit virtual address, size of L varied
    from 10 to 20 bits
  • DTLB power consumption was reduced by 70 and
    ITLB by 93. For ITLB the L was only 3-bits
  • More than 2-way split could reduce power further.
    With 3 components, DTLB power reduction went up
    to 81

21
Off-Chip Bus Energy Reduction
  • Off-chip busses typically connect cache to the
    main memory (and possibly other peripherals).
  • Studies show they also consume 10 to 23 of
    system power.
  • Techniques for data encoding have been proposed
    to reduce the activity on these busses.
  • More recently value caches on both sides with
    associated encoding have been used for power
    reduction.

22
Value Cache Approach
  • Yang4 first proposed a cache based system for
    transmitting frequently occurring values. With a
    cache size restricted to word length (32), all
    hits can be transmitted by just toggling one bit
    with control indicating hit/miss. In miss the
    original data values are sent.

On-chip Data Cache
Off-chip Memory
Control
Value cache
Value cache
Data Bus
23
References
  • V. Venkatachalam and M. Franz, Power Reduction
    Techniques for Microprocessor systems, ACM
    Computing Surveys, Vol. 37, No. 3, sep. 2005, pp.
    195-237
  • D. Balkan et.al., Selective writeback exploiting
    transient values for energy-efficiency and
    performance, ISLPED 2006, Oct. 2006, pp. 37-42
  • J. Sharkey et.al.,Power efficient wakeup tag
    broadcast, ICCD 2005, Oct 2005, pp. 654-661
  • J.Yang et. al,FV encoding for low power data
    I/O, ISLPED 2001, Aug. 2001, pp. 84-87
Write a Comment
User Comments (0)
About PowerShow.com