Title: Energy Recovery Design for Low-Power ASICs
1Energy Recovery Design for Low-Power ASICs
- Conrad H. Ziesler1 Joohee Kim1
- Suhwan Kim2 Marios C. Papaefthymiou1
- 1Advanced Computer Architecture Laboratory
- University of Michigan, Ann Arbor
- 2T. J. Watson Research Center
- IBM Research, Yorktown Heights
2(No Transcript)
3Tutorial Outline
- Introduction to energy recovery
- Application of energy recovery to SOC design
- Fine-grain dynamic pipelines
- Finite state machines
- Memory arrays
- Multi-GHz clocking
4Introduction to Energy Recovery
- Power dissipation in static CMOS design
- Energy recovery operation
- Implementation issues
- Quick glance at history
5Static CMOS Power
- Leakage Power
- Crowbar Power
- Active Power
6Static CMOS Active Power
- Active power to transfer charge to/from output
capacitor - Energy stored in capacitor when charged to
voltage V is ½ CV2 - Energy dissipated in R1 to charge capacitor is ½
CV2 - To discharge capacitor, all energy stored in it
is dissipated in R2 - Note Voltage supply level fixed
7What if we had a time-varying power-supply?
- Turn on switch when time-varying voltage source
is at same level as the voltage on output
capacitor. - Voltage on output capacitor slews up or down at
same rate as voltage source. - Most of the energy stored in capacitor is
returned to time-varying power-supply.
8Model Simplification
- Use same switch for charging and discharging
currents (if it can conduct in both directions). - Time-varying voltage source Vpc, called
power-clock, provides energy and synchronization. - Textbooks sometimes draw ideal energy recovery
model using current source instead of voltage
source. Principal of operation is the same.
9Example Linear Ramp
10Energy Dissipation
- Integrate charging or discharging current through
resistor R - Dissipation for either charging or discharging is
approximately (RC/T) CV2 - T is rise time of Vpc (linear ramp)
If T gtgt RC delay and recovered energy is
efficiently recycled, power savings can be
substantial.
11Implementation Issues
- How is power-clock generated?
- What capacitances should one try to recover
energy from? - How does the timing work if power and clock are
intermingled?
?????
?????
12Power-Clock Generation Challenges
- Need to drive a large mostly capacitive load with
a controlled rise/fall time waveform. - Capacitive load may change as different gates
switch on or off. - Must do so with much less than CV2 dissipation,
otherwise gains are lost.
13Where to recover energy from?
- Power-clock is time-varying
- Synchronize energy recovery circuits with
power-clock. - Energy recovery requires small RC delay with
respect to rise/fall time T - Aim at small RC/T ratio.
- Energy recovery saves on active switching power
- Does not necessarily make sense to apply it on
low-activity loads.
14Good Candidates
- Large capacitive loads that frequently switch
- Clock networks
- Memory bit lines
- LCD row/column drivers
- High activity loads that are synchronous to
power-clock - Dynamic pipelined logic
15Quick Glance at History
- Physics (1970s)
- Logical reversibility of computation
- Connection to thermodynamics (adiabatic
computing) - No absolute minimum to energy dissipation, if
computing is arbitrarily slow. (Does not have to
be slow, however.) - Engineering (1990s)
- Logic circuitry
- Energy recycling circuitry
- VLSI prototyping
16Sample Design Points from90s
- The case for reversible computation
- P. Solomon and D. Frank IWLPD'94
- Asymptotically zero-energy split-level charge
recovery logic - S.G. Younis and T. F. Knight, Jr. IWLPD94
- Clock-powered CMOS A hybrid adiabatic logic
style for energy-efficient computing - N. Tzartzanis and W. C. Athas ARVLSI'99
- And many, many others....
17Split-Level Charge-Recovery Logic
internal node
f
/P
f
input
output
P
P
/f
- Initially, f and /f at Vdd/2, P at Gnd, and /P
at Vdd. - On valid input, the pass gate is turned on by
gradually swinging P and /P. - Rails f and /f split, gradually swinging to
Vdd and Gnd. - As soon as output is sampled, pass gate is
turned off. - Internal node is restored by gradually swinging
f and /f back to Vdd/2. - When is the gate output restored?
18Reversible Pipelines
Split-level charge recovery logic block
output node set by E, restored by F-1
E
F
G
H
input
f2
f1
f3
f4
P1
P2
/P1
/P2
. . .
. . .
E-1
F-1
G-1
H-1
P
f6
f3
f7
f5
P2
f8
/P1
/P2
P1
- Gate outputs are restored using a reverse
pipeline whose elements perform the inverse
function of the forward pipeline - Multiple phases required
19The Remainder of this Tutorial
- Application of energy recovery to SOC Modules
- Low switching activity state-machines or
datapaths - Fine-grain pipelines with high switching activity
- Multi-gigahertz class design
- Memory and memory-like arrays
- Interconnect and I/O
- Defining Characteristics
- Low overhead, fast operation, no reversibility
- Energy recovering chips that work and save power
over conventional operation!
20Targeting Energy Recovery to ASICs and SOC Modules
- Module characterization
- Throughput requirements
- How much time is availible to compute?
- Is the application easily pipelined?
- Expected switching activity
- Can it be easily reduced?
- Memory, computation, or control?
- Clocking requirements
- Fixed frequency? Range of frequencies?
21Low Switching Activity Finite-State Machines or
Datapaths
22Low Switching Activity Modules
- Best place to focus efforts is on clock tree and
flip-flops - Switching activity is low everywhere else
- Every capacitance connected to the clock switches
every cycle - Consider applying resonant clocking system
23Sample Dissipation Breakdown
- Focus on clock dissipation
- 2050 of total power
- Apply energy recovery
- Single-phase clock node
- Target design automation
- Automate energy recovery
24Resonant Clock ASIC
- Compatible with ASIC flow
- Synthesized by Conrad Ziesler
- and Joohee Kim using in-house
- standard-cells library and
- commercial tools
- Energy recovering clock tree and
- SRAM word/bit lines
- Low-cost bulk CMOS process
- TSMC 0.25mm, 108-pin PGA package, through MOSIS
- High frequency (300MHz)
- Low voltage (1.0-1.5V)
25ASIC Statistics
- Discrete wavelet transform (DWT)
- 3,897 gates, 413 ffs
- 15,571 transistors
- 400um x 900um
- 13.6 pF , 21 nH
- 300 MHz , 1.5V
- 0.25um logic process
Dual-mode DWT
Clock generator
26Chip Microphotograph
27System Overview
. . .
28The Energy Recovering Flip-Flop
probe
state element
14 transistors 84 ?m2
- Clock signal Single-phase resonant sinusoid
- Probe activates state element only if next state
differs from present state. - Low voltage operation at high speeds
- Delay similar to conventional flip-flops
- Fully compatible with standard-cell ASIC flow
29Flip-Flop Power Characterization
Order of magnitude difference between idle (D, Q
constant) and active (D, Q changing) dissipation.
30Resonant Clock Generator
- Resonate entire clock capacitance with small
inductor - Pump resonant system with NMOS switch at
appropriate times - NMOS switch only conducts incremental losses
whenever on
Driver
NMOS Switch
Pre-driver
Control
31Clock Generator Operation
32Our Contributions
- Application of energy recovery to ASIC clock
network - Fully synthesized ASIC
- Resonant LC tank forms power-clock
- On-chip power-clock generator w/ off-chip
inductor - Fabrication in 0.25mm standard CMOS process
- Compare energy recovering clocking system with
conventional clocking system - Direct DC power measurements for complete system
- Correct operation, 100300MHz
- 2050 savings in total power consumption
- 8090 savings in clock power
33Recovering vs. Conventional Hardware
- Synthesized dual-mode ASIC
- Conventional
- Energy recovery
- Dual-mode flip-flop cell
- Conventional clock tree with conventional
flip-flop - Resonant clock tree with energy-recovering
flip-flop -
- Direct comparison of dissipation at target
throughput using identical hardware structures
34Correct Function
signature output (Verilog simulator)
35Summary
- Energy recovery technologies for reducing clock
dissipation - Single-phase sinusoidal clock
- Efficient, LC resonant clock generator
- Low power sinusoidally clocked flip-flop
-
- Key attributes
- Compatible with ASIC design flow, low overhead
- High frequency (100300MHz)
- Low voltage (11.5V)
- Real, working chips (in 0.25mm logic process)
36Break
37Fine-Grain Pipelines with High Switching Activity
38High Switching Activity Pipelines
- Problem Parameters
- Can't reduce switching activity
- Need lots of throughput
- Can easily do fine-grained pipelining
- Solution
- Dynamic energy recovering logic
- True single-phase source-coupled adiabatic logic
(SCAL)
39SCAL-D Logic Family
- Dynamic logic family that works with simple
sinusoidal LC resonant clock - Alternate PMOS and NMOS type gates like NMOS/PMOS
''zipper'' domino logic - Test chip 200MHz 8-bit multiplier implemented in
0.5mm bulk silicon process
40SCAL-D Topology
Power clock
NMOS NAND gate
ot
of
- Sense amplifier
- Precharge diodes
- Current switches
- Evaluation tree
- Current tail
- Power-clock
at
af
bf
bt
bias
Vss
- Single-phase sinusoidal power-clock
- Minimum-size low-swing evaluation tee
- Built-in state element
- Dual-rail noise-tolerant design
41SCAL-D Operation
- NMOS Precharge Phase
- rising edge of power-clock
- charge transferred to load
-
42SCAL-D Operation
- NMOS Evaluate Phase
- peak of power-clock
- non-adiabatic evaluation current
- current purposefully limited
43SCAL-D Operation
- NMOS Sense Phase
- falling edge of power clock
- sense amplifiers drive load
-
44SCAL-D Operation
- NMOS Hold Phase
- negative peak of power clock
- next pipeline stage samples outputs
-
45SCAL-D Implicit Pipelining
- Free pipelining no flip-flops needed
- Example Pipelined "andfull adder" cell from
array multiplier
Static CMOS 100 transistors
SCAL-D 85 transistors
46Multiplier Chip
- Suhwan Kim (while still at UM) and Conrad Ziesler
received First Prize in VLSI Design Contest, DAC
2001.
- Minimalist approach
- Simple tools magic and spice
- Low-cost standard CMOS process
- HP 0.5um, 40-pin DIP package, through MOSIS.
- Operational chip demonstrates practicality of
energy-recovering circuit design - Non-trivial size (8-bit operands, on-chip clock,
self-test) - High throughput
- Low energy dissipation
47Chip Microphotograph
48Test Chip Overview
- Two multipliers with self-test per chip (minimum
size die) - Integrated power-clock generator
- Resonant LC oscillator
49Multiplier and Self-Test
Input BILBO (self-test) Product
array Multiplicand buffers Result summation
Result buffers Self-test Control Output BILBO
(self-test)
- 9,048 devices in multiplier array, 2,806 devices
in self-test circuitry - Implemented entirely in energy-recovering
dynamic logic family.
50Energy Comparison
2 stage static CMOS
500
2.9V
4 stage static CMOS
8 stage static CMOS
Energy recovery
400
3.0V
3.0V
4x
300
Dissipation per Cycle (pJ)
1.9V
200
1.6V
2.3V
2.0V
100
2.7V
1.9V
2.2V
0
50
100
200
140
Frequency (MHz)
51Single-Phase Power-Clock Generator
Vdd
S1
Vbp
L
_
PC
_
Vbn
S2
Vss
25 tr. 19 tr. 10 tr.
- Zero-voltage switching
- LC- resonant clock generation
- External/bondwire inductor L
- Resistive/capacitive adiabatic load
- Compact 170 x 115 um
52Switch Timings
- Inductor current builds linearly when switches
are on. - Peak switch current less than peak inductor
current. - Switch S1 turned on at positive voltage peak.
- Switch S2 turned on at negative voltage peak.
- Fixed ''on-window'' controlled by pulse
generator.
Inductor current
Output voltage
53Power-Clock Waveform
- Single-phase sinusoidal waveform _at_140MHz
- 60pF load, 10nH external inductor
- One DC supply (Vdd, Vss), two DC biases (NMOS,
PMOS)
54Multi-GHz Clocking
55Multi-Gigahertz Designs
- Need speed more than anything
- Clock distribution and skew biggest problems
- Retiming desirable for performance
- Need multiple phases of clock
- Clock power dominates
56Rotary ClockTM Principles
- Consider MultiGigs Rotary-Clock network (related
slides courtesy of John Wood, MultiGig Inc.) - Multiple transmission line loops arranged in grid
- Each loop supports a square-wave oscillation in
lock step with neighbors - Small variations between loops average/cancel out
- Ultra-high frequency low-skew clocking
57Multi-Ring Visualization
- Phase lock at junctions without PLL
58Numerical Example Large Chip
Process 0.18u CMOS 1.8v Size 15 x 15
mm Global clock 2.5 GHz FFs 2 million, Total
capacitance 10,000pF Metal Width 40u, Spacing
40u, Thickness 1.5u Copper x 2 Active Area lt
1 Grid X pitch 1300u, Grid Y pitch
1300u Power CV2F 78 W Rotary Power 6 W
59Chip Microphotograph
60Test Chip Measurements
Power 75 less than CV2F of clock capacitance
61Later Test Chips
- Test chip 2
- Switched-capacitor tuning
- (100MHz /- 35 measured)
- Test chip 3 quad ring
- 3.5 GHz
- Tunable (varactor /- 10)
- Jitter below measurable levels
62Benefits of Rotary Clock Architecture
- Scalable in size and frequency.
- Reduces dynamic clock power.
- Guaranteed near-zero skew.
- Precise skew scheduling possible.
- Negligible jitter.
- Inherently low noise
- Tolerant to process, temperature, and supply
variation.
63Memories and Array-Like Structures
64Memory and Array Structures
- Many heavily loaded wires all switching
- Already optimized algorithms to reduce number of
accesses - Already using low-power sense-amplifiers
- Consider energy recovery on the bit wires
65Breakdown of CPU Dissipation
strongARM, JSSC 1996
66Memory Power
256x256 array
Long bit/word lines with large capacitance
high power consumption
67Energy Recovery Driver
Q
ER Driver
C
Power source
0 - VDD
- Sinusoidal power-clock
- Synchronization ?
- Correctness
- Efficiency
68Outline
- Energy recovering driver
- Synchronization
- Feedback
- Energy recovering SRAM
- Operation
- Simulation of full-custom ERSRAM
- Voltage scaling behavior
69One Extreme Fully Gradual Transition
ON
PC
driver output
- Maximum power efficiency
- Relatively slow operation
70The Other Extreme Abrupt Transition
ON
ON
PC
driver output
- Low power efficiency
- Fast operation
71Partially Gradual Transition
ON
ON
PC
driver output
72ER Driver Core
ch
Pull-up control
PC
driver output
Pull-down control
dch
- Single-phase power-clock
- Transmission gate ( wide range of operation )
- Synchronizing circuitry
73Pull-Up Control
ch_out
ch
ch_out
PC
PC
- Transistors sized to achieve correct timing and
pulse width
74Pull-Down Control
dch_out
dch
dch_out
PC
PC
- Transistors sized to achieve correct timing and
pulse width
75Tolerance to Control Timing Variations
dch
Maximum possible
Minimum required
driver output
PC
Minimum required
ch
Maximum possible
76Dissipation During Consecutive Charging/Dischargin
g
output
PC
Consecutive charging
PC
output
Consecutive discharging
77Complete Structure with Feedback
ch
Pull-up control
driver output
PC
Pull-down control
dch
- Feedback circuitry
- Prevents redundant dissipation during consecutive
charging/discharging
78ERSRAM Operation
WL
BLT
BLF
write
idle
read
- WL Explicit discharge after each access for low
power - BL Precharge low (with modified sense amp) for
single cycle read and write
79ERSRAM Architecture
128 x 256 Cell array
128 x 256 Cell array
Wl driver
Wl driver
Bl driver
Bl driver
Sense amp
Sense amp
- 2 x 128 x 256
- Only drivers and sense amplifiers are different
from that of conventional SRAM
80Simulation
- TSMC 0.35mm process.
- Full-custom 256x256 conventional and ER SRAM
- Hspice simulation
81Power Breakdown
82Wide Operation Range
r Conv. SRAM O ERSRAM
- Tolerant to variations in operating conditions
- Memory failure due to mistiming in sense amp
enable
83Voltage Scaling
- Functions correctly down to 0.7V, 1MHz
84Summary
- Static RAM with novel energy recovering driver
- Single-phase power-clock
- Single-cycle read and write
- High speed, low complexity
- Operation range 0.7V, 1MHz 3.5V, 500MHz with
0.35mm process - Energy efficiency 2.6x at 3V, 300MHz ,
alternating read-write access
85Previous Work on SRAMs
- Multi-phase power-clock
- Multi-cycle operation
- Relatively high complexity and low/moderate
speeds - Inefficient during consecutive charging
- Not voltage scalable
- Somasekhar, Ye and Roy, ISLPED 1995
- Tzartzanis and Athas, ISLPED 1996
- Moon and Jeong, JSSC 1998
- Avery and Jabri, ISLPED 1998
- Kwon, Lim and Chae, ISLPED 2000
- Ng and Lau, JCSC 2000
- Tzartzanis, Athas and Svensson, ESSCC 2000
86Interconnect and I/O
- Highly capacitive parallel buses that take an
entire cycle for data transfer - Can't reduce switching activity or drive voltage
- Low slew-rates desirable
- Consider energy recovering driver
- Techniques similar to memories
87Tutorial Summary
- Introduction to energy recovery
- Application of energy recovery to SOC design
- Finite state machines
- Fine-grain dynamic pipelines
- Multi-GHz clocking
- Memory arrays and I/O
- Functional chips in bulk silicon demonstrating
substantial energy savings and fast operation in
practice
88Reference Material
- Energy recovery group web site at U. Michigan
- http//www.eecs.umich.edu/acal/energyrecovery
- Extensive list of references in our tutorial
paper in SOC03 proceedings - The Physics of Computation
- R. Feynman