UC Davis Seminar - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

UC Davis Seminar

Description:

Seminar 1 Ram Kumar Krishnamurthy Microprocessor Research Labs Intel Corporation, Hillsboro, OR ram.krishnamurthy_at_intel.com Intel Labs July 5, 2005 – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 49
Provided by: RamKrish
Category:

less

Transcript and Presenter's Notes

Title: UC Davis Seminar


1
Microprocessor and DSP Technologies for the
Nanoscale Era
Seminar 1 Ram Kumar Krishnamurthy Microprocessor
Research Labs Intel Corporation, Hillsboro,
OR ram.krishnamurthy_at_intel.com
July 5, 2005
2
About Circuits Research Lab
  • Established 1996
  • Belongs under Microprocessor Technology Labs
  • Located in Hillsboro, Oregon, USA (primary) and
    Bangalore, India
  • 75 researchers
  • Charter
  • High-performance low-power digital circuits
  • Off-chip I/O signaling circuits
  • Power delivery circuits
  • gt50 patents, gt25 papers per year

3
Motivation Higher performance at lower power and
cost
Pentium 4 Architecture
Pentium Pro Architecture
Pentium Architecture
486
386
286
8086
Strong demand for gt 1 TIPS performance beyond
this decade How do you get there?
4
Our Research Agenda Outlook
2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm) 90 65 45 32 22 16 11 8
Integration Capacity (BT) 0.5 1 2 4 8 16 32 64
Delay CV/I scaling 0.7 0.7 gt0.7 Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down
Energy/Logic Op scaling gt0.35 gt0.5 gt0.5 Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down
Bulk Planar CMOS High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability
Alternate, 3G etc Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability
Variability Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High
ILD (K) 3 lt3 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5
RC Delay 1 1 1 1 1 1 1 1
Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation
5
Intels Research Focus
Technology Leadership
Complete solution stack
Technology
Arch Design
Platforms
Software
6
Architectures Designs
Back End Server Server Desktop Mobile Mobile Handheld
Family Itanium Itanium Xeon Pentium Celeron Centrino Pentium Centrino Pentium Xscale
Architecture IA64, VLIW IA64/ IA32 IA32 IA32 IA32 ARM
Word 64 bit 64 bit Itanium 32 bit Xeon 32 bit 32 bit 32 bit 32 bit
Address Space Huge Huge/4 GB 4 GB 4 GB 4 GB 4 GB
Cache 6 MB 6 MB, 2 MB 1 MB 1 MB 1 MB 512 KB
Performance High High High Medium Medium Low
Power 130W 100 W lt 100 W 25 W 25 W lt 1W
Power Metric Watts/sq ft Watts/cu ft Watts/sq ft Watts/cu ft Watts Watt-hours Battery Life Watt-hours Battery Life Watt-hours Battery Life
Cost High High Med Med Low Low
Our research agenda addresses all these platforms
7
Is Transistor a Good Switch?
8
Sub-threshold Leakage
Transistors will not be switches, but dimmers
9
Leakage Power
A. Grove, IEDM 2002
Leakage power limits Vt scaling
10
High Leakage ? Impacts Functionality
M. Anders, R. Krishnamurthy et al, 2001 Symp.
VLSI Circuits
  • Sub-65nm Dynamic Circuit Active Leakage
    Tolerance
  • Cache, RF, Arrays, Bitlines most affected
  • Keeper sizes gt 50 of pulldown strength
  • High contention ? degraded performance
  • Slow keeper shutoff ? high short-circuit power

11
Power Will be the Limiter
1B transistor integration capacity will exist
But the Power
Applications will demand TIPS performance
Challenge Highest performance in the power
envelope
12
Power Trend
Cooling Capacity Of Conventional System
Pentium 4 processor
Business As Usual is Not an Option
Pentium II processor
Pentium processor
Power (W)
486
386
C scales by 30 per generation but Vcc scales
by 10-15 only! Must maintain or reduce power in
future
13
Gate Oxide is Near Limit
Intels High K leadership is crucial for the
industry
14
Power Density Will Get Even Worse
  • Need to Keep the Junctions Cool
  • Performance (Higher Frequency)
  • Lower leakage (Exponential)
  • Better reliability (Exponential)

Pat Gelsinger, ISSCC 2001
15
Active Power Reduction
Multiple Supply Voltages
Replicated Designs
Need high-speed multi-supply level converter
techniques
16
Leakage Control
Stack Effect
Body Bias
Sleep Transistor
Vbp
Vdd
Ve
Logic Block
Equal Loading
Vbn
-Ve
2-10X reduction
2-1000X reduction
2-200X reduction
Need low leakage and leakage tolerant techniques
17
Dual Vt Design for Active Leakage Reduction
  • Technology provides two Vt
  • High Vt with nominal Ioff (lower performance)
  • Low Vt with 10X higher loff (higher performance)

High Vt
Number of paths
Employing high Vt everywhere yields
lower performance, and lower leakage (1X)
Delay
Low Vt
Employing low Vt everywhere yields
higher performance, but higher leakage (10X)
Number of paths
Delay
Logic path between latch boundaries
Dual Vt
Selective usage of low and high Vt yields higher
performance, yet low leakage between 1X, and
ltlt10X
Number of paths
Delay
18
Chip Multi-Processing
  • Multi-core, each core Multi-threaded
  • Shared cache and front side bus
  • Each core has different Vdd Freq
  • Core hopping to spread hot spots
  • Lower junction temperature

19
Memory Latency
Memory
CPU
Cache
Small few Clocks
Large 50-100ns
Assume 50ns Memory latency
Cache miss hurts performance Worse at higher
frequency Need power efficient high-speed I/O
techniques
20
Increase on-die Memory
  • Large on die memory provides
  • Increased Data Bandwidth Reduced Latency
  • Hence, higher performance for much lower power

21
Special Purpose Hardware Acceleration
TCP Offload Engine
Opportunities for acceleration Network
processing engines MPEG Encode/Decode
engines Speech engines Wireless
communication/baseband
2.23 mm X 3.54 mm, 260K transistors
Special purpose HWBest MIPS/Watt
22
Energy-efficient Data-path Circuits
Cache
Processor thermal map
Temp (oC)
Execution core
Integer and FP ALUs and MACs
  • ALUs performance and peak-current limiters
  • High activity ? thermal hotspots
  • Goal high-performance energy-efficient design

23
130nm 9GHz 32-bit Integer ALU (ISSCC02)
32-bit integer exec core
M. Anders, R. Krishnamurthy et al, Intl.
Solid-state Circuits Conf. 2002 IEEE Journal of
Solid-state Circuits 11/02
24
90nm 7GHz 64-bit Integer ALU (ISSCC04)
Upper-order 32-bit ALU
Lower-order 32-bit ALU
Clock Generator and Drivers
I/O Circuits
Process 90nm Dual-Vt CMOS, 7 Metal
Die area 0.474mm2
64-bit ALU layout area 0.073mm2
Total transistor count 6100
64-bit ALU average switching power (a0.3) 89mW at 4GHz, 1.3V, 25oC
64-bit ALU active leakage power 9.6mW at 1.3V, 25oC
64-bit ALU maximum frequency 7GHz at 2.1V, 25C
32-bit ALU average switching power (a0.3) 71mW at 7GHz, 1.3V, 25oC
32-bit ALU active leakage power 4.4mW at 1.3V, 25oC

S. Mathew, R. Krishnamurthy et al, Intl.
Solid-state Circuits Conf. 2004 IEEE Journal of
Solid-state Circuits 01/05
64-bit ALU die microphotograph and measured
performance summary
  • 7GHz single-cycle 64-bit integer ALU (measured
    in 90nm CMOS)
  • Simultaneous 9GHz single-cycle 32-bit integer
    ALU mode
  • Fastest reported single-cycle 64-bit integer ALU
    performance

25
90nm 1GHz 9mW 1616b Multiplier (ISSCC05)
Clock Generator and Drivers
16x16b Multiplier
R-PLA
I/O Circuits
Registers
Process 90nm Dual-Vt CMOS
Die area 0.474mm2
16b Multiplier and PLA layout area 0.03mm2
16b Multiplier worst-case power 9mW at 1GHz, 1.3V, 50oC (nominal)
16b Multiplier active leakage power 540µW at 1.3V, 50oC (nominal)
16b Multiplier peak performance 1.5GHz, 32mW at 1.95V, 50oC
16b Multiplier low-voltage mode performance 50MHz, 79µW at 0.57V, 50oC
Reconfigurable PLA peak performance 2.3GHz, 4.2mW at 1.3V, 50C
Reconfigurable PLA worst-case power 2mW at 1GHz, 1.3V, 50oC (nominal)
Stand-by mode power 75µW (7X reduction vs. active leakage)
S. Hsu, R. Krishnamurthy et al, Intl. Solid-state
Circuits Conf. 2005
1616-bit Multiplier die microphotograph and
measured performance summary
  • 1GHz single-cycle 1616-bit DSP multiplier
    (measured in 90nm CMOS)
  • Reconfigurable PLA control engine
  • 9pJ/Op or 110GOPS/Watt
  • Highest reported GOPS/Watt for single-cycle
    16-bit multiply

26
32-bit ALU architecture
Mux control
Shift control
External operands
51 Mux
61 Mux
Adder core
O/p Mux
Sum
External operands
21 Mux
61 Mux
Mux control
Sign control
Loopback bus
Multiple ALUs clustered together in the execution
core ?High power density
27
Full-Adder Intro
28
The Binary Adder
29
The Ripple-Carry Adder
Worst case delay linear with the number of bits
td O(N)
tadder (N-1)tcarry tsum
Goal Make the fastest possible carry path circuit
30
Static CMOS Full Adder
28 Transistors
31
Sumi Ai?? Bi ? Carryi-1Carryi Ai?Bi
(AiBi)Carryi-1
Carry Look-ahead
32
Sumi Ai?? Bi ? Carryi-1Carryi Ai?Bi
(AiBi)Carryi-1
Partial Sum
33
Sumi Ai?? Bi ? Carryi-1Carryi Ai?Bi
(AiBi)Carryi-1
Partial Sum
Propagate
Generate
34
Sumi Ai?? Bi ? Carryi-1Carryi Ai?Bi
(AiBi)Carryi-1
Partial Sum
Propagate
Generate
Carryi Gi Pi ? Carryi-1

35
High-performance Adders Kogge Stone
1 2 3 4 5
6 7
Sumeven
Even input bits
PG Gen.
CM1
CM2
CM3
CM4
CM5
XOR
Sumodd
Odd input bits
CM1
CM2
CM3
CM4
CM5
XOR
PG Gen.
GGGiPiGi-1 GPPiPi-1
  • Generate all 32 carries
  • Full-blown binary tree ? energy-inefficient
  • Carry-merge stages log2(32) ? 5 stages

36
Kogge-Stone Adder
PG
1
2
3
5
4
6
7
9
8
10
11
13
12
14
15
17
16
18
19
21
20
22
23
25
24
26
27
29
28
30
31
0
Carry-merge gates
XOR
  • Critical path PG5XOR 7 gate stages
  • Generate,Propagate fanout of 2,3
  • Maximum interconnect spans 16b

Energy inefficient
37
Sparse-tree Adder Architecture
  • Generate every 4th carry in parallel
  • Side-path 4-bit conditional sum generator
  • 73 fewer carry-merge gates?energy-efficient

38
Non-critical Sum Generator
Pi2 ,Gi2
Gi1
Pi
Pi1
Pi3,Gi3

1
0
CM
CM
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi3
Sumi1
Sumi2
Sumi
  • Non-critical path ripple carry chain
  • Reduced area, energy consumption, leakage
  • Generate conditional sums for each bit
  • Sparse-tree carry selects appropriate sum

39
Adder Core Critical Path
clk3
clk2
clk
Adder Inputs
C27
PG
GG1
GG7
GG27
GG15
GG3
Single-rail dynamic sparse-tree path
Sum31_0
Sum31
CM0 Latch
CM1
XOR
clk
Sum31_1
Static sum generator
  • Critical path 7 gate stages ? same as KS
  • Sparse-tree single-rail dynamic
  • Exploit non-criticality of sum generator
  • Convert to static logic?Semi-dynamic design

40
Sparse-tree Architecture
  • Performance impact (20 speedup)
  • 33-50 reduced G/P fanouts
  • 80 reduced wiring complexity
  • 30 reduction in maximum interconnect
  • Power impact (56 reduction)
  • 73 fewer carry-merge gates
  • 50 reduction in average transistor size

41
Energy-delay Space
100
130nm CMOS, 1.2V, 110oC
80
56
60
Dynamic Kogge-Stone
Worst-case Energy (pJ)
40
20
20
4GHz Design
Semi-dynamic Sparse-Tree
0
140
160
180
200
220
240
260
280
Delay (ps)
  • 20 speedup over Kogge-Stone
  • 56 worst-case energy reduction
  • Scales with activity factor

42
Semi-dynamic Design
40
Dynamic Kogge-Stone
30
71
Average Energy (pJ)
20
Semi-dynamic Sparse-Tree
10
0
0
0.1
0.2
0.3
0.4
0.5
Activity factor
  • Static sum generators low switching activity
  • 71 lower average energy at 10 activity

43
So, How Do We Get There?
Significant Challenges Ahead Can only be solved
with joint industry-university collaboration
44
Thank You for Your Attention
  • QA
  • Our publications can be found in
  • IEEE Intl. Solid-State Circuits Conference, 2001-
  • IEEE Journal of Solid-State Circuits, 2001-
  • Symposium on VLSI Circuits, 1999-
  • Intl. Symposium on Low-power Design, 1999-
  • Custom Integrated Circuits Conference, SOCC,
    etc., 1999-

45
Backup
46
Optimized First-level Carry-merge
Conditional Carry for Cin0
0
CM
Gi
C_0
  • Carry-merge stage reduces to inverter
  • Conditional carry_0 Gi

47
Optimized First-level Carry-merge
1
Conditional carry for Cin1
CM
Pi
C_1
Gi
Ai Bi Pi Gi C_1
0 0 0 0 1
0 1 1 0 0
1 0 1 0 0
1 1 1 1 0
Pi
C_1
  • Pi Gi correlated
  • Conditional carry_1 Pi

48
Optimized Sum Generator
Pi2 ,Gi2
Gi1
Pi3,Gi3
Pi
Pi1

Optimized 1st-level carry-merge
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi1
Sumi3
Sumi
Sumi2
  • Optimized non-critical path 4 stages
Write a Comment
User Comments (0)
About PowerShow.com