Achieving 550 MHz in an ASIC Methodology - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Achieving 550 MHz in an ASIC Methodology

Description:

Share slack between pipeline stages. Slack passing. Time borrowing. 1.15 vs. good ASIC ... The digital logic critical paths are in the read portion: ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 31
Provided by: carl296
Category:

less

Transcript and Presenter's Notes

Title: Achieving 550 MHz in an ASIC Methodology


1
Achieving 550 MHzin an ASIC Methodology
David ChinneryBorivoje Nikolic Kurt
Keutzer University of California at Berkeley
2
ASIC and custom gap in processors in 0.18 um
200 MHz in slower, cheaper process
5 gap between ASIC and custom
speed
Can this be bridged?
Pentium 4 1700 MHz
Average ASIC 100 250 MHz
Tensilica Xtensa 320 MHz
  • Last year we showed the 6 to 8 gap and the
    causes
  • But where is an ASIC bridging the gap?

2
3
ASICs can do better disk drive read channels
The picture in 1999
200 MHz in slower, cheaper process
1.5 gap between ASIC and custom
speed
This cant be bridged yet
Average ASIC 100 250 MHz
Pentium III 800 MHz
TI SP4140 550 MHz 0.21 um
Tensilica Xtensa 320 MHz
  • Last year we showed the 6 to 8 gap and the
    causes
  • But where is an ASIC bridging the gap?
  • Texas Instruments SP4140!

3
4
Where does the speed go?
4
5
Where does the speed go?
  • 4.20 micro-architecture and pipelining
  • Architectural transformations to shorten critical
    path
  • Reducing critical path length by inserting
    flip-flops or latches
  • 1.00 vs. good ASIC

5
6
Where does the speed go?
  • 2.00 due to process variation and accessibility
  • 1.20 vs. good ASIC

ASICworst case, worst process
ASICgood yield, worst process
ASICgood yield, good process
fastest custom bin
produced
2.0
ASIC libraries may lag technology improvements
speed
6
7
Where does the speed go?
  • 1.50 through dynamic logic on critical paths
  • p-channel MOSFETs replaced by precharge
    transistor
  • Reduced gate input capacitance, reduced area
  • There are also other high speed, custom logic
    styles

VDD
VDD
GND
GND
clock
static CMOS
domino logic
7
8
Where does the speed go?
  • 1.40 timing
  • Distribute clock tree carefully to reduce clock
    skew
  • Use latches instead of flip-flops
  • Avoid clock skew and setup time penalty
  • Share slack between pipeline stages
  • Slack passing
  • Time borrowing
  • 1.15 vs. good ASIC

t
t
D-Q
D-Q
D1
H

L
clock

Q1
D2
Q2
slack passed
t
t
comb1
comb2
8
9
Where does the speed go?
  • 1.25 good floorplanning and placement
  • Place connected modules nearby reducing wire
    lengths
  • Layout follows datapath
  • 1.00 vs. good ASIC

9
10
Where does the speed go?
  • 1.25 appropriate sizing of transistors and wires
  • 1.05 vs. good ASIC

VDD
VDD
C
C
GND
GND
10
11
ASIC Example The TI SP4140
  • Constraints
  • Entirely new design, in a new process, in 9
    months
  • Disk drive read channel needs high throughput,
    525 Mb/s
  • Maximum power 1.7 W at full speed
  • Technology characteristics
  • Supply voltage 1.9 V
  • Process 0.21 um CMOS (0.18 um effective channel
    length)



write



write
encoder
scrambler
precomp


signal
data
FIR filter
Viterbi
read

read





VGA
CT filter
decoder
descrambler
equalizer
decoder
ADC


data
signal
servo
timing recovery
11
12
The TI SP4140 Critical Paths
  • The digital logic critical paths are in the read
    portion
  • Encoded data is read and 6-bit sampled at 550
    Mb/s
  • FIR filter
  • Critical path of this is the multiply-accumulate
    operation
  • Runs at 275 MHz, 550 Mb/s throughput
  • Lower power consumption
  • Viterbi decoder
  • Critical part of this is the add-compare-select
    (ACS) unit
  • Single cycle feedback
  • Hard to pipeline, would have to unroll the
    recursive loop
  • Runs at 550 MHz, 525 Mb/s output (redundancy
    removed)

12
13
Components of Critical Path Delay
  • With flip-flops, cycle time T is a function of
  • Combinational logic delay,
  • Clock skew,
  • Setup time, when input must be stable,
  • Clock-to-Q delay, from clock edge to when output
    changes,

data
Q1
Q2
Tclock1
Tclock2
Tclock1
Tclock2
Q2
clock-to-Q
clock
13
14
Concrete for our bridge
We examine the architecture, timing overhead, and
process.
14
15
1. Architectural Transformations
  • Increase speed by reducing the critical path
    length
  • Pipeline adding latches or flip-flops between
    logic
  • With flip-flops, must be pipeline stages balanced
    for high speed
  • Latches allow time borrowing and sharing between
    stages
  • Parallelization increasing throughput by
    duplication
  • Retiming to remove some logic from the critical
    path
  • 25 speed up from ASIC 320 MHz to 400 MHz!

15
16
Pipelined FIR
  • Pipelining breaks up the critical path into
    smaller pieces

Direct form FIR
x(n)
?h0
?h1
?h2
?hn

y(n)



Transpose-form FIR
x(n)
?h0
?h1
?h2
?hn
y(n)



16
17
Pipelined, Interleaved FIR
  • Computation in parallel to double throughput

Transpose-form FIR
x(n)
?h0
?h1
?h2
?hn

y(n)



Two-path parallel transpose FIR throughput
doubles, area doubles
17
18
Pipelined, Interleaved FIR
  • 8 pipeline stages, parallel computation on two
    paths
  • Speed up by pipelining and interleaving, with
    flip-flops
  • Initial cycle time of
  • After transposing to pipeline
  • Interleaving doubles throughput, but doubles the
    area,

x(n)odd
select
?h0
?h1
?h2
?h3
?hn
y(n)odd





x(n)even
y(n)
MUX
?h0
?h1
?h2
?h3
?hn





y(n)even
18
19
Retimed Viterbi Add-Compare-Select
  • Retiming removes subtractor from the critical
    path
  • Critical path is shorter!

select
bm
i,k

select

n
-
1
p
i


n
-
1
sm
i
bm
n

p
n

sm
k,l
l
k
MUX
n
-
1

p
j
n
-
1
sm

j
select
bm

j,k
Standard Add-Compare-Select
bm
n

p
k,m
m
bm
k,l

select

n


p
l
n

sm
Retimed Compare-Select-Add area doubles,
subtractor not in critical path
k

MUX
n

p

m
bm
k,m

Transformation to Compare-Select-Add
19
20
2. Reducing Timing Overhead
  • Timing overheads proportion of total delay
    varies
  • Custom is 1.4 faster than typical ASIC
  • Better clock skew 1.10
  • Fast latches and flip-flops 1.10
  • Can include logic in a custom latch
  • Overlapping clock phases to eliminate some
    latches
  • ASICs can reduce skew and use faster memory
    elements
  • Typical ASIC has timing overhead of 1.0 ns
  • SP4140 has timing overhead of less than 0.5 ns
  • Speed up of 25 from 400 MHz to 500 MHz!

1.15
20
21
Use Latches
  • Cycle time T with flip-flops
  • Latch-based designs are faster because
  • Latch is transparent for half of the cycle
  • Cycle time not affected by
  • clock skew
  • setup time
  • If input arrives after latch is transparent,and
    before the latch closes
  • Minimum time between clock edges is
  • When latch is transparent, delay from
    inputarrival to output changing is



clock
L1


combinational

logic 1
L2


combinational

logic 2
21
22
Use Latches
  • Latches are faster than flip-flops because
  • Pipeline stages dont have to be well-balanced
  • Slack passing and time borrowing between stages



clock
t
t
D-Q
D-Q
L1

D1

H

L
clock

combinational

logic 1
Q1
D2
L2

Q2
slack passed

combinational


logic 2
t
t
comb2
comb1
22
23
Reduce flip-flop setup time and clock-to-Q delay
  • Sometimes have to use flip-flops
  • Single cycle recursion in Viterbi decoder
  • No time borrowing!
  • Fast flip-flops pulsed latches
  • Hybrid-latch flip-flop
  • Sense-amplifier flip-flop
  • Custom cell characterized for standard cell
    synthesis of SP4140
  • Characteristics
  • Smaller setup time and clock-to-Q delay
  • First stage generates a pulse
  • Second stage captures the pulse
  • Clock skew tolerance

23
24
Hybrid-Latch Flip-flop
  • When D transitions low
  • X, Clk, and Clk all high
  • Causes Q to transition low
  • When D transitions high
  • Low pulse on X
  • Causes Q to transition high
  • Otherwise Q is held by the cross-coupled inverters

Vdd
pulse generator
Q
Clk
Q
X
Clk
D
D
X
true single-phase clock latch
Q
Clk
Clk
24
25
Sense-Amplifer Flip-flop (SAFF)
Vdd
  • Sense amplifier amplifies the difference between
    D and D
  • After the clock goes high, it pulls S or R low
  • Set-reset latch captures the pulse
  • Sized for typical load
  • Characterized for use in ASIC flow

Vdd
D
D
sense amplifier
Clk
S
R
R
S
QRSQ
QSRQ
set-reset latch
25
26
Partitioning and Clock Tree Design
  • Timing critical blocks are 10,000 to 30,000 gates
  • Layout areas of 1 2 mm2, small size helps
    synthesis converge
  • Blocks have local gated clock trees
  • Clock distribution over a smaller area reduces
    clock skew
  • Fixed fanout at each clock tree level
  • Insert dummies to match the insertion delays
  • Local trees merged into global tree with added
    clock gating
  • Poor ASIC skew is 500ps or more
  • Good ASIC skew is 200 ps
  • TI SP4140 clock skew of 60 ps

to about 100 clocked elements

26
27
3. Process Variation and Accessibility
  • ASIC libraries calculate worst case speeds for
    process
  • Achieve a good yield by better knowledge of
    process variation
  • SP4140 has a voltage regulator on chip, reducing
    supply voltage variation
  • Speed up of 10 from 500 MHz to 550 MHz!
  • Speeds off a line estimated to vary by 20 to 40
  • Custom designs can speed bin the chips

fast chip, rest slower
ASIC with good yield
ASICworst case
produced
1.1
1.2 1.4
speed
27
28
ASIC vs. custom speed, area for ACS
  • ASIC exploration
  • ACS recursion as fast as 2.2ns
  • CSA recursion as fast as 1.6ns
  • Area increases 2.5
  • Synthesis increases area for speed
  • Custom CSA
  • Half the area at same speed
  • 20 faster at same area

Area
20000
16000
12000
CSA
8000
custom CSA
4000
ACS
0
1.0
1.5
2.0
2.5
3.0
3.5
Clock Period (ns)
28
29
Summary
Weve quantified the speed differences between
ASIC and custom designsWeve told you how to
improve ASIC speeds
  • Good ASIC speeds of 320 MHz
  • SP4140
  • Architectural transformations to get to 400 MHz
  • Reducing timing overhead to get to 500 MHz
  • Reduce process variation to get to 550 MHz
  • Gap closed from 5 to 3
  • Area and power gap of about 2

29
30
Future Work
  • Quantify impact with respect to power and area
  • How do we improve ASIC area and power?
  • Does it help to extend standard cell library with
    more sizes?
  • What is the design time cost of including custom
    cells characterized for an ASIC flow?
  • What about bit slice tiling and custom placement
    versus automated place and route?
  • Look for more ASIC-oriented techniques for
    closing speed gap with minimal power and area
    increase

30
Write a Comment
User Comments (0)
About PowerShow.com