Design Methodology for Semi Custom Processor Cores - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Design Methodology for Semi Custom Processor Cores

Description:

Enforcing bit ordering in the datapath (bit stack seeding) ... bit reverse. unit (0) ... single or multi-bit. behavioral latches. buffer instances. with special names ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 16

Provided by: ibm76

Category:

more less

Transcript and Presenter's Notes

Title: Design Methodology for Semi Custom Processor Cores

1
Design Methodology for Semi Custom Processor Cores

Victor Zyuban
Sameh Asaad
Thomas Fox
Anne-Marie Haen
Daniel Littrell
Jaime Moreno
IBM T.J.Watson Research Center, Yorktown Heights,
NY

2
Introduction

We describe the methodology used in the
implementation of a DSP core whose requirements
didnt allow for a typical soft core or hard core
approach
500 MHz WC, 350mW _at_ 1.5V, 105 C, 0.13um foundry
technology
Objectives
Exceed performance and power characteristics of
designs built using standard ASIC flow - typical
ASIC runs at 300Mhz in this technology, without
compromising its productivity and generality
Enable integration of custom components
Enable application of power reduction techniques
not provided by ASIC flow, such as power gating,
reverse bias, and data retention
Allow optimizations across design phases
Quick turn-around time, reproducible results

3
Methodology Overview
ISA/uA
modify Arch/uA (change latencies,redefine
resource usage)
Define hierarchy Clock gating Latch
grouping Instantiate custom components
VHDL
re-arrange logicre-group latches
Define assertion and Synthesis directives Logic
Synthesis Optimization w/Pseudo-Latches
adjust synthesis constraints
Hiasynth/ Booledozer
Clock Splitters Insertion Scan Insertion Hierarchi
cal Verilog Porting design to Cadence
Scan clock
rewire scan
Pre-placement / pre-routing Place Route Extract
Timing/Clock Skew/Scan Order Power Analysis
PD
4
Overview of main design techniques

Hierarchical VHDL and synthesis pre-placement
of components
Grouping of latches for clock splitters in VHDL
pre-placement of latches and clock splitters
Enforcing bit ordering in the datapath (bit stack
seeding)
Instantiation of decoupling buffers in VHDL
pre-placement of decoupling buffers
Pre-routing clock grid and power-ground grid

5
Hierarchical Synthesis and Pre-placement of
Components methodology

Every unit is broken up into components (a few
thousand gates each)
Components are synthesized independently
Layout of the unit is organized into a set of
overlapping boxes,gates constituting components
are assigned to appropriate boxes,leaving
sufficient flexibility for the place and route
tools

FU1
FU1
FU2
FU2
dust
FU5
FU4
FU3
FU4
FU5
FU3
VHDL Entry
layout
6
Hierarchical Synthesis and Pre-placement of
Components benefits

Different components get best power/performance/ar
ea characteristics when synthesized with
different directives
Gates inside components are sized for smaller
area, only gates constituting dust use high-power
books
Most of the wires are restricted within smaller
areas and are therefore short
Most of the gates in the design use low power
books
Both area and power are saved

slice 1
slice 2
slice 3
slice 0
control slice
pointerupdateunit (3)
bit reverseunit (0)
7
Latch grouping fine-grain clock gating
VHDL Entry
Post-Synthesis Processing
single or multi-bit behavioral latches
Gate1
clk1
Gate1
cclk
C
grid clock
Gate2
Gate2
cclk
clk2
C
CG-OR instances define gated latch groups

Control granularity down to latch group to be
driven by same splitter.
Performs early (L1 and L2) gating. Similar to
ASIC Clock-OR methodology

8
Latch Grouping without clock gating
VHDL Entry
Post-Synthesis Processing
single or multi-bit behavioral latches
clk1
grid clock
buffer instances with special names define
clock/latch groups
clk2

Designer controls latch grouping by inserting
special placeholder buffers to form latch groups
Post-synthesis script replaces buffers with
splitters and behavioral latches with LSSD L1/L2
latches

9
Pre-placement of latches and clock splitters

Clock wires are short resulting in power
reduction
Length of clock wires is under control
resulting in small clock skew, higher frequency
and faster convergence on timing
Bit-precise placement of dataflow latches
enforces bit ordering in the datapath resulting
in improved routability and savings in power and
area

clock distribution
latches
clock splitters
clock, no preplacement
10
Instantiation and pre-placement of decoupling
buffersmethodology

Used when a long wire or non-critical block of
logic needs to be decoupled from the critical
path
Decoupling buffers are instantiated in VHDL, and
preserved in the synthesis and post-synthesis
steps
Overlapping pre-placement boxes are created in
the layout, decoupling buffers are assigned to
the appropriate boxes

latch
latch
latch
decoupling buffer
FU1
FU1
FU1
FU2
FU2
VHDL Entry (case 1)
layout
VHDL Entry (case 2)
11
Instantiation and preplacement of decoupling
buffersbenefits

The power level of the decoupling buffers is
precisely controlled, without impacting the gates
constituting the components (FUs)
Allows keeping the power level of most books
inside the unit small, using high-power books
only where they need to drive long wires or high
FO
Decoupling high capacitance nodes from critical
paths improves speed

decoupling buffers
40-bit latch
output wires
12
Core assembly and timing methodology overview1)
generation of abstracts - unit level
Unit layout
Chipbench
extraction of global wiring (pd file)
EinsTimer
Unit layout abstract
Unit timing abstract
13
Core assembly and timing methodology overview2)
final step core level
top schematic
Generate Physical Hierarchy
Unit layout abstract
Unit timing abstract
top floorplan
Placement (Cadence Preview, skill
scripts)Routing (CCAR)
top routed floorplan
Chipbench
extraction of global wiring
EinsTimer
14
Placed and routed eLite core
custom instructionmemory 32kB
custom vectorregister file256 x 16bit8read /
4write
VPU
DEC
AU
IU
BU
BIU
X buscontrol
CR
16-bit
40-bit
16-bit
40-bit
16-bit
40-bit
16-bit
40-bit
vector control
slice 1
slice 2
slice 3
slice 0
custom datamemory 32kB
reductunit
VMU
SD buscontrol
X buscontrol
15
Conclusion

Significant
speed improvement, compared with standard ASIC
flow (critical path reduced from 3ns to 2ns in
some units)
area reduction (gt 30) due to dominant usage of
low-power cells
power reduction (in the range of 50)
Careful pre-placement of clock splitters and
clock gating circuitry allows more time for
calculating the clock gating conditions
Increased from 0.1 to 0.6ns for 500 MHz WC design
with highly efficient OR-style (early) clock
gating, allowing to clock gate 90 of eligible
latches
Generic VHDL easy to maintain, port and
simulate
Short time from VHDL to layout, fast turn-around
time to close on timing, with consistent
convergence
up to 3 VHDL-to-layout iterations per unit per
day by 2 to 3 designers