Title: Design Methodology for Semi Custom Processor Cores
1Design Methodology for Semi Custom Processor Cores
- Victor Zyuban
- Sameh Asaad
- Thomas Fox
- Anne-Marie Haen
- Daniel Littrell
- Jaime Moreno
- IBM T.J.Watson Research Center, Yorktown Heights,
NY
2Introduction
- We describe the methodology used in the
implementation of a DSP core whose requirements
didnt allow for a typical soft core or hard core
approach - 500 MHz WC, 350mW _at_ 1.5V, 105 C, 0.13um foundry
technology - Objectives
- Exceed performance and power characteristics of
designs built using standard ASIC flow - typical
ASIC runs at 300Mhz in this technology, without
compromising its productivity and generality - Enable integration of custom components
- Enable application of power reduction techniques
not provided by ASIC flow, such as power gating,
reverse bias, and data retention - Allow optimizations across design phases
- Quick turn-around time, reproducible results
3Methodology Overview
ISA/uA
modify Arch/uA (change latencies,redefine
resource usage)
Define hierarchy Clock gating Latch
grouping Instantiate custom components
VHDL
re-arrange logicre-group latches
Define assertion and Synthesis directives Logic
Synthesis Optimization w/Pseudo-Latches
adjust synthesis constraints
Hiasynth/ Booledozer
Clock Splitters Insertion Scan Insertion Hierarchi
cal Verilog Porting design to Cadence
Scan clock
rewire scan
Pre-placement / pre-routing Place Route Extract
Timing/Clock Skew/Scan Order Power Analysis
PD
4Overview of main design techniques
- Hierarchical VHDL and synthesis pre-placement
of components - Grouping of latches for clock splitters in VHDL
pre-placement of latches and clock splitters - Enforcing bit ordering in the datapath (bit stack
seeding) - Instantiation of decoupling buffers in VHDL
pre-placement of decoupling buffers - Pre-routing clock grid and power-ground grid
5Hierarchical Synthesis and Pre-placement of
Components methodology
- Every unit is broken up into components (a few
thousand gates each) - Components are synthesized independently
- Layout of the unit is organized into a set of
overlapping boxes,gates constituting components
are assigned to appropriate boxes,leaving
sufficient flexibility for the place and route
tools
FU1
FU1
FU2
FU2
dust
FU5
FU4
FU3
FU4
FU5
FU3
VHDL Entry
layout
6Hierarchical Synthesis and Pre-placement of
Components benefits
- Different components get best power/performance/ar
ea characteristics when synthesized with
different directives - Gates inside components are sized for smaller
area, only gates constituting dust use high-power
books - Most of the wires are restricted within smaller
areas and are therefore short - Most of the gates in the design use low power
books - Both area and power are saved
slice 1
slice 2
slice 3
slice 0
control slice
pointerupdateunit (3)
bit reverseunit (0)
7Latch grouping fine-grain clock gating
VHDL Entry
Post-Synthesis Processing
single or multi-bit behavioral latches
Gate1
clk1
Gate1
cclk
C
grid clock
Gate2
Gate2
cclk
clk2
C
CG-OR instances define gated latch groups
- Control granularity down to latch group to be
driven by same splitter. - Performs early (L1 and L2) gating. Similar to
ASIC Clock-OR methodology
8Latch Grouping without clock gating
VHDL Entry
Post-Synthesis Processing
single or multi-bit behavioral latches
clk1
grid clock
buffer instances with special names define
clock/latch groups
clk2
- Designer controls latch grouping by inserting
special placeholder buffers to form latch groups - Post-synthesis script replaces buffers with
splitters and behavioral latches with LSSD L1/L2
latches
9Pre-placement of latches and clock splitters
- Clock wires are short resulting in power
reduction - Length of clock wires is under control
resulting in small clock skew, higher frequency
and faster convergence on timing - Bit-precise placement of dataflow latches
enforces bit ordering in the datapath resulting
in improved routability and savings in power and
area
clock distribution
latches
clock splitters
clock, no preplacement
10Instantiation and pre-placement of decoupling
buffersmethodology
- Used when a long wire or non-critical block of
logic needs to be decoupled from the critical
path - Decoupling buffers are instantiated in VHDL, and
preserved in the synthesis and post-synthesis
steps - Overlapping pre-placement boxes are created in
the layout, decoupling buffers are assigned to
the appropriate boxes
latch
latch
latch
decoupling buffer
FU1
FU1
FU1
FU2
FU2
VHDL Entry (case 1)
layout
VHDL Entry (case 2)
11Instantiation and preplacement of decoupling
buffersbenefits
- The power level of the decoupling buffers is
precisely controlled, without impacting the gates
constituting the components (FUs) - Allows keeping the power level of most books
inside the unit small, using high-power books
only where they need to drive long wires or high
FO - Decoupling high capacitance nodes from critical
paths improves speed
decoupling buffers
40-bit latch
output wires
12Core assembly and timing methodology overview1)
generation of abstracts - unit level
Unit layout
Chipbench
extraction of global wiring (pd file)
EinsTimer
Unit layout abstract
Unit timing abstract
13Core assembly and timing methodology overview2)
final step core level
top schematic
Generate Physical Hierarchy
Unit layout abstract
Unit timing abstract
top floorplan
Placement (Cadence Preview, skill
scripts)Routing (CCAR)
top routed floorplan
Chipbench
extraction of global wiring
EinsTimer
14Placed and routed eLite core
custom instructionmemory 32kB
custom vectorregister file256 x 16bit8read /
4write
VPU
DEC
AU
IU
BU
BIU
X buscontrol
CR
16-bit
40-bit
16-bit
40-bit
16-bit
40-bit
16-bit
40-bit
vector control
slice 1
slice 2
slice 3
slice 0
custom datamemory 32kB
reductunit
VMU
SD buscontrol
X buscontrol
15Conclusion
- Significant
- speed improvement, compared with standard ASIC
flow (critical path reduced from 3ns to 2ns in
some units) - area reduction (gt 30) due to dominant usage of
low-power cells - power reduction (in the range of 50)
- Careful pre-placement of clock splitters and
clock gating circuitry allows more time for
calculating the clock gating conditions - Increased from 0.1 to 0.6ns for 500 MHz WC design
with highly efficient OR-style (early) clock
gating, allowing to clock gate 90 of eligible
latches - Generic VHDL easy to maintain, port and
simulate - Short time from VHDL to layout, fast turn-around
time to close on timing, with consistent
convergence - up to 3 VHDL-to-layout iterations per unit per
day by 2 to 3 designers