Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip

About This Presentation

Title:

Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip

Description:

Append destination bit-mask. Control 'sop' and 'eop' signals. Send packets to switch ... Append destination bit-mask. Enqueue packet size to FIFO ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 23

Provided by: dimitrio1

Category:

more less

Transcript and Presenter's Notes

Title: Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip

1
Design of a32x32 Variable-Packet-SizeBuffered
Crossbar Switch Chip

Master of Science Thesis
Dimitrios G. Simos

2
Outline

Introduction to the Variable-Packet-Size Buffered
Crossbar (VPS bufXbar) Architecture
Switch Organization
Logic Synthesis
Placement Routing
Power Consumption
Conclusions, Lessons Learned

3
Introduction Unbuffered (CIOQ) vs. Buffered
Crossbars (CICQ)

Distributed scheduling
Fast/Efficient no speedup required
New Variable-size packets
? no SAR no speedup required
? No Output Queues

Central scheduling
Inefficient speedup required
Fixed-size cells for synchronous operation
? SAR speedup required
? Output Queues

4
Motivation/Contribution

How large crosspoint memories?
Buffer Size MaxPacketSizeRTT
E.g. 1500 500 2 KBytes in our implementation
See Katevenis et al. Variable Packet Size
Buffered Crossbar (CICQ) Switches, ICC 2004,
http//archvlsi.ics.forth.gr/bufXbar/
Until now, difficult to fit large amounts of
SRAMs on-chip
Today, 2-4 Mbytes can fit on-chip
VPS bufXbar simulations showed very nice results,
but
is it feasible to built an actual multi-port
(e.g. 32x32) such chip core using standard
ASIC-flow?
If yes,
How difficult is it?
Area?
Power?

5
Switch Organization block level

Crosspoint Blocks (Crosspoints-XPs)
Packet enqueue/dequeue
Synchronization
Output Schedulers (OSs)
Flow selection,
Packet transmission
Credit transmission
Credit Schedulers (CSs)
Accumulation transmission
of credits to inputs
Line Cards (outside of switch)
Append destination bit-mask
Control sop and eop signals
Send packets to switch
Receive and handle credits
Packet Format
Size 40-1500 Bytes
Packet size multiple of 4
2 Clock Domains

Switch block-level organization
6
Switch Organization Crosspoints

2-port SRAM for data
Packet enqueue/ dequeue to/from SRAM
Clock domain synchronization
1-bit 3-FF synchronizer for control signal
5 clock cycles latency
? Cut-through

7
Switch Organization Output Schedulers

Round robin selection of next eligible flow
Eligible contains a packet
Back-to-Back transmission
Packet transmission to switch outputs
Credit signal transmission to line cards

8
Switch Organization Credit Schedulers

Credit pulse accumulation
Credit Semantics
not a new packet has just departed from
crossbar
but QFC-like protocol for error tolerance
total number of transmitted packets mod 2k
equals
Round Robin credit transmission to line cards
Only those that have changed since last time
If none has changed, send RR-ly all
If a credit is lost, it is retransmitted in the
next round

Credit Format
9
Line Card Logic

Semi-behavioral
VOQ handling
Maintain packet-size FIFOs for each flow
Maintain row XP MEM free space
Packet transmission
Append destination bit-mask
Enqueue packet size to FIFO
Decrease corresponding XP MEM free space
Credit arrival and handling
Check if new credit different from corresponding
saved one
If different, dequeue packet size increase
corresponding XP MEM free space

10
Logic Synthesis

Tool Synopsys Design Compiler
Design Hierarchy
Gate-Level Simulation
Gate Count, FF Area Results

11
Logic Synthesis Hierarchy

Flat synthesis impossible due to design size/
instance number
Hierarchical approach

12
Logic SynthesisConstraints Gate-Level
Simulation

Constraints
Clock frequency 300 MHz (max. memory freq),
? 9.6 Gbit/sec per-port
Logic area minimal
Circuit compilation
Gate-Level netlist to be imported to PR tool
Gate-Level verification
Minimum-size (40 B) packets
Maximum-size (1500 B) packets
Randomly-selected-size packets
All inputs send to one/all output(s) with
load100
In-order back-to-back packet transmission to
outputs

13
Logic SynthesisGate Count, FF Area Results

Logic Area only 5, rest is memory
Proof of feasibility 140 mm2 in 0.13 µm

14
Placement Routing (PR)

Tool Cadence Encounter
Hierarchical approach
Gate Count, FF Area results
Power Consumption

15
Placement Routing Hierarchy

Flat PR Impossible
Design size
Fast results
Hierarchy Decision
Easy to route at
top-level
Small/Medium hierarchy component size
Chip core aspect ratio (AR)
Three alternatives
group by
squares
rows
columns
? Per-column organization seemed best

16
Placement RoutingColumn Organization

Each column
32 XPs, 1 OS
Best org. decision
All three tested
AR 13
Unroutable
Success, AR 12

17
Placement RoutingColumn Layout

Output Scheduler in bottom half of column
Buffered line in feedthroughs
Column optimization output buffer insertion
Credit regs near credit scheduler inputs
Else top-level PR impossible

18
Placement Routing Top-Level

Impossible to PR 32 Columns and Credit Scheduler
Line at Top-Level
Hierarchy level addition group 32 columns into 2
sets of 16
3 components at top-level

19
Placement RoutingChip Core Layout
Core Layout 1.45 x 2.88 cm 420 mm2 in 0.18 µm
20
Placement RoutingGate, FF Area Results

Differ from synthesis
Area 40 larger due to wiring (90), hierarchy
overhead (10)
Logic Gates 60 more due to opt. buffers, clock
tree
Memories 70, Wiring 25, Logic 5
Still feasible under 200 mm2 in 0.13, 110 mm2 in
0.09

21
Placement RoutingPower Estimation

Typical case
32 (XPsOSsCSs)
Total Core Power 5.75 W
62 Wiring/Buffering
17 Memory
21 Logic
3.2 W in 0.13 µm
SERDES Power 23 W !

Chip core power consumption
22
Conclusions, Lessons Learned

VPS bufXbar is good !
New idea no SAR speedup overhead, no output
queues
Feasibility Proof chip core almost 100 mm2 in
90nm
Synthesis Area Underestimation
30 erroneous area results when lots of wiring
Hierarchical Design is Challenging
Choose optimal organization
Produce fast results
Meet timing constraints
Hier. Placement Routing
Time-consuming 40 of thesis
Back-end tools still not perfect human
intervention required
Power of long wires/drivers is large
50-60 in large, hierarchical cores

Write a Comment

User Comments (0)

About PowerShow.com

Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip - PowerPoint PPT Presentation

Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip

Append destination bit-mask. Control 'sop' and 'eop' signals. Send packets to switch ... Append destination bit-mask. Enqueue packet size to FIFO ... – PowerPoint PPT presentation