Title: Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip
1Design of a32x32 Variable-Packet-SizeBuffered
Crossbar Switch Chip
- Master of Science Thesis
- Dimitrios G. Simos
2Outline
- Introduction to the Variable-Packet-Size Buffered
Crossbar (VPS bufXbar) Architecture - Switch Organization
- Logic Synthesis
- Placement Routing
- Power Consumption
- Conclusions, Lessons Learned
3Introduction Unbuffered (CIOQ) vs. Buffered
Crossbars (CICQ)
- Distributed scheduling
- Fast/Efficient no speedup required
- New Variable-size packets
- ? no SAR no speedup required
- ? No Output Queues
- Central scheduling
- Inefficient speedup required
- Fixed-size cells for synchronous operation
- ? SAR speedup required
- ? Output Queues
4Motivation/Contribution
- How large crosspoint memories?
- Buffer Size MaxPacketSizeRTT
- E.g. 1500 500 2 KBytes in our implementation
- See Katevenis et al. Variable Packet Size
Buffered Crossbar (CICQ) Switches, ICC 2004,
http//archvlsi.ics.forth.gr/bufXbar/ - Until now, difficult to fit large amounts of
SRAMs on-chip - Today, 2-4 Mbytes can fit on-chip
- VPS bufXbar simulations showed very nice results,
but - is it feasible to built an actual multi-port
(e.g. 32x32) such chip core using standard
ASIC-flow? - If yes,
- How difficult is it?
- Area?
- Power?
5Switch Organization block level
- Crosspoint Blocks (Crosspoints-XPs)
- Packet enqueue/dequeue
- Synchronization
- Output Schedulers (OSs)
- Flow selection,
- Packet transmission
- Credit transmission
- Credit Schedulers (CSs)
- Accumulation transmission
- of credits to inputs
- Line Cards (outside of switch)
- Append destination bit-mask
- Control sop and eop signals
- Send packets to switch
- Receive and handle credits
- Packet Format
- Size 40-1500 Bytes
- Packet size multiple of 4
- 2 Clock Domains
Switch block-level organization
6Switch Organization Crosspoints
- 2-port SRAM for data
- Packet enqueue/ dequeue to/from SRAM
- Clock domain synchronization
- 1-bit 3-FF synchronizer for control signal
- 5 clock cycles latency
- ? Cut-through
7Switch Organization Output Schedulers
- Round robin selection of next eligible flow
- Eligible contains a packet
- Back-to-Back transmission
- Packet transmission to switch outputs
- Credit signal transmission to line cards
8Switch Organization Credit Schedulers
- Credit pulse accumulation
- Credit Semantics
- not a new packet has just departed from
crossbar - but QFC-like protocol for error tolerance
total number of transmitted packets mod 2k
equals - Round Robin credit transmission to line cards
- Only those that have changed since last time
- If none has changed, send RR-ly all
- If a credit is lost, it is retransmitted in the
next round
Credit Format
9Line Card Logic
- Semi-behavioral
- VOQ handling
- Maintain packet-size FIFOs for each flow
- Maintain row XP MEM free space
- Packet transmission
- Append destination bit-mask
- Enqueue packet size to FIFO
- Decrease corresponding XP MEM free space
- Credit arrival and handling
- Check if new credit different from corresponding
saved one - If different, dequeue packet size increase
corresponding XP MEM free space
10Logic Synthesis
- Tool Synopsys Design Compiler
- Design Hierarchy
- Gate-Level Simulation
- Gate Count, FF Area Results
11Logic Synthesis Hierarchy
- Flat synthesis impossible due to design size/
instance number - Hierarchical approach
12Logic SynthesisConstraints Gate-Level
Simulation
- Constraints
- Clock frequency 300 MHz (max. memory freq),
- ? 9.6 Gbit/sec per-port
- Logic area minimal
- Circuit compilation
- Gate-Level netlist to be imported to PR tool
- Gate-Level verification
- Minimum-size (40 B) packets
- Maximum-size (1500 B) packets
- Randomly-selected-size packets
- All inputs send to one/all output(s) with
load100 - In-order back-to-back packet transmission to
outputs
13Logic SynthesisGate Count, FF Area Results
- Logic Area only 5, rest is memory
- Proof of feasibility 140 mm2 in 0.13 µm
14Placement Routing (PR)
- Tool Cadence Encounter
- Hierarchical approach
- Gate Count, FF Area results
- Power Consumption
15Placement Routing Hierarchy
- Flat PR Impossible
- Design size
- Fast results
- Hierarchy Decision
- Easy to route at
- top-level
- Small/Medium hierarchy component size
- Chip core aspect ratio (AR)
- Three alternatives
- group by
- squares
- rows
- columns
- ? Per-column organization seemed best
16Placement RoutingColumn Organization
- Each column
- 32 XPs, 1 OS
- Best org. decision
- All three tested
- AR 13
- Unroutable
- Success, AR 12
17Placement RoutingColumn Layout
- Output Scheduler in bottom half of column
- Buffered line in feedthroughs
- Column optimization output buffer insertion
- Credit regs near credit scheduler inputs
- Else top-level PR impossible
18Placement Routing Top-Level
- Impossible to PR 32 Columns and Credit Scheduler
Line at Top-Level - Hierarchy level addition group 32 columns into 2
sets of 16 - 3 components at top-level
19Placement RoutingChip Core Layout
Core Layout 1.45 x 2.88 cm 420 mm2 in 0.18 µm
20Placement RoutingGate, FF Area Results
- Differ from synthesis
- Area 40 larger due to wiring (90), hierarchy
overhead (10) - Logic Gates 60 more due to opt. buffers, clock
tree - Memories 70, Wiring 25, Logic 5
- Still feasible under 200 mm2 in 0.13, 110 mm2 in
0.09
21Placement RoutingPower Estimation
- Typical case
- 32 (XPsOSsCSs)
- Total Core Power 5.75 W
- 62 Wiring/Buffering
- 17 Memory
- 21 Logic
- 3.2 W in 0.13 µm
- SERDES Power 23 W !
Chip core power consumption
22Conclusions, Lessons Learned
- VPS bufXbar is good !
- New idea no SAR speedup overhead, no output
queues - Feasibility Proof chip core almost 100 mm2 in
90nm - Synthesis Area Underestimation
- 30 erroneous area results when lots of wiring
- Hierarchical Design is Challenging
- Choose optimal organization
- Produce fast results
- Meet timing constraints
- Hier. Placement Routing
- Time-consuming 40 of thesis
- Back-end tools still not perfect human
intervention required - Power of long wires/drivers is large
- 50-60 in large, hierarchical cores