Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip

Description:

Append destination bit-mask. Control 'sop' and 'eop' signals. Send packets to switch ... Append destination bit-mask. Enqueue packet size to FIFO ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 23
Provided by: dimitrio1
Category:

less

Transcript and Presenter's Notes

Title: Design of a 32x32 VariablePacketSize Buffered Crossbar Switch Chip


1
Design of a32x32 Variable-Packet-SizeBuffered
Crossbar Switch Chip
  • Master of Science Thesis
  • Dimitrios G. Simos

2
Outline
  • Introduction to the Variable-Packet-Size Buffered
    Crossbar (VPS bufXbar) Architecture
  • Switch Organization
  • Logic Synthesis
  • Placement Routing
  • Power Consumption
  • Conclusions, Lessons Learned

3
Introduction Unbuffered (CIOQ) vs. Buffered
Crossbars (CICQ)
  • Distributed scheduling
  • Fast/Efficient no speedup required
  • New Variable-size packets
  • ? no SAR no speedup required
  • ? No Output Queues
  • Central scheduling
  • Inefficient speedup required
  • Fixed-size cells for synchronous operation
  • ? SAR speedup required
  • ? Output Queues

4
Motivation/Contribution
  • How large crosspoint memories?
  • Buffer Size MaxPacketSizeRTT
  • E.g. 1500 500 2 KBytes in our implementation
  • See Katevenis et al. Variable Packet Size
    Buffered Crossbar (CICQ) Switches, ICC 2004,
    http//archvlsi.ics.forth.gr/bufXbar/
  • Until now, difficult to fit large amounts of
    SRAMs on-chip
  • Today, 2-4 Mbytes can fit on-chip
  • VPS bufXbar simulations showed very nice results,
    but
  • is it feasible to built an actual multi-port
    (e.g. 32x32) such chip core using standard
    ASIC-flow?
  • If yes,
  • How difficult is it?
  • Area?
  • Power?

5
Switch Organization block level
  • Crosspoint Blocks (Crosspoints-XPs)
  • Packet enqueue/dequeue
  • Synchronization
  • Output Schedulers (OSs)
  • Flow selection,
  • Packet transmission
  • Credit transmission
  • Credit Schedulers (CSs)
  • Accumulation transmission
  • of credits to inputs
  • Line Cards (outside of switch)
  • Append destination bit-mask
  • Control sop and eop signals
  • Send packets to switch
  • Receive and handle credits
  • Packet Format
  • Size 40-1500 Bytes
  • Packet size multiple of 4
  • 2 Clock Domains

Switch block-level organization
6
Switch Organization Crosspoints
  • 2-port SRAM for data
  • Packet enqueue/ dequeue to/from SRAM
  • Clock domain synchronization
  • 1-bit 3-FF synchronizer for control signal
  • 5 clock cycles latency
  • ? Cut-through

7
Switch Organization Output Schedulers
  • Round robin selection of next eligible flow
  • Eligible contains a packet
  • Back-to-Back transmission
  • Packet transmission to switch outputs
  • Credit signal transmission to line cards

8
Switch Organization Credit Schedulers
  • Credit pulse accumulation
  • Credit Semantics
  • not a new packet has just departed from
    crossbar
  • but QFC-like protocol for error tolerance
    total number of transmitted packets mod 2k
    equals
  • Round Robin credit transmission to line cards
  • Only those that have changed since last time
  • If none has changed, send RR-ly all
  • If a credit is lost, it is retransmitted in the
    next round

Credit Format
9
Line Card Logic
  • Semi-behavioral
  • VOQ handling
  • Maintain packet-size FIFOs for each flow
  • Maintain row XP MEM free space
  • Packet transmission
  • Append destination bit-mask
  • Enqueue packet size to FIFO
  • Decrease corresponding XP MEM free space
  • Credit arrival and handling
  • Check if new credit different from corresponding
    saved one
  • If different, dequeue packet size increase
    corresponding XP MEM free space

10
Logic Synthesis
  • Tool Synopsys Design Compiler
  • Design Hierarchy
  • Gate-Level Simulation
  • Gate Count, FF Area Results

11
Logic Synthesis Hierarchy
  • Flat synthesis impossible due to design size/
    instance number
  • Hierarchical approach

12
Logic SynthesisConstraints Gate-Level
Simulation
  • Constraints
  • Clock frequency 300 MHz (max. memory freq),
  • ? 9.6 Gbit/sec per-port
  • Logic area minimal
  • Circuit compilation
  • Gate-Level netlist to be imported to PR tool
  • Gate-Level verification
  • Minimum-size (40 B) packets
  • Maximum-size (1500 B) packets
  • Randomly-selected-size packets
  • All inputs send to one/all output(s) with
    load100
  • In-order back-to-back packet transmission to
    outputs

13
Logic SynthesisGate Count, FF Area Results
  • Logic Area only 5, rest is memory
  • Proof of feasibility 140 mm2 in 0.13 µm

14
Placement Routing (PR)
  • Tool Cadence Encounter
  • Hierarchical approach
  • Gate Count, FF Area results
  • Power Consumption

15
Placement Routing Hierarchy
  • Flat PR Impossible
  • Design size
  • Fast results
  • Hierarchy Decision
  • Easy to route at
  • top-level
  • Small/Medium hierarchy component size
  • Chip core aspect ratio (AR)
  • Three alternatives
  • group by
  • squares
  • rows
  • columns
  • ? Per-column organization seemed best

16
Placement RoutingColumn Organization
  • Each column
  • 32 XPs, 1 OS
  • Best org. decision
  • All three tested
  • AR 13
  • Unroutable
  • Success, AR 12

17
Placement RoutingColumn Layout
  • Output Scheduler in bottom half of column
  • Buffered line in feedthroughs
  • Column optimization output buffer insertion
  • Credit regs near credit scheduler inputs
  • Else top-level PR impossible

18
Placement Routing Top-Level
  • Impossible to PR 32 Columns and Credit Scheduler
    Line at Top-Level
  • Hierarchy level addition group 32 columns into 2
    sets of 16
  • 3 components at top-level

19
Placement RoutingChip Core Layout
Core Layout 1.45 x 2.88 cm 420 mm2 in 0.18 µm
20
Placement RoutingGate, FF Area Results
  • Differ from synthesis
  • Area 40 larger due to wiring (90), hierarchy
    overhead (10)
  • Logic Gates 60 more due to opt. buffers, clock
    tree
  • Memories 70, Wiring 25, Logic 5
  • Still feasible under 200 mm2 in 0.13, 110 mm2 in
    0.09

21
Placement RoutingPower Estimation
  • Typical case
  • 32 (XPsOSsCSs)
  • Total Core Power 5.75 W
  • 62 Wiring/Buffering
  • 17 Memory
  • 21 Logic
  • 3.2 W in 0.13 µm
  • SERDES Power 23 W !

Chip core power consumption
22
Conclusions, Lessons Learned
  • VPS bufXbar is good !
  • New idea no SAR speedup overhead, no output
    queues
  • Feasibility Proof chip core almost 100 mm2 in
    90nm
  • Synthesis Area Underestimation
  • 30 erroneous area results when lots of wiring
  • Hierarchical Design is Challenging
  • Choose optimal organization
  • Produce fast results
  • Meet timing constraints
  • Hier. Placement Routing
  • Time-consuming 40 of thesis
  • Back-end tools still not perfect human
    intervention required
  • Power of long wires/drivers is large
  • 50-60 in large, hierarchical cores
Write a Comment
User Comments (0)
About PowerShow.com