A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks

Description:

A New Scalable and Cost-Effective Congestion Management ... The Eleventh International Symposium on High-Performance Computer ... heat dissipation ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 41
Provided by: josflic
Category:

less

Transcript and Presenter's Notes

Title: A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks


1
A New Scalable and Cost-Effective Congestion
Management Strategy for Lossless Multistage
Interconnection Networks
  • J. Duato1, I. Johnson2, J. Flich1, F. Naven2,
    P.J. García3, T. Nachiondo1

1Technical University of Valencia Valencia, Spain
2Xyratex Havant, UK
3University of Castilla-La Mancha Albacete, Spain
The Eleventh International Symposium on
High-Performance Computer Architecture, San
Francisco, 2005
2
Outline
  • Introduction
  • Congestion and HOL blocking
  • Why now?
  • Why previous proposals are inadequate
  • Proposal RECN
  • Performance evaluation
  • Conclusions

3
Interconnection Networks
  • MPPs
  • Earth Simulator (640 vectorial CPUs)
  • ASCI Q (12,288 EV68 CPUs, Quadrics network)
  • BlueGene/L (65.535 nodes, each one 2 processors,
    360 TFlops)
  • PC Clusters
  • Storage Area Network (SANs)
  • Google (6.000 CPUs and 12.000 disks)
  • Thunder (1.024 nodes each one 4 Itaniums/8GB)
  • Many data centers all around the world

Thunder
ASCI Q
Earth Simulator
4
Network Throughput beyond Saturation
5
Congestion and HOL Blocking
  • Network Contention
  • Several packets request the same output port
  • One makes progress, the others wait
  • Network Congestion
  • Persistent network contention
  • It is quickly propagated by flow control
    (lossless nets)
  • Network performance degrades dramatically
  • Head of line (HOL) blocking
  • When the first packet in a queue is blocked, any
    other packet in the same queue is also blocked,
    even if it will request available resources

6
Congestion and HOL Blocking
Network contention
7
Congestion and HOL Blocking
Persistent network contention
8
Congestion and HOL Blocking
Flow control
Persistent network contention
9
Congestion and HOL Blocking
Congestion propagates
Persistent network contention
10
Congestion and HOL Blocking
  • Congestion introduces HOL blocking, and this may
    degrade network performance dramatically

11
Traditional Solution
  • Overdimensioning the network

12
Why Congestion Management Now?
  • New problems arising

System cost Recent interconnects (Myrinet,
InfiniBand, ASI) are expensive compared to
processors
Power consumption As network size increases,
higher power consumption, higher heat
dissipation
Possible Solutions
Frequency/voltage scaling techniques Not very
efficient, and do not solve the system cost
problem
Reducing the number of network components
Possible by using a suitable topology, but link
utilization increases
13
Why Current Techniques Are not Suitable?
  • Proactive Congestion Management (congestion
    prevention)
  • Path setup before data transmission
  • Used in ATM, computer networks (QoS)
  • High overhead, high latencies (not suitable for
    HPC)
  • Reactive Congestion Management (congestion
    recovery)
  • Injection limitation techniques using closed-loop
    feedback
  • Do not scale with network size and link bandwidth
  • Notification delay (proportional to distance)
  • Link capacity (proportional to clock frequency)
  • May produce network instabilities

The real problem is not the congestion, but its
negative effects (HOL blocking)
14
Why Current Techniques Are not Suitable?
  • HOL blocking elimination/reduction
  • DAMQs and Virtual Channels
  • not efficient for multihop networks
  • VOQ (Virtual Output Queueing)
  • VOQ at switch level scales but does not eliminate
    HOL
  • VOQ at network level A separate queue at every
    input port for every destination
  • Number of required resources scales at least
    quadratically with network size !!!
  • Credit Flow Controlled ATM
  • References congestion to network output only
  • Consumes large number of buffers A separate
    queue at every output port for every destination

15
Proposal
  • Initial idea
  • Exploit spatial and temporal locality in packet
    destinations
  • Manage the set of queues as a cache
  • No equivalent to main memory!!! (where to
    replace?)
  • Not enough locality!!! (reduction in queue
    silicon area by a factor of 4)
  • Observation
  • Non-congested flows do not introduce significant
    HOL blocking
  • RECN Regional Explicit Congestion Notification
  • Non-congested flows are mapped to the same queue
  • Effective reduction in number of queues and no
    replacement needed
  • Congested flows are detected and mapped to set
    aside queues (SAQs)
  • RECN is a scalable congestion management
    technique because
  • It reacts locally (and thus, it is not affected
    by propagation delays)
  • A very small number of queues (SAQs) for a wide
    range of network sizes
  • RECN enables
  • Effective reduction of network cost by working
    closer to the saturation point
  • More efficient use of voltage/frequency scaling
    techniques

16
RECN
  • Based on the PCI Express Advanced Switching
    Interconnect (ASI) specification
  • Routing (turnpools)
  • Relevant switch architectural features
  • Congestion detection
  • Congestion notification and queue allocation
  • Queue deallocation
  • Packet processing
  • Flow control

17
Turnpools
18
Switch Model
Arbiter
RAM in
RAM out
. . .
LC
LC
XBAR S1.5
RAM in
RAM out
LC
LC
. . .
. . .
. . .
. . .
RAM in
RAM out
LC
LC
Dynamic queue management (VCs)
Dynamic queue management (VCs)
19
RAM in and RAM out
20
How it Works
A congestion point forms
21
How it Works
Cold queue fills over a threshold
22
How it Works
  • Congestion Detection
  • Cold Queue at output port side fills over
    Detection Threshold
  • Congested point output port
  • SAQs are not allocated at the output port

23
How it Works
24
How it Works
Internal notification to each input port
sending packets to the output port
25
How it Works
  • Congestion Information Notification
  • Congestion is notified to input ports sending
    packets to congested ports
  • Notification includes turnpool information and
    mask bits
  • Root token set for the input port

26
How it Works
27
How it Works
Input ports allocate a new SAQ for packets
addressed to the congested output port
28
How it Works
  • Actions after receiving notification
  • A new SAQ is allocated
  • The notified Turnpool and Mask bits are used to
    map the new SAQ

29
How it Works
  • Reception of packets after mapping SAQs (Example
    1)

5
3
3
6
4
3
3
Cold Queue
SAQ 0

SAQ 1
SAQ 2
3
3
6
?
SAQ 3
SAQ 0
SAQ 0
...0003000...
..000110..
1
...0003000...
..000110..
SAQ 1
...0003500...
..00011110..
1
...0003500...
..00011110..
SAQ 2
0
SAQ 3
0
30
How it Works
  • Reception of packets after mapping SAQs (Example
    2)

5
4
9
2
4
4
Cold Queue

SAQ 0
SAQ 1
4
9
2
?
SAQ 2
SAQ 3
COLD
...0003000...
..000110..
SAQ 0
1
...0003000...
..000110..
...0003500...
..00011110..
SAQ 1
1
...0003500...
..00011110..
SAQ 2
0
SAQ 3
0
31
How it Works
32
How it Works
Notification sent when the SAQ fills over a
threshold
33
How it Works
  • Congestion propagation
  • A RECN packet including turn pool, mask bits,
    and SAQ id is sent

RECN
...000500...
..00011110..
S0
5
Cold Queue
SAQ 0
SAQ 1
SAQ 2
No leaf
SAQ 3
SAQ 0
1
...0003500...
..00011110..
0
SAQ 1
0
0
SAQ 2
0
0
SAQ 3
0
0
34
How it Works
35
How it Works
A new SAQ allocated for the congested port at
each output port
36
How it Works
Internal notification when the SAQ fills over A
threshold
37
How it Works
The input port allocates A new SAQ
38
How it Works
At the end, the congestion tree builds and is
mapped entirely onto SAQs
39
Performance Evaluation
  • Evaluation based on simulation results
  • Two evaluation studies
  • Network performance when using
  • RECN
  • VOQ at network level (VOQnet)
  • VOQ at switch level (VOQsw)
  • 4 queues at ingress and egress ports (4Q)
  • 1 queue at ingress and egress ports (1Q)
  • RECN scalability

40
Simulation Model
  • Network configurations evaluated
  • 64 hosts connected by a 64x64 BMIN
  • 256 hosts connected by a 256x256 BMIN
  • 512 hosts connected by a 512x512 BMIN
  • Simulation assumptions
  • BMINs based on shuffle-exchange connection scheme
  • Deterministic routing
  • 128 KB memories at ingress/egress ports
  • Multiplexed crossbar (BW12 Gbps)
  • Serial full-duplex pipelined links (BW8 Gbps)
  • 64 and 512-byte packets
  • Credit-based and Xon-Xoff (for SAQs) flow control
  • Maximum of 8 SAQs at ingress/egress ports (RECN)

41
Traffic Load
  • Synthetic Traffic
  • Traces
  • From I/O activity at cello system disk interface
  • Different compression factors applied

Srcs Dst. Injection Rate () Traffic Start Time Traffic End Time
Corner Case 1 Corner Case 1 Corner Case 1 Corner Case 1 Corner Case 1
75 Random 50 0 Sim. End
25 Hot-Spot 100 800 µs 970 µs
Corner Case 2 Corner Case 2 Corner Case 2 Corner Case 2 Corner Case 2
75 Random 100 0 Sim. End
25 Hot-Spot 100 800 µs 970 µs
42
Performance Comparison
  • Network throughput - Corner case 1, 64x64 BMIN

43
Performance Comparison
  • Network throughput - Corner case 2, 64x64 BMIN

44
Performance Comparison
  • Network throughput Traces, 64x64 BMIN

Compression Factor set to 20
Compression Factor set to 40
45
Scalability Analysis
  • SAQ utilization Corner Case 1, 64x64 BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
46
Scalability Analysis
  • SAQ utilization Corner Case 2, 64x64 BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
47
Scalability Analysis
  • SAQ utilization Traces, Comp. Factor 20, 64x64
    BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
48
Scalability Analysis
  • SAQ utilization Traces, Comp. Factor 40, 64x64
    BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
49
Scalability Analysis
  • Network throughput Corner Case 2, 256x256 BMIN

50
Scalability Analysis
  • Network throughput Corner Case 2, 512x512 BMIN

51
Final Remarks
  • We also designed a protocol to deallocate SAQs
    when they are no longer needed
  • Many optimizations
  • CAM IDs to reduce control message size
  • CAM search done in parallel with packet reception
  • Merging of congestion trees
  • Silicon area reduced with respect to switch-level
    VOQs

52
Conclusions
  • We have proposed a scalable congestion management
    strategy for lossless networks
  • We have shown that it only requires a small
    number of buffers for a wide range of network
    sizes
  • We have modeled an existing ASI switch design,
    verifying
  • Maintains network performance close to ideal (but
    non-scalable) solution
  • Silicon area requirements are now smaller than
    for the original design
Write a Comment
User Comments (0)
About PowerShow.com