Title: A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks
1A New Scalable and Cost-Effective Congestion
Management Strategy for Lossless Multistage
Interconnection Networks
- J. Duato1, I. Johnson2, J. Flich1, F. Naven2,
P.J. García3, T. Nachiondo1
1Technical University of Valencia Valencia, Spain
2Xyratex Havant, UK
3University of Castilla-La Mancha Albacete, Spain
The Eleventh International Symposium on
High-Performance Computer Architecture, San
Francisco, 2005
2Outline
- Introduction
- Congestion and HOL blocking
- Why now?
- Why previous proposals are inadequate
- Proposal RECN
- Performance evaluation
- Conclusions
3Interconnection Networks
- MPPs
- Earth Simulator (640 vectorial CPUs)
- ASCI Q (12,288 EV68 CPUs, Quadrics network)
- BlueGene/L (65.535 nodes, each one 2 processors,
360 TFlops) - PC Clusters
- Storage Area Network (SANs)
- Google (6.000 CPUs and 12.000 disks)
- Thunder (1.024 nodes each one 4 Itaniums/8GB)
- Many data centers all around the world
Thunder
ASCI Q
Earth Simulator
4Network Throughput beyond Saturation
5Congestion and HOL Blocking
- Network Contention
- Several packets request the same output port
- One makes progress, the others wait
- Network Congestion
- Persistent network contention
- It is quickly propagated by flow control
(lossless nets) - Network performance degrades dramatically
- Head of line (HOL) blocking
- When the first packet in a queue is blocked, any
other packet in the same queue is also blocked,
even if it will request available resources
6Congestion and HOL Blocking
Network contention
7Congestion and HOL Blocking
Persistent network contention
8Congestion and HOL Blocking
Flow control
Persistent network contention
9Congestion and HOL Blocking
Congestion propagates
Persistent network contention
10Congestion and HOL Blocking
- Congestion introduces HOL blocking, and this may
degrade network performance dramatically
11Traditional Solution
- Overdimensioning the network
12Why Congestion Management Now?
System cost Recent interconnects (Myrinet,
InfiniBand, ASI) are expensive compared to
processors
Power consumption As network size increases,
higher power consumption, higher heat
dissipation
Possible Solutions
Frequency/voltage scaling techniques Not very
efficient, and do not solve the system cost
problem
Reducing the number of network components
Possible by using a suitable topology, but link
utilization increases
13Why Current Techniques Are not Suitable?
- Proactive Congestion Management (congestion
prevention) - Path setup before data transmission
- Used in ATM, computer networks (QoS)
- High overhead, high latencies (not suitable for
HPC)
- Reactive Congestion Management (congestion
recovery) - Injection limitation techniques using closed-loop
feedback - Do not scale with network size and link bandwidth
- Notification delay (proportional to distance)
- Link capacity (proportional to clock frequency)
- May produce network instabilities
The real problem is not the congestion, but its
negative effects (HOL blocking)
14Why Current Techniques Are not Suitable?
- HOL blocking elimination/reduction
- DAMQs and Virtual Channels
- not efficient for multihop networks
- VOQ (Virtual Output Queueing)
- VOQ at switch level scales but does not eliminate
HOL - VOQ at network level A separate queue at every
input port for every destination - Number of required resources scales at least
quadratically with network size !!!
- Credit Flow Controlled ATM
- References congestion to network output only
- Consumes large number of buffers A separate
queue at every output port for every destination
15Proposal
- Initial idea
- Exploit spatial and temporal locality in packet
destinations - Manage the set of queues as a cache
- No equivalent to main memory!!! (where to
replace?) - Not enough locality!!! (reduction in queue
silicon area by a factor of 4) - Observation
- Non-congested flows do not introduce significant
HOL blocking - RECN Regional Explicit Congestion Notification
- Non-congested flows are mapped to the same queue
- Effective reduction in number of queues and no
replacement needed - Congested flows are detected and mapped to set
aside queues (SAQs) - RECN is a scalable congestion management
technique because - It reacts locally (and thus, it is not affected
by propagation delays) - A very small number of queues (SAQs) for a wide
range of network sizes - RECN enables
- Effective reduction of network cost by working
closer to the saturation point - More efficient use of voltage/frequency scaling
techniques
16RECN
- Based on the PCI Express Advanced Switching
Interconnect (ASI) specification - Routing (turnpools)
- Relevant switch architectural features
- Congestion detection
- Congestion notification and queue allocation
- Queue deallocation
- Packet processing
- Flow control
17Turnpools
18Switch Model
Arbiter
RAM in
RAM out
. . .
LC
LC
XBAR S1.5
RAM in
RAM out
LC
LC
. . .
. . .
. . .
. . .
RAM in
RAM out
LC
LC
Dynamic queue management (VCs)
Dynamic queue management (VCs)
19RAM in and RAM out
20How it Works
A congestion point forms
21How it Works
Cold queue fills over a threshold
22How it Works
- Congestion Detection
- Cold Queue at output port side fills over
Detection Threshold - Congested point output port
- SAQs are not allocated at the output port
23How it Works
24How it Works
Internal notification to each input port
sending packets to the output port
25How it Works
- Congestion Information Notification
- Congestion is notified to input ports sending
packets to congested ports - Notification includes turnpool information and
mask bits - Root token set for the input port
26How it Works
27How it Works
Input ports allocate a new SAQ for packets
addressed to the congested output port
28How it Works
- Actions after receiving notification
- A new SAQ is allocated
- The notified Turnpool and Mask bits are used to
map the new SAQ
29How it Works
- Reception of packets after mapping SAQs (Example
1)
5
3
3
6
4
3
3
Cold Queue
SAQ 0
SAQ 1
SAQ 2
3
3
6
?
SAQ 3
SAQ 0
SAQ 0
...0003000...
..000110..
1
...0003000...
..000110..
SAQ 1
...0003500...
..00011110..
1
...0003500...
..00011110..
SAQ 2
0
SAQ 3
0
30How it Works
- Reception of packets after mapping SAQs (Example
2)
5
4
9
2
4
4
Cold Queue
SAQ 0
SAQ 1
4
9
2
?
SAQ 2
SAQ 3
COLD
...0003000...
..000110..
SAQ 0
1
...0003000...
..000110..
...0003500...
..00011110..
SAQ 1
1
...0003500...
..00011110..
SAQ 2
0
SAQ 3
0
31How it Works
32How it Works
Notification sent when the SAQ fills over a
threshold
33How it Works
- Congestion propagation
- A RECN packet including turn pool, mask bits,
and SAQ id is sent
RECN
...000500...
..00011110..
S0
5
Cold Queue
SAQ 0
SAQ 1
SAQ 2
No leaf
SAQ 3
SAQ 0
1
...0003500...
..00011110..
0
SAQ 1
0
0
SAQ 2
0
0
SAQ 3
0
0
34How it Works
35How it Works
A new SAQ allocated for the congested port at
each output port
36How it Works
Internal notification when the SAQ fills over A
threshold
37How it Works
The input port allocates A new SAQ
38How it Works
At the end, the congestion tree builds and is
mapped entirely onto SAQs
39Performance Evaluation
- Evaluation based on simulation results
- Two evaluation studies
- Network performance when using
- RECN
- VOQ at network level (VOQnet)
- VOQ at switch level (VOQsw)
- 4 queues at ingress and egress ports (4Q)
- 1 queue at ingress and egress ports (1Q)
- RECN scalability
40Simulation Model
- Network configurations evaluated
- 64 hosts connected by a 64x64 BMIN
- 256 hosts connected by a 256x256 BMIN
- 512 hosts connected by a 512x512 BMIN
- Simulation assumptions
- BMINs based on shuffle-exchange connection scheme
- Deterministic routing
- 128 KB memories at ingress/egress ports
- Multiplexed crossbar (BW12 Gbps)
- Serial full-duplex pipelined links (BW8 Gbps)
- 64 and 512-byte packets
- Credit-based and Xon-Xoff (for SAQs) flow control
- Maximum of 8 SAQs at ingress/egress ports (RECN)
41Traffic Load
- Synthetic Traffic
- Traces
- From I/O activity at cello system disk interface
- Different compression factors applied
Srcs Dst. Injection Rate () Traffic Start Time Traffic End Time
Corner Case 1 Corner Case 1 Corner Case 1 Corner Case 1 Corner Case 1
75 Random 50 0 Sim. End
25 Hot-Spot 100 800 µs 970 µs
Corner Case 2 Corner Case 2 Corner Case 2 Corner Case 2 Corner Case 2
75 Random 100 0 Sim. End
25 Hot-Spot 100 800 µs 970 µs
42Performance Comparison
- Network throughput - Corner case 1, 64x64 BMIN
43Performance Comparison
- Network throughput - Corner case 2, 64x64 BMIN
44Performance Comparison
- Network throughput Traces, 64x64 BMIN
Compression Factor set to 20
Compression Factor set to 40
45Scalability Analysis
- SAQ utilization Corner Case 1, 64x64 BMIN
Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
46Scalability Analysis
- SAQ utilization Corner Case 2, 64x64 BMIN
Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
47Scalability Analysis
- SAQ utilization Traces, Comp. Factor 20, 64x64
BMIN
Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
48Scalability Analysis
- SAQ utilization Traces, Comp. Factor 40, 64x64
BMIN
Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
49Scalability Analysis
- Network throughput Corner Case 2, 256x256 BMIN
50Scalability Analysis
- Network throughput Corner Case 2, 512x512 BMIN
51Final Remarks
- We also designed a protocol to deallocate SAQs
when they are no longer needed - Many optimizations
- CAM IDs to reduce control message size
- CAM search done in parallel with packet reception
- Merging of congestion trees
- Silicon area reduced with respect to switch-level
VOQs
52Conclusions
- We have proposed a scalable congestion management
strategy for lossless networks - We have shown that it only requires a small
number of buffers for a wide range of network
sizes - We have modeled an existing ASI switch design,
verifying - Maintains network performance close to ideal (but
non-scalable) solution - Silicon area requirements are now smaller than
for the original design