A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks

Description:

A New Scalable and Cost-Effective Congestion Management ... The Eleventh International Symposium on High-Performance Computer ... heat dissipation ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 41

Provided by: josflic

Category:

more less

Transcript and Presenter's Notes

Title: A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks

1
A New Scalable and Cost-Effective Congestion
Management Strategy for Lossless Multistage
Interconnection Networks

J. Duato1, I. Johnson2, J. Flich1, F. Naven2,
P.J. García3, T. Nachiondo1

1Technical University of Valencia Valencia, Spain
2Xyratex Havant, UK
3University of Castilla-La Mancha Albacete, Spain
The Eleventh International Symposium on
High-Performance Computer Architecture, San
Francisco, 2005
2
Outline

Introduction
Congestion and HOL blocking
Why now?
Why previous proposals are inadequate
Proposal RECN
Performance evaluation
Conclusions

3
Interconnection Networks

MPPs
Earth Simulator (640 vectorial CPUs)
ASCI Q (12,288 EV68 CPUs, Quadrics network)
BlueGene/L (65.535 nodes, each one 2 processors,
360 TFlops)
PC Clusters
Storage Area Network (SANs)
Google (6.000 CPUs and 12.000 disks)
Thunder (1.024 nodes each one 4 Itaniums/8GB)
Many data centers all around the world

Thunder
ASCI Q
Earth Simulator
4
Network Throughput beyond Saturation
5
Congestion and HOL Blocking

Network Contention
Several packets request the same output port
One makes progress, the others wait
Network Congestion
Persistent network contention
It is quickly propagated by flow control
(lossless nets)
Network performance degrades dramatically
Head of line (HOL) blocking
When the first packet in a queue is blocked, any
other packet in the same queue is also blocked,
even if it will request available resources

6
Congestion and HOL Blocking
Network contention
7
Congestion and HOL Blocking
Persistent network contention
8
Congestion and HOL Blocking
Flow control
Persistent network contention
9
Congestion and HOL Blocking
Congestion propagates
Persistent network contention
10
Congestion and HOL Blocking

Congestion introduces HOL blocking, and this may
degrade network performance dramatically

11
Traditional Solution

Overdimensioning the network

12
Why Congestion Management Now?

New problems arising

System cost Recent interconnects (Myrinet,
InfiniBand, ASI) are expensive compared to
processors
Power consumption As network size increases,
higher power consumption, higher heat
dissipation
Possible Solutions
Frequency/voltage scaling techniques Not very
efficient, and do not solve the system cost
problem
Reducing the number of network components
Possible by using a suitable topology, but link
utilization increases
13
Why Current Techniques Are not Suitable?

Proactive Congestion Management (congestion
prevention)
Path setup before data transmission
Used in ATM, computer networks (QoS)
High overhead, high latencies (not suitable for
HPC)

Reactive Congestion Management (congestion
recovery)
Injection limitation techniques using closed-loop
feedback
Do not scale with network size and link bandwidth
Notification delay (proportional to distance)
Link capacity (proportional to clock frequency)
May produce network instabilities

The real problem is not the congestion, but its
negative effects (HOL blocking)
14
Why Current Techniques Are not Suitable?

HOL blocking elimination/reduction
DAMQs and Virtual Channels
not efficient for multihop networks

VOQ (Virtual Output Queueing)
VOQ at switch level scales but does not eliminate
HOL
VOQ at network level A separate queue at every
input port for every destination
Number of required resources scales at least
quadratically with network size !!!

Credit Flow Controlled ATM
References congestion to network output only
Consumes large number of buffers A separate
queue at every output port for every destination

15
Proposal

Initial idea
Exploit spatial and temporal locality in packet
destinations
Manage the set of queues as a cache
No equivalent to main memory!!! (where to
replace?)
Not enough locality!!! (reduction in queue
silicon area by a factor of 4)
Observation
Non-congested flows do not introduce significant
HOL blocking
RECN Regional Explicit Congestion Notification
Non-congested flows are mapped to the same queue
Effective reduction in number of queues and no
replacement needed
Congested flows are detected and mapped to set
aside queues (SAQs)
RECN is a scalable congestion management
technique because
It reacts locally (and thus, it is not affected
by propagation delays)
A very small number of queues (SAQs) for a wide
range of network sizes
RECN enables
Effective reduction of network cost by working
closer to the saturation point
More efficient use of voltage/frequency scaling
techniques

16
RECN

Based on the PCI Express Advanced Switching
Interconnect (ASI) specification
Routing (turnpools)
Relevant switch architectural features
Congestion detection
Congestion notification and queue allocation
Queue deallocation
Packet processing
Flow control

17
Turnpools
18
Switch Model
Arbiter
RAM in
RAM out
. . .
LC
LC
XBAR S1.5
RAM in
RAM out
LC
LC
. . .
. . .
. . .
. . .
RAM in
RAM out
LC
LC
Dynamic queue management (VCs)
Dynamic queue management (VCs)
19
RAM in and RAM out
20
How it Works
A congestion point forms
21
How it Works
Cold queue fills over a threshold
22
How it Works

Congestion Detection
Cold Queue at output port side fills over
Detection Threshold
Congested point output port
SAQs are not allocated at the output port

23
How it Works
24
How it Works
Internal notification to each input port
sending packets to the output port
25
How it Works

Congestion Information Notification
Congestion is notified to input ports sending
packets to congested ports
Notification includes turnpool information and
mask bits
Root token set for the input port

26
How it Works
27
How it Works
Input ports allocate a new SAQ for packets
addressed to the congested output port
28
How it Works

Actions after receiving notification
A new SAQ is allocated
The notified Turnpool and Mask bits are used to
map the new SAQ

29
How it Works

Reception of packets after mapping SAQs (Example
1)

5
3
3
6
4
3
3
Cold Queue
SAQ 0

SAQ 1
SAQ 2
3
3
6
?
SAQ 3
SAQ 0
SAQ 0
...0003000...
..000110..
1
...0003000...
..000110..
SAQ 1
...0003500...
..00011110..
1
...0003500...
..00011110..
SAQ 2
0
SAQ 3
0
30
How it Works

Reception of packets after mapping SAQs (Example
2)

5
4
9
2
4
4
Cold Queue

SAQ 0
SAQ 1
4
9
2
?
SAQ 2
SAQ 3
COLD
...0003000...
..000110..
SAQ 0
1
...0003000...
..000110..
...0003500...
..00011110..
SAQ 1
1
...0003500...
..00011110..
SAQ 2
0
SAQ 3
0
31
How it Works
32
How it Works
Notification sent when the SAQ fills over a
threshold
33
How it Works

Congestion propagation
A RECN packet including turn pool, mask bits,
and SAQ id is sent

RECN
...000500...
..00011110..
S0
5
Cold Queue
SAQ 0
SAQ 1
SAQ 2
No leaf
SAQ 3
SAQ 0
1
...0003500...
..00011110..
0
SAQ 1
0
0
SAQ 2
0
0
SAQ 3
0
0
34
How it Works
35
How it Works
A new SAQ allocated for the congested port at
each output port
36
How it Works
Internal notification when the SAQ fills over A
threshold
37
How it Works
The input port allocates A new SAQ
38
How it Works
At the end, the congestion tree builds and is
mapped entirely onto SAQs
39
Performance Evaluation

Evaluation based on simulation results
Two evaluation studies
Network performance when using
RECN
VOQ at network level (VOQnet)
VOQ at switch level (VOQsw)
4 queues at ingress and egress ports (4Q)
1 queue at ingress and egress ports (1Q)
RECN scalability

40
Simulation Model

Network configurations evaluated
64 hosts connected by a 64x64 BMIN
256 hosts connected by a 256x256 BMIN
512 hosts connected by a 512x512 BMIN
Simulation assumptions
BMINs based on shuffle-exchange connection scheme
Deterministic routing
128 KB memories at ingress/egress ports
Multiplexed crossbar (BW12 Gbps)
Serial full-duplex pipelined links (BW8 Gbps)
64 and 512-byte packets
Credit-based and Xon-Xoff (for SAQs) flow control
Maximum of 8 SAQs at ingress/egress ports (RECN)

41
Traffic Load

Synthetic Traffic
Traces
From I/O activity at cello system disk interface
Different compression factors applied

Srcs Dst. Injection Rate () Traffic Start Time Traffic End Time
Corner Case 1 Corner Case 1 Corner Case 1 Corner Case 1 Corner Case 1
75 Random 50 0 Sim. End
25 Hot-Spot 100 800 µs 970 µs
Corner Case 2 Corner Case 2 Corner Case 2 Corner Case 2 Corner Case 2
75 Random 100 0 Sim. End
25 Hot-Spot 100 800 µs 970 µs
42
Performance Comparison

Network throughput - Corner case 1, 64x64 BMIN

43
Performance Comparison

Network throughput - Corner case 2, 64x64 BMIN

44
Performance Comparison

Network throughput Traces, 64x64 BMIN

Compression Factor set to 20
Compression Factor set to 40
45
Scalability Analysis

SAQ utilization Corner Case 1, 64x64 BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
46
Scalability Analysis

SAQ utilization Corner Case 2, 64x64 BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
47
Scalability Analysis

SAQ utilization Traces, Comp. Factor 20, 64x64
BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
48
Scalability Analysis

SAQ utilization Traces, Comp. Factor 40, 64x64
BMIN

Maximum SAQs used (egress)
Maximum SAQs used (ingress)
Total of active SAQS
49
Scalability Analysis

Network throughput Corner Case 2, 256x256 BMIN

50
Scalability Analysis

Network throughput Corner Case 2, 512x512 BMIN

51
Final Remarks

We also designed a protocol to deallocate SAQs
when they are no longer needed
Many optimizations
CAM IDs to reduce control message size
CAM search done in parallel with packet reception
Merging of congestion trees
Silicon area reduced with respect to switch-level
VOQs

52
Conclusions

We have proposed a scalable congestion management
strategy for lossless networks
We have shown that it only requires a small
number of buffers for a wide range of network
sizes
We have modeled an existing ASI switch design,
verifying
Maintains network performance close to ideal (but
non-scalable) solution
Silicon area requirements are now smaller than
for the original design