Multiprocessor Interconnection Networks Mahim Mishra CS 495 April 16, 2002 - PowerPoint PPT Presentation

About This Presentation

Title:

Multiprocessor Interconnection Networks Mahim Mishra CS 495 April 16, 2002

Description:

(Multidimensional) Meshes and Tori. O(N) switches (but switches are more complicated) ... 3-dimensional torus, with 1024 switches each connected to 2 processors ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 45

Provided by: RandalE9

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multiprocessor Interconnection Networks Mahim Mishra CS 495 April 16, 2002

1
Multiprocessor InterconnectionNetworksMahim
MishraCS 495 April 16, 2002

Topics
Network design issues
Network Topology
Performance

2
Networks

How do we move data between processors?
Design Options
Topology
Routing
Physical implementation

3
Evaluation Criteria

Latency
Bisection Bandwidth
Contention and hot-spot behavior
Partitionability
Cost and scalability
Fault tolerance

4
Communication Perf Latency

Time(n)s-d overhead routing delay channel
occupancy contention delay
occupancy (n ne) / b
Routing delay?
Contention?

5
Link Design/Engineering Space

Cable of one or more wires/fibers with connectors
at the ends attached to switches or interfaces

Synchronous - source dest on same clock
Narrow - control, data and timing multiplexed
on wire
Short - single logical value at a time
Long - stream of logical values at a time
Asynchronous - source encodes clock in signal
Wide - control, data and timing on separate
wires
6
Buses
Bus

Simple and cost-effective for small-scale
multiprocessors
Not scalable (limited bandwidth electrical
complications)

7
Crossbars

Each port has link to every other port
Low latency and high throughput
- Cost grows as O(N2) so not very scalable.
- Difficult to arbitrate and to get all data
lines into and out of a centralized crossbar.
Used in small-scale MPs (e.g., C.mmp) and as
building block for other networks (e.g., Omega).

8
Rings

Cheap Cost is O(N).
Point-to-point wires and pipelining can be used
to make them very fast.
High overall bandwidth
- High latency O(N)
Examples KSR machine, Hector

9
(Multidimensional) Meshes and Tori
3D Cube
2D Grid

O(N) switches (but switches are more complicated)
Latency O(kn) (where Nkn)
High bisection bandwidth
Good fault tolerance
Physical layout hard for multiple dimensions

10
Real World 2D mesh

1824 node Paragon 16 x 114 array

11
Hypercubes

Also called binary n-cubes. of nodes N
2n.
Latency is O(logN) Out degree of PE is
O(logN)
Minimizes hops good bisection BW but tough to
layout in 3-space
Popular in early message-passing computers
(e.g., intel iPSC, NCUBE)
Used as direct network gt emphasizes locality

12
k-ary d-cubes

Generalization of hypercubes (k nodes in every
dimension)
Total of nodes N kd.
k gt 2 reduces of channels at bisection, thus
allowing for wider channels but more hops.

13
Embeddings in two dimensions
6 x 3 x 2
3 x 3 x 3

Embed multiple logical dimension in one physical
dimension using long wires

14
Trees

Cheap Cost is O(N).
Latency is O(logN).
Easy to layout as planar graphs (e.g.,
H-Trees).
For random permutations, root can become
bottleneck.
To avoid root being bottleneck, notion of
Fat-Trees (used in CM-5)

15
Multistage Logarithmic Networks

Key Idea have multiple layers of switches
between destinations.
Cost is O(NlogN) latency is O(logN)
throughput is O(N).
Generally indirect networks.
Many variations exist (Omega, Butterfly, Benes,
...).
Used in many machines BBN Butterfly, IBM RP3,
...

16
Omega Network

All stages are same, so can use recirculating
network.
Single path from source to destination.
Can add extra stages and pathways to minimize
collisions and increase fault tolerance.
Can support combining. Used in IBM RP3.

17
Butterfly Network

Equivalent to Omega network. Easy to see
routing of messages.
Also very similar to hypercubes (direct vs.
indirect though).
Clearly see that bisection of network is (N /
2) channels.
Can use higher-degree switches to reduce depth.

18
Properties of Some Topologies
Topology Degree Diameter Ave Dist Bisection D (D
ave) _at_ P1024 1D Array 2 N-1 N / 3 1 huge 1D
Ring 2 N/2 N/4 2 huge 2D Mesh 4 2 (N1/2 - 1) 2/3
N1/2 N1/2 63 (21) 2D Torus 4 N1/2 1/2
N1/2 2N1/2 32 (16) k-ary d-cube 2d dk/2 dk/4 dk/4
15 (7.5) _at_d3 Hypercube nlogN n n/2 N/2 10 (5)
19
Real Machines

In general, wide links gt smaller routing delay
Tremendous variation

20
How many dimensions?

d 2 or d 3
Short wires, easy to build
Many hops, low bisection bandwidth
Requires traffic locality
d gt 4
Harder to build, more wires, longer average
length
Fewer hops, better bisection bandwidth
Can handle non-local traffic
k-ary d-cubes provide a consistent framework for
comparison
N kd
scale dimension (d) or nodes per dimension (k)
assume cut-through

21
Scaling k-ary d-cubes

What is equal cost?
Equal number of nodes?
Equal number of pins/wires?
Equal bisection bandwidth?
Equal area?
Equal wire length?
Each assumption leads to a different optimum
Recall
switch degree d diameter d(k-1)
total links Nd
pins per node 2wd
bisection kd-1 N/d links in each directions
2Nw/d wires cross the middle

22
Scaling Latency

Assumes equal channel width

23
Average Distance with Equal Width
Avg. distance d (k-1)/2

Assumes equal channel width
but, equal channel width is not equal cost!
Higher dimension gt more channels

24
Latency with Equal Width

total links(N) Nd

25
Latency with Equal Pin Count

Baseline d2, has w 32 (128 wires per node)
fix 2dw pins gt w(d) 64/d
distance up with d, but channel time down

26
Latency with Equal Bisection Width

N-node hypercube has N bisection links
2d torus has 2N 1/2
Fixed bisection gt w(d) N 1/d / 2 k/2
1 M nodes, d2 has w512!

27
Larger Routing Delay (w/ equal pin)

Conclusions strongly influenced by assumptions of
routing delay

28
Latency under Contention

Optimal packet size? Channel utilization?

29
Saturation

Fatter links shorten queuing delays

30
Data transferred per cycle

Higher degree network has larger available
bandwidth
cost?

31
Advantages of Low-Dimensional Nets

What can be built in VLSI is often wire-limited
LDNs are easier to layout
more uniform wiring density (easier to embed in
2-D or 3-D space)
mostly local connections (e.g., grids)
Compared with HDNs (e.g., hypercubes), LDNs have
shorter wires (reduces hop latency)
fewer wires (increases bandwidth given constant
bisection width)
increased channel width is the major reason why
LDNs win!
LDNs have better hot-spot throughput
more pins per node than HDNs

32
Routing

Recall routing algorithm determines
which of the possible paths are used as routes
how the route is determined
R N x N -gt C, which at each switch maps the
destination node nd to the next channel on the
route
Issues
Routing mechanism
arithmetic
source-based port select
table driven
general computation
Properties of the routes
Deadlock free

33
StoreForward vs Cut-Through Routing

h(n/b D) vs n/b h D
h hops, n packet length, b badwidth,
D routing delay at each switch

34
Routing Mechanism

need to select output port for each input packet
in a few cycles
Reduce relative address of each dimension in
order
Dimension-order routing in k-ary d-cubes
e-cube routing in n-cube

35
Routing Mechanism (cont)
P0
P1
P2
P3

Source-based
message header carries series of port selects
used and stripped en route
CRC? Packet Format?
CS-2, Myrinet, MIT Artic
Table-driven
message header carried index for next port at
next switch
o Ri
table also gives index for following hop
o, I Ri
ATM, HPPI

36
Properties of Routing Algorithms

Deterministic
route determined by (source, dest), not
intermediate state (i.e. traffic)
Adaptive
route influenced by traffic along the way
Minimal
only selects shortest paths
Deadlock free
no traffic pattern can lead to a situation where
no packets mover forward

37
Deadlock Freedom

How can it arise?
necessary conditions
shared resource
incrementally allocated
non-preemptible
think of a channel as a shared resource that
is acquired incrementally
source buffer then dest. buffer
channels along a route
How do you avoid it?
constrain how channel resources are allocated
ex dimension order
How do you prove that a routing algorithm is
deadlock free

38
Proof Technique

Resources are logically associated with channels
Messages introduce dependences between resources
as they move forward
Need to articulate possible dependences between
channels
Show that there are no cycles in Channel
Dependence Graph
find a numbering of channel resources such that
every legal route follows a monotonic sequence
gt no traffic pattern can lead to deadlock
Network need not be acyclic, only channel
dependence graph

39
Examples

Why is the obvious routing on X deadlock free?
butterfly?
tree?
fat tree?
Any assumptions about routing mechanism? amount
of buffering?
What about wormhole routing on a ring?

1
2
0
3
7
4
6
5
40
Flow Control

What do you do when push comes to shove?
ethernet collision detection and retry after
delay
FDDI, token ring arbitration token
TCP/WAN buffer, drop, adjust rate
any solution must adjust to output rate
Link-level flow control

41
Examples

Short Links
Long links
several messages on the wire

42
Link vs global flow control

Hot Spots
Global communication operations
Natural parallel program dependences

43
Case study Cray T3E

3-dimensional torus, with 1024 switches each
connected to 2 processors
Short, wide, synchronous links
Dimension order, cut-through, packet-switched
routing
Variable sized packets, in multiples of 16 bits

44
Case Study SGI Origin

Hypercube-like topologies with upto 256 switches
Each switch supports 4 processors and connects to
4 other switches
Long, wide links
Table-driven routing programmable, allowing for
flexible topologies and fault-avoidance

Write a Comment

User Comments (0)