Title: Multiprocessor Interconnection Networks Mahim Mishra CS 495 April 16, 2002
1Multiprocessor InterconnectionNetworksMahim
MishraCS 495 April 16, 2002
- Topics
- Network design issues
- Network Topology
- Performance
2Networks
- How do we move data between processors?
- Design Options
- Topology
- Routing
- Physical implementation
3Evaluation Criteria
- Latency
- Bisection Bandwidth
- Contention and hot-spot behavior
- Partitionability
- Cost and scalability
- Fault tolerance
4Communication Perf Latency
- Time(n)s-d overhead routing delay channel
occupancy contention delay - occupancy (n ne) / b
- Routing delay?
- Contention?
5Link Design/Engineering Space
- Cable of one or more wires/fibers with connectors
at the ends attached to switches or interfaces
Synchronous - source dest on same clock
Narrow - control, data and timing multiplexed
on wire
Short - single logical value at a time
Long - stream of logical values at a time
Asynchronous - source encodes clock in signal
Wide - control, data and timing on separate
wires
6Buses
Bus
- Simple and cost-effective for small-scale
multiprocessors - Not scalable (limited bandwidth electrical
complications)
7Crossbars
- Each port has link to every other port
- Low latency and high throughput
- - Cost grows as O(N2) so not very scalable.
- - Difficult to arbitrate and to get all data
lines into and out of a centralized crossbar. - Used in small-scale MPs (e.g., C.mmp) and as
building block for other networks (e.g., Omega).
8Rings
- Cheap Cost is O(N).
- Point-to-point wires and pipelining can be used
to make them very fast. - High overall bandwidth
- - High latency O(N)
- Examples KSR machine, Hector
9(Multidimensional) Meshes and Tori
3D Cube
2D Grid
- O(N) switches (but switches are more complicated)
- Latency O(kn) (where Nkn)
- High bisection bandwidth
- Good fault tolerance
- Physical layout hard for multiple dimensions
10Real World 2D mesh
- 1824 node Paragon 16 x 114 array
11Hypercubes
- Also called binary n-cubes. of nodes N
2n. - Latency is O(logN) Out degree of PE is
O(logN) - Minimizes hops good bisection BW but tough to
layout in 3-space - Popular in early message-passing computers
(e.g., intel iPSC, NCUBE) - Used as direct network gt emphasizes locality
12k-ary d-cubes
- Generalization of hypercubes (k nodes in every
dimension) - Total of nodes N kd.
- k gt 2 reduces of channels at bisection, thus
allowing for wider channels but more hops.
13Embeddings in two dimensions
6 x 3 x 2
3 x 3 x 3
- Embed multiple logical dimension in one physical
dimension using long wires
14Trees
- Cheap Cost is O(N).
- Latency is O(logN).
- Easy to layout as planar graphs (e.g.,
H-Trees). - For random permutations, root can become
bottleneck. - To avoid root being bottleneck, notion of
Fat-Trees (used in CM-5)
15Multistage Logarithmic Networks
- Key Idea have multiple layers of switches
between destinations. - Cost is O(NlogN) latency is O(logN)
throughput is O(N). - Generally indirect networks.
- Many variations exist (Omega, Butterfly, Benes,
...). - Used in many machines BBN Butterfly, IBM RP3,
...
16Omega Network
- All stages are same, so can use recirculating
network. - Single path from source to destination.
- Can add extra stages and pathways to minimize
collisions and increase fault tolerance. - Can support combining. Used in IBM RP3.
17Butterfly Network
- Equivalent to Omega network. Easy to see
routing of messages. - Also very similar to hypercubes (direct vs.
indirect though). - Clearly see that bisection of network is (N /
2) channels. - Can use higher-degree switches to reduce depth.
18Properties of Some Topologies
Topology Degree Diameter Ave Dist Bisection D (D
ave) _at_ P1024 1D Array 2 N-1 N / 3 1 huge 1D
Ring 2 N/2 N/4 2 huge 2D Mesh 4 2 (N1/2 - 1) 2/3
N1/2 N1/2 63 (21) 2D Torus 4 N1/2 1/2
N1/2 2N1/2 32 (16) k-ary d-cube 2d dk/2 dk/4 dk/4
15 (7.5) _at_d3 Hypercube nlogN n n/2 N/2 10 (5)
19Real Machines
- In general, wide links gt smaller routing delay
- Tremendous variation
20How many dimensions?
- d 2 or d 3
- Short wires, easy to build
- Many hops, low bisection bandwidth
- Requires traffic locality
- d gt 4
- Harder to build, more wires, longer average
length - Fewer hops, better bisection bandwidth
- Can handle non-local traffic
- k-ary d-cubes provide a consistent framework for
comparison - N kd
- scale dimension (d) or nodes per dimension (k)
- assume cut-through
21Scaling k-ary d-cubes
- What is equal cost?
- Equal number of nodes?
- Equal number of pins/wires?
- Equal bisection bandwidth?
- Equal area?
- Equal wire length?
- Each assumption leads to a different optimum
- Recall
- switch degree d diameter d(k-1)
- total links Nd
- pins per node 2wd
- bisection kd-1 N/d links in each directions
- 2Nw/d wires cross the middle
22Scaling Latency
- Assumes equal channel width
23Average Distance with Equal Width
Avg. distance d (k-1)/2
- Assumes equal channel width
- but, equal channel width is not equal cost!
- Higher dimension gt more channels
24Latency with Equal Width
25Latency with Equal Pin Count
- Baseline d2, has w 32 (128 wires per node)
- fix 2dw pins gt w(d) 64/d
- distance up with d, but channel time down
26Latency with Equal Bisection Width
- N-node hypercube has N bisection links
- 2d torus has 2N 1/2
- Fixed bisection gt w(d) N 1/d / 2 k/2
- 1 M nodes, d2 has w512!
27Larger Routing Delay (w/ equal pin)
- Conclusions strongly influenced by assumptions of
routing delay
28Latency under Contention
- Optimal packet size? Channel utilization?
29Saturation
- Fatter links shorten queuing delays
30Data transferred per cycle
- Higher degree network has larger available
bandwidth - cost?
31Advantages of Low-Dimensional Nets
- What can be built in VLSI is often wire-limited
- LDNs are easier to layout
- more uniform wiring density (easier to embed in
2-D or 3-D space) - mostly local connections (e.g., grids)
- Compared with HDNs (e.g., hypercubes), LDNs have
- shorter wires (reduces hop latency)
- fewer wires (increases bandwidth given constant
bisection width) - increased channel width is the major reason why
LDNs win! - LDNs have better hot-spot throughput
- more pins per node than HDNs
32Routing
- Recall routing algorithm determines
- which of the possible paths are used as routes
- how the route is determined
- R N x N -gt C, which at each switch maps the
destination node nd to the next channel on the
route - Issues
- Routing mechanism
- arithmetic
- source-based port select
- table driven
- general computation
- Properties of the routes
- Deadlock free
33StoreForward vs Cut-Through Routing
- h(n/b D) vs n/b h D
- h hops, n packet length, b badwidth,
- D routing delay at each switch
34Routing Mechanism
- need to select output port for each input packet
- in a few cycles
- Reduce relative address of each dimension in
order - Dimension-order routing in k-ary d-cubes
- e-cube routing in n-cube
35Routing Mechanism (cont)
P0
P1
P2
P3
- Source-based
- message header carries series of port selects
- used and stripped en route
- CRC? Packet Format?
- CS-2, Myrinet, MIT Artic
- Table-driven
- message header carried index for next port at
next switch - o Ri
- table also gives index for following hop
- o, I Ri
- ATM, HPPI
36Properties of Routing Algorithms
- Deterministic
- route determined by (source, dest), not
intermediate state (i.e. traffic) - Adaptive
- route influenced by traffic along the way
- Minimal
- only selects shortest paths
- Deadlock free
- no traffic pattern can lead to a situation where
no packets mover forward
37Deadlock Freedom
- How can it arise?
- necessary conditions
- shared resource
- incrementally allocated
- non-preemptible
- think of a channel as a shared resource that
is acquired incrementally - source buffer then dest. buffer
- channels along a route
- How do you avoid it?
- constrain how channel resources are allocated
- ex dimension order
- How do you prove that a routing algorithm is
deadlock free
38Proof Technique
- Resources are logically associated with channels
- Messages introduce dependences between resources
as they move forward - Need to articulate possible dependences between
channels - Show that there are no cycles in Channel
Dependence Graph - find a numbering of channel resources such that
every legal route follows a monotonic sequence - gt no traffic pattern can lead to deadlock
- Network need not be acyclic, only channel
dependence graph
39Examples
- Why is the obvious routing on X deadlock free?
- butterfly?
- tree?
- fat tree?
- Any assumptions about routing mechanism? amount
of buffering? - What about wormhole routing on a ring?
1
2
0
3
7
4
6
5
40Flow Control
- What do you do when push comes to shove?
- ethernet collision detection and retry after
delay - FDDI, token ring arbitration token
- TCP/WAN buffer, drop, adjust rate
- any solution must adjust to output rate
- Link-level flow control
41Examples
- Short Links
- Long links
- several messages on the wire
42Link vs global flow control
- Hot Spots
- Global communication operations
- Natural parallel program dependences
43Case study Cray T3E
- 3-dimensional torus, with 1024 switches each
connected to 2 processors - Short, wide, synchronous links
- Dimension order, cut-through, packet-switched
routing - Variable sized packets, in multiples of 16 bits
44Case Study SGI Origin
- Hypercube-like topologies with upto 256 switches
- Each switch supports 4 processors and connects to
4 other switches - Long, wide links
- Table-driven routing programmable, allowing for
flexible topologies and fault-avoidance