Asynchronous Interconnection Network and Communication - PowerPoint PPT Presentation

About This Presentation
Title:

Asynchronous Interconnection Network and Communication

Description:

Asynchronous Interconnection Network and Communication Chapter 3 of Casanova, et. al. Hypercube Routing (cont) This routing algorithm can be used to implement a ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 84
Provided by: JBa999
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Asynchronous Interconnection Network and Communication


1
Asynchronous Interconnection Network and
Communication
  • Chapter 3 of Casanova, et. al.

2
Interconnection NetworkTopologies
  • The processors in a distributed memory parallel
    system are connected using an interconnection
    network.
  • All computers have specialized coprocessors that
    route messages and place date in local memories
  • Nodes consist of a (computing) processor, a
    memory, and a communications coprocessor
  • Nodes are often called processors, when not
    ambigious.

3
Network Topology Types
  • Static Topologies
  • A fixed network that cannot be changed
  • Nodes connected directly to each other by
    point-to-point communications links
  • Dynamic Topologies
  • Topology can change at runtime
  • One or more nodes can request direct
    communication be established between them.
  • Done using switches

4
Some Static Topologies
  • Fully connected network (or clique)
  • Ring
  • Two-Dimensional grid
  • Torus
  • Hypercube
  • Fat tree

5
Examples of Interconnection Topologies
6
Static Topologies Features
  • Fixed number of nodes
  • Degree
  • Nr of nodes incident to edges
  • Distance between nodes
  • Length of shortest path between two nodes
  • Diameter
  • Largest distance between two nodes
  • Number of links
  • Total number of Edges
  • Bisection Width
  • Minimum nr. of edges that must be removed to
    partition the network into two disconnected
    networks of the same size.

7
Classical Interconnection Networks Features
  • Clique (or Fully Connected)
  • All processors are connected
  • p(p-1)/2 edges
  • Ring
  • Very simple and very useful topology
  • 2D Grid
  • Degree of interior processors is 4
  • Not symmetric, as edge processors have different
    properties
  • Very useful when computations are local and
    communications are between neighbors
  • Has been heavily used previously

8
Classical Network
  • 2D Torus
  • Easily formed from 2D mesh by connecting matching
    end points.
  • Hypercube
  • Has been extensively used
  • Using recursive defn, can design simple but very
    efficient algorithms
  • Has small diameter that is logarithmic in nr of
    edges
  • Degree and total number of edges grows too
    quickly to be useful with massively parallel
    machines.

9
(No Transcript)
10
Dynamic Topologies
  • Fat tree is different than other networks
    included
  • The compute nodes are only at the leaves.
  • Nodes at higher level do not perform computation
  • Topology is a binary tree both in 2D front view
    and in side view.
  • Provides extra bandwidth near root.
  • Used by Thinking Machine Corp. on the CM-5
  • Crossbar Switch
  • Has p2 switches, which is very expensive for
    large p
  • Can connect n processors to combination of n
    processors
  • .Cost rises with the number of switches, which is
    quadratic with number of processors.

11
(No Transcript)
12
Dynamic Topologies (cont)
  • Benes Network Omega Networks
  • Use smaller size crossbars arranged in stages
  • Only crossbars in adjacent stages are connected
    together.
  • Called multi-stage networks and are cheaper to
    build that full crossbar.
  • Configuring multi-stage networks is more
    difficult than crossbar.
  • Dynamic networks are now the most common used
    topologies.

13
A Simple Communications Performance Model
  • Assume a processor Pi sends a message to Pj or
    length m.
  • Cost to transfer message along a network link is
    roughly linear in message length.
  • Results in cost to transfer message along a
    particular route to be roughly linear in m.
  • Let ci,j(m) denote the time to transfer this
    message.

14
Hockney Performance Model for Communications
  • The time ci,j(m) to transfer this message can be
    modeled by
  • ci,j(m) Li,j m/Bi,j Li,j mbi,j
  • m is the size of the message
  • Li,j is the startup time, also called latency
  • Bi,j is the bandwidth, in bytes per second
  • bi,j is 1/Bi,j, the inverse of the bandwidth
  • Proposed by Hockney in 1994 to evaluate the
    performance of the Intel Paragon.
  • Probably the most commonly used model.

15
Hockney Performance Model (cont.)
  • Factors that Li,j and Bi,j depend on
  • Length of route
  • Communication protocol used
  • Communications software overhead
  • Ability to use links in parallel
  • Whether links are half or full duplex
  • Etc.

16
Store and Forward Protocol
  • SF is a point-to-point protocol
  • Each intermediate node receives and stores the
    entire message before retransmitting it
  • Implemented in earliest parallel machines in
    which nodes did not have communications
    coprocessors.
  • Intermediate nodes are interrupted to handle
    messages and route them towards their destination.

17
Store and Forward Protocol (cont)
  • If d(i,j) is the number of links between Pi Pj,
    the formula for ci,j(m) can be re-written as
  • ci,j(m) d(i,j) L m/B d(i,j)L d(i,j)mb
  • where
  • L is the initial latency b is the reciprocal
    for the broadcast bandwidth for one link.
  • This protocol produces a poor latency bandwith
  • The communication cost can be reduced using
    pipelining.

18
Store and Forward Protocol using Pipelining
  • The message is split into r packets of size m/r.
  • The packets are sent one after another from Pi to
    Pj.
  • The first packet reaches node j after ci,j(m/r)
    time units.
  • The remaining r-1 packets arrive in
  • (r-1) (L mb/r) time units
  • Simplifying, total communication time reduces to
  • d(i,j) -1rL mb/r
  • Casanova, et.al. finds optimal value for r above.

19
Two Cut-Through Protocols
  • Common performance model
  • ci,j(m) L d(i,j) ? m/B
  • where
  • L is the one-time cost of creating a message.
  • ? is the routing management overhead
  • Generally ? ltlt L as routing management is
    performed by hardware while L involve software
    overhead
  • m/B is the time required to transmit the message
    through entire route

20
Circuit-Switching Protocol
  • First cut-through protocol
  • Route created before first message is sent
  • Message sent directly to destination through this
    route
  • The nodes used in this transmission can not be
    used during this transmission for any other
    communication

21
Wormhole (WH) Protocol
  • A second cut-through protocol
  • The destination address is stored in the header
    of the message.
  • Routing is performed dynamically at each node.
  • Message is split into small packets called flits
  • If two flits arrive at the same time, flits are
    stored in intermediate nodes internal registers

22
Point-to-Point Communication Comparisons
  • Store and Forward is not used in physical
    networks but only at applications level
  • Cut-through protocols are more efficient
  • Hide distance between nodes
  • Avoid large buffer requirement for intermediate
    nodes
  • Almost no message loss
  • For small networks, flow-control mechanism not
    needed
  • Wormhole generally preferred to circuit
    switching
  • Latency is normally much lower

23
(No Transcript)
24
LogP Model
  • Models based on the LogP model are more precise
    than the Hockney model
  • Involves three components of communication the
    sender, the network, and the receiver
  • At times, some of these components may be busy
    while others are not.
  • Some parameters for LogP
  • m is the message size (in bytes)
  • w is the size of packets message is split into
  • L is an upper bound on the latency
  • o is the overhead,
  • Defined to be the time that the a node is engaged
    in the transmission or reception of a packet

25
LogP Model (cont)
  • Parameters for LogP (cont)
  • g or gap is the minimal time interval between
    consecutive packet transmission or packet
    reception
  • During this time, a node may not use the
    communication coprocessor (i.e., network card)
  • 1/g the communication bandwidth available per
    node
  • P the number of nodes in the platform
  • Cost of sending m bytes with packet size w
  • Processor occupational time on sender/receiver

26
(No Transcript)
27
Other LogP Related Models
  • LogP attempts to capture in a few parameters the
    characteristics of parallel platforms.
  • Platforms are fine-tuned and may use different
    protocols for short long messages
  • LogGP is an extension of LogP where G captures
    the bandwidth for long messages
  • pLogP is an extension of LogP where L, o, and g
    depend on the message size m.
  • Also seperates sender overhead os and receiver
    overhead or.

28
Affine Models
  • The use of the floor functions in LogP models
    causes them to be nonlinear.
  • Causes many problems in analytic theoretical
    studies.
  • Has resulted in proposal of many fully linear
    models
  • The time that Pi is busy sending a message is
    expressed as an affine function of the message
    size
  • An affine function of m has form f(m) am b
    where a and b are constants. If b0, then f is
    linear function
  • Similarly, the time Pj is busy receiving the
    message is expressed as an affine function of the
    message size
  • We will postpone further coverage of the topic of
    affine models for the present

29
Modeling Concurrent Communications
  • Multi-port model
  • Assumes that communications are contention-free
    and do not interfere with each other.
  • A consequence is that a node may communicate with
    an unlimited number of nodes without any
    degradation in performance.
  • Would require a clique interconnection network to
    fully support.
  • May simplify proofs that certain problems are
    hard
  • If hard under ideal communications conditions,
    then hard in general.
  • Assumption not realistic - communication
    resources are always limited.
  • See Casanova text for additional information.

30
Concurrent Communications Models (2/5)
  • Bounded Multi-port model
  • Proposed by Hong and Prasanna
  • For applications that uses threads (e.g., on a
    multi-core technology), the network link can be
    shared by several incoming and outgoing
    communications.
  • The sum of bandwidths allocated by operating
    system to all communications can not exceed
    bandwidth of network card.
  • An unbounded nr of communications can take place
    if they share the total available bandwidth.
  • The bandwidth defines the bandwidth allotted to
    each communication
  • Bandwidth sharing by application is unusual, as
    is usually handled by operating system.

31
Concurrent Communications Models (3/5)
  • 1 port (unidirectional or half-duplex) model
  • Avoids unrealistically optimistic assumptions
  • Forbids concurrent communication at a node.
  • A node can either send data or receive it, but
    not simultaneously.
  • This model is very pessimistic, as real world
    platforms can achieve some concurrent
    computations.
  • Model is simple and is easy to design algorithms
    that follow this model.

32
Concurrent Communications Models (4/5)
  • 1 port (bidirectional or full-duplex) model
  • Currently, most network cards are full-duplex.
  • Allows a single emission and single reception
    simultaneously.
  • Introduced by Blat, et. al.
  • Current hardware does not easily enable multiple
    messages to be transmitted simultaneous.
  • Multiple sends and receives are claimed to be
    eventually serialized by a single hardware port
    to the next.
  • Saif Parashar did experimental work that
    suggests asynchronous sends become serialized
    when message sizes exceed a few megabytes.

33
Concurrent Communications Models (5/5)
  • k-ports model
  • A node may have kgt1 network cards
  • This model allows a node to be involved in a
    maximum of one emission and one reception on each
    network card.
  • This model is used in Chapters 4 5.

34
Bandwidth Sharing
  • The previous concurrent communication models only
    consider contention on nodes
  • Other parts of the network can also limit
    performance
  • It may be useful to determine constraints on each
    network link
  • This type of network model are useful for
    performance evaluation purposes, but are too
    complicated for algorithm design purposes.
  • Casanova text evalutes algorithms using 2 models
  • Hockney model or even simplified versions (e.g.
    assuming no latency)
  • Multi-port (ignoring contention) or the 1 port
    model.

35
Case Study Unidirectional Ring
  • We first consider the platform of p processors
    arranged in a unidirectional ring.
  • Processors are denoted Pk for k 0, 1, , p-1.
  • Each PE can find its logical index by calling
    My_Num().

36
Unidirectional Ring Basics
  • The processor can determine the number of PEs by
    calling NumProcs()
  • Both preceding commands are supported in MPI, a
    language implemented on most asychronous systems.
  • Each processor has its own memory
  • All processors execute the same program, which
    acts on data in their local memories
  • Single Program, Multiple Data or SPMD
  • Processors communicate by message passing
  • explicitly sending and receiving messages.

37
Unidirectional Ring Basics (cont 2/5)
  • A processor sends a message using the function
  • send(addr,m)
  • addr is the memory address (in the sender
    process) of first data item to be sent.
  • m is the message length (i.e., nr of items to be
    sent)
  • A processor receives a message using function
  • receive(addr,m)
  • The addr is the local address in receiving
    processor where first data item is to be stored.
  • If processor Pi executes a receive, then its
    predecessor (P(i-1)mod p) must execute a send.
  • Since each processor has a unique predecessor and
    successor, they do not have to be specified

38
Unidirectional Ring Basics (cont 3/5)
  • A restrictive assumption is to assume that both
    the send and receive is blocking.
  • Then the participating processes can not continue
    until the communication is complete.
  • The blocking assumption is typical of 1st
    generation platforms
  • A classical assumption is keep the receive
    blocking but to allow the send is non-blocking
  • The processor executing a send can continue while
    the data transfer takes place.
  • To implement, one function is used to initiate
    the send and another function is used to
    determine when communication is finished.

39
Unidirectional Ring Basics (cont 4/5)
  • In algorithms, we simply indicate the blocking
    and non-blocking operations
  • More recent proposed assumption is that a single
    processor can send data, receive data, and
    compute simultaneously.
  • All three can occur only if no race condition
    exists.
  • Convenient to think of three logical threads of
    control running on a processor
  • One for computing
  • One for sending data
  • One for receiving data
  • We will usually use the less restrictive third
    assumption

40
Unidirectional Ring Basics (cont 4/5)
  • Timings for Send/Receive
  • We use a simplified version of the Hockney model
  • The time to send or receive over one link is
  • c(m) L mb
  • m is the length of the message
  • L is the startup cost in seconds due to the
    physical latency and the software overhead
  • b is the inverse of the data transfer rate.

41
The Broadcast Operation
  • The broadcast operation allows an processor Pk to
    send the same message of length m to all other
    processors.
  • At the beginning of the broadcast operation, the
    message is stored at the address addr in the
    memory of the sending process, Pk.
  • At the end of the broadcast, the message will be
    stored at address addr in the memory of all
    processors.
  • All processors must call the following function
  • Broadcast(k, addr, m)

42
Broadcast Algorithm Overview
  • The message will go around the ring from
    processor - from Pk to Pk1 to Pk2 to to
    Pk-1.
  • We will assume the processor numbers are modulo
    p, where p is the number of processors. For
    example, if k0 and p8, then k-1 p-1 7.
  • Note there is no parallelism in this algorithm,
    since the message advances around ring only one
    processor per round.
  • The predecessor of Pk (i.e, Pk-1) does not send
    the message to Pk.

43
(No Transcript)
44
Analysis of Broadcast Algorithm
  • For algorithm to be correct, the receive in
    Step 10 will execute before Step 11.
  • Running Time
  • Since we have a sequence of p-1 communications,
    the time to broadcast a message of length m is
  • (p-1)(Lmb)
  • MPI does not typically use ring topology for
    creating communication primitives
  • Instead use various tree topologies that are more
    efficient on modern parallel computer platforms.
  • However, these primitives are simpler on a ring.
  • Prepares readers to implement primitives, when
    more efficient than using MPI primitives.

45
Scatter Algorithm
  • Scatter operation allows Pk to send a different
    message of length m to each processor.
  • Initially, Pk holds a message of length m to be
    sent to Pq at location addr q.
  • To keep the array of addresses uniform, space for
    a message to Pk is also provided.
  • At the end of the algorithm, each processor
    stores its message from Pk at location msg.
  • The efficient way to implement this algorithm is
    to pipeline the messages.
  • Message to most distant processor (i.e,, Pk-1) is
    followed by message to processor Pk-2.

46
(No Transcript)
47
Discussion of Scatter Algorithm
  • In Steps 5-6, Pk successively send messages to
    the other p-1processors in the order of their
    distance from Pk .
  • In Step 7, Pk stores its message to itself.
  • The other processors concurrently move messages
    along as they arrive in steps 9-12.
  • Each processor uses two buffers with addresses
    tempS and tempR.
  • This allows processors to send a message and to
    receive the next message in parallel in Step12.

48
Discussion of Scatter Algorithm (cont)
  • In step 11, tempS ? tempR means two addresses are
    switched so received value can be sent to next
    processor.
  • When a processor receives its message from Pk,
    the processor stops forwarding (Step 10).
  • Whatever is in the receive buffer, tempR, at the
    end is stored as its message from Pk (Step 13).
  • The running time of the scatter algorithm is the
    same as for the broadcast, namely
  • (p-1)(Lmb)

49
Example for Scatter Algorithm
  • Example In Figure 3.7, let p6 and k4.
  • Steps 5-6 For i 1 to p-1 do
  • send(addr(kp-i) mod p, m)
  • Let PE (kp-i) mod p (10 i) mod 6
  • For i1, PE 9 mod 6 3
  • For i2, PE 8 mod 6 2
  • For i3, PE 7 mod 6 1
  • For i4, PE 6 mod 6 0
  • For i5, PE 5 mod 6 5
  • Note messages are sent to processors in the order
    3, 2, 1, 0, 5
  • That is, messages to most distant processors sent
    first.

50
Example for Scatter Algorithm (cont)
  • Example In Figure 3.7, let p6 and k4.
  • Steps 10 For i 1 to (k-1-q) mod p do
  • Compute (k-1-q) mod p (3-q) mod 6 for all q.
  • Note q? k, which is 4
  • q 5 ? i 1 to 4 since (3-5) mod 6 4
  • PE 5 forwards values in loop from i 1 to 4
  • q 0 ? i 1 to 3 since (3-0) mod 6 3
  • PE 0 forwards values from i 1 to 3
  • q 1 ? i 1 to 2 since (3-1) mod 6 2
  • PE 1 forwards values from i 1 to 2
  • q 2 ? i 1 to 1 since (3-2) mod 6 1
  • PE 2 is active in loop when i 1

51
Example for Scatter Algorithm (cont)
  • q 3 ? i 1 to 0 since (3-3) mod 6 0
  • PE 3 precedes PE k, so it never forwards a value
  • However, it receives and stores a message at 9
  • Note that in step 9, all processors store the
    first message they receive.
  • That means even the processor k-1 receives a
    value to store.

52
All-to-All Algorithm
  • This command allows all p processors to
    simultaneously broadcast a message (to all PEs)
  • Again, it is assumed all messages have length m
  • At the beginning, each processor holds the
    message it wishes to broadcast at address
    my_message.
  • At the end, each processor will hold an array
    addr of p messages, where addrk holds message
    from Pk.
  • Using pipelining, the running time is the same as
    for a single broadcast, namely
  • (p-1)(Lmb)

53
(No Transcript)
54
Gossip Algorithm
  • Last of the classical collection of communication
    operations
  • Each processor sends a different message to each
    processor.
  • Gossip algorithm is Problem 3.7 in textbook.
  • Note it will take 1 step for each PE to send a
    message to its closest neighbor using all links.
  • It will take 2 steps for each PE to send a
    message to its 2nd closest neighbor, using all
    links.
  • In general, it will take 12 (p-1) (p-1)!
    steps for each PE to send messages to all other
    nodes, using all links of the network during each
    step.
  • Complexity is (p-1)! (p-1)(p-2)/2 O(p2).

55
Pipelined Broadcast by kth processor
  • Longer messages can be broadcast faster if they
    are broken into smaller pieces
  • Suppose they are broken into r pieces of the same
    length.
  • The sender sends the pieces out in order, and has
    them travel simultaneously on the ring.
  • Initially, the pieces are stored at addresses
    addr0, addr1, , addrr-1.
  • At the end, all pieces are stored in all
    processors.
  • At each step, when a processor receives a message
    piece, it also forwards the piece it previously
    received, if any, to its successor.


56
(No Transcript)
57
Pipelined Broadcast (cont)
  • There must be p-1 communication steps for first
    piece to reach the last processor, Pk-1.
  • Then it takes r-1 time for the rest of the pieces
    to reach Pk-1.
  • The required time is (pr-2) ( L mb/r)
  • The value of r that minimizes this expression can
    be found by setting its derivative (with respect
    to r) to zero and solving for r.
  • For large m, the time required tends to mb
  • Does not depend on p.
  • Compares well to broadcast time (p-1)(Lmb)

58
Hypercube
  • Defn A 0-cube consists of a single vertex. For
    ngt0, an n-cube consists of two identical
    (n-1)-cubes with edges added to join matching
    pairs of vertices in the two (n-1) cubes.

59
Hypercubes (cont)
  • Equivalent defn An n-cube is a graph consisting
    of 2n vertices from 0 to 2n -1 such that two
    vertices are connected if and only if their
    binary representation differs by one single bit.
  • Property The diameter and degree of an n-cube
    are equal to n.
  • Proof is left for reader. Easy if use recursion.
  • Hamming Distance Let A and B be two points in an
    n-cube. H(A,B) is the number of bits that differ
    between the binary labels of A and B.
  • Notation If b is a binary bit, then let b 1-b,
    the bit complement of b.

60
Hypercube Paths
  • Using binary representation, let
  • A an-1 an-2 a2 a1 a0
  • B bn-1 bn-2 b2 b1 b0
  • WLOG, assume that A and B have different bits
    exactly in their last k bits.
  • Differing bits at end makes numbering them easier
  • Then a pathway from A to B can be created by the
    following sequence of nodes
  • A an-1 an-2 a2 a1 a0
  • Vertex 1 an-1 an-2 a2 a1 a0
  • Vertex 2 an-1 an-2 a2 a1 a0
  • ........
  • B Vertex k an-1 an-2 ak ak-1 a2 a1 a0

61
Hypercube Paths (cont)
  • Independent of which bits of A and B agree, there
    are k choices for first bit to flip, (k-1)
    choices for next bit to flip, etc.
  • This gives k! different paths from A to B.
  • How many independent paths exist from A to B?
  • I.e., paths with only A and B as common vertices.
  • Theorem If A and B are n-cube vertices that
    differ in k bits, then there exist exactly k
    independent paths from A to B.
  • Proof First, we show k independent paths exist.
  • We build an independent path for each j with
    0?jltk

62
Hypercube Paths (cont)
  • Let P(j, j-1, j-2, , 0, k-1, k-2, , j1)
    denote the path from A to B with following
    sequence of nodes
  • A an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2 a2
    a1 a0
  • V(1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
    a2 a1 a0
  • V(2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
    a2 a1 a0
  • .........
  • V(j1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
    a2 a1 a0
  • V(j2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
    a2 a1 a0
  • .............
  • V(k-1) an-1 an-2 ak ak-1 ak1 aj1aj aj-1
    aj-2 a2 a1 a0
  • B V(k)an-1 an-2 ak ak-1 ak1 aj1aj aj-1
    aj-2 a2 a1 a0

63
Hypercube Pathways (cont)
  • Suppose the following two path have a common
    vertex X other than A and B.
  • P(j, j-1, j-2, , 0, k-1, k-2, , j1)
  • P(t, t-1, t-2, , 0, k-1, k-2, , t1)
  • Since paths are different, and A and B differ by
    k bits, we may assume 0tltjltk
  • Let A and X differ in q bits
  • To travel from A to X along either path, exactly
    q bits in circular sequence have been flipped, in
    a left to right order for each
  • 1st path j, j-1 t, t-1 0, k-1, k-2 j-1
  • 2nd path t, t-1 0, k-1, k-2
    j-1 t-1
  • This is impossible, as the q bits are flipped
    along each path can not be exactly the same bits.

64
Hypercube Paths (cont)
  • Finally, there can not be another independent
    path Q from A to B (i.e., without another common
    vertex)
  • If so, the first node in path following A would
    have to flipped one bit, say bit q, to agree
    with B.
  • But then, the path described earlier that flipped
    bit q first would have a common interior vertex
    with this path Q.

65
Hypercube Routing
  • XOR is an exclusive OR. Exactly one input must be
    a 1.
  • To design a route from A to B in the n-cube, we
    will use the algorithm that will always flip the
    rightmost bit that disagrees with B.
  • The NOR of the binary representation of A and B
    indicates by 1s the bits that have to be
    flipped.
  • For the 5-cube, if A is 10111 and B is 01110,then
  • A NOR B is 11001, and the routing is as follows
  • A 10111 ? 10110 ? 11110 ? 01110 B
  • This algorithm can be executed as follows
  • A NOR B 10111 NOR 01110 11001, so A routes
    the message along link 1 (the digits link) to
    Node A1 10110
  • A1 NOR B 10110 NOR 01110 11000, so Node A1
    routes message along link 4 to Node A2 10110
  • A2 NOR B 10110 NOR 01110 10000, so A2
    routes message along link 5 to B

66
Hypercube Routing (cont)
  • This routing algorithm can be used to implement a
    wormhole or cut-through protocol in hardware.
  • Problem If another pair of processors have
    already reserved one link on the desired path,
    the message may stall until the end of the other
    communication.
  • Solution Since there are multiple paths, the
    routers select links to use based on a link
    reservation table and message labels.
  • In our example, if link 1 in Node A is busy, then
    instead use link 4 to forward the message to new
    node 11111 that is on a new path to B.
  • If at some point, the current vertex determines
    there is no useful links available, then current
    vertex will have to wait for a useful link to
    become available
  • If at some vertex, the desired links are not
    available, then algorithm could use a link that
    extends path length

67
Gray Code
  • Recursive construction of Gray code
  • G1 (0,1) and has 21 2 elements
  • G2 (00,01,11, 10) and had 22elements
  • G3 (000, 001, 011, 010, 110, 111, 101, 100)
    has 23 elements
  • etc.
  • Gray code for dimension n ? 1 is denoted Gn, and
    is defined recursively. G1 (0,1) and for ngt1,
    Gn is the sequence 0Gn-1 followed by the sequence
    1Gn-1rev (i.e., ) where
  • xGn-1 is the sequence obtained by prefixing every
    element of G with x
  • Grev is the sequence obtained by listing the
    elements of G in the reverse order
  • .

68
Gray Code (cont.)
  • Since we assume Gn-1 has 2n-1 elements and G n
    has exactly two copies of Gn-1 , it has 2n
    elements.
  • Summary The Gray code Gn is an ordered sequence
    of all the 2n binary codes with n digits whose
    successive values differ from each other by
    exactly one bit.
  • Notation Let gi(r) denote the ith element of the
    Gray code of dimension r.
  • Observation Since the Gray code Gn g1(n),
    g2(n), , g(n) form a ordered sequence of names
    for all of the nodes in a 2n ring.

69
Embeddings
  • Defn An embedding of a topology (e.g., ring, 2D
    mesh, etc) into an n-cube is a 1-1 function f
    from the vertices of the topology into the
    n-cube.
  • An embedding is said to preserve locality if the
    image of any two neighbors are also neighbors in
    the n-cube
  • If embeddings do not preserve locality, then we
    try to minimize the distance of neighbors in the
    hypercube.
  • An embedding is said to be onto if the embedding
    function f has its range to be the entire n-cube.

70
A 2n Ring embedding onto n-cube
  • Theorem There is an embedding of a ring with 2n
    vertices onto an n-cube that preserves locality.
  • Proof
  • Our construction of the Gray code provides an
    ordered sequence of binary values with n digits
    that can be used as names of the nodes of the
    ring with 2n vertices
  • The first name (e.g., 000) can be used to name
    any node in ring.
  • The sequence of Gray code names are given names
    to ring nodes successively, in a clockwise or
    counter-clockwise order

71
Embedding Ring onto n-cube (cont)
  • The gray code binary numbers are identical with
    the names assigned to hypercube nodes.
  • Two successive n-cube nodes are connected since
    they only differ by one binary digi
  • So the embedding is a result of the Gray code
    providing an ordered sequence of n-cube names
    that is used to provide successive names the 2n
    ring nodes.
  • This concludes the proof that this embedding
    follows from construction of Gray code.
  • However, the following formal proof provides
    more details and an after construction
    argument..

72
A 2n Ring Embedding onto the n-cube
  • The following optional formal proof is included
    for those who do not find the preceding argument
    convincing
  • Theorem There is an embedding of a ring with 2n
    vertices onto an n-cube that preserves locality.
  • Proof
  • We establish the following claim The mapping
    f(i) gi(n) is an embedding of the ring onto the
    n-cube.
  • This claim is true for n 1 as G1 (0,1) and
    nodes 0 and 1 are connected on both the ring and
    1-cube.
  • We assume above claim is true for a fixed n-1
    with n?1.
  • We use the neighbor-preserving embedding f of the
    vertices of a ring with 2n-1 nodes onto the
    (n-1)-cube to build a similar embedding f of a
    ring with 2n vertices onto the n-cube.

73
Ring Embedding for n-cube (cont)
  • Recall that gray code for Gn is the sequence
    0Gn-1 followed by the sequence 1Gn-1rev
  • The n-cube consists of one copy of an (n-1)-cube
    with a 0 prefix and a second copy of an (n-1)
    cube with a 1 prefix.
  • By assumption that claim is true for n-1, the
    gray code sequence 0Gn provides the binary code
    for a ring of elements in the (n-1)-cube, with
    each successive element differing by 1 digit from
    its previous element.
  • Likewise, the gray code sequence 1Gn-1rev
    provides the binary code for a ring of elements
    in the second copy of an (n-1)-cube, with each
    successive element differing by on digit from its
    previous element..
  • The last element of 0Gn-1 is identical to the
    first element of 1Gn-1rev except for the added
    digit, so these two elements also differ by one
    bit.

74
2D Torus Embedding for n-cube
  • We embed a 2r ? 2s torus onto an n-cube with
    nrs by using the cartesian product Gr ? Gs of
    two Gray codes.
  • A processor with coordinates (i,j) on the grid is
    mapped to the processor f(i,j) (gi(r), gi(s) )
    in the n-cube.
  • Recall the map f1(i) gi(r), is an embedding of
    a 2r ring onto a r-cube row ring and f2(j)
    gi(s) is an embedding of a 2s ring onto a s-cube
    column ring.
  • We must identify (gi(r), gi(s) ) with the nrs
    cube where the first r bits are given by gi(r)
    and the next s bits are given by gi(s) .
  • Then for a fixed j, f(i?1,j) are neighbors of
    f(i,j) since f1 is an embedding of a 2r ring onto
    a r-cube.
  • Likewise, for a fixed i, f(i, j ?1) are neighbors
    of f(i,j) since f2 is an embedding of a 2s ring
    onto a s-cube.

75
Collective Communications in Hypercube
  • Purpose Gain overview of complexity of
    collective communications for hypercube.
  • Will focus on broadcast on hypercube.
  • Assume processor 0 wants to broadcast. And
    consider the naïve algorithm
  • Processor 0 sends the message to all of its
    neighbors
  • Next, every neighbor sends a message to all of
    its neighbors.
  • Etc.
  • Redundancy in naïve algorithm
  • The same processor receives the same message many
    times.
  • E.g., processor 0 receives all its neighbors.
  • Mismatched SENDS and RECEIVES may happen.

76
Improved Hypercube Broadcast
  • We seek a strategy where
  • Each processor receives the message only once
  • The number of steps is minimal.
  • Will use one or more spanning trees.
  • Send and Receive will need parameter of which
    dimension the communication takes place
  • SEND( cube_link, send_addr, m )
  • RECEIVE(cube_link, send_addr, m )

77
Hypercube Broadcast Algorithm
  • There are n steps, numbered from n-1 to 0
  • All processors will receive their msg on the link
    corresponding to their rightmost 1
  • Processors receiving a msg will forward it on
    links whose index is smaller than its rightmost
    1.
  • At step i, all processors whose rightmost 1 is
    strictly larger than i forwards the msg on link
    i.
  • Let broadcast originates with processor 0
  • Assume 0 has a fictitious 1 at position n.
  • This adds an additional digit, as has binary
    digits for positions 0,1, , n-1.

78
Trace of Hypercube Broadcast for n4
  • Since broadcast originates with processor 0, we
    assume its index is 10000
  • Processor 0 sends the broadcast msg on link 3 to
    1000
  • Next, both processor 0000 and 1000 have their
    rightmost 1 in position at least three, so both
    send a message along link 2
  • Msg goes to 0100 and 1100, respectively.
  • This process continues until the last step at
    which every even-numbered processor sends its msg
    along its link 0.

79
Broadcasting using Spanning Tree
80
Broadcast Algorithm
  • Let BIT(A,b) denote the value of the bth bit of
    processor A.
  • The algorithm for the broadcast of a message of
    length m by processor k is given in Algorithm 3.5
  • Since there are n steps, the execution time with
    the store-and-forward model is n(Lmb).
  • This algorithm is valid for the 1-port model.
  • At each step, a processor communicates with at
    most one other processor.

81
(No Transcript)
82
Broadcast Algorithm in Hypercube (cont)
  • Observations about Algorithm Steps
  • Specifies that the algorithm action is for
    processor q
  • n is the number of binary digits used to label
    processors.
  • pos q XOR k
  • uses exclusive OR.
  • Broadcast is from Pk .
  • Updates pos to work as if P0 is root of broadcast
    ???
  • Steps 5-7 sets first-1 to the location of first
    1 in pos.
  • Note Steps 8-10 are the core of algorithm
  • Phase steps through link dimensions, higher ones
    first
  • If link dimension first-1, q gets message on
    first-1 link
  • Seems to be moving message up to q ???
  • Q sends message along each smaller dimension than
    first-1 to processor below it.

83
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com