Title: Asynchronous Interconnection Network and Communication
1Asynchronous Interconnection Network and
Communication
- Chapter 3 of Casanova, et. al.
2Interconnection NetworkTopologies
- The processors in a distributed memory parallel
system are connected using an interconnection
network. - All computers have specialized coprocessors that
route messages and place date in local memories - Nodes consist of a (computing) processor, a
memory, and a communications coprocessor - Nodes are often called processors, when not
ambigious.
3Network Topology Types
- Static Topologies
- A fixed network that cannot be changed
- Nodes connected directly to each other by
point-to-point communications links - Dynamic Topologies
- Topology can change at runtime
- One or more nodes can request direct
communication be established between them. - Done using switches
4Some Static Topologies
- Fully connected network (or clique)
- Ring
- Two-Dimensional grid
- Torus
- Hypercube
- Fat tree
5Examples of Interconnection Topologies
6Static Topologies Features
- Fixed number of nodes
- Degree
- Nr of nodes incident to edges
- Distance between nodes
- Length of shortest path between two nodes
- Diameter
- Largest distance between two nodes
- Number of links
- Total number of Edges
- Bisection Width
- Minimum nr. of edges that must be removed to
partition the network into two disconnected
networks of the same size.
7Classical Interconnection Networks Features
- Clique (or Fully Connected)
- All processors are connected
- p(p-1)/2 edges
- Ring
- Very simple and very useful topology
- 2D Grid
- Degree of interior processors is 4
- Not symmetric, as edge processors have different
properties - Very useful when computations are local and
communications are between neighbors - Has been heavily used previously
8Classical Network
- 2D Torus
- Easily formed from 2D mesh by connecting matching
end points. - Hypercube
- Has been extensively used
- Using recursive defn, can design simple but very
efficient algorithms - Has small diameter that is logarithmic in nr of
edges - Degree and total number of edges grows too
quickly to be useful with massively parallel
machines.
9(No Transcript)
10Dynamic Topologies
- Fat tree is different than other networks
included - The compute nodes are only at the leaves.
- Nodes at higher level do not perform computation
- Topology is a binary tree both in 2D front view
and in side view. - Provides extra bandwidth near root.
- Used by Thinking Machine Corp. on the CM-5
- Crossbar Switch
- Has p2 switches, which is very expensive for
large p - Can connect n processors to combination of n
processors - .Cost rises with the number of switches, which is
quadratic with number of processors.
11(No Transcript)
12Dynamic Topologies (cont)
- Benes Network Omega Networks
- Use smaller size crossbars arranged in stages
- Only crossbars in adjacent stages are connected
together. - Called multi-stage networks and are cheaper to
build that full crossbar. - Configuring multi-stage networks is more
difficult than crossbar. - Dynamic networks are now the most common used
topologies.
13A Simple Communications Performance Model
- Assume a processor Pi sends a message to Pj or
length m. - Cost to transfer message along a network link is
roughly linear in message length. - Results in cost to transfer message along a
particular route to be roughly linear in m. - Let ci,j(m) denote the time to transfer this
message.
14Hockney Performance Model for Communications
- The time ci,j(m) to transfer this message can be
modeled by - ci,j(m) Li,j m/Bi,j Li,j mbi,j
- m is the size of the message
- Li,j is the startup time, also called latency
- Bi,j is the bandwidth, in bytes per second
- bi,j is 1/Bi,j, the inverse of the bandwidth
- Proposed by Hockney in 1994 to evaluate the
performance of the Intel Paragon. - Probably the most commonly used model.
15Hockney Performance Model (cont.)
- Factors that Li,j and Bi,j depend on
- Length of route
- Communication protocol used
- Communications software overhead
- Ability to use links in parallel
- Whether links are half or full duplex
- Etc.
16Store and Forward Protocol
- SF is a point-to-point protocol
- Each intermediate node receives and stores the
entire message before retransmitting it - Implemented in earliest parallel machines in
which nodes did not have communications
coprocessors. - Intermediate nodes are interrupted to handle
messages and route them towards their destination.
17Store and Forward Protocol (cont)
- If d(i,j) is the number of links between Pi Pj,
the formula for ci,j(m) can be re-written as - ci,j(m) d(i,j) L m/B d(i,j)L d(i,j)mb
- where
- L is the initial latency b is the reciprocal
for the broadcast bandwidth for one link. - This protocol produces a poor latency bandwith
- The communication cost can be reduced using
pipelining.
18Store and Forward Protocol using Pipelining
- The message is split into r packets of size m/r.
- The packets are sent one after another from Pi to
Pj. - The first packet reaches node j after ci,j(m/r)
time units. - The remaining r-1 packets arrive in
- (r-1) (L mb/r) time units
- Simplifying, total communication time reduces to
- d(i,j) -1rL mb/r
- Casanova, et.al. finds optimal value for r above.
19Two Cut-Through Protocols
- Common performance model
- ci,j(m) L d(i,j) ? m/B
- where
- L is the one-time cost of creating a message.
- ? is the routing management overhead
- Generally ? ltlt L as routing management is
performed by hardware while L involve software
overhead - m/B is the time required to transmit the message
through entire route
20Circuit-Switching Protocol
- First cut-through protocol
- Route created before first message is sent
- Message sent directly to destination through this
route - The nodes used in this transmission can not be
used during this transmission for any other
communication
21Wormhole (WH) Protocol
- A second cut-through protocol
- The destination address is stored in the header
of the message. - Routing is performed dynamically at each node.
- Message is split into small packets called flits
- If two flits arrive at the same time, flits are
stored in intermediate nodes internal registers
22Point-to-Point Communication Comparisons
- Store and Forward is not used in physical
networks but only at applications level - Cut-through protocols are more efficient
- Hide distance between nodes
- Avoid large buffer requirement for intermediate
nodes - Almost no message loss
- For small networks, flow-control mechanism not
needed - Wormhole generally preferred to circuit
switching - Latency is normally much lower
23(No Transcript)
24LogP Model
- Models based on the LogP model are more precise
than the Hockney model - Involves three components of communication the
sender, the network, and the receiver - At times, some of these components may be busy
while others are not. - Some parameters for LogP
- m is the message size (in bytes)
- w is the size of packets message is split into
- L is an upper bound on the latency
- o is the overhead,
- Defined to be the time that the a node is engaged
in the transmission or reception of a packet
25LogP Model (cont)
- Parameters for LogP (cont)
- g or gap is the minimal time interval between
consecutive packet transmission or packet
reception - During this time, a node may not use the
communication coprocessor (i.e., network card) - 1/g the communication bandwidth available per
node - P the number of nodes in the platform
- Cost of sending m bytes with packet size w
- Processor occupational time on sender/receiver
26(No Transcript)
27Other LogP Related Models
- LogP attempts to capture in a few parameters the
characteristics of parallel platforms. - Platforms are fine-tuned and may use different
protocols for short long messages - LogGP is an extension of LogP where G captures
the bandwidth for long messages - pLogP is an extension of LogP where L, o, and g
depend on the message size m. - Also seperates sender overhead os and receiver
overhead or.
28Affine Models
- The use of the floor functions in LogP models
causes them to be nonlinear. - Causes many problems in analytic theoretical
studies. - Has resulted in proposal of many fully linear
models - The time that Pi is busy sending a message is
expressed as an affine function of the message
size - An affine function of m has form f(m) am b
where a and b are constants. If b0, then f is
linear function - Similarly, the time Pj is busy receiving the
message is expressed as an affine function of the
message size - We will postpone further coverage of the topic of
affine models for the present
29Modeling Concurrent Communications
- Multi-port model
- Assumes that communications are contention-free
and do not interfere with each other. - A consequence is that a node may communicate with
an unlimited number of nodes without any
degradation in performance. - Would require a clique interconnection network to
fully support. - May simplify proofs that certain problems are
hard - If hard under ideal communications conditions,
then hard in general. - Assumption not realistic - communication
resources are always limited. - See Casanova text for additional information.
30Concurrent Communications Models (2/5)
- Bounded Multi-port model
- Proposed by Hong and Prasanna
- For applications that uses threads (e.g., on a
multi-core technology), the network link can be
shared by several incoming and outgoing
communications. - The sum of bandwidths allocated by operating
system to all communications can not exceed
bandwidth of network card. - An unbounded nr of communications can take place
if they share the total available bandwidth. - The bandwidth defines the bandwidth allotted to
each communication - Bandwidth sharing by application is unusual, as
is usually handled by operating system.
31Concurrent Communications Models (3/5)
- 1 port (unidirectional or half-duplex) model
- Avoids unrealistically optimistic assumptions
- Forbids concurrent communication at a node.
- A node can either send data or receive it, but
not simultaneously. - This model is very pessimistic, as real world
platforms can achieve some concurrent
computations. - Model is simple and is easy to design algorithms
that follow this model.
32Concurrent Communications Models (4/5)
- 1 port (bidirectional or full-duplex) model
- Currently, most network cards are full-duplex.
- Allows a single emission and single reception
simultaneously. - Introduced by Blat, et. al.
- Current hardware does not easily enable multiple
messages to be transmitted simultaneous. - Multiple sends and receives are claimed to be
eventually serialized by a single hardware port
to the next. - Saif Parashar did experimental work that
suggests asynchronous sends become serialized
when message sizes exceed a few megabytes.
33Concurrent Communications Models (5/5)
- k-ports model
- A node may have kgt1 network cards
- This model allows a node to be involved in a
maximum of one emission and one reception on each
network card. - This model is used in Chapters 4 5.
34Bandwidth Sharing
- The previous concurrent communication models only
consider contention on nodes - Other parts of the network can also limit
performance - It may be useful to determine constraints on each
network link - This type of network model are useful for
performance evaluation purposes, but are too
complicated for algorithm design purposes. - Casanova text evalutes algorithms using 2 models
- Hockney model or even simplified versions (e.g.
assuming no latency) - Multi-port (ignoring contention) or the 1 port
model.
35Case Study Unidirectional Ring
- We first consider the platform of p processors
arranged in a unidirectional ring. - Processors are denoted Pk for k 0, 1, , p-1.
- Each PE can find its logical index by calling
My_Num().
36Unidirectional Ring Basics
- The processor can determine the number of PEs by
calling NumProcs() - Both preceding commands are supported in MPI, a
language implemented on most asychronous systems. - Each processor has its own memory
- All processors execute the same program, which
acts on data in their local memories - Single Program, Multiple Data or SPMD
- Processors communicate by message passing
- explicitly sending and receiving messages.
37Unidirectional Ring Basics (cont 2/5)
- A processor sends a message using the function
- send(addr,m)
- addr is the memory address (in the sender
process) of first data item to be sent. - m is the message length (i.e., nr of items to be
sent) - A processor receives a message using function
- receive(addr,m)
- The addr is the local address in receiving
processor where first data item is to be stored. - If processor Pi executes a receive, then its
predecessor (P(i-1)mod p) must execute a send. - Since each processor has a unique predecessor and
successor, they do not have to be specified
38Unidirectional Ring Basics (cont 3/5)
- A restrictive assumption is to assume that both
the send and receive is blocking. - Then the participating processes can not continue
until the communication is complete. - The blocking assumption is typical of 1st
generation platforms - A classical assumption is keep the receive
blocking but to allow the send is non-blocking - The processor executing a send can continue while
the data transfer takes place. - To implement, one function is used to initiate
the send and another function is used to
determine when communication is finished.
39Unidirectional Ring Basics (cont 4/5)
- In algorithms, we simply indicate the blocking
and non-blocking operations - More recent proposed assumption is that a single
processor can send data, receive data, and
compute simultaneously. - All three can occur only if no race condition
exists. - Convenient to think of three logical threads of
control running on a processor - One for computing
- One for sending data
- One for receiving data
- We will usually use the less restrictive third
assumption
40Unidirectional Ring Basics (cont 4/5)
- Timings for Send/Receive
- We use a simplified version of the Hockney model
- The time to send or receive over one link is
- c(m) L mb
- m is the length of the message
- L is the startup cost in seconds due to the
physical latency and the software overhead - b is the inverse of the data transfer rate.
41The Broadcast Operation
- The broadcast operation allows an processor Pk to
send the same message of length m to all other
processors. - At the beginning of the broadcast operation, the
message is stored at the address addr in the
memory of the sending process, Pk. - At the end of the broadcast, the message will be
stored at address addr in the memory of all
processors. - All processors must call the following function
- Broadcast(k, addr, m)
42Broadcast Algorithm Overview
- The message will go around the ring from
processor - from Pk to Pk1 to Pk2 to to
Pk-1. - We will assume the processor numbers are modulo
p, where p is the number of processors. For
example, if k0 and p8, then k-1 p-1 7. - Note there is no parallelism in this algorithm,
since the message advances around ring only one
processor per round. - The predecessor of Pk (i.e, Pk-1) does not send
the message to Pk.
43(No Transcript)
44Analysis of Broadcast Algorithm
- For algorithm to be correct, the receive in
Step 10 will execute before Step 11. - Running Time
- Since we have a sequence of p-1 communications,
the time to broadcast a message of length m is - (p-1)(Lmb)
- MPI does not typically use ring topology for
creating communication primitives - Instead use various tree topologies that are more
efficient on modern parallel computer platforms. - However, these primitives are simpler on a ring.
- Prepares readers to implement primitives, when
more efficient than using MPI primitives.
45Scatter Algorithm
- Scatter operation allows Pk to send a different
message of length m to each processor. - Initially, Pk holds a message of length m to be
sent to Pq at location addr q. - To keep the array of addresses uniform, space for
a message to Pk is also provided. - At the end of the algorithm, each processor
stores its message from Pk at location msg. - The efficient way to implement this algorithm is
to pipeline the messages. - Message to most distant processor (i.e,, Pk-1) is
followed by message to processor Pk-2.
46(No Transcript)
47Discussion of Scatter Algorithm
- In Steps 5-6, Pk successively send messages to
the other p-1processors in the order of their
distance from Pk . - In Step 7, Pk stores its message to itself.
- The other processors concurrently move messages
along as they arrive in steps 9-12. - Each processor uses two buffers with addresses
tempS and tempR. - This allows processors to send a message and to
receive the next message in parallel in Step12.
48Discussion of Scatter Algorithm (cont)
- In step 11, tempS ? tempR means two addresses are
switched so received value can be sent to next
processor. - When a processor receives its message from Pk,
the processor stops forwarding (Step 10). - Whatever is in the receive buffer, tempR, at the
end is stored as its message from Pk (Step 13). - The running time of the scatter algorithm is the
same as for the broadcast, namely - (p-1)(Lmb)
49Example for Scatter Algorithm
- Example In Figure 3.7, let p6 and k4.
- Steps 5-6 For i 1 to p-1 do
- send(addr(kp-i) mod p, m)
- Let PE (kp-i) mod p (10 i) mod 6
- For i1, PE 9 mod 6 3
- For i2, PE 8 mod 6 2
- For i3, PE 7 mod 6 1
- For i4, PE 6 mod 6 0
- For i5, PE 5 mod 6 5
- Note messages are sent to processors in the order
3, 2, 1, 0, 5 - That is, messages to most distant processors sent
first.
50Example for Scatter Algorithm (cont)
- Example In Figure 3.7, let p6 and k4.
- Steps 10 For i 1 to (k-1-q) mod p do
- Compute (k-1-q) mod p (3-q) mod 6 for all q.
- Note q? k, which is 4
- q 5 ? i 1 to 4 since (3-5) mod 6 4
- PE 5 forwards values in loop from i 1 to 4
- q 0 ? i 1 to 3 since (3-0) mod 6 3
- PE 0 forwards values from i 1 to 3
- q 1 ? i 1 to 2 since (3-1) mod 6 2
- PE 1 forwards values from i 1 to 2
- q 2 ? i 1 to 1 since (3-2) mod 6 1
- PE 2 is active in loop when i 1
51Example for Scatter Algorithm (cont)
- q 3 ? i 1 to 0 since (3-3) mod 6 0
- PE 3 precedes PE k, so it never forwards a value
- However, it receives and stores a message at 9
- Note that in step 9, all processors store the
first message they receive. - That means even the processor k-1 receives a
value to store.
52All-to-All Algorithm
- This command allows all p processors to
simultaneously broadcast a message (to all PEs) - Again, it is assumed all messages have length m
- At the beginning, each processor holds the
message it wishes to broadcast at address
my_message. - At the end, each processor will hold an array
addr of p messages, where addrk holds message
from Pk. - Using pipelining, the running time is the same as
for a single broadcast, namely - (p-1)(Lmb)
53(No Transcript)
54Gossip Algorithm
- Last of the classical collection of communication
operations - Each processor sends a different message to each
processor. - Gossip algorithm is Problem 3.7 in textbook.
- Note it will take 1 step for each PE to send a
message to its closest neighbor using all links. - It will take 2 steps for each PE to send a
message to its 2nd closest neighbor, using all
links. - In general, it will take 12 (p-1) (p-1)!
steps for each PE to send messages to all other
nodes, using all links of the network during each
step. - Complexity is (p-1)! (p-1)(p-2)/2 O(p2).
55Pipelined Broadcast by kth processor
- Longer messages can be broadcast faster if they
are broken into smaller pieces - Suppose they are broken into r pieces of the same
length. - The sender sends the pieces out in order, and has
them travel simultaneously on the ring. - Initially, the pieces are stored at addresses
addr0, addr1, , addrr-1. - At the end, all pieces are stored in all
processors. - At each step, when a processor receives a message
piece, it also forwards the piece it previously
received, if any, to its successor.
56(No Transcript)
57Pipelined Broadcast (cont)
- There must be p-1 communication steps for first
piece to reach the last processor, Pk-1. - Then it takes r-1 time for the rest of the pieces
to reach Pk-1. - The required time is (pr-2) ( L mb/r)
- The value of r that minimizes this expression can
be found by setting its derivative (with respect
to r) to zero and solving for r. - For large m, the time required tends to mb
- Does not depend on p.
- Compares well to broadcast time (p-1)(Lmb)
58Hypercube
- Defn A 0-cube consists of a single vertex. For
ngt0, an n-cube consists of two identical
(n-1)-cubes with edges added to join matching
pairs of vertices in the two (n-1) cubes.
59Hypercubes (cont)
- Equivalent defn An n-cube is a graph consisting
of 2n vertices from 0 to 2n -1 such that two
vertices are connected if and only if their
binary representation differs by one single bit. - Property The diameter and degree of an n-cube
are equal to n. - Proof is left for reader. Easy if use recursion.
- Hamming Distance Let A and B be two points in an
n-cube. H(A,B) is the number of bits that differ
between the binary labels of A and B. - Notation If b is a binary bit, then let b 1-b,
the bit complement of b.
60Hypercube Paths
- Using binary representation, let
- A an-1 an-2 a2 a1 a0
- B bn-1 bn-2 b2 b1 b0
- WLOG, assume that A and B have different bits
exactly in their last k bits. - Differing bits at end makes numbering them easier
- Then a pathway from A to B can be created by the
following sequence of nodes - A an-1 an-2 a2 a1 a0
- Vertex 1 an-1 an-2 a2 a1 a0
- Vertex 2 an-1 an-2 a2 a1 a0
- ........
- B Vertex k an-1 an-2 ak ak-1 a2 a1 a0
61Hypercube Paths (cont)
- Independent of which bits of A and B agree, there
are k choices for first bit to flip, (k-1)
choices for next bit to flip, etc. - This gives k! different paths from A to B.
- How many independent paths exist from A to B?
- I.e., paths with only A and B as common vertices.
- Theorem If A and B are n-cube vertices that
differ in k bits, then there exist exactly k
independent paths from A to B. - Proof First, we show k independent paths exist.
- We build an independent path for each j with
0?jltk
62Hypercube Paths (cont)
- Let P(j, j-1, j-2, , 0, k-1, k-2, , j1)
denote the path from A to B with following
sequence of nodes - A an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2 a2
a1 a0 - V(1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0 - V(2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0 - .........
- V(j1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0 - V(j2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0 - .............
- V(k-1) an-1 an-2 ak ak-1 ak1 aj1aj aj-1
aj-2 a2 a1 a0 - B V(k)an-1 an-2 ak ak-1 ak1 aj1aj aj-1
aj-2 a2 a1 a0
63Hypercube Pathways (cont)
- Suppose the following two path have a common
vertex X other than A and B. - P(j, j-1, j-2, , 0, k-1, k-2, , j1)
- P(t, t-1, t-2, , 0, k-1, k-2, , t1)
- Since paths are different, and A and B differ by
k bits, we may assume 0tltjltk - Let A and X differ in q bits
- To travel from A to X along either path, exactly
q bits in circular sequence have been flipped, in
a left to right order for each - 1st path j, j-1 t, t-1 0, k-1, k-2 j-1
- 2nd path t, t-1 0, k-1, k-2
j-1 t-1 - This is impossible, as the q bits are flipped
along each path can not be exactly the same bits.
64Hypercube Paths (cont)
- Finally, there can not be another independent
path Q from A to B (i.e., without another common
vertex) - If so, the first node in path following A would
have to flipped one bit, say bit q, to agree
with B. - But then, the path described earlier that flipped
bit q first would have a common interior vertex
with this path Q.
65Hypercube Routing
- XOR is an exclusive OR. Exactly one input must be
a 1. - To design a route from A to B in the n-cube, we
will use the algorithm that will always flip the
rightmost bit that disagrees with B. - The NOR of the binary representation of A and B
indicates by 1s the bits that have to be
flipped. - For the 5-cube, if A is 10111 and B is 01110,then
- A NOR B is 11001, and the routing is as follows
- A 10111 ? 10110 ? 11110 ? 01110 B
- This algorithm can be executed as follows
- A NOR B 10111 NOR 01110 11001, so A routes
the message along link 1 (the digits link) to
Node A1 10110 - A1 NOR B 10110 NOR 01110 11000, so Node A1
routes message along link 4 to Node A2 10110 - A2 NOR B 10110 NOR 01110 10000, so A2
routes message along link 5 to B
66Hypercube Routing (cont)
- This routing algorithm can be used to implement a
wormhole or cut-through protocol in hardware. - Problem If another pair of processors have
already reserved one link on the desired path,
the message may stall until the end of the other
communication. - Solution Since there are multiple paths, the
routers select links to use based on a link
reservation table and message labels. - In our example, if link 1 in Node A is busy, then
instead use link 4 to forward the message to new
node 11111 that is on a new path to B. - If at some point, the current vertex determines
there is no useful links available, then current
vertex will have to wait for a useful link to
become available - If at some vertex, the desired links are not
available, then algorithm could use a link that
extends path length
67Gray Code
- Recursive construction of Gray code
- G1 (0,1) and has 21 2 elements
- G2 (00,01,11, 10) and had 22elements
- G3 (000, 001, 011, 010, 110, 111, 101, 100)
has 23 elements - etc.
- Gray code for dimension n ? 1 is denoted Gn, and
is defined recursively. G1 (0,1) and for ngt1,
Gn is the sequence 0Gn-1 followed by the sequence
1Gn-1rev (i.e., ) where - xGn-1 is the sequence obtained by prefixing every
element of G with x - Grev is the sequence obtained by listing the
elements of G in the reverse order - .
68Gray Code (cont.)
- Since we assume Gn-1 has 2n-1 elements and G n
has exactly two copies of Gn-1 , it has 2n
elements. - Summary The Gray code Gn is an ordered sequence
of all the 2n binary codes with n digits whose
successive values differ from each other by
exactly one bit. - Notation Let gi(r) denote the ith element of the
Gray code of dimension r. - Observation Since the Gray code Gn g1(n),
g2(n), , g(n) form a ordered sequence of names
for all of the nodes in a 2n ring.
69Embeddings
- Defn An embedding of a topology (e.g., ring, 2D
mesh, etc) into an n-cube is a 1-1 function f
from the vertices of the topology into the
n-cube. - An embedding is said to preserve locality if the
image of any two neighbors are also neighbors in
the n-cube - If embeddings do not preserve locality, then we
try to minimize the distance of neighbors in the
hypercube. - An embedding is said to be onto if the embedding
function f has its range to be the entire n-cube.
70A 2n Ring embedding onto n-cube
- Theorem There is an embedding of a ring with 2n
vertices onto an n-cube that preserves locality. - Proof
- Our construction of the Gray code provides an
ordered sequence of binary values with n digits
that can be used as names of the nodes of the
ring with 2n vertices - The first name (e.g., 000) can be used to name
any node in ring. - The sequence of Gray code names are given names
to ring nodes successively, in a clockwise or
counter-clockwise order
71Embedding Ring onto n-cube (cont)
- The gray code binary numbers are identical with
the names assigned to hypercube nodes. - Two successive n-cube nodes are connected since
they only differ by one binary digi - So the embedding is a result of the Gray code
providing an ordered sequence of n-cube names
that is used to provide successive names the 2n
ring nodes. - This concludes the proof that this embedding
follows from construction of Gray code. - However, the following formal proof provides
more details and an after construction
argument..
72A 2n Ring Embedding onto the n-cube
- The following optional formal proof is included
for those who do not find the preceding argument
convincing - Theorem There is an embedding of a ring with 2n
vertices onto an n-cube that preserves locality. - Proof
- We establish the following claim The mapping
f(i) gi(n) is an embedding of the ring onto the
n-cube. - This claim is true for n 1 as G1 (0,1) and
nodes 0 and 1 are connected on both the ring and
1-cube. - We assume above claim is true for a fixed n-1
with n?1. - We use the neighbor-preserving embedding f of the
vertices of a ring with 2n-1 nodes onto the
(n-1)-cube to build a similar embedding f of a
ring with 2n vertices onto the n-cube.
73Ring Embedding for n-cube (cont)
- Recall that gray code for Gn is the sequence
0Gn-1 followed by the sequence 1Gn-1rev - The n-cube consists of one copy of an (n-1)-cube
with a 0 prefix and a second copy of an (n-1)
cube with a 1 prefix. - By assumption that claim is true for n-1, the
gray code sequence 0Gn provides the binary code
for a ring of elements in the (n-1)-cube, with
each successive element differing by 1 digit from
its previous element. - Likewise, the gray code sequence 1Gn-1rev
provides the binary code for a ring of elements
in the second copy of an (n-1)-cube, with each
successive element differing by on digit from its
previous element.. - The last element of 0Gn-1 is identical to the
first element of 1Gn-1rev except for the added
digit, so these two elements also differ by one
bit.
742D Torus Embedding for n-cube
- We embed a 2r ? 2s torus onto an n-cube with
nrs by using the cartesian product Gr ? Gs of
two Gray codes. - A processor with coordinates (i,j) on the grid is
mapped to the processor f(i,j) (gi(r), gi(s) )
in the n-cube. - Recall the map f1(i) gi(r), is an embedding of
a 2r ring onto a r-cube row ring and f2(j)
gi(s) is an embedding of a 2s ring onto a s-cube
column ring. - We must identify (gi(r), gi(s) ) with the nrs
cube where the first r bits are given by gi(r)
and the next s bits are given by gi(s) . - Then for a fixed j, f(i?1,j) are neighbors of
f(i,j) since f1 is an embedding of a 2r ring onto
a r-cube. - Likewise, for a fixed i, f(i, j ?1) are neighbors
of f(i,j) since f2 is an embedding of a 2s ring
onto a s-cube.
75Collective Communications in Hypercube
- Purpose Gain overview of complexity of
collective communications for hypercube. - Will focus on broadcast on hypercube.
- Assume processor 0 wants to broadcast. And
consider the naïve algorithm - Processor 0 sends the message to all of its
neighbors - Next, every neighbor sends a message to all of
its neighbors. - Etc.
- Redundancy in naïve algorithm
- The same processor receives the same message many
times. - E.g., processor 0 receives all its neighbors.
- Mismatched SENDS and RECEIVES may happen.
76Improved Hypercube Broadcast
- We seek a strategy where
- Each processor receives the message only once
- The number of steps is minimal.
- Will use one or more spanning trees.
- Send and Receive will need parameter of which
dimension the communication takes place - SEND( cube_link, send_addr, m )
- RECEIVE(cube_link, send_addr, m )
77Hypercube Broadcast Algorithm
- There are n steps, numbered from n-1 to 0
- All processors will receive their msg on the link
corresponding to their rightmost 1 - Processors receiving a msg will forward it on
links whose index is smaller than its rightmost
1. - At step i, all processors whose rightmost 1 is
strictly larger than i forwards the msg on link
i. - Let broadcast originates with processor 0
- Assume 0 has a fictitious 1 at position n.
- This adds an additional digit, as has binary
digits for positions 0,1, , n-1.
78Trace of Hypercube Broadcast for n4
- Since broadcast originates with processor 0, we
assume its index is 10000 - Processor 0 sends the broadcast msg on link 3 to
1000 - Next, both processor 0000 and 1000 have their
rightmost 1 in position at least three, so both
send a message along link 2 - Msg goes to 0100 and 1100, respectively.
- This process continues until the last step at
which every even-numbered processor sends its msg
along its link 0.
79Broadcasting using Spanning Tree
80Broadcast Algorithm
- Let BIT(A,b) denote the value of the bth bit of
processor A. - The algorithm for the broadcast of a message of
length m by processor k is given in Algorithm 3.5 - Since there are n steps, the execution time with
the store-and-forward model is n(Lmb). - This algorithm is valid for the 1-port model.
- At each step, a processor communicates with at
most one other processor.
81(No Transcript)
82Broadcast Algorithm in Hypercube (cont)
- Observations about Algorithm Steps
- Specifies that the algorithm action is for
processor q - n is the number of binary digits used to label
processors. - pos q XOR k
- uses exclusive OR.
- Broadcast is from Pk .
- Updates pos to work as if P0 is root of broadcast
??? - Steps 5-7 sets first-1 to the location of first
1 in pos. - Note Steps 8-10 are the core of algorithm
- Phase steps through link dimensions, higher ones
first - If link dimension first-1, q gets message on
first-1 link - Seems to be moving message up to q ???
- Q sends message along each smaller dimension than
first-1 to processor below it.
83(No Transcript)